Quantcast
Viewing latest article 27
Browse Latest Browse All 35

Sorting columns and adding new variables using setdiff in R

I am a researcher and R novice working with a large dataset consisting of many small excel files to be read together. I have imported these into R using the read_excel functions and have them all in a large table, but am running into issues trying to format the data appropriately for analysis.

As a small description of the dataset, I have different subject IDs who were tested on two different days and who were exposed to different terms in different conditions. Basically, I have a variable "Term", a variable "Condition", and a variable "Origin". The origin variable is a result of using rbindlist(idcol = "Origin", fill=T), so each is a (long) filepath, which I have redacted below except for the identifying information: the first part "SUBXX" represents my subject number and the second part "SOX" represents the day (either 1 or 2) they were tested on. See a small example dataset below:

df <- data.frame(Origin = c("C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx",                   "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx",                   "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx",                       "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx"))df$Term <- c("Owl", "Dog", "Rat", "Fox", "Cat", "Cow", "Dog", "Bug", "Cow","Mouse", "Bat", "Cat")df$Condition <- c("M", "L", "L", "L", "M", "M", "L", "L", "M", "M", "M", "M")> df   Origin(shortened) Term Condition1  SUB01_S01          Owl     M2  SUB01_S01          Dog     L3  SUB01_S01          Rat     L4  SUB01_S02          Fox     L5  SUB01_S02          Cat     M6  SUB01_SO2          Cow     M7  SUB02_S01          Dog     L8  SUB02_S01          Bug     L9  SUB02_S01          Cow     M10 SUB02_S02          Mouse   M11 SUB02_S02          Bat     M12 SUB02_S02          Cat     M

Lastly, I have a list of all possible terms:termList <- c("Cat", "Dog", "Cow", "Rat", "Fox", "Bug", "Owl", "Bat", "Mouse", "Bear")

What I want to do is to 1) order the dataframe so that the terms appear in the same order as the termList, and 2) add the terms that do not appear for each participant, noting their Day as 0 and their condition as "U". Additionally, I want to replace the "origin" column with two separate columns, one containing participant ID and the other containing day.

Desired result:

 ParNum Term Condition   Day1   1   Cat         M      22   1   Dog         L      13   1   Cow         M      24   1   Rat         L      15   1   Fox         L      26   1   Bug         U      07   1   Owl         M      18   1   Bat         U      0 9   1   Mouse       U      010  1   Bear        U      011  1   Cat         M      212  1   Dog         L      113  1   Cow         M      114  1   Rat         U      015  1   Fox         U      016  1   Bug         L      117  1   Owl         U      018  1   Bat         M      219  1   Mouse       M      220  1   Bear        U      0

I am not a CompSci person so I usually build my way up from little problems to the larger ones. Starting small, I tried to use R's inbuilt apply() functions, as well as setdiff, to find the concepts which don't appear for each SubID. The following code:

df%>%  group_by(Origin) %>%  tapply(setdiff(termList, df$Term))

only returned a single 1, which is confusing. Shouldn't setdiff() return a character variable (i.e. whatever term is missing?) Trying the other options lapply() and sapply() both returned the message "object 'Bear' of mode 'function' was not found".

I also attempted a for loop, again by starting small and just trying to find the missing terms for each SubId. The following:

mismatch <- character()for (i in df$Origin) {  mismatch <- setdiff(termList, tbl$origin)}

Returned

[1] "Cat"   "Dog"   "Cow"   "Rat"   "Fox"   "Bug"   "Owl"   "Bat"   "Mouse"[10] "Bear" 

But I was expecting a subset of terms for each SubID. Could anyone give any advice?

EDIT: I used the solution proposed by Edward below, namely:

#3 replace the "origin" column with two separate columns, one # containing participant ID and the other containing day.separate_wider_position(df, Origin,                         widths=c(69, ParNum=2, 1, Day=2, 9)) |>   mutate(Term=factor(Term, levels=termList),          Day=as.numeric(Day)) |>#2 add the terms that do not appear for each participant, noting # their Day as 0 and their condition as "U".   complete(ParNum, Term, fill = list(Day=0, Condition="U")) |>#1 order the dataframe so that the terms appear in the same order # as the termList,   arrange(ParNum, Term)

Which works. However, there is one other problem I forgot to mention: in my full dataset, each concept appears twice in the spreadsheet (same condition each time). So the sorted list using the above method doubles any concept which isn't in condition "U", like so:

   ParNum Term    Day Condition<chr>  <fct> <dbl> <chr>     1 1    Cat       0 U         2 1    Dog       1 L 3 1    Dog       1 L        4 1    Cow       0 U         5 1    Rat       1 L 6 1    Rat       1 L         7 1    Fox       2 L   8 1    Fox       2 L       9 1    Bug       0 U        10 1    Owl       1 M 11 1    Owl       1 M       12 1    Bat       0 U        13 1    Mouse     0 U        14 1    Bear      0 U 

There is no reason for me to retain these doubles so I'd just like to get rid of them. Is such a thing possible?


Viewing latest article 27
Browse Latest Browse All 35

Trending Articles