I am a researcher and R novice working with a large dataset consisting of many small excel files to be read together. I have imported these into R using the read_excel functions and have them all in a large table, but am running into issues trying to format the data appropriately for analysis.
As a small description of the dataset, I have different subject IDs who were tested on two different days and who were exposed to different terms in different conditions. Basically, I have a variable "Term", a variable "Condition", and a variable "Origin". The origin variable is a result of using rbindlist(idcol = "Origin", fill=T), so each is a (long) filepath, which I have redacted below except for the identifying information: the first part "SUBXX" represents my subject number and the second part "SOX" represents the day (either 1 or 2) they were tested on. See a small example dataset below:
df <- data.frame(Origin = c("C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB01S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S01xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx", "C:/Users/xxxxx/xxxxxxxx/xxxxxxx/xxxx/xxxxxxxxxxxx/xxxxxxxxxxxxxxx/SUB02S02xxxx.xlsx"))df$Term <- c("Owl", "Dog", "Rat", "Fox", "Cat", "Cow", "Dog", "Bug", "Cow","Mouse", "Bat", "Cat")df$Condition <- c("M", "L", "L", "L", "M", "M", "L", "L", "M", "M", "M", "M")> df Origin(shortened) Term Condition1 SUB01_S01 Owl M2 SUB01_S01 Dog L3 SUB01_S01 Rat L4 SUB01_S02 Fox L5 SUB01_S02 Cat M6 SUB01_SO2 Cow M7 SUB02_S01 Dog L8 SUB02_S01 Bug L9 SUB02_S01 Cow M10 SUB02_S02 Mouse M11 SUB02_S02 Bat M12 SUB02_S02 Cat M
Lastly, I have a list of all possible terms:termList <- c("Cat", "Dog", "Cow", "Rat", "Fox", "Bug", "Owl", "Bat", "Mouse", "Bear")
What I want to do is to 1) order the dataframe so that the terms appear in the same order as the termList, and 2) add the terms that do not appear for each participant, noting their Day as 0 and their condition as "U". Additionally, I want to replace the "origin" column with two separate columns, one containing participant ID and the other containing day.
Desired result:
ParNum Term Condition Day1 1 Cat M 22 1 Dog L 13 1 Cow M 24 1 Rat L 15 1 Fox L 26 1 Bug U 07 1 Owl M 18 1 Bat U 0 9 1 Mouse U 010 1 Bear U 011 1 Cat M 212 1 Dog L 113 1 Cow M 114 1 Rat U 015 1 Fox U 016 1 Bug L 117 1 Owl U 018 1 Bat M 219 1 Mouse M 220 1 Bear U 0
I am not a CompSci person so I usually build my way up from little problems to the larger ones. Starting small, I tried to use R's inbuilt apply() functions, as well as setdiff, to find the concepts which don't appear for each SubID. The following code:
df%>% group_by(Origin) %>% tapply(setdiff(termList, df$Term))
only returned a single 1, which is confusing. Shouldn't setdiff() return a character variable (i.e. whatever term is missing?) Trying the other options lapply() and sapply() both returned the message "object 'Bear' of mode 'function' was not found".
I also attempted a for loop, again by starting small and just trying to find the missing terms for each SubId. The following:
mismatch <- character()for (i in df$Origin) { mismatch <- setdiff(termList, tbl$origin)}
Returned
[1] "Cat" "Dog" "Cow" "Rat" "Fox" "Bug" "Owl" "Bat" "Mouse"[10] "Bear"
But I was expecting a subset of terms for each SubID. Could anyone give any advice?
EDIT: I used the solution proposed by Edward below, namely:
#3 replace the "origin" column with two separate columns, one # containing participant ID and the other containing day.separate_wider_position(df, Origin, widths=c(69, ParNum=2, 1, Day=2, 9)) |> mutate(Term=factor(Term, levels=termList), Day=as.numeric(Day)) |>#2 add the terms that do not appear for each participant, noting # their Day as 0 and their condition as "U". complete(ParNum, Term, fill = list(Day=0, Condition="U")) |>#1 order the dataframe so that the terms appear in the same order # as the termList, arrange(ParNum, Term)
Which works. However, there is one other problem I forgot to mention: in my full dataset, each concept appears twice in the spreadsheet (same condition each time). So the sorted list using the above method doubles any concept which isn't in condition "U", like so:
ParNum Term Day Condition<chr> <fct> <dbl> <chr> 1 1 Cat 0 U 2 1 Dog 1 L 3 1 Dog 1 L 4 1 Cow 0 U 5 1 Rat 1 L 6 1 Rat 1 L 7 1 Fox 2 L 8 1 Fox 2 L 9 1 Bug 0 U 10 1 Owl 1 M 11 1 Owl 1 M 12 1 Bat 0 U 13 1 Mouse 0 U 14 1 Bear 0 U
There is no reason for me to retain these doubles so I'd just like to get rid of them. Is such a thing possible?