Abstract
In this vignette, we use the R package groupdata2
to automatically detect and create groups in a dataset using the l_starts
method.
groupdata2
is a set of methods for easy grouping, windowing, folding, partitioning, splitting and balancing of data.
For a more extensive description of groupdata2
, please see Description of groupdata2
Contact author at r-pkgs@ludvigolsen.dk
In this vignette, we will use the l_starts
method with group()
to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself.
library(groupdata2)
library(dplyr) # %>%
library(knitr) # kable
3 participants were asked to solve a task. They had to take turns but could go for multiple runs of the task before taking a break and letting the next participant take over. They had 2 turns (called sessions) each, meaning there were 6 sessions in total with multiple runs per session. A team of experts would rate how well the participant did throughout the entire session, meaning that if the participant had some bad runs, they would have to make a choice whether to save energy for the other session or whether to try and correct the rating of the current session.
For each run of the task, we recorded how many errors the participant made.
<- data.frame(
df_observations "run" = 1:30,
"participant" = c(
1, 1, 1, 1,
2, 2, 2, 2, 2, 2,
3, 3, 3, 3,
1, 1, 1, 1, 1, 1, 1,
2, 2, 2,
3, 3, 3, 3, 3, 3),
"errors" = c(
3, 2, 5, 3,
0, 0, 1, 1, 0, 1,
6, 4, 3, 1,
2, 1, 3, 2, 1, 1, 0,
0, 0, 1,
3, 3, 4, 2, 2, 1)
)
# Show the first 20 rows of data frame
%>% head(20) %>% kable() df_observations
run | participant | errors |
---|---|---|
1 | 1 | 3 |
2 | 1 | 2 |
3 | 1 | 5 |
4 | 1 | 3 |
5 | 2 | 0 |
6 | 2 | 0 |
7 | 2 | 1 |
8 | 2 | 1 |
9 | 2 | 0 |
10 | 2 | 1 |
11 | 3 | 6 |
12 | 3 | 4 |
13 | 3 | 3 |
14 | 3 | 1 |
15 | 1 | 2 |
16 | 1 | 1 |
17 | 1 | 3 |
18 | 1 | 2 |
19 | 1 | 1 |
20 | 1 | 1 |
<- data.frame(
df_ratings "session" = c(1:6),
"rating" = c(3, 8, 2, 5, 9, 4)
)
%>% kable() df_ratings
session | rating |
---|---|
1 | 3 |
2 | 8 |
3 | 2 |
4 | 5 |
5 | 9 |
6 | 4 |
We would like to get the expert ratings into the data frame
with observations. For this, we will first create a session column and then get the ratings for the sessions.
As the participants had differing numbers of runs, we must start a new session group whenever the participant column changes. This can be done with group()
using the l_starts
method. This methods takes group start values, finds those values in a specified column, and creates groups that begin at the start values. To show this, let’s try it out with some manually entered start values before having group()
find them automatically.
group(
data = df_observations,
n = c(1, 2, 3, 1, 2, 3), # Starting values
method = 'l_starts',
starts_col = 'participant',
col_name = 'session'
%>%
) kable()
run | participant | errors | session |
---|---|---|---|
1 | 1 | 3 | 1 |
2 | 1 | 2 | 1 |
3 | 1 | 5 | 1 |
4 | 1 | 3 | 1 |
5 | 2 | 0 | 2 |
6 | 2 | 0 | 2 |
7 | 2 | 1 | 2 |
8 | 2 | 1 | 2 |
9 | 2 | 0 | 2 |
10 | 2 | 1 | 2 |
11 | 3 | 6 | 3 |
12 | 3 | 4 | 3 |
13 | 3 | 3 | 3 |
14 | 3 | 1 | 3 |
15 | 1 | 2 | 4 |
16 | 1 | 1 | 4 |
17 | 1 | 3 | 4 |
18 | 1 | 2 | 4 |
19 | 1 | 1 | 4 |
20 | 1 | 1 | 4 |
21 | 1 | 0 | 4 |
22 | 2 | 0 | 5 |
23 | 2 | 0 | 5 |
24 | 2 | 1 | 5 |
25 | 3 | 3 | 6 |
26 | 3 | 3 | 6 |
27 | 3 | 4 | 6 |
28 | 3 | 2 | 6 |
29 | 3 | 2 | 6 |
30 | 3 | 1 | 6 |
Note how each session only has observations from a single participant.
group()
went through the participant column and found one value from n
at a time. When it encountered the value, it noted down the row index and continued down the column searching for the next value in n
. In the end, it started groups at the found row indices from top to bottom.
Since our data has the same value in the participant column for the entire session, we can actually get group()
to find these group starts automatically. It will go through the given column and whenever it encounters a new value, i.e. one that is different from the previous row, it starts a new group.
<- group(
df_observations data = df_observations,
n = 'auto',
method = 'l_starts',
starts_col = 'participant',
col_name = 'session'
)
%>%
df_observations kable()
run | participant | errors | session |
---|---|---|---|
1 | 1 | 3 | 1 |
2 | 1 | 2 | 1 |
3 | 1 | 5 | 1 |
4 | 1 | 3 | 1 |
5 | 2 | 0 | 2 |
6 | 2 | 0 | 2 |
7 | 2 | 1 | 2 |
8 | 2 | 1 | 2 |
9 | 2 | 0 | 2 |
10 | 2 | 1 | 2 |
11 | 3 | 6 | 3 |
12 | 3 | 4 | 3 |
13 | 3 | 3 | 3 |
14 | 3 | 1 | 3 |
15 | 1 | 2 | 4 |
16 | 1 | 1 | 4 |
17 | 1 | 3 | 4 |
18 | 1 | 2 | 4 |
19 | 1 | 1 | 4 |
20 | 1 | 1 | 4 |
21 | 1 | 0 | 4 |
22 | 2 | 0 | 5 |
23 | 2 | 0 | 5 |
24 | 2 | 1 | 5 |
25 | 3 | 3 | 6 |
26 | 3 | 3 | 6 |
27 | 3 | 4 | 6 |
28 | 3 | 2 | 6 |
29 | 3 | 2 | 6 |
30 | 3 | 1 | 6 |
And it works! :)
If you just want to find the group starts, you can use the find_starts()
function. Alternatively, the differs_from_previous()
function allows setting a threshold for how much the value must differ from the previous value.
Now that we have the session information, we can transfer the ratings from the ratings data frame
.
<- merge(df_observations, df_ratings, by = 'session')
df_merged
# Show head of df_merged
%>% head(15) %>% kable() df_merged
session | run | participant | errors | rating |
---|---|---|---|---|
1 | 1 | 1 | 3 | 3 |
1 | 2 | 1 | 2 | 3 |
1 | 3 | 1 | 5 | 3 |
1 | 4 | 1 | 3 | 3 |
2 | 5 | 2 | 0 | 8 |
2 | 6 | 2 | 0 | 8 |
2 | 7 | 2 | 1 | 8 |
2 | 8 | 2 | 1 | 8 |
2 | 9 | 2 | 0 | 8 |
2 | 10 | 2 | 1 | 8 |
3 | 11 | 3 | 6 | 2 |
3 | 12 | 3 | 4 | 2 |
3 | 13 | 3 | 3 | 2 |
3 | 14 | 3 | 1 | 2 |
4 | 15 | 1 | 2 | 5 |
Now, we can find the average number of errors per session and see if they correlate with the experts’ ratings.
<- df_merged %>%
avg_errors group_by(session) %>%
::summarize("avg_errors" = mean(errors))
dplyr
%>% kable() avg_errors
session | avg_errors |
---|---|
1 | 3.2500000 |
2 | 0.5000000 |
3 | 3.5000000 |
4 | 1.4285714 |
5 | 0.3333333 |
6 | 2.5000000 |
Let’s transfer the averages back to the merged data frame
. Once again, we use merge()
. Note that you may prefer to use one of the join functions from dplyr
instead (e.g. dplyr::left_join()
).
Since we have just one average rating per session, we extract the first row of each session and remove the original error count column.
<- merge(df_merged, avg_errors, by = 'session') %>%
df_summarized group_by(session) %>% # For each session
filter(row_number() == 1) %>% # Get first row
select(-errors) # Remove errors column as we use avg_errors now
%>% kable() df_summarized
session | run | participant | rating | avg_errors |
---|---|---|---|---|
1 | 1 | 1 | 3 | 3.2500000 |
2 | 5 | 2 | 8 | 0.5000000 |
3 | 11 | 3 | 2 | 3.5000000 |
4 | 15 | 1 | 5 | 1.4285714 |
5 | 22 | 2 | 9 | 0.3333333 |
6 | 25 | 3 | 4 | 2.5000000 |
We have 1 row per session with the participant, the rating and the average number of errors. If we wanted to know how many runs a session contained, we could extract it from the run
column.
Let’s check if there’s a correlation between ratings and average errors.
cor(df_summarized$rating, df_summarized$avg_errors)
#> [1] -0.9739425
It seems they are highly negatively correlated, so participants with fewer errors have higher ratings and vice versa.
Well done, you made it to the end of this introduction to groupdata2
! If you want to know more about the various methods and arguments, you can read the Description of groupdata2.
If you have any questions or comments to this vignette (tutorial) or groupdata2
, please send them to me at
r-pkgs@ludvigolsen.dk, or open an issue on the github page https://github.com/LudvigOlsen/groupdata2 so I can make improvements.