Version Note: Up-to-date with v0.3.0
library(psycModel)
library(dplyr)
This article briefly introduces the usage of dplyr::select
, and how it is applied to this package. In the first section, I will briefly describe dplyr::select
syntax for R-beginners. If you are already familiar with dplyr::select
syntax, then you can skip to the next section where I describe how to apply the syntax in this pacakge.
dplyr::select
(abbreviated as select
hereafter) is an extremely power function for R. It allows you to subset columns with the a set of syntax that is also known as the select
syntax / semantics in the R community. A side note here. With the new introduction of dplyr::across
function, the select
syntax can be applied to dplyr::mutate
and dplyr::filter
where make these two already powerful function even more powerful. I will first introduce the usage of :
, c()
and -
. Then, I will discuss how to use everything
, starts_with
, end_with
, contains
, and where
. This is not an exhaustive list of the select
syntax, but there are the most relevant one. If you want to learn more, I encourage you to check the vignette of dplyr
or just google it. There are tons of article that discuss this in detail.
I am going to use the iris
dataset for the demonstration. Let’s take a quick peek of the dataset.
%>% head() # head() show the first 5 rows of the data frame iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
If I want to select the first 3 columns, you can use :
to do that
%>% select(1:3) %>% head(1) iris
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
%>% select(Sepal.Length:Petal.Length) %>% head(1) iris
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
Next, if you want to combine selection then you can use c()
. For example, I want the 1st, 3rd and 4th columns. Then, you can do it like this
%>% select(c(1, 3:4)) %>% head(1) iris
Sepal.Length Petal.Length Petal.Width
1 5.1 1.4 0.2
%>% select(Sepal.Length, Petal.Length:Petal.Width) %>% head(1) iris
Sepal.Length Petal.Length Petal.Width
1 5.1 1.4 0.2
Finally, if you want to delete a column from selection, then you can use -
. For example, you want to select all columns except the 3rd column, then you can do it like this
%>% select(1:5, -3) %>% head(1) iris
Sepal.Length Sepal.Width Petal.Width Species
1 5.1 3.5 0.2 setosa
%>% select(Sepal.Length:Species, -Petal.Length) %>% head(1) iris
Sepal.Length Sepal.Width Petal.Width Species
1 5.1 3.5 0.2 setosa
Ok. Now you understand the basic usage. Let’s get to something a little bit more advanced. First, let’s talk about my favorite which is everything
. As the name entails, it select all the variables in the data frame. It is usually used in combination with c()
if you are using in select
function. However, it is very powerful in other use cases like the one in this package. For example, you want to fit a linear regression with all the variables, then you can use everything
(a more detailed discussion is presented in the next section).
# select all columns
%>% select(everything()) %>% head(1) iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
# select everything except Sepal.Width
%>% select(c(everything(),-Sepal.Width)) %>% head(1) iris
Sepal.Length Petal.Length Petal.Width Species
1 5.1 1.4 0.2 setosa
Next, we can talk about starts_with
. starts_with
select all columns that is starts with a certain specified string. For example, we want to select all columns start with Sepal, then we can do something like this
%>% select(starts_with('Sepal')) %>% head(1) iris
Sepal.Length Sepal.Width
1 5.1 3.5
Similar to starts_with
, ends_with
select all columns that is ends with a certain specified string. For example, we want to select all columns ends with Width.
%>% select(ends_with('Width')) %>% head(1) iris
Sepal.Width Petal.Width
1 3.5 0.2
Next, we are going talk about contains
. As the name entails, it select all columns that contains a specified string.
%>% select(contains('Sepal')) %>% head(1) # same as starts_with iris
Sepal.Length Sepal.Width
1 5.1 3.5
%>% select(contains('Width')) %>% head(1) # same as ends_with iris
Sepal.Width Petal.Width
1 3.5 0.2
%>% select(contains('.')) %>% head(1) # contains "." will be selected iris
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
Finally, we are going to conclude this section with where
. where
is not used alone. It is usually pair with a function return TRUE
or FALSE
. I think the most common use case for this package is paired with is.numeric
. where(is.numeric)
will select all numeric variables. A little tip, you need to pass is.numeric
instead of is.numeric()
. I will not go into the detail of why because this is out of the scope of this article. It required a little bit more advanced understanding of how function work in R. If you have that, you wouldn’t reading this article anyway.
%>% select(where(is.numeric)) %>% head(1) iris
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
First, I will demonstrate the usage of linear regression. I will first create a data frame. You don’t need to know anything about how this data frame is created. Just know that it has 1 DV / outcome / response variable (i.e, y) and 5 IV / predictor variable (i.e, x1 to x5)
set.seed(1)
= data.frame(y = rnorm(n = 100,mean = 2,sd = 3),
test_data x1 = rnorm(n = 100,mean = 1.5, sd = 4),
x2 = rnorm(n = 100,mean = 1.7, sd = 4),
x3 = rnorm(n = 100,mean = 1.5, sd = 4),
x4 = rnorm(n = 100,mean = 2, sd = 4),
x5 = rnorm(n = 100,mean = 1.5, sd = 4))
Ok, let’s fit that linear regression now.
# Without this package:
= lm(data = test_data, formula = y ~ x1 + x2 + x3 + x4 + x5)
model1
# With this package:
= lm_model(data = test_data,
model2 response_variable = y,
predictor_variable = c(everything(),-y))
Fitting Model with lm:
Formula = y ~ x1 + x2 + x3 + x4 + x5
This is already a step up from the basic lm()
function. We can still make is even simpler by just passing everyhing()
. The function is designed to remove the response variable from predictor variables (if selected) automatically. The following model3
is the same as model2
= lm_model(data = test_data,
model3 response_variable = y,
predictor_variable = everything())
Fitting Model with lm:
Formula = y ~ x1 + x2 + x3 + x4 + x5
The same logic is applied to all other functions in this package. Arguments that support dplyr::select
syntax will ends with “support dplyr::select syntax” in the description of the argument.That’s it for this brief introduction. If you want to learn more about this package, I encourage you to check out this article or use vignette('quick-introduction')
if you are in R Studio.