A Guide to Working with Quantities

R Base

In this section, we consider all the methods and functions included in the default packages, i.e., those that are automatically installed along with any R distribution:

rownames(installed.packages(priority="base"))

#>  [1] "base"      "compiler"  "datasets"  "graphics"  "grDevices" "grid"     
#>  [7] "methods"   "parallel"  "splines"   "stats"     "stats4"    "tcltk"    
#> [13] "tools"     "utils"

Row Subsetting

Quantities objects have all the subsetting methods defined ([, [[, [<-, [[<-). Therefore they can be used in the same way as with plain numeric vectors, and in conjunction with which and other functions to perform subsetting. The subset function is very handy too and achieves the same result:

iris.q[which(iris.q$Sepal.Length > set_quantities(7.5, cm)), ]
#> Warning: In '>' : boolean operators not defined for 'errors' objects,
#> uncertainty dropped
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 106  7.6(4) [cm] 3.0(2) [cm]  6.6(3) [cm] 2.1(1) [cm] virginica
#> 118  7.7(4) [cm] 3.8(2) [cm]  6.7(3) [cm] 2.2(1) [cm] virginica
#> 119  7.7(4) [cm] 2.6(1) [cm]  6.9(3) [cm] 2.3(1) [cm] virginica
#> 123  7.7(4) [cm] 2.8(1) [cm]  6.7(3) [cm] 2.0(1) [cm] virginica
#> 132  7.9(4) [cm] 3.8(2) [cm]  6.4(3) [cm] 2.0(1) [cm] virginica
#> 136  7.7(4) [cm] 3.0(2) [cm]  6.1(3) [cm] 2.3(1) [cm] virginica
subset(iris.q, Sepal.Length > set_quantities(7.5, cm))
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 106  7.6(4) [cm] 3.0(2) [cm]  6.6(3) [cm] 2.1(1) [cm] virginica
#> 118  7.7(4) [cm] 3.8(2) [cm]  6.7(3) [cm] 2.2(1) [cm] virginica
#> 119  7.7(4) [cm] 2.6(1) [cm]  6.9(3) [cm] 2.3(1) [cm] virginica
#> 123  7.7(4) [cm] 2.8(1) [cm]  6.7(3) [cm] 2.0(1) [cm] virginica
#> 132  7.9(4) [cm] 3.8(2) [cm]  6.4(3) [cm] 2.0(1) [cm] virginica
#> 136  7.7(4) [cm] 3.0(2) [cm]  6.1(3) [cm] 2.3(1) [cm] virginica

Note that another quantities object is defined for the comparison. This is needed because different units are incomparable. Also note that the first line throws a warning telling us that the uncertainty was dropped for this operation. This kind of warning is thrown once, and this is why subset succeeds silently.

Row Ordering

The sort function, as its name suggests, sorts vectors, and it is compatible with quantities:

iris.q$Sepal.Length[1:5]
#> Units: [cm]
#> Errors: 0.255 0.245 0.235 0.230 0.250
#> [1] 5.1 4.9 4.7 4.6 5.0
sort(iris.q$Sepal.Length[1:5])
#> Units: [cm]
#> Errors: 0.230 0.235 0.245 0.250 0.255
#> [1] 4.6 4.7 4.9 5.0 5.1

More generally, the order function can be used for data frame ordering:

head(iris.q[order(iris.q$Sepal.Length), ])
#>    Sepal.Length Sepal.Width Petal.Length   Petal.Width Species
#> 14  4.3(2) [cm] 3.0(2) [cm] 1.10(6) [cm] 0.100(5) [cm]  setosa
#> 9   4.4(2) [cm] 2.9(1) [cm] 1.40(7) [cm]  0.20(1) [cm]  setosa
#> 39  4.4(2) [cm] 3.0(2) [cm] 1.30(6) [cm]  0.20(1) [cm]  setosa
#> 43  4.4(2) [cm] 3.2(2) [cm] 1.30(6) [cm]  0.20(1) [cm]  setosa
#> 42  4.5(2) [cm] 2.3(1) [cm] 1.30(6) [cm]  0.30(2) [cm]  setosa
#> 4   4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm]  0.20(1) [cm]  setosa

Column Transformation

The transform function is able to modify variables in a data frame or to create new ones. The within function provides a similar but more flexible approach though. Both are fully compatible with quantities:

head(within(iris.q, {
  Sepal.Area <- Sepal.Length * Sepal.Width
  Petal.Area <- Petal.Length * Petal.Width
  rm(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
}))
#>   Species     Petal.Area   Sepal.Area
#> 1  setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 2  setosa 0.28(2) [cm^2] 15(1) [cm^2]
#> 3  setosa 0.26(2) [cm^2] 15(1) [cm^2]
#> 4  setosa 0.30(2) [cm^2] 14(1) [cm^2]
#> 5  setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 6  setosa 0.68(5) [cm^2] 21(1) [cm^2]

Row Aggregation

Row aggregation is the process of summarising data based on some grouping variable(s). There are several ways of working with data split by factors in R base, and, although they tend to preserve classes, they are generally not very kind to other metadata (i.e., attributes) by default.

In the following example, the average Sepal.Length is computed per Species, but the metadata gets dropped:

tapply(iris.q$Sepal.Length, iris.q$Species, mean)
#>     setosa versicolor  virginica 
#>      5.006      5.936      6.588

Many of these functions include a simplify parameter which, if set to FALSE, preserves quantities metadata:

(sepal.length.agg <- 
   tapply(iris.q$Sepal.Length, iris.q$Species, mean, simplify=FALSE))
#> $setosa
#> 5.0(3) [cm]
#> 
#> $versicolor
#> 5.9(3) [cm]
#> 
#> $virginica
#> 6.6(3) [cm]

The only drawback is that the result is a list, and such a list must be unlisted with care, otherwise, metadata gets dropped again:

# drops quantities
unlist(sepal.length.agg)
#>     setosa versicolor  virginica 
#>      5.006      5.936      6.588

# preserves quantities
do.call(c, sepal.length.agg)
#> Units: [cm]
#> Errors: 0.2503 0.2968 0.3294
#>     setosa versicolor  virginica 
#>      5.006      5.936      6.588

The by function is an object-oriented wrapper for tapply applied to data frames which also provides a simplify parameter. A more convenient way of working with summary statistics is the aggregate generic, from the stats namespace. Although there is a aggregate.data.frame method, there is a more intuitive interface to it through the aggregate.formula method. Again, it is necessary to set simplify=FALSE to keep quantities:

(iris.q.agg <- aggregate(. ~ Species, data = iris.q, mean, simplify=FALSE))
#>      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     setosa        5.006       3.428        1.462       0.246
#> 2 versicolor        5.936        2.77         4.26       1.326
#> 3  virginica        6.588       2.974        5.552       2.026

Apparently, the output has no metadata associated, but what really happens is that the resulting columns are lists:

class(iris.q.agg$Sepal.Length)
#> [1] "list"

Therefore, as in the tapply/by case, they must be unlisted with care to still preserve the metadata:

unlist_quantities <- function(x) {
  stopifnot(is.list(x) || is.data.frame(x))
  
  unlist <- function(x) {
    if (any(class(x[[1]]) %in% c("quantities", "units", "errors")))
      do.call(c, x)
    else x
  }
  
  if (is.data.frame(x))
    as.data.frame(lapply(x, unlist), col.names=colnames(x))
  else unlist(x)
}

unlist_quantities(iris.q.agg)
#>      Species Sepal.Length Sepal.Width Petal.Length  Petal.Width
#> 1     setosa  5.0(3) [cm] 3.4(2) [cm] 1.46(7) [cm] 0.25(1) [cm]
#> 2 versicolor  5.9(3) [cm] 2.8(1) [cm]  4.3(2) [cm] 1.33(7) [cm]
#> 3  virginica  6.6(3) [cm] 3.0(1) [cm]  5.6(3) [cm]  2.0(1) [cm]

And this method works for the tapply/by case too:

unlist_quantities(sepal.length.agg)
#> Units: [cm]
#> Errors: 0.2503 0.2968 0.3294
#>     setosa versicolor  virginica 
#>      5.006      5.936      6.588

Column Joining

Joining data frames by common columns can done with the merge generic. Such operations are based on appending columns, which may be subset or replicated to fit the length of the merged observations. Therefore, quantities should be preserved in all cases. In the following example, we generate a data frame with the height per species and then merge it with the main data set:

height <- data.frame(
  Height = set_quantities(c(55, 60, 45), cm, c(45, 30, 35)),
  Species = c("setosa", "virginica", "versicolor")
)

head(merge(iris.q, height))
#>   Species Sepal.Length Sepal.Width Petal.Length  Petal.Width      Height
#> 1  setosa  5.1(3) [cm] 3.5(2) [cm] 1.40(7) [cm] 0.20(1) [cm] 60(40) [cm]
#> 2  setosa  4.9(2) [cm] 3.0(2) [cm] 1.40(7) [cm] 0.20(1) [cm] 60(40) [cm]
#> 3  setosa  4.7(2) [cm] 3.2(2) [cm] 1.30(6) [cm] 0.20(1) [cm] 60(40) [cm]
#> 4  setosa  4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm] 0.20(1) [cm] 60(40) [cm]
#> 5  setosa  5.0(2) [cm] 3.6(2) [cm] 1.40(7) [cm] 0.20(1) [cm] 60(40) [cm]
#> 6  setosa  5.4(3) [cm] 3.9(2) [cm] 1.70(8) [cm] 0.40(2) [cm] 60(40) [cm]

(Un)Pivoting

The reshape function, from the stats namespace, provides an interface for both pivoting and unpivoting (i.e., tidyfying data). In the case of the iris data set, we would say that it is in the wide format, because each row has more than one observation.

This function has a quite peculiar nomenclature. First of all, the unpivoting operation is accessed by providing the argument direction="long". We need to define the varying columns (columns to unpivot), as character or indices, and they are unpivoted based on their names. By default, the separator sep="." is used, which means that Sepal.Width will be broken down into Sepal and Width, and the former will be unpivoted with the latter as grouping variable. We can specify the name of the grouping variable with the timevar argument.

Putting everything together, this is how to unpivot the data set by the dimension (which we will call it dim) of the petal/sepal:

long.1 <- reshape(iris.q, varying=1:4, timevar="dim", idvar="dim.id", direction="long")
head(long.1)
#>          Species    dim       Sepal        Petal dim.id
#> 1.Length  setosa Length 5.1(3) [cm] 1.40(7) [cm]      1
#> 2.Length  setosa Length 4.9(2) [cm] 1.40(7) [cm]      2
#> 3.Length  setosa Length 4.7(2) [cm] 1.30(6) [cm]      3
#> 4.Length  setosa Length 4.6(2) [cm] 1.50(8) [cm]      4
#> 5.Length  setosa Length 5.0(2) [cm] 1.40(7) [cm]      5
#> 6.Length  setosa Length 5.4(3) [cm] 1.70(8) [cm]      6

It can be noted that the unpivoting also generates an index to indentify multiple records from the same group. We have changed the name of that identifier to dim.id (just id by default).

We can further unpivot sepal and petal as the part of the flower. First, we need to prepend a common identifier to columns 3 and 4, which are to be unpivoted:

names(long.1)[3:4] <- paste0("value.", names(long.1)[3:4])
long.2 <- reshape(long.1, varying=3:4, timevar="part", idvar="part.id", direction="long")
head(long.2)
#>         Species    dim dim.id  part       value part.id
#> 1.Sepal  setosa Length      1 Sepal 5.1(3) [cm]       1
#> 2.Sepal  setosa Length      2 Sepal 4.9(2) [cm]       2
#> 3.Sepal  setosa Length      3 Sepal 4.7(2) [cm]       3
#> 4.Sepal  setosa Length      4 Sepal 4.6(2) [cm]       4
#> 5.Sepal  setosa Length      5 Sepal 5.0(2) [cm]       5
#> 6.Sepal  setosa Length      6 Sepal 5.4(3) [cm]       6

And the final result has one tidy observation per row.

The pivoting operation can be accessed by providing the argument direction="wide". The process is almost symmetrical, but we need to specify v.names, as character, instead of varying columns. First, we can pivot by flower part:

wide.1 <- reshape(long.2, v.names="value", timevar="part", idvar="part.id", direction="wide")
head(wide.1)
#>         Species    dim dim.id part.id value.Sepal  value.Petal
#> 1.Sepal  setosa Length      1       1 5.1(3) [cm] 1.40(7) [cm]
#> 2.Sepal  setosa Length      2       2 4.9(2) [cm] 1.40(7) [cm]
#> 3.Sepal  setosa Length      3       3 4.7(2) [cm] 1.30(6) [cm]
#> 4.Sepal  setosa Length      4       4 4.6(2) [cm] 1.50(8) [cm]
#> 5.Sepal  setosa Length      5       5 5.0(2) [cm] 1.40(7) [cm]
#> 6.Sepal  setosa Length      6       6 5.4(3) [cm] 1.70(8) [cm]

Then, we remove "value." from the column names and pivot by dimension (note that indices are removed to match the initial data frame):

names(wide.1)[5:6] <- sub("value\\.", "", names(wide.1)[5:6])
wide.2 <- reshape(wide.1, v.names=c("Sepal", "Petal"), timevar="dim", idvar="dim.id", direction="wide")
#> Warning in reshapeWide(data, idvar = idvar, timevar = timevar, varying =
#> varying, : some constant variables (part.id) are really varying
wide.2$dim.id <- NULL
wide.2$part.id <- NULL
head(wide.2)
#>         Species Sepal.Length Petal.Length Sepal.Width  Petal.Width
#> 1.Sepal  setosa  5.1(3) [cm] 1.40(7) [cm] 3.5(2) [cm] 0.20(1) [cm]
#> 2.Sepal  setosa  4.9(2) [cm] 1.40(7) [cm] 3.0(2) [cm] 0.20(1) [cm]
#> 3.Sepal  setosa  4.7(2) [cm] 1.30(6) [cm] 3.2(2) [cm] 0.20(1) [cm]
#> 4.Sepal  setosa  4.6(2) [cm] 1.50(8) [cm] 3.1(2) [cm] 0.20(1) [cm]
#> 5.Sepal  setosa  5.0(2) [cm] 1.40(7) [cm] 3.6(2) [cm] 0.20(1) [cm]
#> 6.Sepal  setosa  5.4(3) [cm] 1.70(8) [cm] 3.9(2) [cm] 0.40(2) [cm]

We have seen that quantities have been correctly preserved through the whole process. Finally, we can check whether both data frames are identical. Given that the order of columns have changed, we can simply check this column name by column name and then put everything together:

all(sapply(colnames(iris.q), function(col) all(iris.q[[col]] == wide.2[[col]])))
#> [1] TRUE

Tidyverse

The core tidyverse includes the following packages: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr and forcats. This section covers use cases for dplyr (everything except for pivoting and unpivoting) and tidyr (for pivoting and unpivoting).

library(dplyr); packageVersion("dplyr")
#> [1] '1.0.4'
library(tidyr); packageVersion("tidyr")
#> [1] '1.1.2'

Since dplyr 1.0.0, as we will see, there is enhanced support for custom S3 classes thanks to the new implementation based on vctrs >= 0.3.0. Packages units >= 0.6-7, errors >= 0.3.4 and quantities >= 0.1.5 add support for this approach.

Row Subsetting

The filter generic finds observations where conditions hold. The main difference with base subsetting is that, if a condition evaluates to NA for a certain row, it is dropped. As in the base case, another quantities object must be defined for the comparison:

iris.q %>%
  filter(Sepal.Length > set_quantities(7.5, cm)) %>%
  head()
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 1  7.6(4) [cm] 3.0(2) [cm]  6.6(3) [cm] 2.1(1) [cm] virginica
#> 2  7.7(4) [cm] 3.8(2) [cm]  6.7(3) [cm] 2.2(1) [cm] virginica
#> 3  7.7(4) [cm] 2.6(1) [cm]  6.9(3) [cm] 2.3(1) [cm] virginica
#> 4  7.7(4) [cm] 2.8(1) [cm]  6.7(3) [cm] 2.0(1) [cm] virginica
#> 5  7.9(4) [cm] 3.8(2) [cm]  6.4(3) [cm] 2.0(1) [cm] virginica
#> 6  7.7(4) [cm] 3.0(2) [cm]  6.1(3) [cm] 2.3(1) [cm] virginica

There are also three scoped variants available (filter_all, filter_if, filter_at) and a subsetting function by row number called slice. All of them preserve quantities.

Row Ordering

The arrange generic sorts variables in a straightforward way, and it is compatible with quantities:

iris.q %>%
  arrange(Sepal.Length) %>%
  head()
#>   Sepal.Length Sepal.Width Petal.Length   Petal.Width Species
#> 1  4.3(2) [cm] 3.0(2) [cm] 1.10(6) [cm] 0.100(5) [cm]  setosa
#> 2  4.4(2) [cm] 2.9(1) [cm] 1.40(7) [cm]  0.20(1) [cm]  setosa
#> 3  4.4(2) [cm] 3.0(2) [cm] 1.30(6) [cm]  0.20(1) [cm]  setosa
#> 4  4.4(2) [cm] 3.2(2) [cm] 1.30(6) [cm]  0.20(1) [cm]  setosa
#> 5  4.5(2) [cm] 2.3(1) [cm] 1.30(6) [cm]  0.30(2) [cm]  setosa
#> 6  4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm]  0.20(1) [cm]  setosa

The desc function can be applied to individual variables to arrange in descending order.

Column Transformation

There are two generics for column transformations: mutate modifies or adds new variables preserving the existing ones, while transmute drops the existing variables. The syntax is very similar to base functions transform and within, and equally compatible with quantities:

iris.q %>%
  transmute(
    Species = Species,
    Petal.Area = Petal.Length * Petal.Width,
    Sepal.Area = Sepal.Length * Sepal.Width
  ) %>%
  head()
#>   Species     Petal.Area   Sepal.Area
#> 1  setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 2  setosa 0.28(2) [cm^2] 15(1) [cm^2]
#> 3  setosa 0.26(2) [cm^2] 15(1) [cm^2]
#> 4  setosa 0.30(2) [cm^2] 14(1) [cm^2]
#> 5  setosa 0.28(2) [cm^2] 18(1) [cm^2]
#> 6  setosa 0.68(5) [cm^2] 21(1) [cm^2]

Row Aggregation

dplyr breaks down aggregation operations in two distinct parts: grouping (with group_by) and summarising (using summarise and others). Since dplyr >= 1.0.0, operations on aggregated data is now fully compatible with quantities and,compared to base methods, no fancy unlisting is required:

iris.q %>%
  group_by(Species) %>%
  summarise_all(mean)
#> # A tibble: 3 x 5
#>   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
#> * <fct>        (err) [cm]  (err) [cm]   (err) [cm]  (err) [cm]
#> 1 setosa           5.0(3)      3.4(2)      1.46(7)     0.25(1)
#> 2 versicolor       5.9(3)      2.8(1)       4.3(2)     1.33(7)
#> 3 virginica        6.6(3)      3.0(1)       5.6(3)      2.0(1)

Column Joining

Several verbs are provided for different types of joins, such as inner_join, left_join, right_join or full_join. Internally, they use the same grouping mechanism than summaries. Therefore, since dplyr >= 1.0.0, these are fully compatible with quantities too:

iris.q %>%
  left_join(data.frame(
    Height = set_quantities(c(55, 60, 45), cm, c(45, 30, 35)),
    Species = c("setosa", "virginica", "versicolor")
  )) %>%
  head()
#> Joining, by = "Species"
#>   Sepal.Length Sepal.Width Petal.Length  Petal.Width Species      Height
#> 1  5.1(3) [cm] 3.5(2) [cm] 1.40(7) [cm] 0.20(1) [cm]  setosa 60(40) [cm]
#> 2  4.9(2) [cm] 3.0(2) [cm] 1.40(7) [cm] 0.20(1) [cm]  setosa 60(40) [cm]
#> 3  4.7(2) [cm] 3.2(2) [cm] 1.30(6) [cm] 0.20(1) [cm]  setosa 60(40) [cm]
#> 4  4.6(2) [cm] 3.1(2) [cm] 1.50(8) [cm] 0.20(1) [cm]  setosa 60(40) [cm]
#> 5  5.0(2) [cm] 3.6(2) [cm] 1.40(7) [cm] 0.20(1) [cm]  setosa 60(40) [cm]
#> 6  5.4(3) [cm] 3.9(2) [cm] 1.70(8) [cm] 0.40(2) [cm]  setosa 60(40) [cm]

The only difference with base merge here is that dplyr does not reorder columns with respect to the left-hand side.

(Un)Pivoting

Finally, pivoting and unpivoting is handled by a separate package, tidyr. Historically, this was managed using the verbs spread (pivot) and gather (unpivot). These verbs, which are not compatible with quantities, are deprecated and no longer maintained.

Instead, there are new and more straightforward verbs for (un)pivoting data frames called pivot_wider (equivalent to spread) and pivot_longer (equivalent to gather). These verbs do make use of the new approach brought by vctrs and therefore are fully compatible with quantities.

Compared to base R, the unpivoting operation is substantially more straightforward. In the next example, we directly merge the four columns of interest into the value column, and the correspoding column names are gathered into the name column. Such a column is then separated into flower part (sepal, petal) and dim (length, height):

iris.q %>%
  pivot_longer(1:4) %>%
  separate(name, c("part", "dim")) %>%
  head()
#> # A tibble: 6 x 4
#>   Species part  dim         value
#>   <fct>   <chr> <chr>  (err) [cm]
#> 1 setosa  Sepal Length     5.1(3)
#> 2 setosa  Sepal Width      3.5(2)
#> 3 setosa  Petal Length    1.40(7)
#> 4 setosa  Petal Width     0.20(1)
#> 5 setosa  Sepal Length     4.9(2)
#> 6 setosa  Sepal Width      3.0(2)

In the following example, we first unpivot the original data set, then we assign quantities and try to pivot it to obtain iris.q back, and it just works:

iris %>%
  # first gather, with row numbers as row_id
  mutate(row_id = 1:n()) %>%
  pivot_longer(1:4) %>%
  # assign quantities
  mutate(value = set_quantities(value, cm, value * 0.05)) %>%
  # now spread and remove the row_id
  pivot_wider() %>%
  select(-row_id) %>%
  head()
#> # A tibble: 6 x 5
#>   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#>   <fct>     (err) [cm]  (err) [cm]   (err) [cm]  (err) [cm]
#> 1 setosa        5.1(3)      3.5(2)      1.40(7)     0.20(1)
#> 2 setosa        4.9(2)      3.0(2)      1.40(7)     0.20(1)
#> 3 setosa        4.7(2)      3.2(2)      1.30(6)     0.20(1)
#> 4 setosa        4.6(2)      3.1(2)      1.50(8)     0.20(1)
#> 5 setosa        5.0(2)      3.6(2)      1.40(7)     0.20(1)
#> 6 setosa        5.4(3)      3.9(2)      1.70(8)     0.40(2)

A Guide to Working with Quantities

Iñaki Ucar

2021-02-21

Introduction

R Base

Row Subsetting

Row Ordering

Column Transformation

Row Aggregation

Column Joining

(Un)Pivoting

Tidyverse

Row Subsetting

Row Ordering

Column Transformation

Row Aggregation

Column Joining

(Un)Pivoting

Summary

A Note on `data.table`

A Guide to Working with Quantities

Iñaki Ucar

2021-02-21

Introduction

R Base

Row Subsetting

Row Ordering

Column Transformation

Row Aggregation

Column Joining

(Un)Pivoting

Tidyverse

Row Subsetting

Row Ordering

Column Transformation

Row Aggregation

Column Joining

(Un)Pivoting

Summary

A Note on data.table

A Note on `data.table`