To purrr or not to purrr

Blogs home

In this blog post I explore the purrr package (member of tidyverse collection) and its use within a data scientist’s code. I aim to present the case for using the purrr functions and through the use of examples compare them with base R functionality. To do this, we will concentrate on two typical coding scenarios in base R: 1) loops and 2) the suite of apply functions and then compare them with their relevant counterpart map functions in the purrr package.

However, before I start, I wanted to make it clear that I do sympathise with those of you whose first reaction to purrr is “but I can do all this stuff in base R”. Putting that aside, the obvious first obstacle for us to overcome is to lose the notion of “if it’s not broken why change it” and open our ‘coding’ minds to change. At least, I hope you agree with me that the silver lining of this kind of exercise is to satisfy ones curiosity about the purrrpackage and maybe learn something new!

Let us first briefly describe the concept of functional programming (FP) in case you are not familiar with it.

Functional programming (FP)

R is a functional programming language which means that a user of R has the necessary tools to create and manipulate functions. There is no need to go into too much depth here but it suffices to know that FP is the process of writing code in a structured way and through functions remove code duplications and redundancies. In effect, computations or evaluations are treated as mathematical functions and the output of a function only depends on the values of its inputs – known as arguments. FP ensures that any side-effects such as changes in state do not affect the expected output such that if you call the same function twice with the same arguments the function returns the same output.

For those that are interested to find out more, I suggest reading Hadley Wickham’s Functional Programmingchapter in the “Advanced R” book. The companion website for this can be found at: http://adv-r.had.co.nz/

The purrr package, which forms part of the tidyverse ecosystem of packages, further enhances the functional programming aspect of R. It allows the user to write functional code with less friction in a complete and consistent manner. The purrr functions can be used, among other things, to replace loops and the suite of apply functions.

Let’s talk about loops

The motivation behind the examples we are going to look at involve iterating in R for various scenarios. For example, iterate over elements of a vector or list, iterate over rows or columns of a matrix … the list (pun intended) can go on and on!

One of the first things that one gets very excited to ‘play’ when learning to use R – at least that was the case for me – is loops! Lot’s of loops, elaborate, complex… dare I say never ending infinite loops (queue hysteric laughter emoji). Joking aside, it is usually the default answer to a problem that involves iteration of some sort as I demonstrate below.

# Create a vector of the mean values of all the columns of the mtcars dataset
# The long repetitive way
mean_vec <- c(mean(mtcars$mpg),mean(mtcars$cyl),mean(mtcars$disp),mean(mtcars$hp),
              mean(mtcars$drat),mean(mtcars$wt),mean(mtcars$qsec),mean(mtcars$vs),
              mean(mtcars$am),mean(mtcars$gear),mean(mtcars$carb))
mean_vec
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

# The loop way
mean_vec_loop <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
  mean_vec_loop[[i]] <- mean(mtcars[[i]])
}
mean_vec_loop
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

The resulting vectors are the same and the difference in speed (milliseconds) is negligible. I hope that we can all agree that the long way is definitely not advised and actually is bad coding practice, let alone the frustration (and error-prone task) of copy/pasting. Having said that, I am sure there are other ways to do this – I demonstrate this later using lapply – but my aim was to show the benefit of using a for loop in base R for an iteration problem.

Now imagine if in the above example I wanted to calculate the variance of each column as well…

# Create two vectors of the mean and variance of all the columns of the mtcars dataset

# For mean
mean_vec_loop <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
  mean_vec_loop[[i]] <- mean(mtcars[[i]])
}
mean_vec_loop
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

#For variance
var_vec_loop <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
  var_vec_loop[[i]] <- var(mtcars[[i]])
}
var_vec_loop
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

# Or combine both calculations in one loop
for (i in seq_along(mtcars)) {
  mean_vec_loop[[i]] <- mean(mtcars[[i]])
  var_vec_loop[[i]] <- var(mtcars[[i]])
}
mean_vec_loop
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500
var_vec_loop
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

Now let us assume that we know that we want to create these vectors not just for the mtcars dataset but for other datasets as well. We could in theory copy/paste the for loops and just change the dataset we supply in the loop but one should agree that this action is repetitive and could result to mistakes. Instead we can generalise this into functions. This is where FP comes into play.

# Create two functions that returns the mean and variance of the columns of a dataset

# For mean
col_mean <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[[i]] <- mean(df[[i]])
  }
  output
}
col_mean(mtcars)
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

#For variance
col_variance <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[[i]] <- var(df[[i]])
  }
  output
}
col_variance(mtcars)
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

Why not take this one step further and take full advantage of R’s functional programming tools by creating a function that takes as an argument a function! Yes, you read it correctly… a function within a function!

Why do we want to do that? Well, the code for the two functions above, as clean as it might look, is still repetitive and the only real difference between col_mean and col_var is the mathematical function that we are calling. So why not generalise this further?

# Create a function that returns a computational value (such as mean or variance)
# for a given dataset

col_calculation <- function(df,fun) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[[i]] <- fun(df[[i]])
  }
  output
}
col_calculation(mtcars,mean)
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500
col_calculation(mtcars,var)
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

Did someone say apply?

I mentioned earlier that an alternative way to solve the problem is to use the apply function (or suite of applyfunctions such as lapply, sapply, vapply, etc). In fact, these functions are what we call Higher Order Functions. Similar to what we did earlier, these are functions that can take other functions as an argument.

The benefit of using higher order functions instead of a for loop is that they allow us to think about what code we are executing at a higher level. Think of it as: “apply this to that” rather than “take the first item, do this, take the next item, do this…”

I must admit that at first it might take a little while to get used to but there is definitely a sense of pride when you can improve your code by eliminating for loops and replace them with apply-type functions.

# Create a list/vector of the mean values of all the columns of the mtcars dataset
lapply(mtcars,mean) %>% head # Returns a list
$mpg
[1] 20.09062

$cyl
[1] 6.1875

$disp
[1] 230.7219

$hp
[1] 146.6875

$drat
[1] 3.596563

$wt
[1] 3.21725
sapply(mtcars,mean) %>% head # Returns a vector
       mpg        cyl       disp         hp       drat         wt 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250

Once again, speed of execution is not the issue and neither is the common misconception about loops being slow compared to apply functions. As a matter of fact the main argument in favour of using lapply or any of the purrr functions as we will see later is the pure simplicity and readability of the code. Full stop.

Enter the purrr

The best place to start when exploring the purrr package is the map function. The reader will notice that these functions are utilised in a very similar way to the apply family of functions. The subtle difference is that the purrr functions are consistent and the user can be assured of the output – as opposed to some cases when using for example sapply as I demonstrate later on.

# Create a list/vector of the mean values of all the columns of the mtcars dataset
map(mtcars,mean) %>% head # Returns a list
$mpg
[1] 20.09062

$cyl
[1] 6.1875

$disp
[1] 230.7219

$hp
[1] 146.6875

$drat
[1] 3.596563

$wt
[1] 3.21725
map_dbl(mtcars,mean) %>% head # Returns a vector - of class double
       mpg        cyl       disp         hp       drat         wt 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250

Let us introduce the iris dataset with a slight modification in order to demonstrate the inconsistency that sometimes can occur when using the sapply function. This can often cause issues with the code and introduce mystery bugs that are hard to spot.

# Modify iris dataset
iris_mod <- iris
iris_mod$Species <- ordered(iris_mod$Species) # Ordered factor levels class(iris_mod$Species) # Note: The ordered function changes the class [1] "ordered" "factor" # Extract class of every column in iris_mod sapply(iris_mod, class) %>% str # Returns a list of the results
List of 5
 $ Sepal.Length: chr "numeric"
 $ Sepal.Width : chr "numeric"
 $ Petal.Length: chr "numeric"
 $ Petal.Width : chr "numeric"
 $ Species     : chr [1:2] "ordered" "factor"
sapply(iris_mod[1:3], class) %>% str # Returns a character vector!?!? - Note: inconsistent object type
 Named chr [1:3] "numeric" "numeric" "numeric"
 - attr(*, "names")= chr [1:3] "Sepal.Length" "Sepal.Width" "Petal.Length"

Since by default map returns a list one can ensure that an object of the same class is returned without any unexpected (and unwanted) surprises. This is inline with FP consistency.

# Extract class of every column in iris_mod
map(iris_mod, class) %>% str # Returns a list of the results
List of 5
 $ Sepal.Length: chr "numeric"
 $ Sepal.Width : chr "numeric"
 $ Petal.Length: chr "numeric"
 $ Petal.Width : chr "numeric"
 $ Species     : chr [1:2] "ordered" "factor"
map(iris_mod[1:3], class) %>% str # Returns a list of the results
List of 3
 $ Sepal.Length: chr "numeric"
 $ Sepal.Width : chr "numeric"
 $ Petal.Length: chr "numeric"

To further demonstrate the consistency of the purrr package in this type of setting, the map_*() functions (see below) can be used to return a vector of the expected type, otherwise you get an informative error.

  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.
# Extract class of every column in iris_mod
map_chr(iris_mod[1:4], class) %>% str # Returns a character vector
 Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
 - attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
map_chr(iris_mod, class) %>% str # Returns a meaningful error
Error: Result 5 is not a length 1 atomic vector

# As opposed to the equivalent base R function vapply
vapply(iris_mod[1:4], class, character(1)) %>% str  # Returns a character vector
 Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
 - attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
vapply(iris_mod, class, character(1)) %>% str  # Returns a possibly harder to understand error
Error in vapply(iris_mod, class, character(1)): values must be length 1,
 but FUN(X[[5]]) result is length 2

It is worth noting that if the user does not wish to rely on tidyverse dependencies they can always use base R functions but need to be extra careful of the potential inconsistencies that might arise.

Multiple arguments and neat tricks

In case we wanted to apply a function to multiple vector arguments we have the option of mapply from base R or the map2 from purrr.

# Create random normal values from a list of means and a list of standard deviations
mu <- list(10, 100, -100)
sigma <- list(0.01, 1, 10)

mapply(rnorm, n=5, mu, sigma, SIMPLIFY = FALSE) # I need SIMPLIFY = FALSE because otherwise I get a matrix
[[1]]
[1] 10.002750 10.001843  9.998684 10.008720  9.994432

[[2]]
[1] 100.54979  99.64918 100.00214 102.98765  98.49432

[[3]]
[1]  -82.98467  -99.05069  -95.48636  -97.43427 -110.02194

map2(mu, sigma, rnorm, n = 5)
[[1]]
[1] 10.00658 10.00005 10.00921 10.02296 10.00840

[[2]]
[1]  98.92438 100.86043 100.20079  97.02832  99.88593

[[3]]
[1] -113.32003  -94.37817  -86.16424  -97.80301 -105.86208

The map2 function can easily extend to further arguments – not just two as in the example above – and that is where the pmap function comes in.

I also thought of sharing a couple of neat tricks that one can use with the map function.

  1. Say you want to fit a linear model for every cylinder type in the mtcars dataset. You can avoid code duplication and do it as follows:
# Split mtcars dataset by cylinder values and then fit a simple lm
models <- mtcars %>% 
  split(.$cyl) %>% # Split by cylinder into 3 lists
  map(function(df) lm(mpg ~ wt, data = df)) # Fit linear model for each list
  1. Say we are using a function, such as sqrt (calculate square root), on a list that contains a non-numeric element. The base R function lapply throws an error and execution stops without knowing what caused the error. The safely function of purrr completes execution and the user can identify what caused the error.
x <- list(1, 2, 3, "e", 5)

# Base R
lapply(x, sqrt)
Error in FUN(X[[i]], ...): non-numeric argument to mathematical function

# purrr package
safe_sqrt <- safely(sqrt)
safe_result_list <- map(x, safe_sqrt) %>% transpose
safe_result_list$result
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

[[4]]
NULL

[[5]]
[1] 2.236068

Conclusion

Overall, I think it is fair to say that using higher order functions in R is a great way to improve ones code. With that in mind, my closing remark for this blog post is to simply re-iterate the benefits of using the purrrpackage. That is:

  • The output is consistent.
  • The code is easier to read and write.

If you enjoyed learning about purrr, then you can join us at our purrr workshop at this years EARL London – early bird tickets are available now!