In this blog post I explore the purrr
package (member of tidyverse collection) and its use within a data scientist’s code. I aim to present the case for using the purrr
functions and through the use of examples compare them with base R functionality. To do this, we will concentrate on two typical coding scenarios in base R: 1) loops and 2) the suite of apply functions and then compare them with their relevant counterpart map functions in the purrr package.
However, before I start, I wanted to make it clear that I do sympathise with those of you whose first reaction to purrr
is “but I can do all this stuff in base R”. Putting that aside, the obvious first obstacle for us to overcome is to lose the notion of “if it’s not broken why change it” and open our ‘coding’ minds to change. At least, I hope you agree with me that the silver lining of this kind of exercise is to satisfy ones curiosity about the purrr
package and maybe learn something new!
Let us first briefly describe the concept of functional programming (FP) in case you are not familiar with it.
Functional programming (FP)
R is a functional programming language which means that a user of R has the necessary tools to create and manipulate functions. There is no need to go into too much depth here but it suffices to know that FP is the process of writing code in a structured way and through functions remove code duplications and redundancies. In effect, computations or evaluations are treated as mathematical functions and the output of a function only depends on the values of its inputs – known as arguments. FP ensures that any side-effects such as changes in state do not affect the expected output such that if you call the same function twice with the same arguments the function returns the same output.
For those that are interested to find out more, I suggest reading Hadley Wickham’s Functional Programmingchapter in the “Advanced R” book. The companion website for this can be found at: http://adv-r.had.co.nz/
The purrr
package, which forms part of the tidyverse ecosystem of packages, further enhances the functional programming aspect of R. It allows the user to write functional code with less friction in a complete and consistent manner. The purrr
functions can be used, among other things, to replace loops and the suite of apply functions.
Let’s talk about loops
The motivation behind the examples we are going to look at involve iterating in R for various scenarios. For example, iterate over elements of a vector or list, iterate over rows or columns of a matrix … the list (pun intended) can go on and on!
One of the first things that one gets very excited to ‘play’ when learning to use R – at least that was the case for me – is loops! Lot’s of loops, elaborate, complex… dare I say never ending infinite loops (queue hysteric laughter emoji). Joking aside, it is usually the default answer to a problem that involves iteration of some sort as I demonstrate below.
# Create a vector of the mean values of all the columns of the mtcars dataset # The long repetitive way mean_vec <- c(mean(mtcars$mpg),mean(mtcars$cyl),mean(mtcars$disp),mean(mtcars$hp), mean(mtcars$drat),mean(mtcars$wt),mean(mtcars$qsec),mean(mtcars$vs), mean(mtcars$am),mean(mtcars$gear),mean(mtcars$carb)) mean_vec [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 [7] 17.848750 0.437500 0.406250 3.687500 2.812500 # The loop way mean_vec_loop <- vector("double", ncol(mtcars)) for (i in seq_along(mtcars)) { mean_vec_loop[[i]] <- mean(mtcars[[i]]) } mean_vec_loop [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 [7] 17.848750 0.437500 0.406250 3.687500 2.812500
The resulting vectors are the same and the difference in speed (milliseconds) is negligible. I hope that we can all agree that the long way is definitely not advised and actually is bad coding practice, let alone the frustration (and error-prone task) of copy/pasting. Having said that, I am sure there are other ways to do this – I demonstrate this later using lapply
– but my aim was to show the benefit of using a for loop in base R for an iteration problem.
Now imagine if in the above example I wanted to calculate the variance of each column as well…
# Create two vectors of the mean and variance of all the columns of the mtcars dataset # For mean mean_vec_loop <- vector("double", ncol(mtcars)) for (i in seq_along(mtcars)) { mean_vec_loop[[i]] <- mean(mtcars[[i]]) } mean_vec_loop [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 [7] 17.848750 0.437500 0.406250 3.687500 2.812500 #For variance var_vec_loop <- vector("double", ncol(mtcars)) for (i in seq_along(mtcars)) { var_vec_loop[[i]] <- var(mtcars[[i]]) } var_vec_loop [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01 [11] 2.608871e+00 # Or combine both calculations in one loop for (i in seq_along(mtcars)) { mean_vec_loop[[i]] <- mean(mtcars[[i]]) var_vec_loop[[i]] <- var(mtcars[[i]]) } mean_vec_loop [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 [7] 17.848750 0.437500 0.406250 3.687500 2.812500 var_vec_loop [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01 [11] 2.608871e+00
Now let us assume that we know that we want to create these vectors not just for the mtcars dataset but for other datasets as well. We could in theory copy/paste the for
loops and just change the dataset we supply in the loop but one should agree that this action is repetitive and could result to mistakes. Instead we can generalise this into functions. This is where FP comes into play.
# Create two functions that returns the mean and variance of the columns of a dataset # For mean col_mean <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[[i]] <- mean(df[[i]]) } output } col_mean(mtcars) [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 [7] 17.848750 0.437500 0.406250 3.687500 2.812500 #For variance col_variance <- function(df) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[[i]] <- var(df[[i]]) } output } col_variance(mtcars) [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01 [11] 2.608871e+00
Why not take this one step further and take full advantage of R’s functional programming tools by creating a function that takes as an argument a function! Yes, you read it correctly… a function within a function!
Why do we want to do that? Well, the code for the two functions above, as clean as it might look, is still repetitive and the only real difference between col_mean
and col_var
is the mathematical function that we are calling. So why not generalise this further?
# Create a function that returns a computational value (such as mean or variance) # for a given dataset col_calculation <- function(df,fun) { output <- vector("double", length(df)) for (i in seq_along(df)) { output[[i]] <- fun(df[[i]]) } output } col_calculation(mtcars,mean) [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 [7] 17.848750 0.437500 0.406250 3.687500 2.812500 col_calculation(mtcars,var) [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01 [11] 2.608871e+00
Did someone say apply?
I mentioned earlier that an alternative way to solve the problem is to use the apply
function (or suite of apply
functions such as lapply
, sapply
, vapply
, etc). In fact, these functions are what we call Higher Order Functions. Similar to what we did earlier, these are functions that can take other functions as an argument.
The benefit of using higher order functions instead of a for
loop is that they allow us to think about what code we are executing at a higher level. Think of it as: “apply this to that” rather than “take the first item, do this, take the next item, do this…”
I must admit that at first it might take a little while to get used to but there is definitely a sense of pride when you can improve your code by eliminating for
loops and replace them with apply-type functions.
# Create a list/vector of the mean values of all the columns of the mtcars dataset lapply(mtcars,mean) %>% head # Returns a list $mpg [1] 20.09062 $cyl [1] 6.1875 $disp [1] 230.7219 $hp [1] 146.6875 $drat [1] 3.596563 $wt [1] 3.21725 sapply(mtcars,mean) %>% head # Returns a vector mpg cyl disp hp drat wt 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
Once again, speed of execution is not the issue and neither is the common misconception about loops being slow compared to apply
functions. As a matter of fact the main argument in favour of using lapply
or any of the purrr
functions as we will see later is the pure simplicity and readability of the code. Full stop.
Enter the purrr
The best place to start when exploring the purrr
package is the map
function. The reader will notice that these functions are utilised in a very similar way to the apply
family of functions. The subtle difference is that the purrr
functions are consistent and the user can be assured of the output – as opposed to some cases when using for example sapply
as I demonstrate later on.
# Create a list/vector of the mean values of all the columns of the mtcars dataset map(mtcars,mean) %>% head # Returns a list $mpg [1] 20.09062 $cyl [1] 6.1875 $disp [1] 230.7219 $hp [1] 146.6875 $drat [1] 3.596563 $wt [1] 3.21725 map_dbl(mtcars,mean) %>% head # Returns a vector - of class double mpg cyl disp hp drat wt 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
Let us introduce the iris dataset with a slight modification in order to demonstrate the inconsistency that sometimes can occur when using the sapply
function. This can often cause issues with the code and introduce mystery bugs that are hard to spot.
# Modify iris dataset iris_mod <- iris iris_mod$Species <- ordered(iris_mod$Species) # Ordered factor levels class(iris_mod$Species) # Note: The ordered function changes the class [1] "ordered" "factor" # Extract class of every column in iris_mod sapply(iris_mod, class) %>% str # Returns a list of the results List of 5 $ Sepal.Length: chr "numeric" $ Sepal.Width : chr "numeric" $ Petal.Length: chr "numeric" $ Petal.Width : chr "numeric" $ Species : chr [1:2] "ordered" "factor" sapply(iris_mod[1:3], class) %>% str # Returns a character vector!?!? - Note: inconsistent object type Named chr [1:3] "numeric" "numeric" "numeric" - attr(*, "names")= chr [1:3] "Sepal.Length" "Sepal.Width" "Petal.Length"
Since by default map
returns a list one can ensure that an object of the same class is returned without any unexpected (and unwanted) surprises. This is inline with FP consistency.
# Extract class of every column in iris_mod map(iris_mod, class) %>% str # Returns a list of the results List of 5 $ Sepal.Length: chr "numeric" $ Sepal.Width : chr "numeric" $ Petal.Length: chr "numeric" $ Petal.Width : chr "numeric" $ Species : chr [1:2] "ordered" "factor" map(iris_mod[1:3], class) %>% str # Returns a list of the results List of 3 $ Sepal.Length: chr "numeric" $ Sepal.Width : chr "numeric" $ Petal.Length: chr "numeric"
To further demonstrate the consistency of the purrr
package in this type of setting, the map_*()
functions (see below) can be used to return a vector of the expected type, otherwise you get an informative error.
map_lgl()
makes a logical vector.map_int()
makes an integer vector.map_dbl()
makes a double vector.map_chr()
makes a character vector.
# Extract class of every column in iris_mod map_chr(iris_mod[1:4], class) %>% str # Returns a character vector Named chr [1:4] "numeric" "numeric" "numeric" "numeric" - attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" map_chr(iris_mod, class) %>% str # Returns a meaningful error Error: Result 5 is not a length 1 atomic vector # As opposed to the equivalent base R function vapply vapply(iris_mod[1:4], class, character(1)) %>% str # Returns a character vector Named chr [1:4] "numeric" "numeric" "numeric" "numeric" - attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" vapply(iris_mod, class, character(1)) %>% str # Returns a possibly harder to understand error Error in vapply(iris_mod, class, character(1)): values must be length 1, but FUN(X[[5]]) result is length 2
It is worth noting that if the user does not wish to rely on tidyverse dependencies they can always use base R functions but need to be extra careful of the potential inconsistencies that might arise.
Multiple arguments and neat tricks
In case we wanted to apply a function to multiple vector arguments we have the option of mapply
from base R or the map2
from purrr
.
# Create random normal values from a list of means and a list of standard deviations mu <- list(10, 100, -100) sigma <- list(0.01, 1, 10) mapply(rnorm, n=5, mu, sigma, SIMPLIFY = FALSE) # I need SIMPLIFY = FALSE because otherwise I get a matrix [[1]] [1] 10.002750 10.001843 9.998684 10.008720 9.994432 [[2]] [1] 100.54979 99.64918 100.00214 102.98765 98.49432 [[3]] [1] -82.98467 -99.05069 -95.48636 -97.43427 -110.02194 map2(mu, sigma, rnorm, n = 5) [[1]] [1] 10.00658 10.00005 10.00921 10.02296 10.00840 [[2]] [1] 98.92438 100.86043 100.20079 97.02832 99.88593 [[3]] [1] -113.32003 -94.37817 -86.16424 -97.80301 -105.86208
The map2
function can easily extend to further arguments – not just two as in the example above – and that is where the pmap
function comes in.
I also thought of sharing a couple of neat tricks that one can use with the map
function.
- Say you want to fit a linear model for every cylinder type in the mtcars dataset. You can avoid code duplication and do it as follows:
# Split mtcars dataset by cylinder values and then fit a simple lm models <- mtcars %>% split(.$cyl) %>% # Split by cylinder into 3 lists map(function(df) lm(mpg ~ wt, data = df)) # Fit linear model for each list
- Say we are using a function, such as
sqrt
(calculate square root), on a list that contains a non-numeric element. The base R functionlapply
throws an error and execution stops without knowing what caused the error. Thesafely
function ofpurrr
completes execution and the user can identify what caused the error.
x <- list(1, 2, 3, "e", 5) # Base R lapply(x, sqrt) Error in FUN(X[[i]], ...): non-numeric argument to mathematical function # purrr package safe_sqrt <- safely(sqrt) safe_result_list <- map(x, safe_sqrt) %>% transpose safe_result_list$result [[1]] [1] 1 [[2]] [1] 1.414214 [[3]] [1] 1.732051 [[4]] NULL [[5]] [1] 2.236068
Conclusion
Overall, I think it is fair to say that using higher order functions in R is a great way to improve ones code. With that in mind, my closing remark for this blog post is to simply re-iterate the benefits of using the purrr
package. That is:
- The output is consistent.
- The code is easier to read and write.
If you enjoyed learning about purrr, then you can join us at our purrr workshop at this years EARL London – early bird tickets are available now!