Can you tell us about your upcoming keynote at EARL and what the key take-home messages will be for delegates?
I’m going to talk about functional programming which I think is one of the most important programming techniques used with R. It’s not something you need on day 1 as a data scientist but it gives you some really powerful tools for repeating the same action again and again with code. It takes a little while to get your head around it but recently, because I’ve been working writing the second edition of Advanced R, I’ve prepared some diagrams that make it easier to understand. So the take-home message will be to use more functional programming because it will make your life easier!
Writer, developer, analyst, educator or speaker – what do you enjoy most? Why?
The two things that motivate me most in the work that I do is the intellectual joy of understanding how you can take a big problem and break it up into small pieces that can be combined together in different ways – for example, the algebras and grammars of tools like dplyr and ggplot2. I’ve been working a lot in Advanced R to understand how the bits and pieces of R fit together which I find really enjoyable. The other thing that I really enjoy is hearing from people who have done cool stuff with the things that I’ve worked on which has made their life easier – whether that’s on Twitter or in person. Those are the two things that I really enjoy and pretty much everything else comes out of that. Educating, for example, is just helping other people understand how I’ve broken down a problem and sharing in ways that they can understand too.
What is your preferred industry or sector for data, analysis and applying the tools that you have developed?
I don’t do much data analysis myself anymore so when I do it, it’s normally data related to me in some way or, for example, data on RStudio packages. I do enjoy challenges like figuring out how to get data from a web API and turning it into something useful but the domain for my analysis is very broadly on data science topics.
When developing what are your go-to resources?
I still use StackOverflow quite a bit and Google in general. I also do quite a bit of comparative reading to understand different programming languages, seeing what’s going on across different languages, the techniques being used and learning about the evolving practices in other languages which is very helpful.
Is there anything that has surprised you about how any of the tools you’ve created has been used by others?
I used to but I’ve lost my capability to be surprised now just because the diversity of uses is crazy. I guess now, I think it’s most notable is when someone uses any of my tools to commit academic fraud (well-publicised examples sometimes). Otherwise, people are using R and data to understand pretty much every aspect of the world which is really neat.
What are the biggest changes that you see between data science now and when you started?
I think the biggest difference is that there’s a term for it – data science. I think it’s been useful to have that term rather than just Applied Statistician or Data Analyst because I think Data Science is becoming different to what these roles have been traditionally. It’s different from data analysis because data science uses programming heavily and it’s different from statistics since there’s a much greater emphasis on correct data import and data engineering with the goal may be to eventually turn the data analysis into a product, web app or something other than a standard report.
Where do you currently perceive the biggest bottlenecks in data science to be?
I think there are still a lot of bottlenecks in getting high-quality data and that’s what most people currently struggle with. I think another bottleneck is how to help people learn about all the great tools that are available, understand what their options are, where all the tools are and what they should be learning. I think there are still plenty of smaller things to improve with data manipulation, data visualization and tidying but by and large it feels to me like all the big pieces are there. Now it’s more about getting everything polished and working together really well. But still, getting data to a place to even start an analysis can be really frustrating so a major bottleneck is the whole pipeline that occurs before arriving in R.
What topic would you like to be presenting on in a data science conference a year from now?
I think one thing I’m going to be talking more about next year is this vctrs package that I’ve been working on. The package provides tools for handling object types in R and managing the types of inputs and outputs that a function expects and produces. My motivation for this is partly because there are a lot of inconsistencies in the tidyverse and base R that vctrs aims to fix. I think of this as part of my mental model because when I read R code, there’s a simplified R interpreter in my head which mainly focuses on the types of objects and predicts whether some code is going to work at all or if it’s going to fail. So part of the motivation behind this package is me thinking about how to get stuff out of my head and into the heads of other people so they can write well-functioning and predictable R code.
What do you hope to see out of RStudio in the next 5 years?
Generally, I want RStudio to be continually strengthening the connections between R and other programming languages. There’s an RStudio version 1.2 coming out which has a bunch of features to make it easy to use SQL, Stan and Python from RStudio. Also, the collaborative work we do with Wes McKinney and Ursa Labs – I think we’re just going to see more and more of that because data scientists are working in bigger and bigger teams on a fundamentally collaborative activity so making it as easy as possible to get data in and out of R is a big win for everyone.
I’m also excited to see the work that Max Kuhn has been doing on tidy modelling. I think the idea is really appealing because it gives modelling an API that is very similar to the tidyverse. But I think the thing that’s really neat about this work is that it takes inspiration from dplyr to separate the expression of a model from its computation so you can express the model once and fit it in R, Spark, tensorflow, Stan or whatever. The R language is really well suited to exploit this type of tool where the computation is easily described in R and executed somewhere that is more suitable for high performance.