Dr. Gentleman’s work at 23andMe focuses on the exploration of how human genetic and trait data in the 23andMe database can be used to identify new therapies for disease. Dr. Gentleman is also recognised as one of the originators of the R programming language and has been awarded the Benjamin Franklin Award, a recognition for Open Access in the Life Sciences presented by the Bioinformatics Organisation. His keynote will focus on the History of R and some thoughts on data science.
Dr Robert Gentleman it is an honour to have you as a keynote speaker at our EARL conference in Houston, we are intrigued to hear more about your career to date and how your work around open access to scientific data has helped shape valuable research worldwide…
Amongst your significant achievements to date has been the development of the R programming language alongside fellow statistician Ross Ihaka at the University of Auckland in the mid 1990’s. What prompted you to develop a new statistical programing language?
Ross and I had a lot of familiarity with S (from Bell Labs) and at that time (my recollection is) there were a lot more languages for Statistics around. We were interested in how languages were used for data analysis. Both Ross and I had some experience with Lisp and Scheme and at that time some of the work in computer science was showing how one could easily write interpreters for different types of languages, largely based on simple Scheme prototypes. We liked a lot about S, but there were a few places where we thought that different design decisions might provide improved functionality. So we wrote a simple Scheme interpreter and then gradually modified it into the core of the R language. As we went forward we added all sorts of different capabilities and found a large number of great collaborators.
As we made some progress we found that were others around the world who were also interested in developing a system like R. And luckily at just about that time, the internet became reliable and tools evolved that really helped support a distributed software development process. That group of collaborators became R Core and then later formed the nucleus for the R Foundation.
Probably the most important development was CRAN and some of the important tools that were developed to support the widespread creation of packages. Really a very large part of the success of R is due to the ability of any scientist to write a package containing code to carry out an analysis and to share that.
In 2008 you were awarded the Benjamin Franklin Award for your contribution to open access research. What areas of your research contributed towards being awarded this prestigious accolade?
I believe that my work on R was important, but perhaps more important for that award was the creation of the Bioconductor Project, together with a number of really great colleagues. Our paper in Genome Biology describes both those involved and what we did.
In your opinion how has the application of open source R for big data analysis, predictive modelling, data science and visualisation evolved since its inception?
In too many ways for me to do a good job of really describing. As I said above, the existence of CRAN (and the Bioconductor package repository) where there are a vast number of packages. UseRs can easily get a package to try out just about any idea. Those packages are not always well written, or well supported, but they provide a simple, fast way to try ideas out. And mostly the packages are of high quality and the developers are often very happy to discuss ideas with users. The community aspect is important. And R has become increasingly performant.
Your work today involves the fascinating work of combining bioinformatics and computational drug discovery at 23andMe. What led to your transition to drug discovery or was it a natural progression?
I had worked in two cancer centers including The Dana Farber in Boston and The Fred Hutchinson in Seattle. When I was on the faculty at Harvard, and then the Fred Hutchinson in Seattle, I was developing a computational biology department. The science is fantastic at both institutions and I learned a lot about cancer and how we could begin to use computational methods to begin exploring and understanding some of the computational molecular biology that is important.
But I also became convinced that making new drugs was something that would happen in a drug company. I wanted to see how computational methods could help lead to better and faster target discovery. When I was approached by Genentech it seemed like a great opportunity – and it was. Genentech is a fantastic company, I spend almost six years there and learned a huge amount.
As things progressed, I became convinced that using human genetics to do drug discovery was likely to be more effective than any other strategy that I was aware of. And when the opportunity came to join 23andMe I took it. And 23andMe is also a great company. We are in the early stages of drug discovery still, but I am very excited about the progress we are making and the team I am working with.
How is data science being used to improve health and accelerate the discovery of therapies?
If we are using a very broad definition (by training I am a statistician, and still think that careful hypothesis-driven research is essential to much discovery) of data science – it is essential. Better models, more data and careful analysis has yielded many breakthroughs.
Perhaps a different question is ‘where are the problems’? And for me, the biggest problem is that I am not sure there is much appreciation of the impact of bias as there should be. Big data is great – but it really only addresses the variance problem. Bias is different, it is harder to discover and its effects can be substantial. Perhaps, put another way, is just how generalizable are the results?
In addition to the development of the R programming language? What have been your proudest career achievements that you’d like to share?
The Bioconductor Project, working with my graduate students and post-docs and pretty much anytime I gave someone good advice.
Can you tell us about what to expect from your keynote talk and what might be the key take-home messages for our EARL delegates?
I hope an appreciation of why it is important to get involved in developing software systems and tools. And I hope some things to think about when approaching large-scale data analysis projects.
Inspired by the work of Dr Robert Gentleman, what questions would you like to ask? Tickets to EARL Houston are still available. Find out more and get tickets here.