EARLy bird catches the worm!

Two weeks ago was our most successful EARL London conference in its 5-year history, which I had the pleasure of attending for both days of talks. Now I must admit, as a Python user, I did feel a little bit like I was being dragged along to an event where everyone would be talking about the latest R packages for customising RMarkdown and Shiny applications (… and there was a little bit of that – I’m pretty sure I heard someone joke that it should be called the Shiny conference).

However, I was pleasantly surprised to find a diverse forum of passionate and inspiring data scientists from a wide range of specialisations (and countries!), each with unique personal insights to share. Although the conference was R focused, the concepts that were discussed are universally applicable across the Data Science profession, and I learned a great deal from attending these talks. If you weren’t so fortunate to attend or would like a refresher, here are my top 5 takeaways from the conference (you can find the slides for all the talks here, click on the speaker image to find the slides):

1. Business decisions should lead Data Science

Steven Wilkins, Edwina Dunn, Rich Pugh

For data to have a positive impact within an organisation, data science projects need to be defined according to the challenges impacting the business and those important decisions that the business needs to make. There’s no use building a model to describe past behaviour or predict future sales if this can’t be translated into action. I’ve heard this from Rich a thousand times since I’ve been at Mango Solutions, but hearing Steven Wilkins describe how this allowed Hiscox to successfully deliver business value from analytics really drove the point home for me. Similarly, Edwina Dunn demonstrated that those organisations which take the world by storm (e.g. Netflix, Amazon, Uber and AirBnB) are those which first and foremost are able to identify customer needs and then use data to meet those needs.

2. Communication drives change within organisations

Rich Pugh, Edwina Dunn, Leanne Fitzpatrick, Steven Wilkins

However, even the best run analytics projects won’t have any impact if the organisation does not value the insights they deliver. People are at the heart of the business, and organisations need to undergo a cultural shift if they want data to drive their decision making. An organisation can only become truly data-driven if all of its members can see the value of making decisions based on data and not intuition. Obviously, an important part of data science is the ability to communicate insights to external stakeholders, by means of storytelling and visualisations. However, even within an organisation, communication is just as important to instil this much needed cultural change.

3. Setting up frameworks streamlines productivity

Leanne Fitzpatrick, Steven Wilkins, Garrett Grolemund, Scott Finnie & Nick Forrester, George Cushen

Taking the time to set up frameworks ensures that company vision can be translated into day to day productivity. In reference to point 1, setting up a framework for prototyping of data science projects allows rapid evaluation of their potential impact to the business. Similarly, a consistent framework should be applied to communication within organisations, such as establishing how to educate the business to promote cultural change, or in the form of documentation and code reviews for developers.

On the technical side, pre-defined frameworks should also be used to bridge the gap between modelling and deployment. Leanne Fitzpatrick’s presentation demonstrated how the use of Docker images, YAML, project templates and engineer-defined test frameworks minimises unnecessary back and forth between data scientists and data engineers and therefore can streamline productivity. To enable this, however, it is important to teach modellers the importance of keeping production in mind during development, and to teach model requirements to data engineers, which hugely improved collaboration at Hymans according to Scott Finnie & Nick Forrester.

In the same vein, I was really intrigued by the flexibility of RMarkdown for creating re-usable templates. Garrett Grolemund from RStudio mentioned that we are currently experiencing a reproducibility crisis, in which the validity of scientific studies is put to question by the fact that most of their results are not reproducible. Using a tool such as RMarkdown to publish code used in statistical studies makes sharing and reviewing code much simpler, and minimises the risk of oversight. Similarly, RMarkdown seems to be a valuable tool for documentation and can even become a simple way of creating project websites, when combined with R packages such as George Cushen’s Kickstart-R.

4. Interpretability beats complexity (sometimes)

Kasia Kulma, Wojtek Kostelecki, Jeremy Horne, Jo-fai Chow

Stakeholders might not always be willing to trust models, and might prefer to fall back on their own experience. Therefore, being able to clearly interpret modelling results is essential to engage people and drive decision-making. One way of addressing this concern is to use simple models such as linear regression or logistic regression for time-series econometrics and market attribution, as demonstrated by Wojtek Kostelecki. The advantage of these is that we can assess the individual contribution of variables to the model, and therefore clearly quantify their impact on the business.

However, there are some cases where a more sophisticated model should be favoured over a simple one. Jeremy Horne’s example of customer segmentation proved that we aren’t always able to implement geo-demographic rules to help identify which customers are likely to engage with the business. “This is the reason why we use sophisticated machine learning models”, since they are better able to distinguish between different people from the same socio-demographic group, for example. This links back to Edwina Dunn’s mention of how customers should no longer be categorised by their profession or geo-demographics, but by their passions and interests.

Nevertheless, ‘trusting the model’ is a double-edged sword, and there are some serious ethical issues to consider, especially when dealing with sensitive personal information. I’m also pretty sure I heard the word ‘GDPR’ mentioned at every talk I attended. But fear not, here comes LIME to the rescue! Kasia Kulna explained how Local Interpretable Model-Agnostic Explanations (say that 5 times fast) allow modellers to sanity check their models by giving interpretable explanations as to why a model predicted a certain result. By extension, this can help prevent bias, discrimination and help avoid exploitative marketing.

5. R and Python can learn from each other

David Smith (during the panellist debate)

Now comes the fiery debate. Python or R? Call me controversial but, how about both? This was one of the more intriguing concepts that I heard, which came as the result of a question during the engaging panellist debate about the R and data science community. What this conference has demonstrated to me is that R is undergoing a massive transformation from being the simple statistical tool it once was, to a fully-fledged programming language which even has tools for production! Not only this, but it has the advantage of being a domain-specific language, which results in a very tight-knit community – which seemed to be the general consensus amongst the panel.

However, there are still a few things R can learn from Python, namely its vast array of tools for transitioning from modelling to deployment. It does seem like R is making steady progress in this regard, with tools such as Plumber to create REST APIs, Shiny Server for serving Shiny web apps online and RStudio Connect to tie these all together with RMarkdown and dashboards. Similarly, machine learning frameworks and cloud services which were more Python focused are now available in R. Keras, for example, provides a nice way to use TensorFlow from R, and there are many R packages available for deploying those models to production servers, as mentioned by Andrie de Vries.

Conversely, Python could learn from R in its approach to data analysis. David Smith remarked that there is a tendency within the Python world to have a model-centric approach to data science. This is also something that I have personally noticed. Whereas R is historically embedded in statistics, and therefore brings many tools for exploratory data analysis, this seems to take a backstage in the Python world. This tendency is exacerbated by popular Python machine learning frameworks such as scikit-learn and TensorFlow, which seem to recommend throwing whole datasets into the model and expecting the algorithm to select significant features for us. Python needs to learn from R tools such as ggplot2, Shiny and the tidyverse, which make it easier to interactively explore datasets.

Another part of the conference I really enjoyed were the lightning talks, which proved how challenging it can be to effectively pitch an idea within a single 10 minute presentation! As a result here are my…

Lightning takeaways!

“Companies should focus on what data they need, not the data they have.” (Edwina Dunn – Starcount)
“Don’t give in to the hype” (Andrie de Vries – RStudio)
“Trust the model” (Jeremy Horne – MC&C Media)
“h2o + Spark = hot” (Paul Swiontkowski – Microsoft)
“Shiny dashboards are cool” (Literally everyone at EARL)

I’m sorry to all the speakers who I haven’t mentioned. I heard great things about all the talks, but this is all I could attend!

Finally, my personal highlight of the conference was the unlimited free drinks – er I mean, getting the opportunity to talk to so many knowledgeable and approachable people from such a wide range of fields! It really was a pleasure meeting and learning from all of you.

If you enjoyed this post, be sure to join us at LondonR at Ball’s Brothers on Tuesday 25th September, where other Mangoes will share their experience of the conference, in addition to the usual workshops, talks and networking drinks.

If you live in the US, or happen to be visiting this November, then come join us in at one of our EARL 2018 US Roadshow events: EARL Seattle (WA) on 7th November, EARL Houston (TX) on 9th November, and EARL Boston (MA) on 13th November. Our highlights to the EARL Conference London will be online soon.

MANGO IS NOW A PART OF ASCENT

1. Business decisions should lead Data Science

2. Communication drives change within organisations

3. Setting up frameworks streamlines productivity

4. Interpretability beats complexity (sometimes)

5. R and Python can learn from each other