ODSC Europe 2019 - a junior data scientist's perspective

Thanks to the University of Bath and Mango I got the chance to volunteer and attend the Open Data Science Conference (ODSC) Europe 2019. This year the event saw around 1000 data-driven individuals flock to London for four days filled with workshops, tutorials and talks. My time there provided me with plenty to think about so here are some of my personal highlights!

Michael Wooldridge, PhD – The Future is Multiagent!

Having done research into Artificial Intelligence for 30 years, with the last seven of those as the head of the University of Oxford’s Computer Science department, Michael Wooldridge knows what he’s talking about. He shared with us his vision for the future of AI, multiagency! Think Siri talking to Siri or Tesla talking to Tesla. With self-interested computational systems becoming more common, eventually it will become vital that they can figure out not just how to do what’s best for themselves, but what’s also best for their whole ecosystem. The example Michael gave involved two supermarket stock-taking robots who needed to cross paths to complete their respective tasks. Without knowledge of each other these two robots would impede each other’s tasks, simply by getting in each other’s way – much in the same way as we communicate to complete tasks, these machines need the ability to communicate with each other and figure out an optimal solution which suits them both. It’s easy to see how this small problem could be scaled up for autonomous cars on a road for example, or an automated meeting scheduler (which Michael is very keen for someone to develop). This is definitely an area I look forward to seeing advance, and maybe even becoming a part of our everyday lives!

Cassie Kozyrkov, PhD – Making Data Science Useful

Cassie tried to make our lives as Data Scientists easier by splitting the job into smaller, better defined roles (e.g. Analyst, Statistician, Machine Learning Engineer) and claiming that being able to excel at all of these is near enough impossible! These roles require vastly different skill sets, one of the most varied being the way of thinking. For example analysts are there to explore the data and formulate questions, which the Statisticians are then there to rigorously test and provide the answers to. This leads to the revelation that “Data Science is a team sport!” I personally couldn’t agree more, especially as someone just starting out in the field. Having the opportunity to work alongside other Data Scientists during my time at Mango has helped me to develop and has taught me so much more than working independently would of. I think that Cassie’s main message was that you don’t have to know everything, as long as you’ve got the passion for data and the drive to let that data inform better decisions then you can go a long way.

Ian Ozsvald – Tools for High Performance Python

Imagine being able to run code that was taking 90 minutes in just 30 seconds! Well, Ian taught us how to do just that using some Python cleverness (and it didn’t even seem that hard!). His first tip was to find out where your code is slow – otherwise how will you know where to speed it up? Profilers are a great tool for this – check out Robert Kern’s line_profiler for line by line profiling.

Once you’ve figured out where to focus your effort, it’s time for speeding up. Ian really nicely talked us through his approach, starting with simple changes such as swapping out your for loops for Pandas apply, then using the raw argument. You’ll start looking at utilising the multiple cores on your device with Swifter and Dask before you know it. But beware, there’s a pay off between time spent re-factoring code to speed things up and the gains that you’re actually making, which for me, was definitely the take home message from Ian’s talk. As someone with limited experience in Python, there was still a lot of technical takeaways for me both in terms of technology when I do pick up Python (Dask and Numba), and techniques that I can give a go next time my R code takes forever.

Rishabh Mehrotra, PhD – Multi-stakeholder Machine Learning for Marketplaces

Rishabh, a Senior Research Scientist at Spotify, gave a very interesting talk which detailed how it’s important to consider all of your stakeholders whenever using methods like Machine Learning within your company. For example, think of a company like Just Eat, they could easily just optimise to provide their app users (the people receiving the food) with the best experience. However, there are 2 other key stakeholders; the delivery workers and the restaurants themselves. If they optimise for the end user, this may put too much pressure on delivery drivers and restaurants which causes them to leave the site – reducing the options available to the customer which may lead to them choosing another service. The first step in this process is realising who your stakeholders are, then thinking about how they interact with each other. Traditional recommender engines may then not meet all stakeholders’ needs, in the case of Spotify there is a trade-off with relevance, fairness and satisfaction. Showing all users the most relevant artists may keep the majority of streamers happy, but is this fair to new artists trying to make an impact? Yet if we present too many of these less relevant artists to the user will this then have a negative effect on the users satisfaction? Spotify has certainly got a lot of interesting problems and if you want any more information on Rishabh’s work then check out his website.

Sudha Subramanian – Identifying Heart Disease Risk Factors from Clinical Notes

Natural Language Processing (NLP) was a frequent topic of talks at this year’s ODSC Europe, I found Sudha’s case study to be a really good example of using NLP in practice. Her problem was that she wanted to use clinical notes scribbled by practitioners to help identify presence of heart disease risk factors in patients. Of course, notes from a GP come in widely different forms – for example different GP’s will use different abbreviations to mean the same thing. BERT was a common theme through all of the NLP talks, a new method of pre-training language representations that has produced some very impressive results. It utilises dynamic word embedding’s, so the word is converted from text to a vector representation and this vector takes into account the context that the word is being used in by looking at the surrounding text. In Sudha’s case BERT worked really well, outperforming human annotators when it came to identifying the common risk factors for heart disease. This work reminded me of some of the ‘Data for good’ lightning talks we saw at EARL back in September. Sudha’s work will allow common risk factors for heart disease to be identified sooner after appointment. This can then be acted upon faster, giving patients the best possible chance of changing their lifestyle and preventing the disease.

Of course, these weren’t the only talks that I got to see, and choosing my favourites to mention was an incredibly tricky task. If you want to check out any of the talks then take a look at the ODSC Europe 2019 website.

Finally another thank you to Mango, the University of Bath and the ODSC conference for the chance to attend and help out!

Author: Jack Talboys, Data Scientist, Mango Solutions

If you have any requirements regarding the use or implementation of R or Python for your business, than contact us at [email protected] or go to our contact us page.

MANGO IS NOW A PART OF ASCENT

Michael Wooldridge, PhD – The Future is Multiagent!

Cassie Kozyrkov, PhD – Making Data Science Useful

Ian Ozsvald – Tools for High Performance Python

Rishabh Mehrotra, PhD – Multi-stakeholder Machine Learning for Marketplaces

Sudha Subramanian – Identifying Heart Disease Risk Factors from Clinical Notes