This summer Mango took on three summer interns, Chris, Ellena and Lizzi, all maths students at different stages of their university careers. To provide insight into what it’s like to work on a data science project, Mango set up a three day mini-project. The brief was to analyse data from the 2018 Cost of Edinburgh survey.
The Cost of Edinburgh project was founded in 2017 by director and producer Claire Stone. The survey was designed in collaboration with, and with support from, the Fringe Society. It ran on SurveyMonkey in 2018 with the goal of collecting 100-150 responses from the wide range of people involved with performing at the Edinburgh Fringe Festival that year. There were three elements to the scope of the survey:
- Collect demographic data in order to explore the impact of the costs of attending the Fringe on diversity;
- Collect data on production income versus costs over multiple years, in terms of venue costs, accommodation and travel;
- Obtain qualitative responses on the financial and wellbeing costs of attending the Fringe.
The survey aimed to determine which performers attend the Fringe, ascertain what stops people from attending and whether it’s becoming more expensive to perform at the festival. 368 people responded to the survey which had 22 questions with three main sections: demographics, quantitative questions on costs and income and qualitative questions on costs and wellbeing. In this post, Chris and Lizzi share their experiences of their first data science project.
——————————————————————————————————————————————————————————
As is usually the case when real-world data are involved, they weren’t ready to be analysed out of the box. On first look we saw that we didn’t have tidy data in which each column contained the answer to a different question. There were lots of gaps in the spreadsheet, but not all due to missing data.
- Questions that required the respondent to choose one answer but included an ‘Other’ category corresponded to two columns in the spreadsheet; one containing the chosen answer, and one containing free text if the chosen answer was ‘Other’ and empty otherwise.
- Questions for which respondents could choose multiple answers had one column per answer. Each column contained the answer if chosen and was empty if not.
- For quantitative cost and income questions respondents could fill in details for up to ten productions, or years of attending the Fringe. If a question asked for five pieces of information it corresponded to 50 columns per subject, many of which were empty.
Chris’s thoughts
After being told that Nick had had a “quick, preliminary look at the data” and discussing his findings, we decided to split the data into two sections; Demographics and Financial, with the idea that any spare time at the end would be spent on the more qualitative feedback questions. Considering that there was a lot more complicated data in the financial questions, it was decided that both Ellena and I would tackle them. Furthermore, Ellena would take the “costs” and I would take the “income” questions.
Now the jobs were split into manageable chunks we could start appraising what questions we wanted to answer with the data. Looking more carefully at the data given, it was clear that a lot of the answers were categorical and hence bar-graphs seemed like an obvious option. It would have been really nice to have continuous data but I can understand why people would be uncomfortable answering a survey with that level of personal detail. Having the opportunity to see the project evolve from the beginning to this point where we had specific questions to answer was a really positive experience. By this point, I felt as if I’d learnt so much already. Here is a histogram of the annual income of Edinburgh Fringe performers from their work in the arts.
We were to use an AGILE development methodology, using a creative, sped-up version of the scrum method. The scrum method is a series of short bursts of work with defined targets, called sprints, interspersed with short meetings called stand-ups (named because you’re not allowed to sit down). This was my first introduction to a professional workflow, and it’s given me an insight into how companies might manage work. These sprints are meant to be days or weeks in length but because of our 3 day deadline we had to augment this strategy, splitting the day into 2 parts, and having 2 stand-ups per day.
We spent the next 2 days transforming the data into a usable form, and creating some graphs that certainly showed some things clearly. However, for a lot of the financial data we didn’t have a large enough sample size to perform statistical tests on it with a high level of certainty, which left the output feeling very categorical. This didn’t stop me from learning a huge amount though. I was introduced to tidyverse, a collection of packages that have been integrated into each other. And then it was just a matter of coaxing me out of my `for` loop ways and using group_by instead. There was a lot of coding to do and I feel that this project has really developed my R skills. I mainly code in python and before this, my only history with R was one years worth of academic use at University. This was a whole new experience, both in level of exposure, and impromptu lessons every few hours. My favourite being enthusiastically introduced to regex by Nick, who taught me that any problem can be solved by regex.
This project made me appreciate the need for good quality data in data-science, and how much of a project is spent cleaning and pre-processing compared to actually performing statistical tests and generating suave ggplots. There were a lot of firsts for me; like using ggplot2, dplyr and git in a professional environment.
Lizzi’s thoughts
Three days isn’t very long to take a piece of work from start to finish and in particular, it doesn’t allow for much thinking time. We had to decide which questions it was achievable to address in the time frame and divide the tasks so that we weren’t all working on the same thing. My job was to look at the demographic data. I didn’t produce any ground-breaking research, but I was able to produce a bunch of pretty pictures, discovering some new plotting packages and practising some useful skills along the way.
Firstly, I learnt what a waffle plot is. A way of displaying categorical data to show parts-to-whole contributions. Plots like the example below, which represents the self-reported annual income of Edinburgh Fringe performers from their work in the arts, can be easily created in R using the waffle package. The most time consuming task required to create such a plot is ensuring the factor levels are in the desired order. The forcats package came in useful for this.
The leaflet package enables you to easily create zoomable maps, using images from OpenStreetMap by default. Once again, the most time consuming part was getting the data in the right format. Location data was a mix of UK post codes, US zip codes, towns, countries and sets of four numbers that seemed like they might be Australian post codes. Using a Google API through the mapsapi package, and a bit of a helping hand so that Bristol didn’t turn into a NASCAR track in Tennessee, I could convert these data into longitude and latitude coordinates for plotting. This package can also calculate distances between places, but it only worked for locations in Great Britain as it uses Google maps driving directions. Instead, to determine how far performers travel to get to the Fringe I resorted to calculating the geodesic distance between pairs of longitudes and latitudes using the gdist function from Imap.
This project was also a good opportunity to practise using the tidyverse to manipulate data. I learnt base R at university and only came across the %>% pipe on a placement last year when working with somebody else’s code. Currently data manipulation in base R is second nature to me and doing something in the tidyverse way requires much more thought, so this project was a step towards becoming less old fashioned and shifting towards the tidyverse.
This was my first experience of using version control properly for a collaborative project. I use version control for my PhD work but purely in a way that means I don’t lose everything if my laptop gets lost, stolen or broken. I’m sure this will send shivers down some spines but my commits are along the lines of “all changes on laptop” or “all changes from desktop” as I update between working at home or at uni, often after a few days. I’ve learnt that proper use of Git means making a branch on which to carry out a particular piece of the work without interfering with the master branch. It also means committing regularly, with informative messages. Once work on a branch is finished, you submit a merge request (aka pull request). This prompts somebody else to review the code and if they’re satisfied with it, press the big green button to merge the branch into the master branch. It was also important to make the repository tidy and well-structured so that it made sense to others and not just to me.
The output of the project was an R Markdown document rendered to html. We brought our work together by making a skeleton Markdown document and importing each person’s work as a child document. Once we worked out which packages each of us had installed and made the sourcing of files and reading in of data consistent, the document knitted together smoothly.
As well as the coding side of things, I also learnt a bit about project management methodology. In the initial messages about the project we were told it would be carried out using scrum methodology. A quick google reassured me that no rugby was going to be involved. As Chris mentioned, we had 15 minute stand-ups (nothing to do with comedians) each morning and afternoon. The purpose of these meetings was to quickly catch each other up on what we had been working on, what we were going to do next and whether there were any blockers to getting the work done. The latter being particularly important given the small time frame.
In summary
In summary, this project resembled an exploratory phase of a full project and was perhaps a bit too short to produce a completely polished deliverable. However, we all learnt something, improved our R skills and had an enjoyable and interesting experience of working on a data science project.