In between a rock and a conditional join

Blogs home Featured Image

Joining two datasets is a common action we perform in our analyses. Almost all languages have a solution for this task: R has the built-in merge function or the family of join functions in the dplyr package, SQL has the JOIN operation and Python has the merge function from the pandas package. And without a doubt these cover a variety of use cases but there’s always that one exception, that one use case that isn’t covered by the obvious way of doing things.

In my case this is to join two datasets based on a conditional statement. So instead of there being specific columns in both datasets that should be equal to each other I am looking to compare based on something else than equality (e.g. larger than). The following example should hopefully make things clearer.

myData <- data.frame(Record = seq(5), SomeValue=c(10, 8, 14, 6, 2))
myData
## Record SomeValue
## 1 1 10
## 2 2 8
## 3 3 14
## 4 4 6
## 5 5 2

The above dataset, myData, is the dataset to which I want to add values from the following dataset:

linkTable <- data.frame(ValueOfInterest = letters[1:3], LowerBound = c(1, 4, 10),
UpperBound = c(3, 5, 16))
linkTable
## ValueOfInterest LowerBound UpperBound
## 1 a 1 3
## 2 b 4 5
## 3 c 10 16

This second dataset, linkTable, is the dataset containing the information to be added to myData. You may notice the two dataset have no columns in common. That is because I want to join the data based on the condition that SomeValue is between LowerBound and UpperBound. This may seem like an artificial (and perhaps trivial) example but just imagine SomeValue to be a date or zip code. Then imagine the LowerBoundand UpperBound to be bounds on a specific time period or geographical region respectively.

In Mango’s R training courses one of the most important lessons we teach our participants is that the answer is just as important as how you obtain the answer. So i’ll try to convey that here too instead of just giving you the answer.

Helping you help yourself

So the first step in finding the answer is to explore R’s comprehensive help system and documentation. Since we’re talking about joins its only natural to look at the documentation of the merge function or the join functions from the dplyr package. Unfortunately both only have the option to supply columns that are compared to each other based on equality. However the documentation for the merge functions does mention that when no columns are given the function performs a Cartesian product. That’s just a seriously cool way of saying every row from myData is joined with every row from linkTable. It might not solve the task but it does give me the following idea:

# Attempt #1: Do a cartesian product and then filter the relevant rows
merge(myData, linkTable) %>% 
filter(SomeValue >= LowerBound, SomeValue <= UpperBound) %>% 
select(-LowerBound, -UpperBound)
## Record SomeValue ValueOfInterest
## 1 5 2 a
## 2 1 10 c
## 3 3 14 c

You can do the above in dplyr as well but I’ll leave that as an exercise. The more important question is: what is wrong with the above answer? You may notice that we’re missing records 2 and 4. That’s because these didn’t satisfy the filtering condition. If we wanted to add them back in we would have to do another join. Something that you won’t notice with these small example datasets is that a Cartesian product is an expensive operation, combining all the records of two datasets can result in an explosion of values.

(Sometimes) a SQL is better than the original

When neither of the built-in functions or functions from packages you know solve the problem, the next step is to expand the search. You can directly resort to your favourite search engine (which will inevitably redirect you to Stack Overflow) but it helps to first narrow the search by thinking about any possible clues. For me that clue was that joins are an important part of SQL so I searched for a SQL solution that works in R.

The above search directed me to the excellent sqldf package. This package allows you to write SQL queries and execute them using data.frames instead of tables in a database. I can thus write a SQL JOIN query with a BETWEEN clause and apply it to my two tables.

library(sqldf)
# Attempt #2: Execute a SQL query
sqldf('SELECT Record, SomeValue, ValueOfInterest 
FROM myData 
LEFT JOIN linkTable ON SomeValue BETWEEN LowerBound and UpperBound')
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Marvellous! That gives me exactly the result I want and with little to no extra effort. The sqldf package takes the data.frames and creates corresponding tables in a temporary database (SQLite by default). It then executes the query and returns a data.frame. Even though the package isn’t built for performance it handles itself quite well, even with large datasets. The only disadvantage I can think of is that you must know a bit of SQL.

So now that I have found the answer I can continue with the next step in the analysis. That would’ve been the right thing to do but then curiosity got the better of me and I continued to find other solutions. For completeness I have listed some of these solutions below.

Fuzzy wuzzy join

If you widen the search for a solution you will (eventually, via various GitHub issues and StackOverflow questions) come across the fuzzyjoin package. If you’re looking for flexible ways to join two data.frames then look no further. The package has a few ready-to-use solutions for a number of use cases: matching on equality with a tolerance (difference_inner_join), string matching (stringdist_inner_join), matching on euclidean distance (distance_inner_join) and many more. For my usecase I will use the more generic fuzzy_left_join which allows for one or more matching functions.

library(fuzzyjoin)
# Attempt #3: use the fuzzyjoin package
fuzzy_left_join(myData, linkTable, 
by=c("SomeValue"="LowerBound", "SomeValue"="UpperBound"),
match_fun=list(`>=`, `<=`)) %>% 
select(Record, SomeValue, ValueOfInterest)
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Again, this is exactly what we’re looking for. Compared to the SQL alternative it takes a little more time to figure out what is going on but that is a minor disadvantage. On the other hand, now there is no need to go back and forth with a database backend. I haven’t checked what the performance differences are, that is a little out of scope for this post.

If not dplyr then data.table

I know it can be slightly annoying when someone answers your question about dplyr by saying it can be done in data.table but it’s always good to keep an open mind. Especially when one solves a task the other can’t (yet). It doesn’t take much effort to convert from a data.frame to a data.table. From there we can use the foverlaps function to do a non-equi join (as it is referred to in data.table-speak).

library(data.table)
# Attempt #4: Use the data.table package
myDataDT <- data.table(myData)
myDataDT[, SomeValueHelp := SomeValue]
linkTableDT <- data.table(linkTable)
setkey(linkTableDT, LowerBound, UpperBound)

result <- foverlaps(myDataDT, linkTableDT, by.x=c('SomeValue', 'SomeValueHelp'), 
by.y=c('LowerBound', 'UpperBound'))
result[, .(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

Ok so I’m not very well versed in the data.table way of doing things. I’m sure there is a less verbose way but this will do for now. If you know the magical spell please let me know (through the links provided at the end).

Update 6-Feb-2018
Stefan Fritsch provided the following (less verbose) way of doing it with data.table:

linkTableDT[myDataDT, on = .(LowerBound <= SomeValue, UpperBound >= SomeValue),
.(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

The pythonic way

Now that we’re off the tidyverse-reservoir, we might as well go all the way. During my search I also encountered a Python solution that looked interesting. It involves using pandas and some matrix multiplication and works as follows (yes, you can run Python code in a RMarkdown document).

import pandas as pd
# Attempt #5: Use python and the pandas package
# create the pandas Data Frames (kind of like R data.frame)
myDataDF = pd.DataFrame({'Record':range(1,6), 'SomeValue':[10, 8, 14, 6, 2]})
linkTableDF = pd.DataFrame({'ValueOfInterest':['a', 'b', 'c'], 'LowerBound': [1, 4, 10],
'UpperBound':[3, 5, 16]})
# set the index of the linkTable (kind of like setting row names) 
linkTableDF = linkTableDF.set_index('ValueOfInterest')
# now apply a function to each row of the linkTable
# this function checks if any of the values in myData are between the upper
# and lower bound of a specific row thus returning 5 values (length of myData)
mask = linkTableDF.apply(lambda r: myDataDF.SomeValue.between(r['LowerBound'], 
r['UpperBound']), axis=1)
# mask is a 3 (length of linkTable) by 5 matrix of True/False values
# by transposing it we get the row names (the ValueOfInterest) as the column names
mask = mask.T
# we can then matrix multiply mask with its column names
myDataDF['ValueOfInterest'] = mask.dot(mask.columns)
print(myDataDF)
## Record SomeValue ValueOfInterest
## 0 1 10 c
## 1 2 8 
## 2 3 14 c
## 3 4 6 
## 4 5 2 a

This is a nice way of doing it in Python but it’s definitely not as readable as the sqldf or fuzzyjoin alternatives. I for one had to blink at it a couple of times before I understood this witchcraft. I didn’t search extensively for a solution in Python so this may actually not be the right way of doing it. If you know of a better solution let me know via the links below.

Have no fear, the tidyverse is here

As you search for solutions to your own tasks you will undoubtedly come across many Stack Overflow questions and Github Issues. Hopefully, they will provide the answer to your question or at least guide you to one. When they do, don’t forget to upvote or leave a friendly comment. When they don’t, do not despair but see it as a challenge to contribute your own solution. In my case the issue had already been reported and the dplyr developers are on it. I look forward to trying out their solution in the near future.

The code for this post is available on Github. I welcome any feedback, please let me know via Twitter or Github.