FAQ: How to overcome the limitations of a JOIN

In this video, Nate Lanier, Ikigai's Senior Data Scientist, answers one of the more frequent questions that we get from our clients: how to overcome the limitations of a JOIN.

Watch Video

Transcript

Typically most organizations store their data across a number of different tables and oftentimes it's necessary to combine these data sources into one mastered data set. There's a lot of different approaches that people use to combine data sets and one of the most common is just the standard join. So I wanted to take a few minutes to discuss some of the drawbacks that can be encountered with a join. They require a common key between data sets which must be an exact match and so if such a key doesn't exist or if there are issues with typos or inconsistencies and how the data has been entered, the performance can oftentimes be very poor. Let's just take a quick look at what a data set might look like that you'd want to perform a join with.

Here's just a very small example data set. We have table one, we have table two, and if you just compare the records we have James Davis and Jay Davis which looks like it could be a match; Sarah Brown and Sarah Brown which looks like a match; John Smith and Jonathan Smith which looks like a match; and then Frank Brown which has no corresponding record in the top data set. So the goal here would be to combine these two tables so we have one master data set that has both social security number and income.

If we were just going to join on the last name, we might have some results that look like this:

- James Davis if we're just using last name would match with Jay Davis and the bottom table and we have social security number and income and everything looks pretty good

- we move on to matching with Sarah Brown we start to run into a few issues here so Sarah Brown would match with Sarah Brown based on last name but Sarah Brown would also match with Frank Brown if we're just using last name as the identifier. So as a result we have some errors in our final table where we have this uh incorrect match.

- then John Smith will match with Jonathan Smith correctly and so as you can see we have these issues in our final table where we have incorrect data and one can imagine as you know the size of the data it gets greater for using you know tens of thousands or hundreds of thousands of rows or even millions of rows how this problem could be sort of exacerbated throughout our final data set

So then the next option might be trying to approach it by using the first name and last name. We've run into issues here as well so if we're looking for an exact match on first name and last name James Davis would not match with Jay Davis even though a human might look at this and think that it should be a match.

Sarah Brown actually works out well so we have both social security number and we have income. And then Jonathan Smith would not match with John Smith, and then also Frank Brown does not have a matching record up top, so as a result, the final data set is certainly not correct.

So at Ikigai Labs we have the aiMatch algorithm that we've been working on that will allow us to perform these sorts of joins without a unique identifier between the different tables so as you can see James Davis mapped correctly to Jay Davis.

Additionally, we have the street here which isn't an exact match either because we have Main St versus Main Street and so we're successfully able to get our table with social security number and income. And then similarly, Sarah Brown matches correctly; John Smith matches to Jonathan Smith even though the street isn't a perfect match either, and then Frank Brown has no corresponding record in the top data set but so we can see Frank Brown does appear but it's just missing the social security number because there's no match from table one.

WEBINAR

FAQ: How to overcome the limitations of a JOIN

In this video, Nate Lanier, Ikigai's Senior Data Scientist, answers one of the more frequent questions that we get from our clients: how to overcome the limitations of a JOIN.

This is some text inside of a div block.
EST
SPEAKER
No Speakers found.
Description

Typically most organizations store their data across a number of different tables and oftentimes it's necessary to combine these data sources into one mastered data set. There's a lot of different approaches that people use to combine data sets and one of the most common is just the standard join. So I wanted to take a few minutes to discuss some of the drawbacks that can be encountered with a join. They require a common key between data sets which must be an exact match and so if such a key doesn't exist or if there are issues with typos or inconsistencies and how the data has been entered, the performance can oftentimes be very poor. Let's just take a quick look at what a data set might look like that you'd want to perform a join with.

Here's just a very small example data set. We have table one, we have table two, and if you just compare the records we have James Davis and Jay Davis which looks like it could be a match; Sarah Brown and Sarah Brown which looks like a match; John Smith and Jonathan Smith which looks like a match; and then Frank Brown which has no corresponding record in the top data set. So the goal here would be to combine these two tables so we have one master data set that has both social security number and income.

If we were just going to join on the last name, we might have some results that look like this:

- James Davis if we're just using last name would match with Jay Davis and the bottom table and we have social security number and income and everything looks pretty good

- we move on to matching with Sarah Brown we start to run into a few issues here so Sarah Brown would match with Sarah Brown based on last name but Sarah Brown would also match with Frank Brown if we're just using last name as the identifier. So as a result we have some errors in our final table where we have this uh incorrect match.

- then John Smith will match with Jonathan Smith correctly and so as you can see we have these issues in our final table where we have incorrect data and one can imagine as you know the size of the data it gets greater for using you know tens of thousands or hundreds of thousands of rows or even millions of rows how this problem could be sort of exacerbated throughout our final data set

So then the next option might be trying to approach it by using the first name and last name. We've run into issues here as well so if we're looking for an exact match on first name and last name James Davis would not match with Jay Davis even though a human might look at this and think that it should be a match.

Sarah Brown actually works out well so we have both social security number and we have income. And then Jonathan Smith would not match with John Smith, and then also Frank Brown does not have a matching record up top, so as a result, the final data set is certainly not correct.

So at Ikigai Labs we have the aiMatch algorithm that we've been working on that will allow us to perform these sorts of joins without a unique identifier between the different tables so as you can see James Davis mapped correctly to Jay Davis.

Additionally, we have the street here which isn't an exact match either because we have Main St versus Main Street and so we're successfully able to get our table with social security number and income. And then similarly, Sarah Brown matches correctly; John Smith matches to Jonathan Smith even though the street isn't a perfect match either, and then Frank Brown has no corresponding record in the top data set but so we can see Frank Brown does appear but it's just missing the social security number because there's no match from table one.

Join Us

Join Us

Watch on Demand

By clicking Register, you agree to the Webinars Terms and Conditions.
Thank you! Your submission has been received!
Click here to view the webinar
Oops! Something went wrong while submitting the form.