Merge by proximity, not by equality

mayasandler
Jun 15, 2022
4 min read

I am doing a lot of JOINs in SQL and Python and usually I need to match the exact numbers/timestamps/booleans etc.

In a project I worked on I needed to match the values on one data frame to similar values on the other data frame. Pandas's regular merge does not allow merging on similar values - only on the exact one, so I looked for the best solution.

I found pandas.merge_asof Pandas method, which merges on multiple columns on their exact values and on an additional column with a delta value that you specify. The documentation shows examples of a single column that needs to me specific and a single columns that can have a tolerance. This method is used to LEFT MERGE the right data frame onto the left data frame. However, I needed to do an inner merge, therefore needed another condition to make it an inner merge.

As I needed a solution for matching multiple columns on their exact value and an additional column that needed the tolerance, I found the documentation a bit lacking in explanations and examples. In addition, I found multiple questions online in this topic when I was looking to solve my issue, so I wanted to share my example.

on = the column which may have exact or similar value.

by = the columns which have the exact values to merge on.

tolerance = the error gap that we allow in the "on" column values.

direction = can be "forward" (adding the tolerance to the left value), "backward" (subtracting it) or "nearest" (closest match, i.e. adding or subtracting the tolerance). If you don't know that the tolerance will always be added or subtracted from the value, choose "backward", but expect less efficient process.

An important note: The data frames need to be ordered in the same order.

* See the documentation for more information

This is my simple example:

Output:

I want to merge these tables so that columns "a" and "b" will match exactly but leave a tolerance of 6 to the "amount". Therefore, columns "a" and "b" will be in the "by" section and the "amount" column will be in the "on". In my case, I needed the option to have the exact match in the "amount" column, therefore I used allow_exact_match=True:

Output:

The result shows that this left merge was done matching exactly "a" and "b" and matching 'loosely' on the "amount".

As I needed an inner merge I needed to add one more condition to this script using notnull():

Output:

What did not merge?

It is always good practice to look for the unmatched records to know what will be missing from the new merged data frame. However, there is not how=outer that we can use to find it. To go around this issue, we need to look for the records on the left data frame and on the right data frame that did not find a match and append them together. Subsetting only the records with NULL values on the left_merged_df1 will give us only the records from the left data frame and will disregard the mismatched records from the right data frame.

Therefore, to get the full non-matched data, we need to do a loose left merge and a loose right merge, and then subset these data frames form NULL values in their corresponding columns:

Output:

Now to subset the records with NULL values in bot the left and the right merges and appending them to a single data frame:

Output:

Now we can go ahead a look into the mismatches data if we need to see if the reason is dirty data or something else.

Issues and Solutions

1. If you are looking to match many:many or 1:many relationship data frames and want all the results that match - the solution is going to be long and not really achievable in this method. The reason for this is that for each row in the left data frame, merge_asof() selects the last row in the right data frame whose ‘on’ key matches the left’s key, and excludes the others.

** Want a 1:many match? The best solution is to flip direction to create a many:1 match. ** Want a many:many match? Well I am not sure this is the best method to use for this...

2. It merges on a column from left and on the right. In a regular merge we expect both of the columns to be included. However, in merge_asof() it excludes the right column from the merged data frame. This can be an issue when you need both of the columns in your result.

A good article that summarize the need-to-know (and there are a lot in this method) is Angwalt's The hidden rules of pandas.merge_asof() https://angwalt12.medium.com/the-hidden-rules-of-pandas-merge-asof-e67293a5318e

To conclude, the pandas.merge_asof method is very useful in matching values from two data frames that are close but not an exact match. This method is very useful in matching amounts that may have been rounded on one data frame but not on the other, as well in finding close or prior/subsequent timestamps of online actions, and still being able to merge the data frames on the exact values of the other columns.

Merge by proximity, not by equality

What did not merge?

Issues and Solutions

Recent Posts

Comments