Ikigai Labs’ Generative AI Platform Now Available in AWS Marketplace
Read announcement

When You Can’t Join, Use aiMatch

Learn how to take datasets as inputs, learn the relationship between the columns of datasets and use similarities between rows to “stitch.”

Sep 29, 2022
15 minutes to read

The Challenge. The quintessential pre-task of most data-driven analysis is that of “stitching” multiple data sources together. Traditionally, in the database language, this is achieved through “joins”. They “stitch” datasets together based on a commonality in terms of shared entries within common columns across datasets. Classical joins work well when the matches are rule based and depend on common columns.

In many modern settings, however, this does not work. Because two datasets may lack a shared column or have mismatched entries, the correct relationship is likely to be missed. This situation may arise due to many reasons including (1) a lack of centralized design across first-party and third-party datasets; (2) the datasets not adhering to a standardized format; or (3) some typo-like errors or missing values in the data which might make “stitching” difficult. Figure 1 below provides an example of “messy” data where stitching is difficult.

Figure 1: Consider two tables as shown above: left has names such as “Jones, Angela” which is represented in the right as “Jones A”. The traditional Join operation will not be able to “stitch” such datasets. More generally, left table could have two columns with one having the entry “Jones” and other “Angela”. This will make it even more difficult.

The Opportunity. Such a task of “stitching” multiple datasets like that shown in Figure 1 comes up routinely in many settings. This includes Financial Reconciliation (Accounts Receivable, Accounts Payable), Record Linkage, Inventory Tracking, Order Management, Auditing in Insurance, and more.

Currently this is addressed through a mix of manual matching and some ad-hoc solutions. Our Excel users spend up to 80% of their time conducting such operations. This problem begs for a solution that automates “stitching” of datasets where joins do not work.

The Solution. At Ikigai, we have developed a machine learning empowered solution for this precise challenge — the aiMatch. At the highest level, aiMatch takes datasets as inputs, attempts to learn the relationship between the columns of datasets and then uses “similarities” between rows to “stitch”.

aiMatch comes in three forms. In the simplest form, it simply takes two datasets, and produces the best stitching it can, without any further input. For example, if two datasets can be joined in the sense of traditional database operation, it will be achieved. That is, now on, you do not need to worry about which join should you use: left, right, …?

For the dataset whose example is shown in Figure 1, the simplest form of aiMatch achieves an F1-score of 0.83. F1 score indicates the predictive performance of a model.

In many settings, in addition, historical examples of stitching which indicates a “match” of the datasets might be available. In the second version of aiMatch, stitching is learnt from such historical examples.

For the dataset whose example is shown in Figure 1, by correcting a small number of errors in the output of the simplest form of aiMatch and then feeding it to the second version of aiMatch, the F-1 score improves from 0.83 to 1! That is, perfect stitching.

The third, and most advanced version of aiMatch involves eXpert-In-the-loop (Xitl). To that end, any machine learning solution is not perfect. At Ikigai, we believe that exceptions are the norm in machine learning driven solutions. Therefore, after aiMatch (both the simplest form and the form where stitching is learnt from historical examples) produces a stitching of data, it may not be accurate. Typically, one would expect few inaccuracies where aiMatch may not be confident in its ability to stitch (or not). This can be rectified by simple human in the loop interaction through a few clicks where a human corrects inaccurate stitching. These corrections are learnt by the machine to improve the stitching in the future.

Interested? If you want to automate your data stitching workflow, reconciliation or any such application and want to give it a try, book a demo with us.

About the Author

Anusha Bharat (Backend Software Engineer)

Anusha Bharat is a Backend Software Engineer with a focus on Machine Learning at Ikigai Labs. She recently graduated from Texas A&M University with a Master’s in Computer Engineering. She is working on developing the platform’s ML/AI capabilities.

Originally featured on medium.com

In this article:


Anusha Bharat

Recommended articles

Ikigai’s No-Code AI Platform
AI Builder
A Real Connection
Ikigai Labs Serves Interactive AI Workflows at Scale with Ray Serve

Subscribe to Ikigai Blog

Don't miss the latest updates from the Ikigai team.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.