The Solve to a Data Engineer's Pain: My Tools for Flawless Data Validation

mayasandler
Sep 15
4 min read

As a data engineer, I've spent countless hours making sure data flows smoothly from one system to another. This often involves the stressful, yet crucial, task of comparing data before and after a big move - checking if the data is correct. I've witnessed the struggle, varying from manual "spot checks", counting records and looking at column names, where human mistakes are very common, while worrying about small differences in spaces, duplicate records and small changes in the values - to expensive custom scripts. The question of "is this data truly identical?" has been a persistent pain point for clients. The dread of finding one mistake, fixing it and retesting was always there.

Many time clients knew they needed automated tools to avoid manual errors and to save time, but didn't have the technical skills or the money to build them. We would offer to build custom scripts for them, but these tool were expensive (our time is money) and felt like we were starting from scratch every time. It was frustrating for everyone to spend so much money on something that felt like a basic need. This constant frustration and repeated cycle of data validation challenges, sparked a mission within me. I wanted to build tools that could help, easy to use and provide clear answers to critical data quality questions.

Today, I'm thrilled to share two powerful automatic data comparison tools that I've developed. Each is designed to tackle different aspects of data validation, ensuring that you have the right instrument for the job, whether you're working with local files or massive cloud data warehouses. My GitHub repo includes the script as well as some example local files and output files from the analysis.

Tool 1: A Local CSV File Comparison Tool

My CSV File Comparison Tool is built for those times when you need to quickly and thoroughly compare two CSV files. It's perfect for quick but robust sanity checks and validating smaller datasets exports from various systems. I designed it to be a comprehensive data quality and comparison engine, going far beyond a simple "diff" tool.

Key Features to Give You Confidence:

Automatic File Finder: no typing out file paths is needed, just set a folder and paste the two csv files into it
Record Count: the first and most important check - are all your rows there?
Works with Different Column Orders: can compare columns even if their order is different between files
Null and Empty Value Checks: sometimes extra space get in the way - it is not the same value and it is not NULL. This script finds them for you.
Data type inconsistencies: identifies data type mismatches between columns with the same name.
Deep Statistical Look: compares averages, medians, and unique values for numbers.
Finds Records Smartly: uses your chosen key columns to find duplications as well as full duplications
Easy Setup: you just change one line to set your KEY_COLUMNS

Beyond the Basics: Finding Every Problem

This tool goes deep to find issues that old methods often miss, putting them into categories so you can easily understand them:

Basic Checks: record counts, column names, order, and data types.
Data Quality Checks: finding nulls, empty spaces, extra spaces, and case differences.
Value Analysis: detailed, clear comparisons of values in each cell, even handling text versus numbers.
Statistical Analysis: it goes beyond simple counts to check data at a deeper level.
Error Tracking: tracking different types of problems:
- Full row Duplicates ('f'): identical rows within one file.
- Key based Duplicates ('k'): multiple rows with the same key columns, even if other data is different.
- Missing Records ('m'): Records in your first file that are not present in the second.
- Extra Records ('e'): Records in your second file that are not present in the first.

Clear Output for Real Results:

You won’t be left guessing. The tool gives you clear, helpful files:

Console Output: shows real-time progress and a quick summary.
Detailed Text File: a well organized comparison_results__file1_vs_file2.txt with sections on every check.
Error Records CSV File: if problems are found, it creates an error_records__file1_vs_file2.csv file. This file is a lifesaver, containing only the problematic records, along with columns that tell you the source_file, error_type, and num_errors.

Tool 2: The Snowflake Table Comparison Tool – Power for Big Data

While the CSV tool is fantastic for local files, I quickly realized its limitations when dealing with massive datasets of millions of records, living already in data warehouses. Loading huge tables into local memory for comparison simply isn't efficient, or even possible. That's why I developed the Snowflake Table Comparison Tool. It leverages the power of Snowflake itself, avoiding memory issues and ensuring speed (and it can of course be easily transformed to be used in other data systems).

What it does for you

Direct-in-Snowflake Comparison: this tool runs directly within a Snowflake Python worksheet using Snowpark. It harnesses Snowflake's computing power to compare tables, making it incredibly fast and scalable for even the largest datasets.
Similar Comprehensive Checks: like its CSV counterpart, it performs:
- Record count and column structure analysis.
- Duplicate detection (full-row and key-based).
- Identification of missing/extra records.
- Data type consistency checks.
- Statistical comparisons for numeric columns.
- Value distribution analysis.
Smart Output in Snowflake:
- Results Pane Summary: get an immediate overview of key metrics directly in your Snowflake worksheet, showing record counts, column matches, and total issues.
- Multiple Detailed Database Views: for deep dives, the script creates several views in your Snowflake schema. These views contain granular details on missing records, extra records, different values between tables, and duplicates. This allows you to explore the issues using standard SQL queries, integrating seamlessly into your data analysis workflow.

This tool is designed for scale and performance, making it an essential asset for anyone working with large datasets in Snowflake.

My Vision: Making Data Integrity Accessible

Both of these tools stem from my years of experience, witnessing the challenges and frustrations in data validation. My goal is simple: to empower individuals and teams to confidently ensure the accuracy and integrity of their data, whether it's in a small CSV file or a massive Snowflake table.

I truly hope these tools will become invaluable assets in your own data journeys. They represent a step towards making robust data validation an accessible standard.

Please explore the GitHub repository, try out the tools, and share your feedback. Your insights are crucial as we continue to build better, more reliable data practices together and make data moves a little less stressful and a lot more reliable.

The Solve to a Data Engineer's Pain: My Tools for Flawless Data Validation

Tool 1: A Local CSV File Comparison Tool

Tool 2: The Snowflake Table Comparison Tool – Power for Big Data

My Vision: Making Data Integrity Accessible

Recent Posts

Comments