top of page
Search

The Solve to a Data Engineer's Pain: Simple Tool for Complex Problem

  • maya21373
  • Sep 4
  • 4 min read
ree

For years, I've worked in the world of data, helping companies move their information from old systems to new ones. I've seen countless projects succeed, but beneath the surface of every big move, there’s a stressful part that always bothers me: checking if the data is correct.


I remember one project with a huge customer database. The stakes were high. Millions of customer records needed to move perfectly and fast (like always, ha!). The pressure was real, and the fear of even one tiny mistake and drilling into the issue, fixing it and retesting was always there.


The first way we'd check things was always a mix of manual checks and basic reports. Imagine a team of people staring at spreadsheets, their eyes tired, manually comparing highlighted rows. “Is this record in both files?”, “Does the address match exactly?”, “Why is this number different?”. The questions never stopped, the work was slow, and human mistakes were very common.


Then we had the clients who knew they needed automated tools but didn't have the skills or money to build them. We would offer to build custom scripts for them, made for their specific data. While these tools worked, they were expensive (our time = $$) and felt like we were starting from scratch every time. It was frustrating for everyone to spend so much money on something that felt like a basic need.


I can still remember a difficult day, looking at two huge CSV files from a financial data project. My head hurt from too much coffee and the stress of the deadline. I had run my usual checks, like counting rows and looking at column names, and everything looked fine. But that worried feeling wouldn't go away. What about small differences in spaces? What about duplicate records that we missed? What about the actual values themselves? It was a mix of hoping we had caught everything and being deeply afraid of what we might have missed. I knew there had to be a better way, a more reliable solution.


I’ve seen clients have to settle for less, only checking the easiest things or just doing "spot checking", and just hoping the data migration went smoothly. This isn’t how we should handle important data. It felt like a roadblock for companies that just wanted to make sure their most valuable asset - their data - was safe.


So, I decided to build something different. Instead of making custom tools for every single client, I wanted to create a simple, powerful Python script that anyone could use to feel sure about their data. A tool born from many late nights, the frustration of manual checks, and the strong belief that comparing data should be easy for everyone. After many hours of work, I’m proud to share this open source CSV File Comparison Tool with you. It’s designed to be your trusted partner, an automated data detective that looks deeper to find even the smallest problems. This script isn't just a basic tool. It’s a complete data quality and comparison engine. You can check it out in my Github - it includes the script as well as some example files and of the output files from the analysis that you can use to identify the issues in the data!


My Solution: The CSV File Comparison Tool


Key Features to Give You Confidence:

  • Automatic File Finder: no typing out file paths is needed, just set a folder

  • Record Count: the first and most important check - are all your rows there?

  • Works with Different Column Orders: can compare columns even if their order is different between files

  • Null and Empty Value Checks: sometimes extra space get in the way - it is not the same value and it is not NULL. This script finds them for you.

  • Data type inconsistencies: identifies data type mismatches between columns with the same name.

  • Deep Statistical Look: compares averages, medians, and unique values for numbers.

  • Finds Records Smartly: uses your chosen key columns to find duplications as well as full duplications

  • Easy Setup: you just change one line to set your KEY_COLUMNS


Beyond the Basics: Finding Every Problem

This tool goes deep to find issues that old methods often miss, putting them into categories so you can easily understand them:

  1. Basic Checks: record counts, column names, order, and data types.

  2. Data Quality Checks: finding nulls, empty spaces, extra spaces, and case differences.

  3. Value Analysis: detailed, clear comparisons of values in each cell, even handling text versus numbers.

  4. Statistical Analysis: it goes beyond simple counts to check data at a deeper level.

  5. Error Tracking: tracking different types of problems:

    • Full row Duplicates ('f'): identical rows within one file.

    • Key based Duplicates ('k'): multiple rows with the same key columns, even if other data is different.

    • Missing Records ('m'): Records in your first file that are not present in the second.

    • Extra Records ('e'): Records in your second file that are not present in the first.


Clear Output for Real Results:

You won’t be left guessing. The tool gives you clear, helpful files:

  • Console Output: shows real-time progress and a quick summary.

  • Detailed Text File: a well organized comparison_results__file1_vs_file2.txt with sections on every check.

  • Error Records CSV File: if problems are found, it creates an error_records__file1_vs_file2.csv file. This file is a lifesaver, containing only the problematic records, along with columns that tell you the source_file, error_type, and num_errors.


Why this Tool Matters to You


I’ve seen clients struggle with the cost and work of building custom tools for every project. As a data engineer consultant, I've felt the worry of clients not knowing if every single piece of data makes it over safely. This tool is my answer to that struggle. It’s built on the idea that good, easy-to-use data checks should be something everyone has access to. I truly hope this script will be useful in your own data work, whether you're a data professional, a business analyst, or anyone who needs to make sure their CSV data is correct.

Please check it out. I would love to hear your feedback - how you use it, and any ideas you might have. Together, we can make data moves a little less stressful and a lot more reliable

 
 
 

Comments


© 2020 by Maya Sandler.
Created with Wix.com

bottom of page