Fixing Bugs Using the Principle of Separation of Responsibilities in "Legacy" Data Pipelines
In 4 simple steps to follow.
Situation: The pipeline is in production, it’s running. But since yesterday, some data isn’t passing through, and error logs are showing up, or worse, the errors aren’t handled and the entire pipeline has crashed!
This is a classic scenario for a data engineer, and it’s time to dive into the code and data to figure out what’s going on.
I’m going to show you a generic method, based on several years of experience in fixing, rectifying, and improving data pipelines.
1️⃣ Before diving into the code, the first step is to identify which data points are causing errors. Keep these data handy. For example, if they are in JSON, save them in a file.
2️⃣ Next, dive into the code and pinpoint the problematic instruction. I’ve illustrated a data extraction function: for an identifier, two records are pulled from two different databases and then merged.
Can you immediately see the problem? Well, in this example, the problem is quite obvious, but most of the time it’s not so apparent!
The best way to see the problem concretely is to execute the code locally. But surprise: it’s going to be complicated to execute anything since the function calls external databases. I can’t run it locally on my PC to reproduce the problem with the troublesome data. Unless I have local access to production databases and execute the function on them, but if that idea doesn’t bother you, I suggest you consider reorienting towards gardening with round-tipped shears 👨🌾
The problem is that this function mixes multiple responsibilities: taking the data and then merging it.
So, we’re going to separate these responsibilities into two different functions:
3️⃣ Thanks to this little refactoring, we can write a test for our merging function (See image below). The input data for this test are the “problematic” data isolated earlier, in step 1.
Thus, I can execute the merging function locally by running the unit test, debug the problem using breakpoints if necessary, and finally correct it.
In my example, the problem is that if the key
"type" doesn’t exist on
object2, the execution of
object2["type"] will throw a
KeyError, interrupting the function.
4️⃣ We keep the change and the test in the code! This way, after your intervention, the codebase will be even more robust, and the responsibilities will be decoupled.