You’ve built a great data pipeline, but it struggles to process just a few thousand messages per hour. The pipeline takes days to run. Occasionally it crashes, requiring processes to be restarted, which can be time-consuming and very manual…
It’s time to apply the following three key pillars 👇
Make your system resilient with this 6 bullet-point checklist
It means being tolerant to failures.
Yes, but which failures?
Well, all of them. Well, not really all. In the case where Earth is swallowed by the Sun, being resilient would mean having servers outside the solar system. But at that point, perhaps the resilience of your systems won’t be your biggest problem.
So, we aim to be tolerant to failures, but not just any failures. There are some that we will deliberately not tolerate.
Several types of failures:
-
Hardware failure: the server burns out, the hard drive fails, power outage. To address this, we make the hardware redundant. And if possible, in different locations.
-
Operating System failure: the OS is the software on which your application runs. This software is not free of bugs. Like the addition of a leap second on June 30, 2012, which crashed many systems. For this, we choose stable and mature OSs and update them as soon as possible.
-
Software failure: the application crashes. It could be due to an unaccounted user input, a poorly called API, etc.
To safeguard as much as possible against the last kind of failure, apply this checklist:
✅ Are your APIs, your interfaces well designed? How much are the consumers of an API coupled with it?
✅ Do you have multiple environments for deployment? Sandbox, Staging, Pre-prod vs Prod?
✅ Do you have unit tests? Integration tests? Are they well-designed or fragile?
✅ Do you have a way to easily rollback a deployment or configuration? SCV? CI/CD?
✅ Do you have a way to monitor the state of your application and its components, or your data pipelines?
✅ Are your users well-trained on your application? And your devs? And the admins? Is the documentation clear regarding
the deployment of the app, or the testing strategy?
Each point is a topic in itself, which I invite you to research further. I’ve used specific terms to help you in this search.
You now have a checklist to quickly assess resilience to software failures. Experiment in your mind with the last system, the last application you coded: does it pass all the tests?
Explore how you can scale with these two dimensions
In one sentence, it’s the ability of your system to go from, say, 100 users to 10,000 with the **least possible effort **.
If you need to recreate a lot of modules, change technology to scale in terms of the number of users, then your system is not scalable.
In this example, I use the word ‘user’, but we could more generally talk about “load”.
Load can be the number of requests per second that your system handles, the number of writes on a disk, or even the number of messages exchanged in a chat. Once you’ve identified the different loads on your application, we can start talking about performance.
In general, we measure the performance of a resource either by an average response time (e.g., 30ms to respond) or a throughput (e.g., 50 requests per second, or 100kbit/s). For response time, percentiles are generally used instead of an arithmetic mean. I invite you to search “SLA and percentiles” on Google for more info on this topic.
Coming back to our topic, to have a scalable application, the question to ask is “how can I increase the load that my application can receive without impacting performance?”
Well, we have two approaches, which are often more intertwined than distinct:
- Vertical scalability: I increase the resources of the machines on which my services run. The advantage is that nothing changes at the architectural level, the disadvantage is that you may eventually be limited by the hardware.
- Horizontal scalability: With a shared-nothing architecture, I can create as many services as necessary on new hardware to distribute the load across these new services. This requires a compatible architecture. These services can also automatically scale through a mechanism that detects the upstream load and provisions or decommissions services. We then talk about elastic systems.
There are no magic system. A pragmatic approach, tailored to your needs, is necessary. Too many projects start with ultra-scalable systems and complex infrastructure for, in the end, just 100 users. Start simple, iterate, and then scale if needed.
Make sure your system is maintainable with this 11 steps checklist
It is the art of creating software that is
- easy to operate,
- easy to understand for other engineers,
- and easy to evolve.
Easy to say, but obviously not easy to do.
Here is a checklist to ensure your software is pleasant to operate, simple to understand, and easy to evolve:
👨🎓 You must be able to monitor the health of your system and quickly restore failed services,
👨🎓 You must be able to understand why a problem (or a degradation in performance) occurred,
👨🎓 You must be able to update systems, especially security patches,
👨🎓 You must have documentation of your modules’ dependencies, to avoid problematic changes,
👨🎓 You should plan for load increases in advance when possible,
👨🎓 You need to provide the right tools and practices to new members and ensure that these practices are followed,
👨🎓 You must ensure that complex maintenance tasks (e.g., database migration) are well documented,
👨🎓 You should define processes to control deployments and make them predictable (e.g., staging area),
👨🎓 You need to preserve knowledge about your system, even when some contributors leave the project,
👨🎓 You must have a strategy to manage technical debt and accidental complexity,
👨🎓 You must be able to respond quickly to changing needs in the software through an efficient team organization.
All these points should prompt questions; if you don’t like some of the answers, keep digging, there’s something to be done! 👷♂️
In conclusion
Use these checklists for your own projects and applications.
If you ever find yourself in need of assistance or have questions about implementing these principles in your projects, don’t hesitate to reach out for help. Your journey towards creating robust and efficient software is a valuable one, and you’re not alone in it.
Good luck!