Wow - we did it!
After more than one year of discussions and preparation, we founded systemticks. We want to be the first company to specialize in understanding distributed embedded software systems.
Every now and then we are asked, why three experienced software architects and consultants quit their well-paying jobs to start a company focusing on system understanding.
The answer is: with an accumulated history of more than 35 years in software development, we’ve seen a lot … But way too often, we’ve looked at faces of frustrated developers and project managers. So, we asked ourselves: does it need to be like that?
Do software projects have to be that frustrating?
We started sharing opinions and experience with colleagues and customers. And we recognized that it takes a lot to run large software projects successfully. However, if the developers stay frustrated, it remains nearly impossible. We’ve learned, that motivation is often eaten up when there is no, outdated or only partial understanding of the system to be developed.
If you don’t have a detailed understanding of the system you are working on, you will have a hard time extending it or tracking down and fixing bugs. Motivation is lost when you look at code that does not explain itself in the context of the system of if you are constantly faced with half-truths. In the end everything takes longer than expected. And the quality suffers because time is running out.
However, a buggy system is not an option. So, pressure rises and all stakeholders become unhappier every day. Everyone is in "escalation" mode and project success is highly at risk.
A common strategy to fix this is to add staff. Usually, this leads to quite the contrary - as knowledge ramp-up is often neglected. So higher cost and lower progress again become the result of too little system understanding. Also new people - in their attempt to get to know the project - poll the productive people. Worst case: many of them are blocked.
It’s like a new year’s resolution at the beginning of every project: "We will write reasonable documentation."
But why does this so often just not work out, particularly in the aim for explaining how the system works? There is a wide range of reasons. For one, software engineers are well educated in programming but their skills in writing crisp and telling development documentation are often falling short.
When facing time pressure, first thing that gets neglected is documentation. At first this might seem to speed up progress but in the long run, it leads to an even worse understanding of the system and in turn slows down development.
It is not uncommon that the initial documentation stays widely untouched after the project has progressed to daily business mode. Hence, you can never be sure whether the documentation is falling behind reality and therefore rendered useless or even counterproductive. This is why documentation starts being ignored in favor of analyzing source code directly.
Also, worth mentioning are Clean Code-evangelists. They are known for statements like: "I don’t need to document, because my code reads like a book." But does that suffice? Extreme Programming, Test-driven Development and Clean Code in combination with Continuous Integration will - if used consequently - lead to readable code on the lowest level. An understanding for the overall system architecture and the interplay of software components, however, will not simply emerge from reading clean code. This particularly is the case in complex systems, which use dedicated frameworks for inter-process communication. It is extremely time consuming to analyze and understand a system entirely from bottom up. This would jeopardize mid to large size projects even more.
If you’ve taken care of your tooling, you will be able to analyze your software line by line during run-time. This is key to success if an issue has already been isolated to the level of a package or class. This is definitely feasible if you already have a good understanding of the system as a whole. Trying to analyze the entire system this way often leads to resignation after a few hours. In addition to that, large scale systems that will usually be parallelized are hard to be stepped through with a debugger at all.
For that reason, logfiles are introduced. Primarily to better isolate errors. In most cases though, log information is spread across various files. Often, it is not clear, where to find which info and where to find the related file. Different files are hard to correlate, because structure and content are not aligned. It is a source of errors, when log messages are manually introduced to the code - similar to documentation - because the information might not be up to date with the code. In large systems, depending on the log level, log files can reach the size of gigabytes. Crawling through this pile of information is looking for the needle in a haystack. Frustration becomes inevitable.
Theoretically, most projects have everything to understand their systems. Only that no one usually thinks about how to analyze it from the beginning. This often is an area of architecture work that’s being neglected. Nobody will seriously be willing to state, they are able to write bug-free software. Why then is it, that so few of them care about identifying errors quicker and easier? What does it need to understand a complex software system?
In our opinion, complex systems are only to be understood from top down or better from the outside in. Therefore we need the interplay of these four measures:
Architecture- and API-documentation
Clean and well structured code
Reliable and hassle-free debug tooling (Because, be honest, who of you
gdb before writing a
Consistent logging with flexible analysis tooling
We try to zoom into the system from the very outside. That’s why we start with the architecture documentation. It must at least describe the primary use case of the software, the major components with their responsibilities and particularly the mapping of components to source code and log information.
If nobody is able to find the listed components in the sources, nor the corresponding log information, they will not be able to understand the system very easily. If we are able to reconstruct and visualize the behavior and interplay of the components from logs and traces, we are able to understand the system and its structure.
If additionally, the interfaces are well documented, we can identify problematic components quickly when we filter the log information by those components. Then is the time to switch on the debugger and to identify the root cause of the issue.
Name a responsible person
Start at the very beginning of the project with making a plan how to analyze the system quickly and effectively
Look for short, but precise and consistent architecture- and API-documentation (take the readers perspective)
Instrument relevant interfaces with relevant information
Apply tooling, to visualize the interplay of components from logs and traces
Make sure, debugging works from the start
If you like to learn more about our experience and methodology or think your project could need some support in the area of system understanding, we are happy to hear from you.