As stated in our blog post about understanding systems the isolation and localization of errors in complex systems is a very challenging task. Because features are implemented in a highly distributed manner, cross-cutting multiple layers, using devices and technologies from many different parties and suppliers.
One possible approach to overcome those difficulties is to log the calls of the relevant service interfaces in a structured and consistent way. Together with an appropriate tooling, those traces help gaining greater insight into the operational actitivy of your system.
It enables you to comprehend the interplay of your components.
So you might think: Why do we need to instrument interface calls with logging and use appropriate tooling to analyze those logs afterwards, instead of simply writing enough tests?
Yes, the most natural and most effective action is undoubtedly implementing unit test against your service interfaces. Because by providing extensive and meaningful tests, there is a good chance that you find the functional errors of your services already - for the most part. The usual approach is to validate the provided interfaces, in a way you expect the service to be called and to mock the implementation of the component that will offer the required interfaces.
But eventually to a later point in time the tests and the mocks will be replaced by production code. And although all components have been tested carefully (possibly test-driven) the software runs into trouble when integrating the single components into a larger system.
That means, there are obviously some latent shortcomings in your software which don’t surface until your service is used in the combination with other services.
What kind of shortcomings are we talking about and why are they so difficult to detect?
This is one of the most prominent and at the same time one of the most difficult to find issue types. It can become nasty tracking it down, because single message calls may react as expected, even though the overall feature is not working since the entire sequence flow is not implemented as intended.
The service is behaving as expected from a functional point of view: it returns a valid value by the given input parameters. However your client is not able to fulfill its task, since the response time of the service simply is too long. Maybe your service interface is not even the originator of the issue, but another service interface within the invocation chain.
In some cases a service interface is invoked by the same client with the same parameters in a high-frequent manner. This might happen by accident in case of calling an interface in a loop. But it can also be implemented on purpose, because the client is too lazy to manage state information on its own. Either way this can lead to a non-functional issue like a high cpu load, since a tremendous call hierarchy might be actuated behind the interface call, which is triggered again and again…
Further shortcomings might be induced because some preconditions or postconditions are not fulfilled when invoking an interface, a service is called by an unauthorized client, multiple clients cannot be served at the same time or the values of parameters are simply out of range. And of course there are many more sources of malfunction.
But they all have in common that they are very difficult to be identified with the approach of unit-testing as the only measure. Obviously there is some dynamic impact on your system, which makes it behave different in production environment than in isolated test environment.
This is where the methodology of structured logging comes into play to complement unit-testing for finding malfunctions.
The idea of structured logging is to trace the most valuable information of your system in a consistent, determined and machine-readable format. Since service interfaces are representing the cutting edges of your system, they are predestinated to be instrumented with structured logging data.
In part II we will explain how the structured logging methodology can be applied to service interfaces in a distributed system.