Log Aggregation are not more a commodity but a critical component in container-based platforms

Log Management doesn’t seem like a very fantastic topic. It is not the topic that you see and says: “Oh! Amazing! This is what I was dreaming about my whole life”. No, I’m aware that this is not to fancy, but that doesn’t make it less critical than other capabilities that you’re architecture needs to have.
Since the start of time, we’ve been used log files as the single trustable data source when it was related to troubleshoot your applications or know what was failed in your deployment or any other actions regarding a computer.
The procedure was easy:
- Launch “something”
- “something” failed.
- Check the logs
- Change something
- Repeat
And we’ve been doing it that way for a long, long time. Even with other more robust error handling and management approaches like Audit System, we also go back to logs when we need to get the fine-grained detail about the error. Look for a stack trace there, more detail about the error that was inserted into the Audit System or more data than just the error code and description thas was provided by a REST API.
Systems starting to grow, architecture became more complicated, but even with that, we end with the same method over and over. You’re aware of log aggregation architectures like the ELK stack or commercial solutions like Splunk or even SaaS offerings like Loggly, but you just think they’re not just for you.
They’re expensive to buy or expensive to set, and you know very well your ecosystem, and it’s easier to just jump into a machine and tail the log file. Probably you also have your toolbox of scripts to do this as quickly as anyone can open Kibana and try to search for something instance ID there to see the error for a specific transaction.
Ok, I need to tell you something: It’s time to change, and I’m going to explain to you why.
Things are changing, and IT and all the new paradigms are based on some common grounds:
- You’re going to have more components that are going to run isolated with its log files and data.
- Deployments will be more regular in your production environment, and that means that things are going to be wrong more usual (on a controlled way, but more usual)
- Technologies are going to coexist, so logs are going to be very different in terms of patterns and layouts, and you need to be ready for that.
So, let’s discuss these three arguments that I hope make you think in a different way about Log Management architectures and approaches.
1.- Your approach just doesn’t scale
Your approach is excellent for traditional systems. How many machines do you manage? 30? 50? 100? And you’re able to do it quite fine. Imagine now a container-base platform for a typical enterprise. I think an average number could be around 1000 containers just for business purposes, not talking about architecture or basic services. Are you able to be ready to go container by container to check 1000 logs streams to know the error?
Even if that’s possible, are you going to be the bottleneck for the growth of your company? How many container logs do you can keep a trace on? 2000? As I was saying at the beginning, that just not scale.
2.- Logs are not there forever
And now, you read the first topic and probably are you just saying to the screen you’re using to read is. Come on! I already know that logs are not there, they’re getting rotated, they got lost, and so on.
Yeah, that’s true, this is even more important in cloud-native approach. With container-based platforms, logs are ephemeral, and also, if we follow the 12-factor app manifesto there is no file with the log. All log traces should be printed to the standard output, and that’s it.
And where the logs are deleted? When the container fails.. and which records are the ones that you need more? The ones that have been failed.
So, if you don’t do anything, the log traces that you need the most are the ones that you’re going to lose.
3.- You need to be able to predict when things are going to fail
But logs are not only valid when something goes wrong are adequate to detect when something is going to be wrong but to predict when things are going to fail. And you need to be able to aggregate that data to be able to generate information and insights from it. To be able to run ML models to detect if something is going as expected or something different is happening that could lead to some issue before it happens.
Summary
I hope these arguments have made you think that even for your small size company or even for your system, you need to be able to set up a Log Aggregation technique now and not wait for another moment when it will probably be too late.