Linkerd Service Mesh Explained: Solving Microservice Communication Challenges

Linkerd Service Mesh Explained: Solving Microservice Communication Challenges

CNCF-sponsored service Mesh Linkerd provides a lot of needed features in nowadays microservices architectures.

If you are reading this, probably, you are already aware of the challenges that come with a microservices architecture. It could be because you are reading about those or even because you are challenging them right now in your own skin.

One of the most common challenges is network and communication. With the eclosion of many components that need communication and the ephemeral approach of the cloud-native developments, many new features are a need when in the past were just a nice-to-have.

Concepts like service registry and service discovery, service authentication, dynamic routing policies, and circuit breaker patterns are no longer things that all the cool companies are doing but something basic to master the new microservice architecture as part of a cloud-native architecture platform, and here is where the Service Mesh project is increasing its popularity as a solution for most of this challenges and providing these features that are needed.

If you remember, a long time ago, I already cover that topic to introduce Istio as one of the options that we have:

But this project created by Google and IBM is not the only option that you have to provide those capabilities. As part of the Cloud Native Computing Foundation (CNCF), the Linkerd project provides similar features.

How to install Linkerd

To start using Linkerd, the first thing that we need to do is to install the software and to do that. We need to do two installations, one on the Kubernetes server and another on the host.

To install on the host, you need to go to the releases page and download the edition for your OS and install it.

I am using a Windows-based system in my sample, so I use chocolatey to install the client. After doing so, I can see the version of the CLI typing the following command:

linkerd version

And you will get an output that will say something similar to this:

PS C:\WINDOWS\system32> linkerd.exe version
Client version: stable-2.8.1
Server version: unavailable

Now we need to do the installation on the Kubernetes server, and to do so, we use the following command:

linkerd install | kubectl apply -f -

And you will get an output similar to this one:

PS C:\WINDOWS\system32> linkerd install | kubectl apply -f -
namespace/linkerd created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-identity created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-identity created
serviceaccount/linkerd-identity created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-controller created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-controller created
serviceaccount/linkerd-controller created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-destination created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-destination created
serviceaccount/linkerd-destination created
role.rbac.authorization.k8s.io/linkerd-heartbeat created
rolebinding.rbac.authorization.k8s.io/linkerd-heartbeat created
serviceaccount/linkerd-heartbeat created
role.rbac.authorization.k8s.io/linkerd-web created
rolebinding.rbac.authorization.k8s.io/linkerd-web created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-web-check created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-web-check created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-web-admin created
serviceaccount/linkerd-web created
customresourcedefinition.apiextensions.k8s.io/serviceprofiles.linkerd.io created
customresourcedefinition.apiextensions.k8s.io/trafficsplits.split.smi-spec.io created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-prometheus created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-prometheus created
serviceaccount/linkerd-prometheus created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-proxy-injector created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-proxy-injector created
serviceaccount/linkerd-proxy-injector created
secret/linkerd-proxy-injector-tls created
mutatingwebhookconfiguration.admissionregistration.k8s.io/linkerd-proxy-injector-webhook-config created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-sp-validator created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-sp-validator created
serviceaccount/linkerd-sp-validator created
secret/linkerd-sp-validator-tls created
validatingwebhookconfiguration.admissionregistration.k8s.io/linkerd-sp-validator-webhook-config created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-tap created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-tap-admin created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-tap created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-tap-auth-delegator created
serviceaccount/linkerd-tap created
rolebinding.rbac.authorization.k8s.io/linkerd-linkerd-tap-auth-reader created
secret/linkerd-tap-tls created
apiservice.apiregistration.k8s.io/v1alpha1.tap.linkerd.io created
podsecuritypolicy.policy/linkerd-linkerd-control-plane created
role.rbac.authorization.k8s.io/linkerd-psp created
rolebinding.rbac.authorization.k8s.io/linkerd-psp created
configmap/linkerd-config created
secret/linkerd-identity-issuer created
service/linkerd-identity created
deployment.apps/linkerd-identity created
service/linkerd-controller-api created
deployment.apps/linkerd-controller created
service/linkerd-dst created
deployment.apps/linkerd-destination created
cronjob.batch/linkerd-heartbeat created
service/linkerd-web created
deployment.apps/linkerd-web created
configmap/linkerd-prometheus-config created
service/linkerd-prometheus created
deployment.apps/linkerd-prometheus created
deployment.apps/linkerd-proxy-injector created
service/linkerd-proxy-injector created
service/linkerd-sp-validator created
deployment.apps/linkerd-sp-validator created
service/linkerd-tap created
deployment.apps/linkerd-tap created
configmap/linkerd-config-addons created
serviceaccount/linkerd-grafana created
configmap/linkerd-grafana-config created
service/linkerd-grafana created
deployment.apps/linkerd-grafana created

Now we can check that the installation has been done properly using the command:

linkerd check

And if everything has been done properly, you will get an output like this one:

PS C:\WINDOWS\system32> linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

Then we can see the dashboard from Linkerd using the following command:

linkerd dashboard
Linkerd Service Mesh Explained: Solving Microservice Communication Challenges
Dashboard initial web page after a clean Linkerd installation

Deployment of the apps

We will use the same apps that we use some time ago to deploy istio, so if you want to remember what they are doing, you need to look again at that article.

I have uploaded the code to my GitHub repository, and you can find it here: https://github.com/alexandrev/bwce-linkerd-scenario

To deploy, you need to have your docker images pushed to a docker registry, and I will use Amazon ECR as the docker repository that I am going to use.

So I need to build and push those images with the following commands:

docker build -t provider:1.0 .
docker tag provider:1.0 938784100097.dkr.ecr.eu-west-2.amazonaws.com/provider-linkerd:1.0
docker push 938784100097.dkr.ecr.eu-west-2.amazonaws.com/provider-linkerd:1.0
docker build -t consumer:1.0 .
docker tag consumer:1.0 938784100097.dkr.ecr.eu-west-2.amazonaws.com/consumer-linkerd:1.0
docker push 938784100097.dkr.ecr.eu-west-2.amazonaws.com/consumer-linkerd:1.0

And after that, we are going to deploy the images on the Kubernetes cluster:

kubectl apply -f .\provider.yaml
kubectl apply -f .\consumer.yaml

And now we can see those apps in the Linkerd Dashboard on the default namespace:

Linkerd Service Mesh Explained: Solving Microservice Communication Challenges
Image showing the provider and consumer app as linked applications

And now, we can reach the consumer endpoint using the following command:

kubectl port-forward pod/consumer-v1-6cd49d6487-jjm4q 6000:6000

And if we reach the endpoint, we got the expected reply from the provider.

Linkerd Service Mesh Explained: Solving Microservice Communication Challenges
Sample response provided by the provider

And in the dashboard, we can see the stats of the provider:

Linkerd Service Mesh Explained: Solving Microservice Communication Challenges
Linkerd dashboard showing the stats of the flow

Also, linked by default provided a Grafana dashboard where you can see more metrics you can get there using the grafana link that the dashboard has.

Linkerd Service Mesh Explained: Solving Microservice Communication Challenges
Grafana link on the Linkerd Dashboard

When you enter that, you could see something like the dashboard shown below:

Linkerd Service Mesh Explained: Solving Microservice Communication Challenges
Grafana dashboard showing the linkerd statistics

Summary

With all this process, we have seen how easily we can deploy a linkerd service mesh in our Kubernetes cluster and how applications can integrate and interact with them. In the next posts, we will dive into the most advanced features that will help us in the new challenges that come with the Microservices architecture.

Event Streaming, APIs, and Data Integration: The 3 Core Pillars of Cloud Integration

Event Streaming, APIs, and Data Integration: The 3 Core Pillars of Cloud Integration

Event Streaming, API, and Data are the three musketeers that cover all the aspects of mastering integration in the cloud.

Enterprise Application Integration has been one of the most challenging IT landscape topics since the beginning of time. As soon as the number of systems and applications in big corporations started and grows, this becomes an issue we should address. This process’s efficiency will also define what companies succeed and which ones will fail as the cooperation between applications becomes critical to respond at the pace that the business was demanding.

I usually like to use the “road analogy” to define this:

It doesn’t matter if you have the fastest cars if you don’t have proper roads you will not get anywhere

This situation generates a lot of investments from the companies. Also, a lot of vendors and products were launched to support that situation. Some solutions are starting to emerge: EAI, ESB, SOA, Middleware, Distributed Integration Platforms, Cloud-Native solution, and iPaaS.

Each of the approaches provides a solution for existing challenges. As long as the rest of the industry was evolving, the solutions changed to adapt to the new reality (containers, microservices, DevOps, API-led, Event-Driven..)

So, what is the situation today? Today is extended the misconception that integration is the same as API and also that API is asynchronous HTTP based (REST, gRPC, GraphQL) API. But it is much more than this.

Event Streaming, APIs, and Data Integration: The 3 Core Pillars of Cloud Integration
Photo by Tolga Ulkan on Unsplash

1.- API

API-led is key to the integration solution for sure, especially focus on the philosophical approach behind it. Each component that we create today is created with a collaboration in mind to work with existing and future components to benefit the business in an easy and agile way. This transcends the protocol discussion completely.

API covers all different kinds of solutions from existing REST API to AsyncAPI to cover the event-based API.

2.- Event Streaming

Asynchronous communication is needed because the patterns and the requirements when you are talking about big enterprises and different applications make this essential. Requirements like pub-sub approach to increase independence among services and apps, control-flow to manage the execution of high-demanding flows that can exceed the throttling for applications, especially when talking about SaaS solutions.

So, you can think that this is a very opinionated view, but at the same time, this is something that most of the providers in this space have realized based on their actions:

  • AWS release SNS/SQS, its first messaging system, as its only solution.
  • Nov 2017 AWS releases Amazon MQ, another queue messaging system to cover the scenarios that SQS cannot cover.
  • May 2019 AWS releases Amazon MSK, a managed service for Kafka solutions to provide streaming data distribution and processing capabilities.

And that situation is because when we move away from smaller applications when we are migrating from a monolith approach to a micro-service application, more patterns and more requirements are needed, and here is. In contrast, integration solutions have shown in the past,t this is critical for integration solutions.

3.- Data Integration

Usually, when we talk about integration, we talk about Enterprise Application Integration because we have this past bias. Even I use this term to cover this topic, EAI, because we usually refer to these solutions. But since the last years, we are more focused on the data distribution amount the company rather than how applications integrated because what is really important is the data they are exchanging and how we can transform this raw data into insights that we can use to know better our customers or optimize our process or discover new opportunities based on that.

Until recently, this part was handled apart from the integration solutions. You probably rely on a focused ETL (Extract-Transform-Load) that helps to move the data from one database to another or a different kind of storage like a Data Warehouse so your Data Scientist can work with them.

But again, agility has made that this needs to change, and all the principles integration has in terms of providing more agility to the business is also applied to how we exchange data. We try to avoid the data’s technical move and try to ease the access and the right organization on this data. Data Virtualization and Data Streaming are the core capabilities that address and handle those challenges providing an optimized solution for how the data is distributed.

Summary

My main expectation with this article is to make you aware that when you are thinking about integrating your application, this is much more than the REST API that you are exposing, maybe using some API Gateway, and the needs can be very different. The strongest your integration platform is, the stronger your business will be.

3 Unusual Developer Tools That Seriously Boost Productivity (Beyond VS Code)

3 Unusual Developer Tools That Seriously Boost Productivity (Beyond VS Code)

A non-VS Code list for software engineers

This is not going to be one of those articles about tools that can help you develop code faster. If you’re interested in that, you can check out my previous articles regarding VS Code extensions, linters, and other tools that make your life as a developer easier.

My job is not only about software development but also about solving issues that my customers have. While their issues can be code-related, they can be an operation error or even a design problem.

I usually tend to define my role as a lone ranger. I go out there without knowing what I will face, and I need to be ready to adapt, solve the problem, and make customers happy. This experience has helped me to develop a toolchain that is important for doing that job.

Let’s dive in!


1. MobaXterm

This is the best tool to manage different connections to different servers (SSH access for a Linux server, RDP for a Windows server, etc.). Here are some of its key features:

  • Graphical SSH port-forwarding for those cases when you need to connect to a server you don’t have direct access to.
  • Easy identity management to save the passwords for the different servers. You can organize them hierarchically for ease of access, especially when you need to access so many servers for different environments and even different customers.
  • SFTP automatic connection when you connect to an SSH server. It lets you download and upload files as easily as dropping files there.
  • Automatical X11 forwarding so you can launch graphical applications from your Linux servers without needing to configure anything or use other XServers like XMing.
MobaXterm in action
MobaXterm in action

2. Beyond Compare

There are so many tools to compare files, and I think I have used all of them — from standalone applications like WinMerge, Meld, Araxis, KDiff, and others to extensions for text editors like VS Code and Notepad++.

However, none of those can compare to the one and only Beyond Compare.

I covered Beyond Compare when I started working on software engineering in 2010, and it is a tool that comes with me on each project I have. I use it every day. So, what makes this tool different from the rest?

It is simply the best tool to make any comparison because it does not just compare text and folders. It does that perfectly, but at the same time, it also compares ZIP files while browsing the content, JAR files, and so on. This is very important when we’d like to check if two JAR files that are uploaded in DEV and PROD are the same version of the tool or to know if a ZIP file has the right content when it is uploaded.

Beyond Compare in action
Beyond Compare 4

3. Vi Editor

This is the most important one — the best text editor of all time — and it is available pretty much on every server.

It is a command-line text editor with a huge number of shortcuts that allows you to be very productive when you are inside a server checking logs and configuration files to see where the problem is.

For a long time, I have had a Vi cheat sheet printed out to make sure I can master the most important shortcuts and thus increase my productivity while fighting inside the enemy lines (the customer’s servers).

Vi text editor
VIM — Vi improved the ultimate text editor.

Amazon Managed Service for Prometheus Explained: High-Availability Monitoring on AWS

Amazon Managed Service for Prometheus Explained: High-Availability Monitoring on AWS

Learn what Amazon Managed Service for Prometheus provides and how you can benefit from it.

Monitoring is one of the hot topics when we talk about cloud-native architectures. Prometheus is a graduated Cloud Native Computing Foundation (CNCF) open-source project and one of the industry-standard solutions when it comes to monitoring your cloud-native deployment, especially when Kubernetes is involved.

Following its own philosophy of providing a managed service for some of the most used open-source projects but fully integrated with the AWS ecosystem, AWS releases a general preview (at the time of writing this article): Amazon Managed Service for Prometheus (AMP).

The first thing is to define what Amazon Managed Service for Prometheus is and what features provide. So, this is the Amazon definition of the service:

A fully managed Prometheus-compatible monitoring service that makes it easy to monitor containerized applications securely and at scale.

And I would like to spend some time on some parts of this sentence.

  • Fully managed service: So, this will be hosted and handle by Amazon, and we are just going to interact with it using API as we do with other Amazon services like EKS, RDS, MSK, SQS/SNS, and so on.
  • Prometheus-compatible: So, that means that even if this is not a pure-Prometheus installation, the API is going to be compatible. So the Prometheus clients who can use Grafana or others to get the information from Prometheus will work without changing their interfaces.
  • Service at-scale: Amazon, as part of the managed service, will take care of the solution’s scalability. You don’t need to define an instance-type or how much RAM or CPU you do need. This is going to be handled by AWS.

So, that sounds perfect. So you can think that you are going to delete your Prometheus server, and it will start using this service. Maybe you are even typing something like helm delete prom… WAIT WAIT!!

Because at this point, this is not going to replace your local Prometheus server, but it will allow the integration with it. So, that means that your Prometheus server is going to act like a scraper for the whole monitoring scalable solution that AMP is providing, something as you can see in the picture below:

Amazon Managed Service for Prometheus Explained: High-Availability Monitoring on AWS
Reference Architecture for Amazon Prometheus Service

So, you are still going to need a Prometheus server, that is right, but all the complexity are going to be avoided and leverage to the managed service: Storage configuration, High availability, API optimization, and so on is going to be just provided to you out of the box.

Ingesting information into Amazon Managed Service for Prometheus

At this moment, there is two way to ingest data into the Amazon Prometheus Service:

  • From an existing Prometheus server using the remote_write capability and configuration, so that means that each series that is scraped by the local Prometheus is going to be sent to the Amazon Prometheus Service.
  • Using AWS Distro for OpenTelemetry to integrate with this service using the Prometheus Receiver and the AWS Prometheus Remote Write Exporter components to get that.

Summary

So this is a way to provide an enterprise-grade installation leveraging on all the knowledge that AWS has hosting and managing this solution at scale and optimized in terms of performance. You can focus on the components you need to get the metrics ingested into the service.

I am sure this will not be the last movement from AWS in observability and metrics management topics. I am sure they will continue to provide more tools to the developer’s and architects’ hands to define optimized solutions as easily as possible.

📚 Want to dive deeper into Kubernetes? This article is part of our comprehensive Kubernetes Architecture Patterns guide, where you’ll find all fundamental and advanced concepts explained step by step.

Why Use GraphQL? 3 Key Benefits Over REST APIs Explained

Why Use GraphQL? 3 Key Benefits Over REST APIs Explained

3 benefits of using GraphQL in your API that you should take into consideration.

We all know that APIs are the new standard when we develop any piece of software. All the latest paradigm approaches are based on a distributed amount of components created with a collaborative approach in mind that they need to work together to provide more value to the whole ecosystem.

Talking about the technical part, an API has become a synonym for using REST/JSON to expose those APIs as a new standard. But this is not the only option even in the synchronous request/reply world, and we are starting to see a shift in this by-default selection of REST as the only choice in this area.

GraphQL has emerged as an alternative that works as well since Facebook introduced it in 2015. During these five years of existence, its adoption is growing outside Facebook walls, but this is still far from the general public uses as the following Google Trends graph shows

Why Use GraphQL? 3 Key Benefits Over REST APIs Explained
Google Trend graph showing interest in REST vs. GraphQL in the last five years

But I think this is a great moment to look again and the benefits that GraphQL can provide to your APIs in your ecosystem. You can start a new year by introducing a technology that can provide you and your enterprise with clear benefits. So, let’s take a look at them.

1.- More flexible style to meet different client profile needs.

I want to start this point with a small jump to the past when REST was introduced. REST was not always the standard we use to create our API or Web Services, as we called it at that point. A W3C standard, SOAP, was the leader of that, and REST replaces it, focusing on several points.

However, the weight of the protocol much lighter than SOAP makes a difference, especially when mobile devices start to be part of the ecosystem.

That is the situation today, and GraphQL is an additional step further on that approach and the perspective of being more flexible. GraphQL allows each customer to decide what part of the data they would like to use the same interface for different applications. Each of them will still have an optimized approach because they can decide what they like to obtain at each time.

2.- More loosely coupled approach with the service provider

Another important topic is the dependency between the consumer of the API and the provider. We all know that different paradigms like microservices are focus on that approach. We aim to get much independence as possible among our components.

REST is not providing a big link between the components that is true. Still, the interface is fixed at the same time, so that means each time we modify that interface by adding a new field or changing one, we can affect the consumer even if they do not need that field for anything.

GraphQL, by its feature of selecting the fields that I would like to obtain, makes much easier the evolution of the API itself much and at the same time provides much more independence for the components because only the changes that have a clear impact on the data that a client needs can generate an effect on them but the rest it is completely transparent form them.

3.- More structured and defined specification

One of the aspects that defined the rise of REST as a wide-used protocol is the lack of standards to structure and define its behavior. We had several attempts using RAML or even just “samples as specification”, swagger, and finally an OpenAPI specification. But that time of “unstructured” situation generates that REST API can be done in very different ways.

Each developer or service provider can generate REST API with a different approach and philosophy that generates noise and is difficult to standardize. GraphQL is based on a GraphQL Schema that defines the type managed by the API and the operations that you can do with it in two main groups: queries and mutations. That allows that all the GraphQL APIs, no matter who is developing them, follow the same philosophy as it is already included in the core of the specification itself.

Summary

After reading this article, you are probably saying, so that means that I should remove all my REST API and start building everything in GraphQL. And my answer to that is …. NO!

The goal of this article if that you are aware of the benefits that different way to define API is providing to you so you can add them to your toolbelt, so next time that you create an API to think about these topics described here and reach to a conclusion that is: mmm I think GraphQL is the better pick for this specific situation or the other way around, I am not going to get any benefits on this specific API, so I rather use REST.

The idea is that you now know to apply to your specific case and choose based on that because no better than yourself to decide what is best for your use case.

SOA Principles That Still Matter in Cloud-Native Architecture

SOA Principles That Still Matter in Cloud-Native Architecture

The development world has changed a lot, but that does not mean that all things are not valid. Learn what principles you should be aware of.

The world changes fast, and in IT, it changes even faster. We all know that, which usually means that we need to face new challenges and find new solutions. Samples of that approach are the trends we have seen in the last years: Containers, DevSecOps, Microservices, GitOps, Service Mesh…

But at the same time, we know that IT is a cycle in terms that the challenges that we face today are different evolution of challenges that have been addressed in the past. The main goal is to avoid re-inventing the wheel and avoiding making the same mistakes people before us.

So, I think it is worth it to review principles that Service-oriented Architectures (SOA) provided to us in the last decades and see which ones are relevant today.

Principles Definition

I will use the principles from Thomas Erl’s SOA Principles of Service Design and the definitions that we can found on the Wikipedia article:

1.- Service Abstraction

Design principle that is applied within the service-orientation design paradigm so that the information published in a service contract is limited to what is required to effectively utilize the service.

The main goal behind these principles is that a service consumer should not be aware of the particular component. The main advantage of that approach is that we need to change the current service provider. We can do it without impacting those consumers. This is still totally relevant today because of different reasons:

  • Service to service communication: Service Meshes and similar projects provide service registry and service discovery capabilities based on the same principles to avoid knowing the pod providing the functionality.
  • SaaS “protection-mode” enabled: Some backend systems are still here to stay even if they have more modern ways to be set up as SaaS platforms. That flexibility also provides a more easy way to move away or change the SaaS application providing the functionality. But all that flexibility is not real if you have that SaaS application totally coupled with the rest of the microservices and cloud-native application in your land space.

2.- Service Autonomy

Design principle that is applied within the service-orientation design paradigm, to provide services with improved independence from their execution environments.

We all know the importance of the service isolation that cloud-native development patterns provide based on containers’ capabilities to provide independence among execution environments.

Each service should have its own execution context isolated as much as possible from the execution context of the other services to avoid any interference between them.

So that is still relevant today but encouraged by today’s paradigms as the new normal way to do things because of the benefits shown.

3.- Service Statelessness

Design principle that is applied within the service-orientation design paradigm, in order to design scalable services by separating them from their state data whenever possible.

Stateless microservices do not maintain their own state within the services across calls. The services receive a request, handle it, and reply to the client requesting that information. If needed to store some state information, this should be done externally to the microservices using an external data store such as a relational database, a NoSQL database, or any other way to store information outside the microservice.

4.- Service Composability

Design of services that can be reused in multiple solutions that are themselves made up of composed services. The ability to recompose the service is ideally independent of the size and complexity of the service composition.

We all know that re-usability is not one of the principles behind the microservices because they argue that re-usability is against agility; when we have a shared service among many parties, we do not have an easy way to evolve it.

But this is more about leverage on existing services to create new ones that are the same approach that we follow with the API Orchestration & Choreography paradigm and the agility that provides leverage on the existing ones to create compounded services that meet the innovation targets from the business.

Summary

Cloud-native application development paradigms are a smooth evolution from the existing principles. We should leverage the ones that are still relevant and provide an updated view to them and update the needed ones.

In the end, in this industry, what we do each day is to do a new step of the long journey that is the history of the industry, and we leverage all the work that has been done in the past, and we learn from it.

Kubernetes Distributions Explained: What They Are and the Top Platforms to Choose

Kubernetes Distributions Explained: What They Are and the Top Platforms to Choose

Learn what Kubernetes Distributions are, why it matters to you and who are the top best players available in the market today

Introduction

One of the biggest announcements from the latest AWS re:Invent 2020 sessions was the release of EKS-D from Amazon. EKS-D is their open-source Kubernetes Distribution that’s now available for everyone to start using in their cloud provider or even on-premises.

It’s based on past findings and the entire process Amazon has undergone in managing their Kubernetes managed platform, Amazon EKS.

These announcements have many people asking themselves: “OK, I know Kubernetes, but what’s a Kubernetes distribution? And why should I care?”

So I’ll try to answer that with the knowledge I have, and I always try to use the same approach: a Kubernetes versus Linux model comparison.

Kubernetes is an open-source project, as you know, started by Google and is now being managed by the community and the Cloud Native Computing Foundation (CNCF), and you can find all the code available here:

But let’s be honest: Not many of us are pulling that repo and trying to compile it to provide a cluster. That’s not how we usually work. If you follow the code path — downloading it, building it, and so on — this is usually named vanilla Kubernetes.

If we start with the Linux comparison, it’s the same situation as we have with the Linux kernel that most of the Linux distribution ships, but this is already compiled and available with a bunch of other tools all working together via the usual approach.

So that’s what a Kubernetes distribution is. They build Kubernetes. They provide other tools and components to enhance or provide more features and to focus on additional aspects like a security focus, a DevOps focus, or another focus. Another concept that usually is raised is the purity of distribution, and we try to talk about distribution that’s pure.

We call a distribution pure when it’s building Kubernetes, and that’s it. It leaves everything else to the developers or users to decide what they want to use on top of it.


What Are the Main Components Shipped in a Kubernetes Distribution?

The main components that can differ when we’re talking about a Kubernetes distribution are the following:

Container runtime and registries

We all know there’s more that one container runtime, and even if you weren’t aware of that, you’ve probably read all of the articles regarding the removal of Docker support in Kubernetes v1.20, as you can read in this awesome article from Edgar Rodriguez.

At this moment, it seems all runtimes should support the existing Container Runtime Interface, and runtimes like CRI-O, Containerd, or Kata seem to be the default options now.

Networking

Another topic that often differs when we’re talking about Kubernetes distributions is how they manage their network, and this is one of the most critical aspects of the whole platform.

As we have with the container runtime, a standard specification exists to cover that topic, and that’s the Container Network Interface (CNI). Several projects exist on this topic, like Flannel, Calico, Canal, and Wave. Also, some platforms provide their own component, like the Openshift SDN operator.

Storage

How to handle storage in Kubernetes is also very important, especially as we embrace this model in deployments that require stateful models. Different platforms can support different storage options, like file systems and so on.


Who Are the Top Players?

The first thing we need to be aware of is there are a huge number of Kubernetes distributions out there.

We’ll count the ones with a CNCF certification, and you can take a look at all of them here. At the moment of writing this article, we’re talking about 72 certified distributions.

Logos of the various CNCF-certified distributions.
Image via the Cloud Native Computing Foundation

These are the ones that I’d like to highlight today:

Red Hat OpenShift

Red Hat OpenShift logo.

The Red Hat OpenShift platform could be one of the most used platforms, especially in a private-cloud fashion. It could include most of the Red Hat services regarding storage, like GlusterFS and networking with OpenShift DNS. It has OKD as the open-source project that backs and contributes to the OpenShift platform. Check this article to see how to set up Openshift locally to test it

Mirantis

Mirantis logo

The former Docker enterprise that’s been acquired by Mirantis is another of the usual choices when we’re talking about supported platforms.

VMware Tanzu

VMware Tanzu logo

VMware Tanzu, also coming from the acquisition of Pivotal from VMware, is a Kubernetes platform.

Canonical

Canonical logo

Canonical (open source) is a platform from the company that develops and maintains Ubuntu. It’s another one of the important choices here and provides a variety of options, focusing not only on the common central mode but also on edge Kubernetes deployments with projects like MicroK8S and more options.

Rancher

Rancher logo

Rancher (open source) is another one of the big players, focusing on following and extending the CNCF standards and also offering a big push for edge deployment with K3S. It also offers automated upgrades.


Summary

So, as you can see, the number of options available out there is huge. They all differ, so it’s important to take your time when you’re deciding your target platform based on your criteria for your project or your company.

And that’s without covering the managed platforms available out there that are becoming one of the more preferred options for companies so they can get all the flexibility from Kubernetes while not needing to handle the complexity of managing a Kubernetes platform themselves. But that’s a topic for another article — hopefully soon.

This article at least has provided you with more clarity about what a Kubernetes distribution is, the main differences among them, and a quick look at some of the key actors in this spectrum. Enjoy your day, and enjoy your life.

📚 Want to dive deeper into Kubernetes? This article is part of our comprehensive Kubernetes Architecture Patterns guide, where you’ll find all fundamental and advanced concepts explained step by step.

Why Visual Diff Is Essential for Low-Code Development and Team Collaboration

Why Visual Diff Is Essential for Low-Code Development and Team Collaboration

Helping you excel using low code in distributed teams and parallel programming

Most enterprises are exploring low-code/no-code development now that the most important thing is to achieve agility on the technology artifacts from different perspectives (development, deployment, and operation).

This article is part of my comprehensive TIBCO Integration Platform Guide where you can find more patterns and best practices for TIBCO integration platforms.

The benefits of this way of working make this almost a no-brainer decision for most companies. We already covered them in a previous article. Take a look if you have not read it yet.

But we know that all new things come with their own challenges that we need to address and master in order to unleash the full benefits that these new paradigms or technologies are providing. Much like with cloud-native architecture, we need to be able to adapt.

Sometimes it is not the culture that we need to change. Sometimes the technology and the tools also need to evolve to address those challenges and help us on that journey. And this is how Visual Diff came into life.

When you develop using a low-code approach, all the development process is easier. You need to combine different blocks that do the logic you need, and everything is simpler than a bunch of code lines.

Example of low-code development
Low-code development approach using TIBCO BusinessWorks.

But we also need to manage all these artifacts in a repository whereby all of them are focused on source code development. That means that when you are working with those tools at the end, you are not working with a “low-code approach” but rather a source code approach. Things like merging different branches and looking to the version history to know the changes are complex.

And they are complex because they are performed by the repository itself, which is focused on the file changes and the source code that changes. But one of the great benefits of low-code development is that the developer doesn’t need to be aware of the source code generated as part of the visual, faster activity. So, how can we solve that? What can we use to solve that?

Low-code technologies need to advance to take the lead here. For example, this is what TIBCO BusinessWorks has done with the release of their Visual Diff capability.

So, you still have your integration with your source code repository. You can do all the processes and activities you usually need to do in this kind of parallel distributed development. Still, you can also see all those activities from a “low-code” perspective.

That means that when I am taking a look at the version history, I can see the visual artifacts that have been modified. The activities added or deleted are shown there in a meaningful way for low-code development. That closes the loop about how low-code developments can take all the advantages of the modern source code repositories and their flows (GitFlow, GitHub Flow, One Flow, etc.) as well as the advantages of the low-code perspective.

Let’s say there are two options with which you can see how an application has been changed. One is the traditional approach and the other uses the Visual Diff:

Visual Diff of your processes
Option A: Visual Diff of your processes
Same processes but with a Text Comparision approach
Option B: Same processes but with a Text Comparison approach

So, based on this evidence, what do you think is easier to understand? Even if you are a true coder as I am, we cannot deny the ease and benefits of the low-code approach for massive and standard development in the enterprise world.


Summary

No matter how fast we are developing with all the accelerators and frameworks that we have, a well-defined low-code application will be faster than any of us. It is the same battle that we had in the past with the Graphical Interfaces or mouse control versus the keyboard.

We accept that there is a personal preference to choose one or the other, but when we need to decide what is more effective and need to rely on the facts, we cannot be blind to what is in front of us.

I hope you have enjoyed this article. Have a nice day!

Lens Kubernetes IDE Explained: Improve Development and Cluster Management Productivity

Lens Kubernetes IDE Explained: Improve Development and Cluster Management Productivity

Find the greatest way to manage your Kubernetes development cluster

I need to start this article by admitting that I am an advocate of Graphical User Interfaces and everything that provides a way to speed up the way we do things and be more productive.

So when we talk about how to manage our Kubernetes cluster mainly for development purposes, you can imagine that I am one of those people who tries any available tool to make that journey easier. The ones who’ve started using Portainer to manage their local Docker engine or are a fan of the new dashboard in Docker for Windows/Mac. But that is far from reality.

In terms of Kubernetes management, I got used to typing all the commands to check the pods, the logs, the status of the cluster to do the port-forwards, etc. Any task I did was with a terminal, and I felt that it was the right thing to do. I did not even use a Kubernetes dashboard to have a web page for my Kubernetes environment. All of that changed last week when I met with a colleague who showed me what Lens could do.

Lens is a totally different story. I am not praising it because I am being paid to do so. This is an open source project that you can find on GitHub. But the way that it does the job is just awesome!

Image of Len showing the status of a Kubernetes cluster
Image of Len showing the status of a Kubernetes cluster — Screenshot by the author.

The first thing I would like to mention regarding Lens is that it has multi-context support, so you can have all the different Kubernetes contexts available to switch following a Slack approach when we switch from different workspaces. It just reads your .kube/config file and makes all those contexts available to you to connect to the one you would like.

Kubernetes context selection in Lens
Kubernetes context selection in Lens

Once we have connected to one of these clusters, we have different options to see the status of it, but the first one is to check the Workloads using the Overview option:

Workloads Overview in Lens
Workloads Overview in Lens

Then, you can drill down to any pod or different object inside Kubernetes to check its status and at the same time do the main actions you usually do when you deal with a pod, such as check the logs, execute a terminal to one of the containers that belong to that pod, or even edit the YAML for that pod.

Pod options inside Lens
Pod options inside Lens

But Lens goes beyond the usual Kubernetes tasks because it also has a Helm integration, so you can check the releases that you have there, the version of the status, and so on:

Helm integration option in Lens
Helm integration option in Lens

The experience of managing everything feels perfect. You are more productive as well. Even those who love the CLI and terminals need to admit that to do regular tasks, the Graphical approach and the mouse are faster than the keyboard — even for the defenders of the mechanical keyboard like myself.

So, I encourage you to download Lens and start using it right now. To do so, go to their main web page and download it:

Thanks for reading!

📚 Want to dive deeper into Kubernetes? This article is part of our comprehensive Kubernetes Architecture Patterns guide, where you’ll find all fundamental and advanced concepts explained step by step.

Optimize Prometheus Disk Usage: Reduce TSDB Size and Control Metrics Cardinality

Optimize Prometheus Disk Usage: Reduce TSDB Size and Control Metrics Cardinality

Learn some tricks to analyze and optimize the usage that you are doing of the TSDB and save money on your cloud deployment.

In previous posts, we discussed how the storage layer worked for Prometheus and how effective it was. But in the current times, we are of cloud computing we know that each technical optimization is also a cost optimization as well and that is why we need to be very diligent about any option that we use regarding optimization.

We know that usually when we monitor using Prometheus we have so many exporters available at our disposal and also that each of them exposes a lot of very relevant metrics that we need to track everything we need to. But also, we should be aware that there are also metrics that we don’t need at this moment or we don’t plan to use it. So, if we are not planning to use, why do we want to waste disk space storing them?

So, let’s start taking a look at one of the exporters we have in our system. In my case, I would like to use a BusinessWorks Container Application that exposes metrics about its utilization. If you check their metrics endpoint you could see something like this:

# HELP jvm_info JVM version info
# TYPE jvm_info gauge
jvm_info{version="1.8.0_221-b27",vendor="Oracle Corporation",runtime="Java(TM) SE Runtime Environment",} 1.0
# HELP jvm_memory_bytes_used Used bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_used gauge
jvm_memory_bytes_used{area="heap",} 1.0318492E8
jvm_memory_bytes_used{area="nonheap",} 1.52094712E8
# HELP jvm_memory_bytes_committed Committed (bytes) of a given JVM memory area.
# TYPE jvm_memory_bytes_committed gauge
jvm_memory_bytes_committed{area="heap",} 1.35266304E8
jvm_memory_bytes_committed{area="nonheap",} 1.71302912E8
# HELP jvm_memory_bytes_max Max (bytes) of a given JVM memory area.
# TYPE jvm_memory_bytes_max gauge
jvm_memory_bytes_max{area="heap",} 1.073741824E9
jvm_memory_bytes_max{area="nonheap",} -1.0
# HELP jvm_memory_bytes_init Initial bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_init gauge
jvm_memory_bytes_init{area="heap",} 1.34217728E8
jvm_memory_bytes_init{area="nonheap",} 2555904.0
# HELP jvm_memory_pool_bytes_used Used bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_used gauge
jvm_memory_pool_bytes_used{pool="Code Cache",} 3.3337536E7
jvm_memory_pool_bytes_used{pool="Metaspace",} 1.04914136E8
jvm_memory_pool_bytes_used{pool="Compressed Class Space",} 1.384304E7
jvm_memory_pool_bytes_used{pool="G1 Eden Space",} 3.3554432E7
jvm_memory_pool_bytes_used{pool="G1 Survivor Space",} 1048576.0
jvm_memory_pool_bytes_used{pool="G1 Old Gen",} 6.8581912E7
# HELP jvm_memory_pool_bytes_committed Committed bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_committed gauge
jvm_memory_pool_bytes_committed{pool="Code Cache",} 3.3619968E7
jvm_memory_pool_bytes_committed{pool="Metaspace",} 1.19697408E8
jvm_memory_pool_bytes_committed{pool="Compressed Class Space",} 1.7985536E7
jvm_memory_pool_bytes_committed{pool="G1 Eden Space",} 4.6137344E7
jvm_memory_pool_bytes_committed{pool="G1 Survivor Space",} 1048576.0
jvm_memory_pool_bytes_committed{pool="G1 Old Gen",} 8.8080384E7
# HELP jvm_memory_pool_bytes_max Max bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_max gauge
jvm_memory_pool_bytes_max{pool="Code Cache",} 2.5165824E8
jvm_memory_pool_bytes_max{pool="Metaspace",} -1.0
jvm_memory_pool_bytes_max{pool="Compressed Class Space",} 1.073741824E9
jvm_memory_pool_bytes_max{pool="G1 Eden Space",} -1.0
jvm_memory_pool_bytes_max{pool="G1 Survivor Space",} -1.0
jvm_memory_pool_bytes_max{pool="G1 Old Gen",} 1.073741824E9
# HELP jvm_memory_pool_bytes_init Initial bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_init gauge
jvm_memory_pool_bytes_init{pool="Code Cache",} 2555904.0
jvm_memory_pool_bytes_init{pool="Metaspace",} 0.0
jvm_memory_pool_bytes_init{pool="Compressed Class Space",} 0.0
jvm_memory_pool_bytes_init{pool="G1 Eden Space",} 7340032.0
jvm_memory_pool_bytes_init{pool="G1 Survivor Space",} 0.0
jvm_memory_pool_bytes_init{pool="G1 Old Gen",} 1.26877696E8
# HELP jvm_buffer_pool_used_bytes Used bytes of a given JVM buffer pool.
# TYPE jvm_buffer_pool_used_bytes gauge
jvm_buffer_pool_used_bytes{pool="direct",} 148590.0
jvm_buffer_pool_used_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_capacity_bytes Bytes capacity of a given JVM buffer pool.
# TYPE jvm_buffer_pool_capacity_bytes gauge
jvm_buffer_pool_capacity_bytes{pool="direct",} 148590.0
jvm_buffer_pool_capacity_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_used_buffers Used buffers of a given JVM buffer pool.
# TYPE jvm_buffer_pool_used_buffers gauge
jvm_buffer_pool_used_buffers{pool="direct",} 19.0
jvm_buffer_pool_used_buffers{pool="mapped",} 0.0
# HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM
# TYPE jvm_classes_loaded gauge
jvm_classes_loaded 16993.0
# HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution
# TYPE jvm_classes_loaded_total counter
jvm_classes_loaded_total 17041.0
# HELP jvm_classes_unloaded_total The total number of classes that have been unloaded since the JVM has started execution
# TYPE jvm_classes_unloaded_total counter
jvm_classes_unloaded_total 48.0
# HELP bwce_activity_stats_list BWCE Activity Statictics list
# TYPE bwce_activity_stats_list gauge
# HELP bwce_activity_counter_list BWCE Activity related Counters list
# TYPE bwce_activity_counter_list gauge
# HELP all_activity_events_count BWCE All Activity Events count by State
# TYPE all_activity_events_count counter
all_activity_events_count{StateName="CANCELLED",} 0.0
all_activity_events_count{StateName="COMPLETED",} 0.0
all_activity_events_count{StateName="STARTED",} 0.0
all_activity_events_count{StateName="FAULTED",} 0.0
# HELP activity_events_count BWCE All Activity Events count by Process, Activity State
# TYPE activity_events_count counter
# HELP activity_total_evaltime_count BWCE Activity EvalTime by Process and Activity
# TYPE activity_total_evaltime_count counter
# HELP activity_total_duration_count BWCE Activity DurationTime by Process and Activity
# TYPE activity_total_duration_count counter
# HELP bwpartner_instance:total_request Total Request for the partner invocation which mapped from the activities
# TYPE bwpartner_instance:total_request counter
# HELP bwpartner_instance:total_duration_ms Total Duration for the partner invocation which mapped from the activities (execution or latency)
# TYPE bwpartner_instance:total_duration_ms counter
# HELP bwce_process_stats BWCE Process Statistics list
# TYPE bwce_process_stats gauge
# HELP bwce_process_counter_list BWCE Process related Counters list
# TYPE bwce_process_counter_list gauge
# HELP all_process_events_count BWCE All Process Events count by State
# TYPE all_process_events_count counter
all_process_events_count{StateName="CANCELLED",} 0.0
all_process_events_count{StateName="COMPLETED",} 0.0
all_process_events_count{StateName="STARTED",} 0.0
all_process_events_count{StateName="FAULTED",} 0.0
# HELP process_events_count BWCE Process Events count by Operation
# TYPE process_events_count counter
# HELP process_duration_seconds_total BWCE Process Events duration by Operation in seconds
# TYPE process_duration_seconds_total counter
# HELP process_duration_milliseconds_total BWCE Process Events duration by Operation in milliseconds
# TYPE process_duration_milliseconds_total counter
# HELP bwdefinitions:partner BWCE Process Events count by Operation
# TYPE bwdefinitions:partner counter
bwdefinitions:partner{ProcessName="t1.module.item.getTransactionData",ActivityName="FTLPublisher",ServiceName="GetCustomer360",OperationName="GetDataOperation",PartnerService="TransactionService",PartnerOperation="GetTransactionsOperation",Location="internal",PartnerMiddleware="MW",} 1.0
bwdefinitions:partner{ProcessName=" t1.module.item.auditProcess",ActivityName="KafkaSendMessage",ServiceName="GetCustomer360",OperationName="GetDataOperation",PartnerService="AuditService",PartnerOperation="AuditOperation",Location="internal",PartnerMiddleware="MW",} 1.0
bwdefinitions:partner{ProcessName="t1.module.item.getCustomerData",ActivityName="JMSRequestReply",ServiceName="GetCustomer360",OperationName="GetDataOperation",PartnerService="CustomerService",PartnerOperation="GetCustomerDetailsOperation",Location="internal",PartnerMiddleware="MW",} 1.0
# HELP bwdefinitions:binding BW Design Time Repository - binding/transport definition
# TYPE bwdefinitions:binding counter
bwdefinitions:binding{ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInterface="GetCustomer360:GetDataOperation",Binding="/customer",Transport="HTTP",} 1.0
# HELP bwdefinitions:service BW Design Time Repository - Service definition
# TYPE bwdefinitions:service counter
bwdefinitions:service{ProcessName="t1.module.sub.item.getCustomerData",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
bwdefinitions:service{ProcessName="t1.module.sub.item.auditProcess",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
bwdefinitions:service{ProcessName="t1.module.sub.orchestratorSubFlow",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
bwdefinitions:service{ProcessName="t1.module.Process",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
# HELP bwdefinitions:gateway BW Design Time Repository - Gateway definition
# TYPE bwdefinitions:gateway counter
bwdefinitions:gateway{ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",Endpoint="bwce-demo-mon-orchestrator-bwce",InteractionType="ISTIO",} 1.0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1956.86
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.604712447107E9
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 763.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1048576.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.046207488E9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.2151936E8
# HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.
# TYPE jvm_gc_collection_seconds summary
jvm_gc_collection_seconds_count{gc="G1 Young Generation",} 540.0
jvm_gc_collection_seconds_sum{gc="G1 Young Generation",} 4.754
jvm_gc_collection_seconds_count{gc="G1 Old Generation",} 2.0
jvm_gc_collection_seconds_sum{gc="G1 Old Generation",} 0.563
# HELP jvm_threads_current Current thread count of a JVM
# TYPE jvm_threads_current gauge
jvm_threads_current 98.0
# HELP jvm_threads_daemon Daemon thread count of a JVM
# TYPE jvm_threads_daemon gauge
jvm_threads_daemon 43.0
# HELP jvm_threads_peak Peak thread count of a JVM
# TYPE jvm_threads_peak gauge
jvm_threads_peak 98.0
# HELP jvm_threads_started_total Started thread count of a JVM
# TYPE jvm_threads_started_total counter
jvm_threads_started_total 109.0
# HELP jvm_threads_deadlocked Cycles of JVM-threads that are in deadlock waiting to acquire object monitors or ownable synchronizers
# TYPE jvm_threads_deadlocked gauge
jvm_threads_deadlocked 0.0
# HELP jvm_threads_deadlocked_monitor Cycles of JVM-threads that are in deadlock waiting to acquire object monitors
# TYPE jvm_threads_deadlocked_monitor gauge
jvm_threads_deadlocked_monitor 0.0

As you can see a lot of metrics but I have to be honest I am not using most of them in my dashboards and to generate my alerts. I can use the metrics regarding the application performance for each of the BusinessWorks process and its activities, also the JVM memory performance and number of threads but things like how the JVM GC is working for each of the layers of the JVM (G1 Young Generation, G1 Old Generation) I’m not using them at all.

So, If I show the same metric endpoint highlighting the things that I am not using it would be something like this:

# HELP jvm_info JVM version info
# TYPE jvm_info gauge
jvm_info{version="1.8.0_221-b27",vendor="Oracle Corporation",runtime="Java(TM) SE Runtime Environment",} 1.0

# HELP jvm_memory_bytes_used Used bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_used gauge
jvm_memory_bytes_used{area="heap",} 1.0318492E8
jvm_memory_bytes_used{area="nonheap",} 1.52094712E8
# HELP jvm_memory_bytes_committed Committed (bytes) of a given JVM memory area.
# TYPE jvm_memory_bytes_committed gauge
jvm_memory_bytes_committed{area="heap",} 1.35266304E8
jvm_memory_bytes_committed{area="nonheap",} 1.71302912E8
# HELP jvm_memory_bytes_max Max (bytes) of a given JVM memory area.
# TYPE jvm_memory_bytes_max gauge
jvm_memory_bytes_max{area="heap",} 1.073741824E9
jvm_memory_bytes_max{area="nonheap",} -1.0
# HELP jvm_memory_bytes_init Initial bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_init gauge
jvm_memory_bytes_init{area="heap",} 1.34217728E8
jvm_memory_bytes_init{area="nonheap",} 2555904.0

# HELP jvm_memory_pool_bytes_used Used bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_used gauge
jvm_memory_pool_bytes_used{pool="Code Cache",} 3.3337536E7
jvm_memory_pool_bytes_used{pool="Metaspace",} 1.04914136E8
jvm_memory_pool_bytes_used{pool="Compressed Class Space",} 1.384304E7
jvm_memory_pool_bytes_used{pool="G1 Eden Space",} 3.3554432E7
jvm_memory_pool_bytes_used{pool="G1 Survivor Space",} 1048576.0
jvm_memory_pool_bytes_used{pool="G1 Old Gen",} 6.8581912E7
# HELP jvm_memory_pool_bytes_committed Committed bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_committed gauge
jvm_memory_pool_bytes_committed{pool="Code Cache",} 3.3619968E7
jvm_memory_pool_bytes_committed{pool="Metaspace",} 1.19697408E8
jvm_memory_pool_bytes_committed{pool="Compressed Class Space",} 1.7985536E7
jvm_memory_pool_bytes_committed{pool="G1 Eden Space",} 4.6137344E7
jvm_memory_pool_bytes_committed{pool="G1 Survivor Space",} 1048576.0
jvm_memory_pool_bytes_committed{pool="G1 Old Gen",} 8.8080384E7
# HELP jvm_memory_pool_bytes_max Max bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_max gauge
jvm_memory_pool_bytes_max{pool="Code Cache",} 2.5165824E8
jvm_memory_pool_bytes_max{pool="Metaspace",} -1.0
jvm_memory_pool_bytes_max{pool="Compressed Class Space",} 1.073741824E9
jvm_memory_pool_bytes_max{pool="G1 Eden Space",} -1.0
jvm_memory_pool_bytes_max{pool="G1 Survivor Space",} -1.0
jvm_memory_pool_bytes_max{pool="G1 Old Gen",} 1.073741824E9
# HELP jvm_memory_pool_bytes_init Initial bytes of a given JVM memory pool.
# TYPE jvm_memory_pool_bytes_init gauge
jvm_memory_pool_bytes_init{pool="Code Cache",} 2555904.0
jvm_memory_pool_bytes_init{pool="Metaspace",} 0.0
jvm_memory_pool_bytes_init{pool="Compressed Class Space",} 0.0
jvm_memory_pool_bytes_init{pool="G1 Eden Space",} 7340032.0
jvm_memory_pool_bytes_init{pool="G1 Survivor Space",} 0.0
jvm_memory_pool_bytes_init{pool="G1 Old Gen",} 1.26877696E8
# HELP jvm_buffer_pool_used_bytes Used bytes of a given JVM buffer pool.
# TYPE jvm_buffer_pool_used_bytes gauge
jvm_buffer_pool_used_bytes{pool="direct",} 148590.0
jvm_buffer_pool_used_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_capacity_bytes Bytes capacity of a given JVM buffer pool.
# TYPE jvm_buffer_pool_capacity_bytes gauge
jvm_buffer_pool_capacity_bytes{pool="direct",} 148590.0
jvm_buffer_pool_capacity_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_used_buffers Used buffers of a given JVM buffer pool.
# TYPE jvm_buffer_pool_used_buffers gauge
jvm_buffer_pool_used_buffers{pool="direct",} 19.0
jvm_buffer_pool_used_buffers{pool="mapped",} 0.0
# HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM
# TYPE jvm_classes_loaded gauge
jvm_classes_loaded 16993.0
# HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution
# TYPE jvm_classes_loaded_total counter
jvm_classes_loaded_total 17041.0
# HELP jvm_classes_unloaded_total The total number of classes that have been unloaded since the JVM has started execution
# TYPE jvm_classes_unloaded_total counter
jvm_classes_unloaded_total 48.0

# HELP bwce_activity_stats_list BWCE Activity Statictics list
# TYPE bwce_activity_stats_list gauge
# HELP bwce_activity_counter_list BWCE Activity related Counters list
# TYPE bwce_activity_counter_list gauge
# HELP all_activity_events_count BWCE All Activity Events count by State
# TYPE all_activity_events_count counter
all_activity_events_count{StateName="CANCELLED",} 0.0
all_activity_events_count{StateName="COMPLETED",} 0.0
all_activity_events_count{StateName="STARTED",} 0.0
all_activity_events_count{StateName="FAULTED",} 0.0
# HELP activity_events_count BWCE All Activity Events count by Process, Activity State
# TYPE activity_events_count counter
# HELP activity_total_evaltime_count BWCE Activity EvalTime by Process and Activity
# TYPE activity_total_evaltime_count counter
# HELP activity_total_duration_count BWCE Activity DurationTime by Process and Activity
# TYPE activity_total_duration_count counter
# HELP bwpartner_instance:total_request Total Request for the partner invocation which mapped from the activities
# TYPE bwpartner_instance:total_request counter
# HELP bwpartner_instance:total_duration_ms Total Duration for the partner invocation which mapped from the activities (execution or latency)
# TYPE bwpartner_instance:total_duration_ms counter
# HELP bwce_process_stats BWCE Process Statistics list
# TYPE bwce_process_stats gauge
# HELP bwce_process_counter_list BWCE Process related Counters list
# TYPE bwce_process_counter_list gauge
# HELP all_process_events_count BWCE All Process Events count by State
# TYPE all_process_events_count counter
all_process_events_count{StateName="CANCELLED",} 0.0
all_process_events_count{StateName="COMPLETED",} 0.0
all_process_events_count{StateName="STARTED",} 0.0
all_process_events_count{StateName="FAULTED",} 0.0
# HELP process_events_count BWCE Process Events count by Operation
# TYPE process_events_count counter
# HELP process_duration_seconds_total BWCE Process Events duration by Operation in seconds
# TYPE process_duration_seconds_total counter
# HELP process_duration_milliseconds_total BWCE Process Events duration by Operation in milliseconds
# TYPE process_duration_milliseconds_total counter
# HELP bwdefinitions:partner BWCE Process Events count by Operation
# TYPE bwdefinitions:partner counter
bwdefinitions:partner{ProcessName="t1.module.item.getTransactionData",ActivityName="FTLPublisher",ServiceName="GetCustomer360",OperationName="GetDataOperation",PartnerService="TransactionService",PartnerOperation="GetTransactionsOperation",Location="internal",PartnerMiddleware="MW",} 1.0
bwdefinitions:partner{ProcessName=" t1.module.item.auditProcess",ActivityName="KafkaSendMessage",ServiceName="GetCustomer360",OperationName="GetDataOperation",PartnerService="AuditService",PartnerOperation="AuditOperation",Location="internal",PartnerMiddleware="MW",} 1.0
bwdefinitions:partner{ProcessName="t1.module.item.getCustomerData",ActivityName="JMSRequestReply",ServiceName="GetCustomer360",OperationName="GetDataOperation",PartnerService="CustomerService",PartnerOperation="GetCustomerDetailsOperation",Location="internal",PartnerMiddleware="MW",} 1.0
# HELP bwdefinitions:binding BW Design Time Repository - binding/transport definition
# TYPE bwdefinitions:binding counter
bwdefinitions:binding{ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInterface="GetCustomer360:GetDataOperation",Binding="/customer",Transport="HTTP",} 1.0
# HELP bwdefinitions:service BW Design Time Repository - Service definition
# TYPE bwdefinitions:service counter
bwdefinitions:service{ProcessName="t1.module.sub.item.getCustomerData",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
bwdefinitions:service{ProcessName="t1.module.sub.item.auditProcess",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
bwdefinitions:service{ProcessName="t1.module.sub.orchestratorSubFlow",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
bwdefinitions:service{ProcessName="t1.module.Process",ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",} 1.0
# HELP bwdefinitions:gateway BW Design Time Repository - Gateway definition
# TYPE bwdefinitions:gateway counter
bwdefinitions:gateway{ServiceName="GetCustomer360",OperationName="GetDataOperation",ServiceInstance="GetCustomer360:GetDataOperation",Endpoint="bwce-demo-mon-orchestrator-bwce",InteractionType="ISTIO",} 1.0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1956.86
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.604712447107E9
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 763.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1048576.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.046207488E9
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.2151936E8
# HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.
# TYPE jvm_gc_collection_seconds summary
jvm_gc_collection_seconds_count{gc="G1 Young Generation",} 540.0
jvm_gc_collection_seconds_sum{gc="G1 Young Generation",} 4.754
jvm_gc_collection_seconds_count{gc="G1 Old Generation",} 2.0
jvm_gc_collection_seconds_sum{gc="G1 Old Generation",} 0.563

# HELP jvm_threads_current Current thread count of a JVM
# TYPE jvm_threads_current gauge
jvm_threads_current 98.0
# HELP jvm_threads_daemon Daemon thread count of a JVM
# TYPE jvm_threads_daemon gauge
jvm_threads_daemon 43.0
# HELP jvm_threads_peak Peak thread count of a JVM
# TYPE jvm_threads_peak gauge
jvm_threads_peak 98.0
# HELP jvm_threads_started_total Started thread count of a JVM
# TYPE jvm_threads_started_total counter
jvm_threads_started_total 109.0
# HELP jvm_threads_deadlocked Cycles of JVM-threads that are in deadlock waiting to acquire object monitors or ownable synchronizers
# TYPE jvm_threads_deadlocked gauge
jvm_threads_deadlocked 0.0
# HELP jvm_threads_deadlocked_monitor Cycles of JVM-threads that are in deadlock waiting to acquire object monitors
# TYPE jvm_threads_deadlocked_monitor gauge
jvm_threads_deadlocked_monitor 0.0

So, it can be a 50% of the metric endpoint response the part that I’m not using, so, why I am using disk space that I am paying for to storing it? And this is just for a “critical exporter”, one that I try to use as much information as possible, but think about how many exporters do you have and how much information you use for each of them.

Ok, so now the purpose and the motivation of this post are clear, but what we can do about it?

Discovering the REST API

Prometheus has an awesome REST API to expose all the information that you can wish about. If you have ever use the Graphical Interface for Prometheus (shown below) you are using the REST API because this is why is behind it.

Optimize Prometheus Disk Usage: Reduce TSDB Size and Control Metrics Cardinality
Target view of the Prometheus Graphical Interface

We have all the documentation regarding the REST API in the Prometheus official documentation:

https://prometheus.io/docs/prometheus/latest/querying/api/

But what is this API providing us in terms of the time-series database TSDB that Prometheus is using?

TSDB Admin APIs

We have a specific API to manage the performance of the TSDB database but in order to be able to use it, we need to enable the Admin API. And that is done by providing the following flag where we are launching the Prometheus server --web.enable-admin-api.

If we are using the Prometheus Operator Helm Chart to deploy this we need to use the following item in our values.yaml

## EnableAdminAPI enables Prometheus the administrative HTTP API which includes functionality such as deleting time series.    
## This is disabled by default.
## ref: https://prometheus.io/docs/prometheus/latest/querying/api/#tsdb-admin-apis
## enableAdminAPI: true

We have a lot of options enable when we enable this administrative API but today we are going to focus on a single REST operation that is the “stats”. This is the only method related to TSDB that it doesn’t require to enable the Admin API. This operation, as we can read in the Prometheus documentation, returns the following items:

headStats: This provides the following data about the head block of the TSDB:

  • numSeries: The number of series.
  • chunkCount: The number of chunks.
  • minTime: The current minimum timestamp in milliseconds.
  • maxTime: The current maximum timestamp in milliseconds.

seriesCountByMetricName: This will provide a list of metrics names and their series count.

labelValueCountByLabelName: This will provide a list of the label names and their value count.

memoryInBytesByLabelName This will provide a list of the label names and memory used in bytes. Memory usage is calculated by adding the length of all values for a given label name.

seriesCountByLabelPair This will provide a list of label value pairs and their series count.

To access to that API we need to hit the following endpoint:

GET /api/v1/status/tsdb

So, when I am doing that in my Prometheus deployment I get something similar to this:

{
"status":"success",
"data":{
"seriesCountByMetricName":[
{
"name":"apiserver_request_duration_seconds_bucket",
"value":34884
},
{
"name":"apiserver_request_latencies_bucket",
"value":7344
},
{
"name":"etcd_request_duration_seconds_bucket",
"value":6000
},
{
"name":"apiserver_response_sizes_bucket",
"value":3888
},
{
"name":"apiserver_request_latencies_summary",
"value":2754
},
{
"name":"etcd_request_latencies_summary",
"value":1500
},
{
"name":"apiserver_request_count",
"value":1216
},
{
"name":"apiserver_request_total",
"value":1216
},
{
"name":"container_tasks_state",
"value":1140
},
{
"name":"apiserver_request_latencies_count",
"value":918
}
],
"labelValueCountByLabelName":[
{
"name":"__name__",
"value":2374
},
{
"name":"id",
"value":210
},
{
"name":"mountpoint",
"value":208
},
{
"name":"le",
"value":195
},
{
"name":"type",
"value":185
},
{
"name":"name",
"value":181
},
{
"name":"resource",
"value":170
},
{
"name":"secret",
"value":168
},
{
"name":"image",
"value":107
},
{
"name":"container_id",
"value":97
}
],
"memoryInBytesByLabelName":[
{
"name":"__name__",
"value":97729
},
{
"name":"id",
"value":21450
},
{
"name":"mountpoint",
"value":18123
},
{
"name":"name",
"value":13831
},
{
"name":"image",
"value":8005
},
{
"name":"container_id",
"value":7081
},
{
"name":"image_id",
"value":6872
},
{
"name":"secret",
"value":5054
},
{
"name":"type",
"value":4613
},
{
"name":"resource",
"value":3459
}
],
"seriesCountByLabelValuePair":[
{
"name":"namespace=default",
"value":72064
},
{
"name":"service=kubernetes",
"value":70921
},
{
"name":"endpoint=https",
"value":70917
},
{
"name":"job=apiserver",
"value":70917
},
{
"name":"component=apiserver",
"value":57992
},
{
"name":"instance=192.168.185.199:443",
"value":40343
},
{
"name":"__name__=apiserver_request_duration_seconds_bucket",
"value":34884
},
{
"name":"version=v1",
"value":31152
},
{
"name":"instance=192.168.112.31:443",
"value":30574
},
{
"name":"scope=cluster",
"value":29713
}
]
}
}

We can also check the same information if we use the new and experimental React User Interface on the following endpoint:

/new/tsdb-status
Optimize Prometheus Disk Usage: Reduce TSDB Size and Control Metrics Cardinality
Graphical Visualization of top 10 series count by metric name in the new Prometheus UI

So, with that, you will get the Top 10 series and labels that are inside your time-series database, so in case, some of them are not useful you can just get rid of them using the normal approaches to drop a series or a label. This is great, but what if all the ones shown here are relevant, what can we do about it?

Mmmm, maybe we can use PromQL to monitor this (dogfodding approach). So if we would like to extract the same information but using PromQL we can do it with the following query:

topk(10, count by (__name__)({__name__=~".+"}))
Optimize Prometheus Disk Usage: Reduce TSDB Size and Control Metrics Cardinality
Top 10 of metric series generated and stored in the time series database

And now we have all the power at my hands. For example, let’s take a look not at the 10 more relevant but the 100 more relevants or any other filter that we need to apply. For example, let’s see the metrics regarding with the JVM that we discussed at the beginning. And we will do that with the following PromQL query:

topk(100, count by (__name__)({__name__=~"jvm.+"}))
Optimize Prometheus Disk Usage: Reduce TSDB Size and Control Metrics Cardinality
Top 100 of metric series regarding to JVM metrics

So we can see that we have at least 150 series regarding to metrics that I am not using at all. But let’s do it even better, let’s take a look at the same but group by job names:

topk(10, count by (job,__name__)({__name__=~".+"}))
Optimize Prometheus Disk Usage: Reduce TSDB Size and Control Metrics Cardinality
Result of checking the top 10 metric series count with the job that is generating them

📚 Want to dive deeper into Kubernetes? This article is part of our comprehensive Kubernetes Architecture Patterns guide, where you’ll find all fundamental and advanced concepts explained step by step.