Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense

Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense

Apache Kafka seems to be the standard solution in nowadays architecture, but we should focus if it is the right choice for our needs.

Nowadays, we’re in a new age of Event-Driven Architecture, and this is not the first time we’ve lived that. Before microservices and cloud, EDA was the new normal in enterprise integration. Based on different kinds of standards, there where protocols like JMS or AMQP used in broker-based products like TIBCO EMS, Active MQ, or IBM Websphere MQ, so this approach is not something new.

With the rise of microservices architectures and the API lead approach, it seemed that we’ve forgotten about the importance of the messaging systems, and we had to go through the same challenges we saw in the past to come to a new messaging solution to solve that problem. So, we’re coming back to EDA Architecture, pub-sub mechanism, to help us decouple the consumers and producers, moving from orchestration to choreography, and all these concepts fit better in nowaday worlds with more and more independent components that need cooperation and integration.

During this effort, we started to look at new technologies to help us implement that again. Still, with the new reality, we forgot about the heavy protocols and standards like JMS and started to think about other options. And we need to admit that we felt that there is a new king in this area, and this is one of the critical components that seem to be no matter what in today’s architecture: Apache Kafka.

And don’t get me wrong. Apache Kafka is fantastic, and it has been proven for so long, a production-ready solution, performant, with impressive capabilities for replay and powerful API to ease the integration. Apache Kafka has some challenges in this cloud-native world because it doesn’t play so well with some of its rules.

If you have used Apache Kafka for some time, you are aware that there are particular challenges with it. Apache Kafka has an architecture that comes from its LinkedIn days in 2011, where Kubernetes or even Docker and container technologies were not a thing, that makes to run Apache Kafka (purely stateful service) in a container fashion quite complicated. There are improvements using helm charts and operators to ease the journey, but still, it doesn’t feel like pieces can integrate well into that fashion. Another thing is the geo-replication that even with components like MirrorMaker, it is not something used, works smooth, and feels integrated.

Other technologies are trying to provide a solution for those capabilities, and one of them is also another Apache Foundation project that has been donated by Yahoo! and it is named Apache Pulsar.

Don’t get me wrong; this is not about finding a new truth, that single messaging solution that is perfect for today’s architectures: it doesn’t exist. In today’s world, with so many different requirements and variables for the different kinds of applications, one size fits all is no longer true. So you should stop thinking about which messaging solution is the best one, and think more about which one serves your architecture best and fulfills both technical and business requirements.

We have covered different ways for general communication, with several specific solutions for synchronous communication (service mesh technologies and protocols like REST, GraphQL, or gRPC) and different ones for asynchronous communication. We need to go deeper into the asynchronous communication to find what works best for you. But first, let’s speak a little bit more about Apache Pulsar.

Apache Pulsar

Apache Pulsar, as mentioned above, has been developed internally by Yahoo! and donated to the Apache Foundation. As stated on their official website, they are several key points to mention as we start exploring this option:

  • Pulsar Functions: Easily deploy lightweight compute logic using developer-friendly APIs without needing to run your stream processing engine
  • Proven in production: Apache Pulsar has run in production at Yahoo scale for over three years, with millions of messages per second across millions of topics
  • Horizontally scalable: Seamlessly expand capacity to hundreds of nodes
  • Low latency with durability: Designed for low publish latency (< 5ms) at scale with strong durability guarantees
  • Geo-replication: Designed for configurable replication between data centers across multiple geographic regions
  • Multi-tenancy: Built from the ground up as a multi-tenant system. Supports Isolation, Authentication, Authorization, and Quotas
  • Persistent storages: Persistent message storage based on Apache BookKeeper. Provides IO-level isolation between write and read operations
  • Client libraries: Flexible messaging models with high-level APIs for Java, C++, Python and GO
  • Operability: REST Admin API for provisioning, administration, tools, and monitoring. Deploy on bare metal or Kubernetes.

So, as we can see, in its design, Apache Pulsar is addressing some of the main weaknesses of Apache Kafka as Geo-replication and their cloud-native approach.

Apache Pulsar provides support for the pub/sub pattern, but also provides so many capabilities that also place as a traditional queue messaging system with their concept of exclusive topics where only one of the subscribers will receive the message. Also provides interesting concepts and features used in other messaging systems:

  • Dead Letter Topics: For messages that were not able to be processed by the consumer.
  • Persistent and Non-Persistent Topics: To decide if you want to persist your messages or not during the transition.
  • Namespaces: To have a logical distribution of your topics, so an application can be grouped in namespaces as we do, for example, in Kubernetes so we can isolate some applications from the others.
  • Failover: Similar to exclusive, but when the attached consumer failed to process another takes the chance to process the messages.
  • Shared: To be able to provide a round-robin approach similar to the traditional queue messaging system where all the subscribers will be attached to the topic, but the only one will receive the message, and it will distribute the load along all of them.
  • Multi-topic subscriptions: To be able to subscribe to several topics using a regexp (similar to the Subject approach from TIBCO Rendezvous, for example, in the 90s) that has been so powerful and popular.

But also, if you require features from Apache Kafka, you will still have similar concepts as partitioned topics, key-shared topics, and so on. So you have everything at your hand to choose which kind of configuration works best for you and your specific use cases, you also have the option to mix and match.

Apache Pulsar Architecture

Apache Pulsar Architecture is similar to other comparable messaging systems today. As you can see in the picture below from the Apache Pulsar website, those are the main components of the architecture:

Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense
  • Brokers: One or more brokers handles incoming messages from producers, dispatches messages to consumers
  • BookKeeper Cluster for persistent storage of messages management
  • ZooKeeper Cluster for management purposes.

So you can see this architecture is also quite similar to the Apache Kafka one again with the addition of a new concept of the BookKeeper Cluster.

Broker in Apache Pulsar are stateless components that mainly will run two pieces

  • HTTP Server that exposes a REST API for management and is used by consumers and producers for topic lookup.
  • TCP Server using a binary protocol called dispatcher that is used for all the data transfers. Usually, Messages are dispatched out of a managed ledger cache for performance purposes. But also if this cache grows too big, it will interact with the BookKeeper cluster for persistence reasons.

To support the Global Replication (Geo-Replication), the Brokers manage replicators that tail the entries published in the local region and republish them to the remote regions.

Apache BookKeeper Cluster is used as persistent message storage. Apache BookKeeper is a distributed write-ahead log (WAL) system that manages when messages should be persisted. It also supports horizontal scaling based on the load and multi-log support. Not only messages are persistent but also the cursors that are the consumer position for a specific topic (similar to the offset in Apache Kafka terminology)

Finally, Zookeeper Cluster is used in the same role as Apache Kafka as a metadata configuration storage cluster for the whole system.

Hello World using Apache Pulsar

Let’s see how we can create a quick “Hello World” case using Apache Pulsar as a protocol, and to do that, we’re going to try to implement it in a cloud-native fashion. So we will do a single-node cluster of Apache Pulsar in a Kubernetes installation and deploy a producer application using Flogo technology and a consumer application using Go. Something similar to what you can see in the diagram below:

Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense
Diagram about the test case we’re doing

And we’re going to try to keep it simple, so we will just use pure docker this time. So, first of all, just spin up the Apache Pulsar server and to do that we will use the following command:

docker run -it -p 6650:6650 -p 8080:8080 --mount source=pulsardata,target=/pulsar/data --mount source=pulsarconf,target=/pulsar/conf apachepulsar/pulsar:2.5.1   bin/pulsar standalone

And we will see an output similar to this one:

Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense

Now, we need to create simple applications, and for that, Flogo and Go will be used.

Let’s start with the producer, and in this case, we will use the open-source version to create a quick application.

First of all, we will just use the Web UI (dockerized) to do that. Run the command:

docker run -it -p 3303:3303 flogo/flogo-docker eula-accept

And we install a new contribution to enable the Pulsar publisher activity. To do that we will click on the “Install new contribution” button and provide the following URL:

flogo install github.com/mmussett/flogo-components/activity/pulsar

And now we will create a simple flow as you can see in the picture below:

Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense

We will now build the application using the menu, and that’s it!

Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense

To be able to run just launch the application as you can see here:

./sample-app_linux_amd64

Now, we just need to create the Go-lang consumer to be able to do that we need to install the golang package:

go get github.com/apache/pulsar-client-go/pulsar

And now we need to create the following code:

package main
import (
 “fmt”
 “log”
 “github.com/apache/pulsar-client-go/pulsar”
)
func main() {
 client, err := pulsar.NewClient(pulsar.ClientOptions{URL: “pulsar://localhost:6650”})
 if err != nil {
 log.Fatal(err)
 }
defer client.Close()
channel := make(chan pulsar.ConsumerMessage, 100)
options := pulsar.ConsumerOptions{
 Topic: “counter”,
 SubscriptionName: “my-subscription”,
 Type: pulsar.Shared,
 }
options.MessageChannel = channel
consumer, err := client.Subscribe(options)
 if err != nil {
 log.Fatal(err)
 }
defer consumer.Close()
// Receive messages from channel. The channel returns a struct which contains message and the consumer from where
 // the message was received. It’s not necessary here since we have 1 single consumer, but the channel could be
 // shared across multiple consumers as well
 for cm := range channel {
 msg := cm.Message
 fmt.Printf(“Received message msgId: %v — content: ‘%s’\n”,
 msg.ID(), string(msg.Payload()))
consumer.Ack(msg)
}
}

And after running both programs, you can see the following output as you can see, we were able to communicate both applications in an effortless flow.

Apache Pulsar vs Apache Kafka: Architecture, Cloud-Native Design, and When Pulsar Makes Sense

This article is just a starting point, and we will continue talking about how to use Apache Pulsar in your architectures. If you want to take a look at the code we’ve used in this sample, you can find it here:

Welcome to the AsyncAPI Revolution!

Welcome to the AsyncAPI Revolution!

We’re living in an age where technologies are switching standards are changing all the time. You forget to read Medium/Stackoverflow/Reddit and you found there are at least five (5) new industry standards that are taking the place of the existing ones that you know (those that have been releasing something like a year ago 🙂 ).

Do you still remember the old ages when SOAP was the unbeatable format? How much time did we spend building our SOAP Services in our enterprises? REST replace it as the new standard.. but just a few years and we’re back in a new battle just for synchronous communication: gRPC, GraphQL, are here to conquer everything again. It is crazy, huh?

But the situation is similar to asynchronous communication. Asynchronous communication has been here for a long time. Even, a long time before the terms Event-Driven Architecture or Streaming was really a “cool” term or a thing you be aware of.

We’ve been using these patterns for so long in our companies. Big enterprises have been using this model into their enterprise integrations for so long. Pub/Sub based protocols and technologies like TIBCO Rendezvous has been using since the late 90, and then we also incorporate more standards approaches like JMS using a different kind of servers to have all these event-based communications.

But now with the cloud-native revolution, the need for distributed computing, more agility, more scalability, centralized solutions are not valid anymore, and we’ve seen an explosion in the number of options to communicate based on these patterns.

You could think that this is the same situation as we were discussing at the beginning of this article regarding REST predominance and new cutting-edge technologies trying to replace it, but this is something quite different. Because experience has told us that a single size doesn’t fit all.

You cannot find a single technology or component that can provide all the communication needs that you need for all your use-cases. You can name any technology or protocol that you want: Kafka, Pulsar, JMS, MQTT, AMQP, Thrift, FTL, and so on.

Think about each of them and you probably will find some use-cases that one technology plays better than the others, so it makes no sense to just trying to find a single technology solution to cover all the needs. What it is needed is more a polyglot approach when you have different technologies that play well together and use the one that works best for your use case (the right tool for the right job approach) as we’re doing for the different technologies we’re deploying in our cluster.

Probably we’re not going to use the same technology to do a Machine Learning based Microservice, than a Streaming Application, right? The same principle applies here.

But the problem here when we try to talk about different technologies playing together is about standardization. If we think about REST, gRPC, or GraphQL even that they’re different they play based some common grounds. They rely on the same base HTTP protocol for a standard so it is easy to support all of them in the same architecture.

But this is not true with the technologies about Asynchronous Communication. And I’d like to focus on standardization and specification today. And that’s what AsyncAPI Initiative is trying to solve. And to define what AsyncAPI is I’d like to use their own words from their official website:

AsyncAPI is an open source initiative that seeks to improve the current state of Event-Driven Architectures (EDA). Our long-term goal is to make working with EDA’s as easy as it is to work with REST APIs. That goes from documentation to code generation, from discovery to event management. Most of the processes you apply to your REST APIs nowadays would be applicable to your event-driven/asynchronous APIs too.

So, their goal is to provide a set of tools to have a better world in all those EDA architectures that all companies have or starting to have at this moment and everything pivots around one thing: The OpenAPI Specification.

Similar to the OpenAPI specification it allows us to define a common interface for our EDA Interfaces and the most important part is that this is multi-channel. So the same specification can be used for your MQTT-based API or your Kafka API. Let’s take a look at how it looks like this AsyncAPI Specification:

Welcome to the AsyncAPI Revolution!
AsyncAPI 2.0 Definition (from https://www.asyncapi.com/docs/getting-started/coming-from-openapi/)

As you can see it is very similar to the OpenAPI 3.0 and they already done that with the purpose to ease the transition between OpenAPI 3.0 and AsyncAPI and also to try to join both worlds together: It is more about just API, no matter if they’re synchronous or asynchronous and provide the same benefits regarding the ecosystem from one to the other.

Show me the code!!

But let’s stop talking and let’s start coding and to do that I’d like to use one of the tools that in my view has the greater support for AsyncAPI, and that’s it Project Flogo.

Probably you remember some of the different posts I’ve been done regarding Project Flogo and TIBCO Flogo Enterprise as a great technology to use for your microservices development (low-code/all-code approach, Golang based, a lot of connectors and open source extensions as well).

But today we’re going to use it to create our first AsyncAPI compliant microservice. And we’re going to rely on that because it provides a set of extensions to support the AsyncAPI initiative as you can see here:

So the first thing that we’re going to do is to create our AsyncAPI definition and to do it simpler, we’re going to use the sample one that we have available in the OpenAsync API with a simple change: We’re going to change from AMQP protocol to Kafka protocol because this is cool these days, isn’t it? 😉

asyncapi: '2.0.0'
info:
  title: Hello world application
  version: '0.1.0'
servers:
  production:
    url: broker.mycompany.com
    protocol: kafka
    description: This is "My Company" broker.
    security:
      - user-password: []
channels:
  hello:
    publish:
      message:
        $ref: '#/components/messages/hello-msg'
  goodbye:
    publish:
      message:
        $ref: '#/components/messages/goodbye-msg'
components:
  messages:
    hello-msg:
      payload:
        type: object
        properties:
          name:
            type: string
          sentAt:
            $ref: '#/components/schemas/sent-at'
    goodbye-msg:
      payload:
        type: object
        properties:
          sentAt:
            $ref: '#/components/schemas/sent-at'
  schemas:
    sent-at:
      type: string
      description: The date and time a message was sent.
      format: datetime
  securitySchemes:
    user-password:
      type: userPassword

As you can see something simple. Two operations “hello” and “goodbye” with easy payload:

  • name: Name that we’re going to use for the greeting.
  • sentAt: The date and time a message was sent.

So the first thing we’re going to do is to create a Flogo Application that complies to that AsyncAPI specification:

git clone https://github.com/project-flogo/asyncapi.git
cd asyncapi/
go install

Now we have the generator installer so we only need to execute and provide our YML as the input in the following command:

asyncapi -input helloworld.yml -type flogodescriptor

And we will create a HelloWorld application for us, that we need to tweak a little bit. Only to make you be up & running quickly, I’m just sharing the code in my GitHub repository that you can borrow from them (But I really encourage you to take the time to take a look at the code to see the beauty of the Flogo App Development 🙂 )

https://github.com/project-flogo/asyncapi

Now, that we already have the app, we have just a simple dummy application that allows us to receive the message that complies with the specification, and in our case just log the payload, which can be our starting point to build our new Event-Driven Microservices compliant with AsyncAPI.

So, let’s try it but to do so, we need a few things. First of all, we need a Kafka server running and to do that in a quick way we’re going to leverage on the following docker-compose.yml file:

version: '2'
services:
  zookeeper:
    image: wurstmeister/zookeeper:3.4.6
    expose:
    - "2181"
  kafka:
    image: wurstmeister/kafka:2.11-2.0.0
    depends_on:
    - zookeeper
    ports:
    - "9092:9092"
    environment:
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181

And to run that we just need to fire the following command from the same folder we have this file named as docker-compose.yml:

docker-compose up -d

And after doing that, we just need a sample application and what better that use Flogo again to create it but this time, let’s use the Graphical Viewer to create it right away:

Welcome to the AsyncAPI Revolution!
Simple Flogo Application to send a AsyncAPI complaint-message each minute using Kafka as a protocol

So we need just to configure the Publish Kafka activity to provide the broker (localhost:9092), the topic (hello) and the message :

{
"name": "hello world",
"sentAt": "2020-04-24T00:00:00"
}

And that’s it! Let’s run it!!!:

First we start the AsyncAPI Flogo Microservice:

Welcome to the AsyncAPI Revolution!
Async API Flogo Microservices Started!

And then we just launch the tester, that is going to send the same message each minute, as you can see in the picture below:

Welcome to the AsyncAPI Revolution!
Sample Tester sending sample messages

And each time we sent that message, is going to be received in our Async API Flogo Microservice:

Welcome to the AsyncAPI Revolution!

So, I hope this first introduction to the AsyncAPI world has been of the interest of you, but don’t forget to take a look at more resources in their own website:

Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation

Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation

If you are familiar with Enterprise Integration world for sure you know Kafka, one of the most famous projects from Apache Foundation for the last years, and also if you are in Integration World you know TIBCO Software, and some of our flagship products like TIBCO ActiveMatrix BusinessWorks for integration, TIBCO Cloud Integration as our iPaaS, TIBCO AMX BPM, TIBCO BusinessEvents.. and I could continue that list over and over.. 🙂

This article is part of my comprehensive TIBCO Integration Platform Guide where you can find more patterns and best practices for TIBCO integration platforms.

But, probably you don’t know about TIBCO(R) Messaging — Apache Kafka Distribution. This is one of the parts of our global messaging solution named TIBCO Messaging and it is composed of several components:

  • TIBCO Enterprise Message Service (aka TIBCO EMS) is our JMS 2.0 compliant server, one of our messaging standards over a decade.
  • TIBCO FTL is our cloud-ready messaging solution, using a direct pub-sub no centralized and so performant communication system.
  • TIBCO(R) Messaging — Apache Kafka Distribution is designed for efficient data distribution and stream processing with the ability to bridge Kafka applications to other TIBCO Messaging applications powered by TIBCO FTL(R), TIBCO eFTL(TM) or TIBCO Enterprise Message Service(TM).
  • TIBCO(R) Messaging — Eclipse Mosquitto Distribution includes a lightweight MQTT broker and C library for MQTT client in the Core package and a component for bridging MQTT clients to TIBCO FTL applications in the Bridge package.

I’d like to do that post fully-technical but I’m going to let a piece of info about product version that you could find interesting, because we have a Community Edition of this whole Messaging solution and you could use yourself.

TIBCO Messaging software is available in a community edition and an enterprise edition.

TIBCO Messaging — Community Edition is ideal for getting started with TIBCO Messaging, for implementing application projects (including proof of concept efforts) for testing, and for deploying applications in a production environment. Although the community license limits the number of production processes, you can easily upgrade to the enterprise edition as your use of TIBCO Messaging expands. The community edition is available free of charge, with the following limitations and exclusions:

● Users may run up to 100 application instances or 1000 web/mobile instances in a production environment.

● Users do not have access to TIBCO Support, but you can use TIBCO Community as a resource (http://community.tibco.com).

TIBCO Messaging — Enterprise Edition is ideal for all application development projects, and for deploying and managing applications in an enterprise production environment. It includes all features presented in this documentation set, as well as access to TIBCO Support.

You can read that info here, but, please, take your time too to read our official announcement that you could find here.

So, this is going to be the first of a few posts about how to integration Kafka in the usual TIBCO ecosystem and different technologies. This series is going to assume that you already know about Apache Kafka, and if you don’t, please take a look at the following reference before moving forward:

So, now we are going to start installing this distribution in our machine, in my case I’m going to use UNIX based target, but you have this software available for MacOS X, Windows or whatever OS you’re using.

Installation process is quite simple because distribution is based on the most usual Linux distributions so it provides you a deb package, a rpm package or even a tar package so you can use whatever you need for your current distribution. In my case, as I’m using CentOS I’ve moved with the rpm package and everything goes so smooth.

And after that I have installed in my /opt/tibco folder a pretty much usual Kafka distribution, so to start it we need to start the zookeeper server first and then the kafka server itself. And that’s it. Everything is running!!

Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation
Apache Kafka Distribution up & running

Mmmm.. But, How can I be sure of it? Kafka it doesn’t provide a GUI to monitor or monitor it, but there are a bunch of tools out there to do that. In my case I’m going to use Kafka Tool because it doesn’t need another components like Kafka REST and so on, but keep in mind there are other options with “prettier” UI but this is going to make the job just perfect.

So, after installing Kafka Tool, we only provide the data of where zookeeper is listening (if you keep everything by default, is going to be listening at 2181) and the Kafka version we’re using (in this case is 1.1), and now you can be sure everything is up & running and working as expected:

Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation
Kafka Tool could be used to monitor your Kafka brokers

So, now we are going to do only a quick test using our flagship integration product TIBCO AMX BusinessWorks which has a Kafka plug-in, so you can communicate with this new server we just launched. This is going to be only a Hello World with the following structure:

  • Process A is going to sent a Hello! to Process B.
  • Process B is going to receive that message and print it in a log.

The process are going to be developed just like these:

Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation
Kafka Sender Test Process
Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation
Kafka Receiver Test Process

And that’s the output we get after executing both processes:

Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation
Correct execution using TIBCO Business Studio

And we can see that the topic sample we used to do the comunication and the default partition has been created by default using Kafka Tool:

Starting with TIBCO(R) Messaging — Apache Kafka Distribution (I) Overview and Installation
Topic has been created on-demand

As you can see, so easy and straightforward to have it all configured in a default way. After that, we continue going deep on this new component in the TIBCO world!