Why We Love Kafka's Open Source Data Pipelines

Apache Kafka is a popular choice for organizations that need to collect, move, and store large amounts of data.

Share with others

Apache Kafka is a popular choice for organizations that need to collect, move, and store large amounts of data. In fact, more than a third of the Fortune 500 companies use Kafka, including PayPal, Microsoft, and Uber. 

In addition to top tech companies adopting Kafka, it's gaining popularity among organizations that are usually slow to adopt new technology.

So why is it taking enterprises by storm?

As a data pipeline, Kafka offers unique solutions to companies that need scalable streaming. Their open-source structure makes them ideal as a message queue since it enables them to continue to grow and be refined by some of the top innovators. It can be used for a variety of purposes: metrics, messaging, and stream processing are just a few. 

So it's a message queue, but it's also more than a message queue.

Data in real-time is non-negotiable for many large companies. Apache Kafka can harness these large streams of data to provide a solution for a wide variety of businesses

Here’s what you need to know about Kafka and why it’s a popular choice for many enterprises.

First, what sparked Apache Kafka?

In 2010, LinkedIn was quickly growing its user base and business. However, this started to create serious issues with data. 

For one thing, the engineers and developers needed a better option for the various real-time applications on their site, such as LinkedIn’s newsfeed. At the time, they did make use of some services, but they proved to be a poor fit and were hard to manage as LinkedIn continued to grow. 

Additionally, LinkedIn had issues with its data pipeline. They operated many data systems, which were all connected and tangled together. It was an overly complex system, and they struggled to get the reliable feeds of data they needed to accomplish tasks. At one point, while working with data, they found discrepancies between two sets. They ran a third set of data to see which was correct, only to find that it was completely different as well! 

At each turn, developers were forced to create an increasingly complex system that was too complicated and slowed them down. They decided, then, to build technology in response that would help them transport large amounts of data and use them in real-time applications.

The result was Kafka, an open-source stream process software platform. It provides real-time analytics using real-time streaming data architectures. 

Kafka works as a universal pipeline each application or stream processor can tap into. It is highly scalable to stream data, fault-tolerant, and can grow along with enterprises as the increase their data, applications, and demands.

Kafka creates a much faster messaging system that can handle incredible amounts of data. In fact, LinkedIn handles more than 1.1 trillion messages per day, streamed through Kafka.

Plus, Kafka's open-source message queue is a standout among the other options. 

Kafka's Open-Sourced Message Queue vs. The Others

There is an incredible amount of data in our world today, and it continues to explode in record rates. There will be an estimated 200 billion smart machines connect to the internet by 2020. All of these machines will continue to produce exponentially more data.

However, that brings us to a new problem: storing and using this data. An open-sourced message queue allows you to do just that. There are a lot of options when looking for the right message queue for your business. (And although we are message queue agnostic, we use Kafka internally; that's how much we love it.)

Yes, Kafka was created for LinkedIn, but its open-sourced structure means that innovators from around the world have been able to continue to improve and refine it. Some of the largest technical organizations in the world, such as Microsoft and Netflix,  contribute to its production to help it work even better at scale.

What are Kafka’s strengths over its competitors?

When it comes to finding the right message queue, there are many options and top competitors. From Kinesis, Pub/Sub, Databus to RabbitMQ, each offers different messaging queues.

Superior Performance. Kafka provides a fast, stable, and durable application over its competitors. Its robust replication provides tunable consistency. It offers a reliable solution and is fault-tolerant.

Kafka combines the capabilities of a pub/sub system with messaging. As a result, it can work well as a replacement for a traditional message broker for larger scale companies. Their queueing application also allows for microservices and makes the flow of data easier between them.

Scalable. More than any of its competitors, Kafka can handle large amounts of data. Its ability to scale makes it an ideal choice for larger companies or data-heavy industries. In these cases, their volume and responsiveness make other options less than perfect. With Kafka, though, they can aggregate, transform, and load into other stores on a large quantity in real-time.

Ease of Use. Kafka was created to unify and simplify large volumes of data that can often become complex. As a result, it is fairly straight-forward to set-up and implement. 

Wide Range of Capabilities. One of the reasons many companies turn to Kafka is its ability to adapt to a wide range of systems—especially ideal for users who need data flow across multiple systems and applications. It is compatible with systems such as: 

  • Web and desktop custom applications
  • NoSQL, Oracle, Hadoop, SFDC
  • Monitoring and analytical systems
  • Microservices
  • Any needed sinks or sources

Kafka can process data streams in real-time, publish and subscribe them on a large scale, and store them for highly available deployment.

Beyond Message Queues: Use Cases for Kafka

Kafka is popular for a wide variety of use cases. Like we mentioned, it's more than a message queue, and its ability to be used for a wide range of needs make it ideal for large companies. Some top use cases include:

Metrics 

Kafka was first developed as a means to track website activity. As the original use case, it can track very high-volume messages. It can also load this information onto Hadoop or offline data warehousing systems for the most accurate and insightful processing and reporting.

Each activity, such as searches, uploads, or page views, is published with one topic per activity type. This helps to ensure the most accurate data and allow for both real-time processing and monitoring.

Messaging

The capabilities of Kafka make it an ideal replacement for traditional message brokers. It is highly scalable and can be used for almost any number of applications and customers. Compared to other message brokers, it has low-throughput but requires low end-to-end latency.  

However, Kafka offers better replication, built-in partitioning, and fault-tolerance compared with its competitors. Many message brokers start to slow down as volume grows. Kafka allows for scalability in its messaging and stays fast no matter how much is needed.

Log Aggregation 

Another popular use case for Kafka is as a log solution. Kafka allows for the ability to publish an event for everything that happens in an application. From there, other parts can subscribe to events and take action.

The typical log aggregation system collects the log files from the server and puts them in a central place for processing. Because Kafka abstracts away the details of each file, it offers a cleaner abstraction of event data. Kafka's stream of messages means easier support for multiple data sources while still offering excellent performance and stronger durability of log-centric systems.

Stream Processing

More recently, Kafka added Kafka Streams. It works as an alternative to other streaming platforms such as Spring Cloud Data Flow and Google Cloud Data Flow. 

As companies grow, adopt more applications or microservices, and their data needs grow along with it, streaming can become complicated. At LinkedIn before the development of Kafka, engineers referred to their streaming as resembling "spaghetti." Kafka simplifies the process. It can stream from one source to the next without complex routing.

Basically, we love Kafka because it brings simplicity, durability, and innovation to the complex data structure of large organizations. Since it's open-source, the top companies in the world add to its capabilities and continue to improve its offerings, providing continual optimization. It sustains long-term growth

While other options may be more affordable and check the right boxes (and might be a better fit for some!), companies that require large amounts of data, microservices, and messaging capabilities will likely find Kafka is the ideal solution.