Building a real-time data pipeline using Spark Streaming and Kafka

Opcito Technologies
2 min readJul 13, 2018

In one of our previous blogs, Aashish gave us a high-level overview of data ingestion with Hadoop Yarn, Spark, and Kafka. Now it’s time to take a plunge and delve deeper into the process of building a real-time data ingestion pipeline. To do that, there are multiple technologies which you can use to write your own Spark streaming applications like Java, Python, Scala and so on. In this blog, I will explain how you can build one using Python.

This solution typically finds application in scenarios where businesses are struggling to make sense of the data collected over customer habits and preferences to be able to make smarter business decisions. The problem with the traditional approach of processing the data (before analyzing it) in a sequential manner is the bottleneck which takes time typically up to weeks. But with the approach which I am about to explain you can bring down this time significantly.

Before going through the modules of the solution, let’s take a quick look at the tools that I am going to use in this process:

  • Spark: Apache Spark is an open source and flexible in-memory framework which serves as an alternative to map-reduce for handling batch, real-time analytics, and data processing workloads. It provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. I have used Spark, in the solution which I am about to explain, for improving the processing time.
  • Kafka: Apache Kafka is an open source distributed streaming platform which is useful in building real-time data pipelines and stream processing applications. I have used Kafka for internal communication between the different streaming jobs.
  • HBase: Apache HBase is an Open source distributed column-oriented NoSQL database that runs on top of Hadoop Distributed File System (HDFS). It is natively integrated with the Hadoop ecosystem and is designed to provide quick random access to huge amounts of structured data. I have used HBase in the final stage of the solution which I have implemented for one of our clients to store the processed data.

The workflow of the solution has 4 main stages which include Transformation, Cleaning, Validation, and Writing of the data received from the various sources.

Data Transformation:
This is an entry point for the streaming application. This module/application performs the operations related to the normalization of data and helps in converting the different keys and values received from various sources to respective associated forms. In Spark streaming, the transformation of data can be performed by using built-in functions like map, filter, foreachRDD, etc…read more.



Opcito Technologies

Product engineering experts specializing in DevOps, Containers, Cloud, Automation, Blockchain, Test Engineering, & Open Source Tech