Connecting the Dots (Python, Spark, and Kafka) | by Kiruparan Balachandran


Python, Spark, and Kafka are very important frameworks in data scientists’ everyday actions. It is important to allow them to combine these frameworks.

Photo By César Gaviria from Pexels


Frequently, Data scientists want to make use of Python (in some instances, R) to develop machine learning fashions. Here, they’ve a sound justification since data-driven options arrive with many experiments. Numerous interactions with the language we use to develop the fashions are required to carry out experiments, and the libraries and platforms accessible in python to develop machine-learning fashions are super. This is a sound argument; nonetheless, we confront points when these fashions are utilized to manufacturing.

  • Starting Kafka (for extra particulars, please discuss with this article).
  • Creating a PySpark app for eat and course of the occasions and write again to Kafka.
  • Steps to supply and eat occasions utilizing Kafka-Python.

The newest model of Apache Spark is obtainable at

tar -xvf spark-2.1.1-bin-hadoop2.7.tgz
ln -s /dwelling/xxx/spark-2.3.2-bin-hadoop2.7/ /dwelling/xxx/spark
#set spark associated atmosphere varibales
export PATH=$SPARK_HOME/sbin:$PATH
Spark Master Web GUI — Image by Author spark://abc.def.ghi.jkl:7077
Spark Master Web GUI with employees — Image by Author

Here Kafka is a streaming platform that helps to supply and eat the occasions to the spark platform.

cd /dwelling/xxx/IQ_STREAM_PROCESSOR/kafka_2.12-2.0.0/
bin/ config/
bin/ config/

Setup Spark

Step 1

export PYSPARK_PYTHON='/dwelling/xxx/anaconda3/bin/python'
cd /dwelling/xxx/anaconda3/bin/
pip set up kafka-python

Create and Submit the park Application

Creating SparkContext

#batch length, right here i course of for every second
ssc = StreamingContext(sc,1)
  • Group id of this client.
  • “Per-topic number of Kafka partitions to consume”: To specify the variety of partitions this stream reads parallel.
  • deploy-mode: choice to deploy driver (both at the employee node or regionally as an exterior consumer)
  • jars: recall our dialogue about Spark Streaming’s Kafka libraries; right here we have to submit that jar to offer Kafka dependencies.

Final Thoughts

The key takeaways from this text are,
1) Python, Spark, and Kafka are vital frameworks in a data scientist’s every day actions.
2) This article helps data scientists to carry out their experiments in Python whereas deploying the remaining mannequin in a scalable manufacturing atmosphere.


Source hyperlink

Write a comment