Top Python Libraries for Data Science, Data Visualization & Machine Learning


It has been a while since we final carried out a Python libraries roundup, and as such we’ve got taken the chance to begin the month of November with simply such a recent checklist.

Last time we at KDnuggets did this, editor and creator Dan Clark break up up the huge array of Python data science associated libraries up into a number of smaller collections, together with data science libraries, machine learning libraries, and deep studying libraries. While splitting libraries into classes is inherently arbitrary, this made sense on the time of earlier publication.

This time, nevertheless, we’ve got break up the collected on open supply Python data science libraries in two. This first submit (this) covers “data science, data visualization & machine learning,” and might be regarded as “traditional” data science instruments masking frequent duties. The second submit, to be revealed subsequent week, will cowl libraries for use in constructing neural networks, and people for performing pure language processing and laptop imaginative and prescient duties.

Again, this separation and classification is bigoted, in some cases greater than others, however we’ve got completed our greatest to group instruments collectively by meant use case, hoping that is most helpful for readers.

The classes included on this submit, which we see as considering frequent data science libraries — these possible for use by practitioners within the data science area for generalized, non-neural community, non-research work — are:

  • Data – libraries for the administration, manipulation, and different processing of knowledge
  • Math – whereas many libraries carry out mathematical duties, this small assortment does so completely
  • Machine studying – self explanatory; excludes libraries primarily meant for constructing neural networks or for automating machine learning processes
  • Automated machine learning – libraries that primarily perform to automate processes associated to machine learning
  • Data visualization – libraries that primarily serve a perform associated to visualizing information, versus modeling, preprocessing, and many others.
  • Explanation & exploration – libraries primarily for exploring and explaining fashions or information

Our checklist is made up of libraries that our workforce determined collectively by consensus was repetitiveness of frequent and well-used Python libraries. Also, to be included a library should have a Github repository. The classes are in no explicit order, and neither are the libraries included inside every. We contemplated establishing an ordering arbitrarily by stars or another metric, however determined towards it so as not explicitly stray from inserting any perceived worth or significance of the libraries inside. Their itemizing right here, then, is solely random. Library descriptions are immediately from the Github repositories, in some kind or one other.

Thanks to Ahmed Anis for contributing to the gathering of this information, and to the remainder of the KDnuggets employees for their inputs, insights, and recommendations.

Note that visualization under, by Gregory Piatetsky, represents every library by sort, plots it by stars and contributors, and its image dimension is reflective of the relative variety of commits the library has on Github.


Figure 1: Top Python Libraries for Data Science, Data Visualization & Machine Learning
Plotted by variety of stars and variety of contributors; relative dimension by variety of contributors


And, so with out additional ado, listed below are the 38 high Python libraries for data science, information visualization & machine learning, as finest decided by KDnuggets employees.


1. Apache Spark
Stars: 27600, Commits: 28197, Contributors: 1638

Apache Spark – A unified analytics engine for large-scale information processing

2. Pandas
Stars: 26800, Commits: 24300, Contributors: 2126

Pandas is a Python bundle that gives quick, versatile, and expressive information buildings designed to make working with “relational” or “labeled” information each simple and intuitive. It goals to be the basic high-level constructing block for doing sensible, actual world information evaluation in Python.

3. Dask
Stars: 7300, Commits: 6149, Contributors: 393

Parallel computing with activity scheduling



4. Scipy
Stars: 7500, Commits: 24247, Contributors: 914

SciPy (pronounced “Sigh Pie”) is open-source software program for arithmetic, science, and engineering. It contains modules for statistics, optimization, integration, linear algebra, Fourier transforms, sign and picture processing, ODE solvers, and extra.

5. Numpy
Stars: 1500, Commits: 24266, Contributors: 1010

The basic bundle for scientific computing with Python.


Machine Learning

6. Scikit-Learn
Stars: 42500, Commits: 26162, Contributors: 1881

Scikit-learn is a Python module for machine learning constructed on high of SciPy and is distributed underneath the 3-Clause BSD license.

7. XGBoost
Stars: 19900, Commits: 5015, Contributors: 461

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and extra. Runs on single machine, Hadoop, Spark, Flink and DataStream

8. LightGBM
Stars: 11600, Commits: 2066, Contributors: 172

A quick, distributed, excessive efficiency gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based mostly on determination tree algorithms, used for rating, classification and lots of different machine learning duties.

9. Catboost
Stars: 5400, Commits: 12936, Contributors: 188

A quick, scalable, excessive efficiency Gradient Boosting on Decision Trees library, used for rating, classification, regression and different machine learning duties for Python, R, Java, C++. Supports computation on CPU and GPU.

10. Dlib
Stars: 9500, Commits: 7868, Contributors: 146

Dlib is a contemporary C++ toolkit containing machine learning algorithms and instruments for creating complicated software program in C++ to unravel actual world issues. Can be used with Python through dlib API

11. Annoy
Stars: 7700, Commits: 778, Contributors: 53

Approximate Nearest Neighbors in C++/Python optimized for reminiscence utilization and loading/saving to disk

12. H20ai
Stars: 500, Commits: 27894, Contributors: 137

Open Source Fast Scalable Machine Learning Platform For Smarter Applications: Deep Learning, Gradient Boosting & XGBoost, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), Okay-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML), and many others.

13. StatsModels
Stars: 5600, Commits: 13446, Contributors: 247

Statsmodels: statistical modeling and econometrics in Python

14. mlpack
Stars: 3400, Commits: 24575, Contributors: 190

mlpack is an intuitive, quick, and versatile C++ machine learning library with bindings to different languages

15. Pattern
Stars: 7600, Commits: 1434, Contributors: 20

Web mining module for Python, with instruments for scraping, pure language processing, machine learning, community evaluation and visualization.

16. Prophet
Stars: 11500, Commits: 595, Contributors: 106

Tool for producing top quality forecasts for time sequence information that has a number of seasonality with linear or non-linear development.


Automated Machine Learning

17. TPOT
Stars: 7500, Commits: 2282, Contributors: 66

A Python Automated Machine Learning instrument that optimizes machine learning pipelines utilizing genetic programming.

18. auto-sklearn
Stars: 4100, Commits: 2343, Contributors: 52

auto-sklearn is an automatic machine learning toolkit and a drop-in alternative for a scikit-learn estimator.

19. Hyperopt-sklearn
Stars: 1100, Commits: 188, Contributors: 18

Hyperopt-sklearn is Hyperopt-based mannequin choice amongst machine learning algorithms in scikit-learn.

20. SMAC-3
Stars: 529, Commits: 1882, Contributors: 29

Sequential Model-based Algorithm Configuration

21. scikit-optimize
Stars: 1900, Commits: 1540, Contributors: 59

Scikit-Optimize, or skopt, is a straightforward and environment friendly library to attenuate (very) costly and noisy black-box capabilities. It implements a number of strategies for sequential model-based optimization.

22. Nevergrad
Stars: 2700, Commits: 663, Contributors: 38

A Python toolbox for performing gradient-free optimization

23. Optuna
Stars: 3500, Commits: 7749, Contributors: 97

Optuna is an automated hyperparameter optimization software program framework, notably designed for machine learning.


Data Visualization

24. Apache Superset
Stars: 30300, Commits: 5833, Contributors: 492

Apache Superset is a Data Visualization and Data Exploration Platform

25. Matplotlib
Stars: 12300, Commits: 36716, Contributors: 1002

Matplotlib is a complete library for creating static, animated, and interactive visualizations in Python.

26. Plotly
Stars: 7900, Commits: 4604, Contributors: 137 is an interactive, open-source, and browser-based graphing library for Python

27. Seaborn
Stars: 7700, Commits: 2702, Contributors: 126

Seaborn is a Python visualization library based mostly on matplotlib. It gives a high-level interface for drawing engaging statistical graphics.

28. folium
Stars: 4900, Commits: 1443, Contributors: 109

Folium builds on the information wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library. Manipulate your information in Python, then visualize it in a Leaflet map through folium.

29. Bqplot
Stars: 2900, Commits: 3178, Contributors: 45

Bqplot is a 2-D visualization system for Jupyter, based mostly on the constructs of the Grammar of Graphics.

30. VisPy
Stars: 2500, Commits: 6352, Contributors: 117

VisPy is a high-performance interactive 2D/3D information visualization library. VisPy leverages the computational energy of contemporary Graphics Processing Units (GPUs) via the OpenGL library to show very giant datasets. Applications of VisPy embody:

31. PyQtgraph
Stars: 2200, Commits: 2200, Contributors: 142

Fast information visualization and GUI instruments for scientific / engineering functions

32. Bokeh
Stars: 1400, Commits: 18726, Contributors: 467

Bokeh is an interactive visualization library for trendy net browsers. It gives elegant, concise building of versatile graphics, and affords high-performance interactivity over giant or streaming datasets.

33. Altair
Stars: 600, Commits: 3031, Contributors: 106

Altair is a declarative statistical visualization library for Python. With Altair, you possibly can spend extra time understanding your information and its which means.


Explanation & Exploration

34. eli5
Stars: 2200, Commits: 1198, Contributors: 15

A library for debugging/inspecting machine learning classifiers and explaining their predictions

35. LIME
Stars: 800, Commits: 501, Contributors: 41

Lime: Explaining the predictions of any machine learning classifier

36. SHAP
Stars: 10400, Commits: 1376, Contributors: 96

A recreation theoretic strategy to elucidate the output of any machine learning mannequin.

37. YellowBrick
Stars: 300, Commits: 825, Contributors: 92

Visual evaluation and diagnostic instruments to facilitate machine learning mannequin choice.

38. pandas-profiling
Stars: 6200, Commits: 704, Contributors: 47

Create HTML profiling reviews from pandas DataBody objects



Source hyperlink

Write a comment