Julie Michelman – Pandas, Pipelines, and Custom Transformers
Using pandas and scikit-learn together can be a bit clunky. For complex preprocessing, the scikit-learn Pipeline conveniently chains together transformers. But, it will convert your DataFrame to a numpy array. In this talk, we will walk through pandas DataFrames, scikit-learn preprocessing and Pipelines, and how to use custom transformers to stay in pandas land.
For data science in python, the pandas DataFrame is a common choice to store and manipulate data sets. It has named columns, each of which can contain a different data type, and an index to identify rows and assist in joining. The scikit-learn package is the major machine learning library in python. It has implementations for a wide variety of popular feature engineering, supervised, and unsupervised machine learning algorithms. Perhaps even more importantly to its success, scikit-learn provides a uniform interface for these transformers and estimators, making it easy to swap out one for another.
Many scikit-learn transformers will take and return pandas DataFrames, but some only return numpy arrays. This means losing the column names and row indices. A few important examples include the meta-transformers Pipeline and FeatureUnion. The Pipeline chains together transformers to be applied in order. The FeatureUnion combines the results of transformers that can be applied in parallel. With these, the entire feature engineering process can be stored in one object and easily applied to new data sets.
Luckily, scikit-learn also provides the ability to write your own custom transformers. It is as simple as defining a new class that implements the fit and transform methods. We can use this to create pandas-friendly versions of the Pipeline and FeatureUnion, as well as add transformations that are not already provided.
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.