How to Future-Proof Your Data Science Project


By Nicole Janeway Bills, Data Scientist at Atlas Research



87% of Data Science projects never make it into production. That statistic is shocking. Yet if you’re like most Data Scientists, it probably doesn’t surprise you. Nontechnical stakeholders struggle to define business requirements. Crossfunctional teams face an uphill battle to set up robust pipelines for replicable data delivery. Deployment is hard. Machine learning models can take on a life of their own.

Here’s a list of five practical steps for future-proofing your model against these challenges of model selection and deployment. If you’ve been ignoring these critical elements in the past, you may find your deployment rate skyrockets. Your data products may depend on correctly deploying the tips from this article.


1.0 Don’t Underestimate Interpretability

An interpretable model is one that is inherently explainable. For example, Decision Tree based methods — Random Forest, Adaboost, Gradient Tree Boosting — offer up a clear view of their underlying decision logic.



Interpretability may be mandatory in the heavily regulated fields of criminal justice and finance. It also tends to be an underrated element of a strong data science project.

Along with inherent interpretability, a Decision Tree model has the following helpful properties:

  • Easily depicted in a visual format
  • Able to detect any non-linear bivariate relationship
  • Good predictive power across a wide variety of use cases
  • Provides ranked feature importance
  • Low requirements for feature preprocessing
  • Works with categorical features using sklearn.OneHotEncoder​
  • Handles outliers well and does not easily overfit
  • Can be used for either classification or regression

For these reasons, Decision Trees are a solid initial model to explore many typical business problems.

At the point of making a decision, are stakeholders more likely to trust an uninterpretable black box Neural Network or a Random Forest? Consider that a very detail-oriented (or very bored) business person could clearly trace the logic in every single underlying Decision Tree if they so chose. If the job of a Chief Data Officer is to keep the CEO out of jail, then this level of interpretability is clearly a win.


Uninterpretable models run the risk of perpetuating societal inequalities, such as the systematic “redlining” of Black families by human and AI-based mortgage lending systems — unless concrete steps are taken to mitigate bias against vulnerable groups. Photo by August de Richelieu on Pexels.


Beyond the Decision Tree, the family of interpretable models includes Naive Bayes Classifier, Linear and Logistic Regression, and K-Nearest Neighbors (for clustering and regression). These intrinsically interpretable models have the added benefit that they save significant time and resources in training and serving at a negligible cost to predictive performance relative to black box Neural Networks.

1.1 How to select the right model

Whether aiming for interpretability or not, use this resource (Decision Trees everywhere!) to guide your model selection:



1.2 Read more about Model Selection

Comprehensive Guide to Model Selection
A systematic approach to picking the right algorithm.


2.0 Prune for Productionization

Of course, sometimes going with a Neural Network may be your best option. Perhaps you’re doing image recognition or natural language processing (NLP). Perhaps you’re working with a very complicated dataset. If you’re using a Neural Net, you should consider how to pare back the model before putting it into production.



In the words of Mark Kurtz, Machine Learning Lead at Neural Magic:

Most weights in a neural network are actually useless.

After training, 60–90% of weights can be removed with no impact on performance. The result is faster inference time, reduced model size, and lower cost to serve users. In fact, the Neural Magic team argues that this sparsification could enable a renaissance in CPU-based architectures and “no hardware” AI.

ICML Paper: Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference
In July 2020, at the International Conference on Machine Learning, we presented a paper on methods for maximizing the…

Pruning involves removing the unused pathways in the Neural Network, keeping the necessary ones. Gradual magnitude pruning (GMP) has emerged as a favorite technique. In general, unstructured pruning — i.e. the removal of specific weights rather than entire neurons — allows for greater control over the sparsification process, resulting in better performance.

2.1 How to prune your model before productionizing

  • Retrain network at a slightly higher learning rate than final one used in training
  • At the start of epoch 1, set all the sparsity for all layers to be pruned to 5%
  • Iteratively remove the weights closest to zero once per epoch until designated sparsity is reached
  • Hold sparsity constant for the remainder of retraining while reducing learning rate

2.2 Read more about the Lottery Ticket Hypothesis

5 Must-Read Data Science Papers (and How to Use Them)
Foundational ideas to keep you on top of the data science game.


3.0 Prevent Data and Model Drift

After deployment, the forces of drift will inevitably buffet your model and cause its performance to degrade over time. Data drift occurs when the model’s underlying input changes with a data feature or features longer measuring what was originally measured. Model drift occurs when environmental conditions change, and the model is no longer reliably representing the real world.



3.0a Data Drift

Data drift is typically the result of changes in the data collection process. For example, a sensor at a manufacturing plant could break, recording several hours of zero temperatures before the problem can be corrected. Then the new may sensor may record temperatures in celsius, rather than the previous measurement in Fahrenheit. Without context on these changes, the zero values and switch to a new standard of measurement will have an adverse effect on the downstream model.

The same can be said for changes to qualitative information. Survey data collection methodology — e.g. switching from mailing questionnaires to polling landlines — will have an impact on the demographics of respondents. Even slight changes to the way a question is worded will adversely impact a model’s capability to draw longitudinal inferences from the dataset.

Data drift could also result from changes to the definitions of the fields in the dataset. For example, the data owner at the manufacturing plant could decide that the term “scrap” should refer not just to unusable material, but also material that will eventually reprocessed into recycled products. This change in terminology will also impact model performance.

3.0b Model Drift

Changes in the real world environment may degrade a model’s predictive power.

Given the cataclysm of a year that 2020 has been, models of consumer behavior generally need to be kicked to the curb. Carl gold is the Chief Data Scientist at Zuora, a services provider for subscription businesses that helps them move beyond analytics with advanced data products. In a recent interview, he shared his perspective on the impact of the pandemic:

I’m telling everyone to update their model. Now, if you do a new churn model, you should really only use data since COVID if possible.

That will only be possible for a consumer company that has a lot of observations. Generally, business-to-business companies have a small data challenge. So there’s so many competing concerns with refitting your model.

The job doesn’t stop once you’ve deployed.

You should continuously monitor your model’s predictions for accuracy because that’ll actually give you the warning sign if it’s been too long since retraining.

3.1 How to make your model robust to drift

  • Set up a Data Sharing Agreement with data source owners to receive advanced warnings of data drift
  • Monitor the distribution of incoming data against original training data — you can do this using the Kolmogorov-Smirnov (K-S) test or simply comparing the z-score
  • Monitor a time series dataset for drift from the previous time period — you may want to deploy the Population Stability Index (PSI) metric to do so
  • Retrain your model on a scheduled basis — e.g every five months — or through online learning, where the model is constantly intaking new training data and new versions are released in a continuous integration / continuous deployment process.

3.2 Read more about model retraining

Online Ensemble Learning with Abstaining Classifiers for Drifting and Noisy Data Streams
Mining data streams is among most vital contemporary topics in machine learning. Such scenario requires adaptive…




4.0 Take Advantage of Positive Feedback Loops

Algorithms are a powerful tool for empowering data-driven action. Through retraining on paired predicted and actual data, the results of the model become increasingly sophisticated over time.

The output of the data product provides high quality signals when integrated back into the data lifecycle. Andrew Ng referred to this concept as the virtuous cycle of AIHarvard Business Review called it the insights engine.



Robust capture of data-driven decisions and their outcomes could further enrich the data collection process. Hopefully soon, more feedback collection opportunities will be built into dashboards, web interfaces, and other data products. Feedback collection can empower the end user and improve the insight engine.

4.1 How to take advantage of positive cycles

  • Communicate with stakeholders at the beginning of the planning process about the outsized benefits of effective machine learning models
  • Create data collection pipelines from the deployed model
  • Ensure accuracy of metadata

4.2 Read more about what makes for an effective data product

The Age of the Data Product
We are living through an information revolution. Like any economic revolution, it has had a transformative effect on…


5.0 Prevent Negative Feedback Loops

A word of caution: far from being a self-sustaining system, a data product requires consistent monitoring. While the algorithmic feedback loop can create an insight-enriched dataset, it can also generate a bias-perpetuating cycle. There are many examples where the deployment of machine learning tools, particularly those with limited interpretability and explainability, accidentally deepened societal biases.



For example, a data science contracting firm created an algorithm to predict recidivism that was deployed in New York, Wisconsin, California, Florida, and other jurisdictions. ProPublica found that the algorithm perpetuated existing inequalities into a well-trodden feedback loop.

Although the defendant’s race was explicitly left out of the feature set, the algorithm used features highly correlated to race that informed inadvertently biased judgments. These features should also have been eliminated in order to reduce disparities in the judgment of the machine learning system. Read more about these risks in this article.

As a lighthearted solution to the stagnation of a negative feedback loop, a computer scientist invented a randomness generator to shake up his social life:

Randomized Living
Starting in 2015, I let a computer decide where I lived and what I did for over two years. It sent me all over the world…


5.1 How to avoid a downward spiral


  • Start with a checklist that helps you think through the ethical implications of your model
  • Thoroughly investigate potential sources of bias in your pre-processing, processing, and post-processing phases of model training — and then remediate sources of bias
  • Communicate model performance across protected classes in documentation

5.2 Read more about anti-bias tools

3 Open Source Tools for Ethical AI
Before integrating artificial intelligence into your organization’s workflow, consider these tools to prevent machine…



Failing to plan is planning to fail. So said Benjamin Franklin, immediately before getting struck by lightning while flying a kite out his bedroom window during a thunderstorm.

I like to think that if he were alive today, the Founding Father of the $100 bill would have been building a GPU-powered deep learning box, regularly PR’ing open source projects, and selecting and deploying models like a boss.

By starting off your next data science project with a robust planning process, you can ensure your model has better than 1:9 odds of making it into production. Use these tips for better model selection and deployment:

  1. Don’t Underestimate Interpretability
  2. Prune for Productionization
  3. Prevent Data and Model Drift
  4. Take Advantage of Positive Feedback Loops
  5. Prevent Negative Feedback Loops

Only fly a kite out your bedroom window during a thunderstorm if you want to get misquoted in Towards Data Science articles.

If you enjoyed this article, follow me on MediumLinkedInYouTube, and Twitter for more ideas to improve your data science skills.

How are you future-proofing your machine learning models? Let me know in the comments.


Resources to Build Your Data Science Skills

The Best Data Science Certification You’ve Never Heard Of
A practical guide to the most valuable training in data strategy.

How to Break in to Data Science in 2020
This year, I finished a bootcamp and immediately landed a data science job. If I had to go back and learn everything by…

What I’ve Learned in 6 Months as a Data Scientist
I started in bootcamp, and then landed my dream job. Here are some key takeaways.

10 Underrated Python Skills
Up your Data Science game with these tips for improving your Python coding for better EDA, target analysis, feature…

Comprehensive Guide to the Data Warehouse
Learn about the role of the data warehouse as the master store of analysis-ready datasets.

Bio: Nicole Janeway Bills is Data Scientist with experience in commercial and federal consulting. She helps organizations leverage their top asset: a simple and robust Data Strategy. Sign up for more of her writing.

Original. Reposted with permission.



Source link

Write a comment