Notable Data Science Platforms of 2020 | by Matthew Yates | Dec, 2020
A list of data science platforms worth considering
This field is moving fast and changes every day. This is why I’m attempting to do a snapshot of data science platforms (DSPs) that I have found to be notable in 2020.
So let’s get going!
First, let’s look at the major cloud providers. Today, all the major cloud providers have their own data science platforms. These options are really amazing because they integrate so well with all the other services that AWS, Azure, and Google have to offer. Want to deploy a model into an ETL pipeline? They’ve got that covered (given you’re doing it in the cloud).
If you can’t get access to the cloud in your company, you might want to move to the next section.
Doing data science in the cloud used to be a huge pain and took 8,000 steps just to get set up. Other companies (see middleweight section) started selling their own “data science platforms” on top of the cloud, which made scalable cloud data science easier. Eventually, AWS caught on, and they came out with AWS Sagemaker. When it comes to major cloud providers, AWS Sagemaker was an initial trailblazer. AWS Sagemaker continues to add features like AutoML, Ground Truth, and now Sagemaker Studio.
Azure and AWS Sagemaker feel VERY similar. To some degree, AWS Sagemaker and Azure are “po(tay)to, po(ta)to”. They both offer machine learning studios, AutoML, labeling services, and more. Even the code looks the same. (I’ve heard some say Azure is cheaper).
Personally, one thing I really like about Azure is its integration with Databricks. It’s very easy to get up and running on Azure with a Databricks service! So you could call-up Databricks, or you could just get Azure and spin up a Databricks environment in no time! Pretty awesome!
In this space, Google was surprisingly late to the game. Although Azure and AWS are a bit more mature, Google is a groundbreaking company that’s doing great work in AI and ML. Their AutoML tools can be impressive, and I can only imagine they’ll get even better. I don’t have to tell you that Google is a pretty smart company when it comes to tech.
The following options have some maturity behind them, and they generally focus on model training but have varying support for model deployment. Unlike cloud platforms, which can only be hosted in the cloud, these platforms can be hosted in the cloud and/or on-prem.
From the creators of Apache Spark, Databricks is a company with a ton of steam behind them. Why? Well, their focus on Apache Spark makes them unique. Some companies might say “we’re not using a lot of Spark, so this doesn’t seem like the right tool” and that’s where you would be wrong. Spark is a very flexible tool. Spark is an excellent tool for ETL because of its focus on distributed processing of large data (and as I always say, it’s easier to scale down than it is to scale up). Spark has a great ML library and many ML extensions. If your data is small, no worries, you can skip Spark and just train a model on a single node. Or you can utilize just the Spark executors, not the Spark language, and run hyperparameter tuning for a Python model over a cluster. Now instead of your hyperparameter tuning taking a day, maybe it runs in under an hour. I think was makes Spark so powerful is how strong it is in not just ML, but in data engineering as well.
Databricks adds features every day like version control and even dashboarding! This platform is a strong solution for not just building models, but also productionalizing those models as batch jobs, and even REST APIs (although probably not as robust a solution for REST API deployments).
Domino has been a strong and mature data science platform for many years now. In many ways, Domino was a trailblazing company when it first came out. Domino is still a strong company with a strong platform.
This platform is a strong solution for building models, and also has support for productionalizing those models as REST APIs (and maybe more).
We all know the Hadoop boom is over, but ClouderaHortonworks is still very much alive and kicking. Cloudera can be deployed in the cloud or on-prem. I think what makes Cloudera a strong company that probably won’t go away anytime soon is their on-prem support and their core knowledge of cluster maintenance. Want to build models and then deploy them in Apache Airflow jobs? Getting an Airflow cluster up and running is not easy, but with the Cloudera platform, it’s much easier. Cloudera is probably an underrated solution today, and depending on your company’s needs, it could be a perfect fit.
This platform is a strong solution for not just building models, but also productionalizing those models as batch jobs, and even REST APIs (and since they’re mainly a cluster company that supports a lot of open-source tools, that means they should be a robust solution for REST APIs and batch jobs, you just need to learn the open-source tools).
Anaconda is a well known Python package manager, but did you know that they have their own data science platform?! From reviewing one of their customer case studies with PNC, seems like a nice option if you already have an on-prem Hadoop cluster that you’d like to leverage into a data science platform. Anaconda has had data science platform capabilities for quite a while, so they’re worth a look.
Here are some platforms that are newer to the scene.
This is an interesting company that has popped up in the ML space. Seems like they lean more on the productionalization side of things vs. the model building, but they seem to support both.
Comet is a company/tool that’s gaining a lot of traction. Seems like they lean more heavily in the model building space vs. the model productionalization space. As far as model building goes, it looks very feature-rich.
Algorithmia is purely a model deployment platform that focuses on deploying models as REST APIs (although you can also use the service for batch jobs with some finagling).
Zepl is an overlooked and underrated data science platform focused on model building and data mining. Zepl has strong ties to Apache Zeppelin and again, focuses more on model building than model deployment. Zepl has a lot of great features and at a great price, which really makes Zepl stand out.
If you want modeling flexibility, this is not the tool for your team. With AutoML platforms, there is no Jupyter or notebook IDE, there is only the AutoML UI/SDK. If you lack a strong data science team or have a collection of citizen data scientists (people who have data science knowledge but lack coding experience or professional experience) then an AutoML platform could be a great fit. AutoML platforms are focused on AutoML in such a way that you just load your data (either through a UI or an SDK), and they take care of the rest.
The truth is that there are a lot of pitfalls in data science, and so the idea of a ‘citizen data scientist’ is a hard one to realize. Also, it’s realistic that many projects could fall outside of the capabilities of your AutoML platform, and then what? That doesn’t mean that AutoML platforms are a bust — far from it.
AutoML platforms can also be treated as a great tool in the toolbelt of a data scientist (much like Random Forest or XGBoost, or any other AutoML tool for that matter). If you plan to use an AutoML platform in this capacity, then you’ll still need to figure out where your data scientists complete the rest of their work (running other experiments, deployment, etc.).
DataRobot is a very strong and mature AutoML platform. DataRobot also has some model deployment features (which are starting to get stronger after their acquisition of MLOps company ParallelM).
H2O is a very different company. They started as an open-source framework that was similar to Apache Spark, but then transition to an AutoML company that now competes with DataRobot. A lot of top Kagglers today work for H2O.ai, making them a notable company in 2020.
Sigopt is another company that is very different. Sigopt has a focus on model tuning, which makes it pretty unique. They also have features for productionalization, but their focus is primarily on model tuning.
Here are some other platforms that I didn’t mention. They’re not highlighted for various reasons, but I know they exist and thus shouldn’t be ignored. Also, there are some big companies that use these platforms, so here they are:
- IBM Watson Studio
- Oracle Data Science Platform
- SAS (Viya, Enterprise Miner, etc.)
- D2iQ Kaptain
- Cognitive Scale
There are a lot of data science platforms/tools out there, and the questions you need to ask yourself are:
- What does my team need?
- What hardware is possible to acquire at my company?
The answer to question #1 is long but can be summarized down to — a flexible and scalable development environment for model building and analytics, data storage, a flexible ETL tool, a BI tool for publishing dashboards, and a service for hosting REST APIs. More info than this would require another article (which I’m working on).
The answer to question #2 is something only you can answer. If you have access to the cloud, I would check out the heavyweight options. If you don’t have access to the cloud, I would check out the middleweight and lightweight options. In addition, you might need more than just 1 service if you’re not using cloud services. Some platforms are mainly for model building, and some are purely for model deployment, so you might need 2–3 different solutions to cover development and deployment. This all comes back to question #1. You need to ask yourself:
- How am I going to build my models?
- How am I going to deploy a model into an ETL framework?
- How am I going to deploy a model as a REST API for real-time application integrations?
- How am I going to share insights with the rest of the organization? What dashboarding tool should I use? How will I host our dashboards?
- How are we going to test our models? How are we going to monitor our models?
As you can see, it’s unlikely that 1 platform will cover all of this unless you’re using Azure, AWS, or Google Cloud. If you’re going with a middleweight or lightweight solution, you might need to use more than 1 to get everything your team needs.
Thanks for reading and hope you found this insightful!
Read More …