New Conda environment feature available in Oracle Cloud Infrastructure Data Science

[ad_1]


On January 13, 2021, the Oracle Cloud Infrastructure (OCI) Data Science service released a new feature called Conda environments to the notebook session resource. This new feature includes a JupyterLab extension called the Environment Explorer, available through the JupyterLab Launcher tab, and a CLI tool called odsc conda available through the JupyterLab terminal window. These tools give you the capabilities to manage the lifecycle of Conda environments in notebook sessions.

This is a major change to the notebook session resource that data scientists have been using since the OCI Data Science service was released in February 2020. Data scientists and machine learning (ML) engineers can now pick-and-choose which environments they want to install in their notebook sessions from a list of pre-built ones or create, install, and publish their own environments.

In this post, I give an overview of the new Conda environments feature set.

 

What is a Conda environment? 

You can think of a Conda environment as somewhere between a Docker image and a Python virtual environment. Conda is like a virtual environment that lets you run Python processes in different environments with different versions of the same library. It’s more powerful than virtualenv, because it also manages different versions of Python that aren’t installed system-wide, lets you upgrade libraries, and supports the installation of packages for R, Python, Node.js, Java, and so on.

The process of building Conda environments is simpler and faster than building Docker images. For many ML and AI use cases, Conda environments offer the right level of isolation and flexibility.

 

Benefits of using Conda environments

Conda environments give you the following capabilities: 

  • You can install Python libraries from the different Conda channels such as conda-forge, from a pypi service, or directly from a third-party version control provider such as github.com.

  • Conda environments are also portable through the conda-pack tool. You can archive them in an Object Storage bucket, for example, or shipped across platforms and operating systems.

  • You can access different Conda environments as different notebook kernels in JupyterLab. So, data scientists and machine learning engineers can simultaneously execute different notebooks in different kernels with potentially conflicting sets of dependencies.

 

Manage the lifecycle of Conda environments within Data Science notebook sessions 

Within notebook sessions, you can leverage the Environment Explorer extension, available through the JupyterLab launcher tab, to list, install, publish, delete, and clone Conda environments. Each Explorer tab allows you to filter on either the Data Science, Installed, or Published Conda environments.

Alternatively, you can call the odsc conda  CLI  to list, create, install, delete, clone, and publish Conda environments directly from the JupyterLab terminal window.

 

Install curated Data Science Conda environments

From the odsc conda CLI or the Explorer extension, you can install one or more of the Data Science Conda environments. Those environments are built and curated by the OCI Data Science service team. Although all of these environments include Accelerated Data Science (ADS), only Classic and General Machine Learning environments include AutoML and MLX.

More Data Science Conda environments are added over time around NLP, computer vision, time series, and geospatial modeling.

Each Data Science Conda environment is versioned. New versions of existing Data Science Conda environments will include upgrades to libraries, ADS, notebook examples, etc. or include new libraries.

 

Create your own environment

Maybe the Data Science Conda environments don’t have exactly what you are looking for. That’s not a problem. You can always create your own Conda environment using odsc conda create command. List what libraries you want to install in a Conda compatible environment.yaml file, and we take care of the rest, including installing the dependencies needed to turn your Conda environment into a notebook kernel! Conda supports the installation of libraries from Conda channels and pip.

 

Publish an environment and share it with colleagues across notebook sessions

One of the greatest benefits of our environment feature is the ability to take a Conda environment that you’ve installed or created in your notebook session and publish it to your object storage bucket, using the odsc conda publish command. This capability allows you to share Conda environments with colleagues who have access to the same bucket or to install a Conda that you’ve previously published in a different notebook session.

Once an environment is published, it becomes available under the Published Conda environment tab of the Explorer extension. The odsc conda list -o command does the same from a terminal window! You can reinstall any of the environments that you’ve previously published.

 

Persistence of installed environments

All Conda environments that you create or install in your notebook session are stored in the block volume drive. So, the Conda environments persist a notebook session deactivation and activation cycle. You never have to reinstall the same libraries after notebook session activation ever again!

 

Notebook examples

Notebook examples are now specific to each Data Science Conda environment that you install in your notebook session. Although each Conda environment has its own set of example notebooks, we included a tailored Getting Started notebook for each Data Science Conda environment. Whenever you delete a Conda environment, we also delete the associated notebook examples.

 

Data Science curated Conda environments: Ready to install in your notebook

You can now install one or more of the pre-built Data Science Conda environments that are made available through the Explorer notebook extension and the odsc conda CLI tool. Each Conda environment comes with its own set of notebook examples to help you get started quickly with each environment.

We offer various Conda environments, and this list grows over time to include Condas tailored for particular use cases like NLP, compute vision, and time series.

The following Data Science Conda environments are available in this initial release: 

For CPU virtual machine (VM) shapes:

  • Classic CPU notebook session kernel
    • Same kernel as the one that was available in the notebook session prior to this release. This kernel ensures that code that was developed against the old monolithic kernel can be run successfully in this new release. We strongly recommend to start using the other Conda environments.
       
  • General machine learning for CPUs 
    • Includes the new versions of ADS, AutoML, and MLX, along with the usual machine learning suspects, including sklearn, xgboost, lightGBM, and others
       
  • PySpark 
    • Provides a local development environment for a PySpark job. Ideal environment to test your Oracle Cloud Infrastructure Data Flow jobs before submitting them with ADS (also included in this environment).
       
  • ONNX 
    • A runtime environment for ONNX models. Useful environment to test ONNX models before deploying them as Oracle Functions.
       
  • Data exploration and manipulation 
    • An environment jam-packed with data processing and data visualization tools including pandas, dask, pandarallel, kafka-python, matplotlib, plotly, bokeh, seaborn, and more
       
  • Oracle Database 
    • Works seamlessly with the Oracle Database using the ADS connector, SQLAlchemy, cx-Oracle, ipython-sql, mysql-connector, etc. Includes support for Oracle Database, MySQL, and SQLite. Notebook examples provide a variety of code snippets to connect to various databases and execute queries.
  • Classic GPU notebook session kernel
    • Same kernel as the one that was available in the notebook session for GPUs prior to this release. This kernel ensures that code that was developed against the old monolithic kernel can be run successfully in this new release. We strongly recommend to start using the other Conda environments.
  • General machine learning for GPUs  ​​​​​
    • Includes the new versions of ADS, AutoML, and MLX. This environment also includes TensorFlow 2.3.1 optimized for GPUs.
  • NVIDIA RAPIDS v0.16
    • Includes the version 0.16 release of the NVIDIA RAPIDS framework. This environment will be the subject of an upcoming blog post.

 

Interested in data science and machine learning? 

Attend Oracle Developer Live for technical sessions, hands-on labs, and live Q&A about how you can optimize data for the machine learning lifecycle. Sign up for any of the following dates:

  • January 26: Americas
  • January 28: Europe, Middle East, and Africa
  • February 2: Asia Pacific

I’ll be leading sessions on Accelerated Data Science, the end-to-end machine learning lifecycle, and using GPUs for data science with NVIDIA RAPIDS.

 

Keep in touch! 

–    Visit our website

–    Visit our service documentation

–    (Oracle Internal) Visit our slack channel #oci_datascience_users

–    Visit our YouTube Playlist

–    Visit our LiveLabs Hands-on Lab 

 

Read More …

[ad_2]


Write a comment