Providing fine-grained, trusted access to enterprise datasets with Okera and Domino – Data Science Blog by Domino


Domino and Okera – Present knowledge scientists entry to trusted datasets inside reproducible and immediately provisioned computational environments.

In the previous few years, we’ve seen the acceleration of two traits — the rising quantities of knowledge saved and utilized by organizations, and the next want for knowledge scientists to assist make sense of that knowledge for important enterprise choices. This explosion in each the quantity of knowledge as effectively customers who want entry to it has created new challenges, chief amongst them being how one can present safe entry to this knowledge at scale and how one can give knowledge scientists constant, repeatable, and handy entry to the computational instruments they want.

These patterns play out in a number of industries and use instances. For instance, within the pharmaceutical world, there’s a substantial amount of knowledge produced for medical trials and the business manufacturing of recent medication and coverings, and this has solely accelerated for the reason that emergence of COVID-19. This knowledge helps all types of use instances inside organizations, from serving to manufacturing analysts perceive how manufacturing is progressing, to permitting analysis scientists to take a look at the outcomes of a set of therapies throughout completely different trials and cross-sections of the inhabitants.

Domino Knowledge Lab, the world’s main knowledge science platform, permits knowledge scientists easy accessibility to reproducible and simply provisioned computational environments. They will work with knowledge with out worrying about establishing Apache Spark clusters or getting the appropriate model of libraries. They will simply share outcomes with different customers and create recurring jobs to supply new outcomes over time as effectively.

In at the moment’s more and more privacy-aware atmosphere, an increasing number of kinds of knowledge are thought-about delicate. These datasets should be protected in accordance with industry-specific rules corresponding to HIPAA, or the slew of rising shopper knowledge privateness rules together with GDPR, CCPA and different rules in numerous jurisdictions. This could function a roadblock to knowledge customers; though Domino Knowledge Lab makes it simple to entry computational sources, getting access to all the info they want is an actual problem.

Historically, this downside has been solved by both denying entry to this knowledge altogether (a not rare end result), or creating and sustaining a number of copies of many datasets for every doable use case by omitting the info {that a} specific consumer isn’t allowed to see (e.g. PII, PHI, and so on). This course of of making duplicate variations of the info not solely takes loads of time (sometimes months) and will increase storage prices (which rapidly add up when speaking about petabytes of knowledge), but additionally turns into a administration nightmare. Knowledge managers have to preserve monitor of all these copies and the needs for which they had been created, and do not forget that they should be saved updated with new knowledge – and even worse, doable future redactions and transformations as new kinds of knowledge are deemed delicate.

Okera, the main supplier of safe knowledge entry, lets you outline fine-grained knowledge entry management utilizing attribute-based entry insurance policies. Combining the ability of Domino Knowledge Labs with Okera, your knowledge scientists solely get entry to the columns, rows, and cells allowed, simply eradicating or redacting delicate knowledge corresponding to PII and PHI not related to coaching fashions. Moreover, Okera connects to an organization’s current technical and enterprise metadata catalogs (corresponding to Collibra), making it simple for knowledge scientists to find, entry and make the most of new, accredited sources of data.

For the compliance group, the mix of Okera and Domino Knowledge Lab is extraordinarily highly effective. It permits compliance to not solely govern what info will be accessed, but additionally to audit and have visibility into how the info is definitely being accessed – when, by who, by what instruments, how a lot knowledge was considered, and so on. This could determine knowledge breaches and to see the place knowledge entry ought to be additional lowered, corresponding to decreasing the danger of publicity by eradicating entry to infrequently-used knowledge.

So what does this appear to be? Take into account an instance the place a knowledge scientist desires to load a CSV file from Amazon S3 right into a pandas dataframe for additional evaluation, corresponding to constructing a mannequin for a downstream ML course of. In Domino Knowledge Lab, the consumer would use one of many Environments they’ve entry to, and have some code which may appear to be this:

import boto3
import io

s3 = boto3.consumer('s3')
obj = s3.get_object(Bucket='clinical-trials', Key='drug-xyz/trial-july2020/knowledge.csv')
df = pd.read_csv(io.BytesIO(obj['Body'].learn()))

A important element embedded within the above snippet is the query of how the info scientist will get permission to entry the file. This may be performed through IAM permissions by both storing consumer credentials in safe atmosphere variables inside Domino or utilizing  keycloak capabilities to do credential propagation between Domino and AWS.

Lastly, if the info scientist was not allowed to see sure columns, rows, or cells inside the CSV file, there can be no strategy to give entry to the file.

When Domino Knowledge Lab is built-in with Okera, the identical code merely seems to be like this:

import os
from okera.integration import domino

ctx = domino.context()
with ctx.join(host=os.environ['OKERA_HOST'], port=int(os.environ['OKERA_PORT'])) as conn:
    df = conn.scan_as_pandas('drug_xyz.trial_july2020')


The id of the present consumer in Domino Knowledge Lab is routinely and transparently propagated to Okera, with all of the requisite fine-grained entry management insurance policies utilized. Because of this if the executing consumer was solely allowed to see sure rows (e.g. trial outcomes from contributors within the US, to stick to knowledge locality rules) or see sure columns however with out exposing PII (e.g. by not exposing a participant’s title however nonetheless having the ability to meaningfully create aggregations), this shall be mirrored in the results of the question that will get returned, with out ever exposing the info scientist to the underlying delicate knowledge. Lastly, this knowledge entry can also be audited, and that audit log is made obtainable as a dataset for querying and inspection.

Along with the advantages of having the ability to entry knowledge securely whereas sustaining fine-grained entry management insurance policies, it’s now a lot simpler for knowledge scientists to seek out the info that they should entry. Beforehand, this concerned sifting by object storage corresponding to Amazon S3 or Azure ADLS, however with the mix of Okera and Domino Knowledge Lab, knowledge scientists can simply examine and search Okera’s metadata registry to seek out knowledge they’ve entry to that has been validated, certified and documented by subject material specialists, preview it, and get easy directions on how one can entry it of their Domino Knowledge Lab environments.

As your group’s funding in your knowledge and the productiveness of your knowledge scientists will increase, it’s important that they’ve the appropriate instruments and entry to the appropriate knowledge. With the mix of Okera and Domino Knowledge Lab, the entire is greater than the sum of its components. If you happen to’re already leveraging Domino Knowledge Lab, including Okera can assist you to unlock knowledge for evaluation that was beforehand forbidden because of privateness and safety considerations. If you happen to’re already utilizing Okera, including Domino Knowledge Lab can improve the productiveness of your knowledge scientists by giving them easy accessibility to reproducible and simply provisioned computational environments.

For extra details about Okera and their partnership with Domino, please go to



Source link

Write a comment