2021 Trends in Data Science: The Entire AI Spectrum
As an enterprise discipline, data science is the antithesis of Artificial Intelligence. The one is an unrestrained field in which creativity, innovation, and efficacy are the only limitations; the other is bound by innumerable restrictions regarding engineering, governance, regulations, and the proverbial bottom line.
Nevertheless, the tangible business value praised by enterprise applications of AI is almost always spawned from data science. The ModelOps trend spearheading today’s cognitive computing has a vital, distinctive correlation within the realm of data scientists.
Whereas ModelOps is centered on solidifying operational consistency for all forms of AI—from its knowledge base to its statistical base—data science is the tacit force underpinning this motion by expanding the sorts of data involved in these undertakings.
Or, as Stardog CEO Kendall Clark put it, “If companies want to win with data science they really have to take seriously the breadth and diversity of all the types of data, not just the ones that are amenable to statistical techniques.”
By availing themselves of the full spectrum of data at their disposal, organizations can explore the boundaries of data science to master intelligent feature creation, explainability, data preparation, model standardization and selection—almost all of which lead to palpable advantages for enterprise deployments of AI.
Intelligent Feature Generation
What Clark termed “perceptual or computer visible” machine learning data directly invokes AI’s statistical foundation. Building machine learning models is predicated on identifying features that enhance model accuracy for applications of computer vision, for instance, to monitor defects in an assembly line process in the Industrial Internet. According to Gul Ege, SAS Senior Director of Advanced Analytics, Research and Development, “Intelligent feature creation comes from what is important to the domain and how we process this data.” Some of the numerous methods for enriching feature identification involve:
- Peaks and Distances: Ege outlined an EKG wearables device use case in which streaming data comes in cyclical patterns. When discerning features to see if patients are afflicted with specific heart diseases conditions, for instance, “You apply noise reduction, and then look at the cyclic patterns and apply analytics to find the peaks and measure the distance between the peaks,” Ege explained. “The feature is the distance between the peaks.”
- Simplified Queries: Entity event models in graph settings supporting AI’s knowledge base greatly simplify the schema—and abridge the length of queries to traverse them—to represent an endless array of temporal events pertaining to critical entities like customers, patients, or products. According to Franz CEO Jans Aasman, “If you have a complex graph without entity event models, then if you want to extract features for machine learning, you have to write complex queries. With this approach, you write simple queries to get the data out.”
- Features Databases: Utilizing specific databases for feature generation is an emergent data science development. Clark mentioned an autonomous vehicle use case involving computer vision in which “features get bundled into scenes and they’re depicted or represented graphically.” Scenes can consist of other scenes; features are extracted via rules-based and statistical approaches. Scenes represent specific driving scenarios like pedestrians crossing the street. For the vehicle, the “task is to understand what the appropriate response in that situation is,” Clark indicated. “For computer vision this is roughly a selection of features, but they’re arranged spatially and temporally.”
For rapidly changing data (such as e-commerce transactions, recommendations, or Internet of Things applications), accurate feature identification hinges on the noise reduction Ege referenced. Data scientists employ unsupervised learning techniques, similar to clustering, for reducing the variables for training models. Dimensionality reduction approaches like Principal Component Analysis (PCA) “can separate the background from the moving parts in a video, or for any matrix, really,” Ege specified.
Graph embedding is gaining traction for performing this and other critical data science work for “doing predictions and inferences using the nature of the graph to understand similarities between things like products or people,” denoted Cambridge Semantics CTO Sean Martin. Advantages of this application of knowledge graphs include:
- Decreased Data Prep Time: Graph embedding abbreviates the elaborate pipelines that monopolize the time of data scientists preparing—as opposed to analyzing—data. Transferring data into tools like Python for this machine learning work is programming intensive and time consuming. But, when performed in a graph database “you can do it much more quickly, more iteratively, than ending up having to keep extracting data from the graph and into pipelines,” Martin maintained.
- Matrix Support: Data must be vectorized for use in machine learning models. Graphs with matrix support enable organizations to “shift the data from a graph representation to matrices,” Martin commented. Subsequently, they can perform functions like PCA “which lets you see correlations between things; how different parts of datasets are correlated,” Martin remarked.
- Granular Feature Engineering: Graphs are also ideal for inputting the results of machine learning analytics—like clustering—for refining features and other aspects of training models. In this respect, “what works better with graphs is taking the output of what you learned, especially for unsupervised learning, and putting it back in the graph,” Aasman acknowledged.
The explainability issue, which is contiguous to interpretability, model bias, and fair AI, still has the potential to compromise any enterprise worth from statistical AI deployments. Nonetheless, by coupling AI’s statistical side with its knowledge side, organizations can consistently surmount this obstacle. “The explainability crisis really gets at people’s ability to trust these systems,” Clark observed. “The only real solution to the explainability crisis are blended techniques that supplement statistical models with logic or rules-based formalisms, so whatever the computer is doing to get the answer, the explanation of that answer is in terms that are intelligible to people.” One of the premier tasks for data scientists in the coming year, then, is to augment machine learning with AI’s knowledge foundation typified by rules-based learning.
Doing so will expand the data types and techniques data science must come to encompass to include data described by Clark as “conceptual or categorical; it’s about the concepts or categories that exist between people.” The business utility of leveraging these data with logical rules facilitates explainability with practical applications of machine learning. “Most business data doesn’t really come in that perceptible or computer visible [variety]; it comes as more categorical,” Clark revealed. “Like, what’s a risky loan, or what’s a risky purchase, or is this person an insider threat to an organization from a risk and analysis point of view. Or, what’s the part of our supply chain that’s at the most risk if there’s an earthquake in Chile?” Analyzing these scenarios with statistical AI in conjunction with symbolic reasoning, semantic inferencing, and rules can issue much needed explainability for organizations and regulators alike.
Aside from approaches like Random Forest or ensembling techniques such as gradient boosting, immensely multi-layered neural network results have proven the most arduous to explain—particularly with the compute and scale of deep learning. Organizations can standardize these models and others to maximize their deployment by taking into account considerations for:
- Open Neural Network Exchange (ONNX): According to SAS Chief Data Scientist Wayne Thompson “ONNX is an environmental standard for the exchange of deep learning models.” ONNX’s scope of use is expansive; one could develop a model in a proprietary framework then “someone else can bring it into open source and use my model as a preliminary weight and train it further for their environment,” Thompson noted.
- Autotuning: Data scientists can expedite the potentially cumbersome task of tuning parameters for machine learning models by opting to “build algorithms that have very few tuning parameters and also default to add optimal value,” Ege disclosed. “We put another algorithm in there to see what the optimal tuning parameter is and try not to have a zillion parameters.” This method is effective for smaller form factor models on IoT devices, for example.
- Recurrent Neural Networks (RNNs): RNNs work well for forecasting and text analytics. “That’s because they look at a sequence of data points,” Thompson added. “A conversation is a token of spoken words that have a sequence to it.”
- Convolutional Neural Networks (CNNs): One of the predominant use cases for CNNs is computer vision. “They can see better than humans today,” Thompson said. “So yes, they’re very good for image analysis and there’s a plethora of use cases for that.”
On The Roadmap
Data science will increasingly prioritize integrating the entire spectrum of data and AI methods, including aspects of its statistical and knowledge base, into daily deployments across the enterprise. Utilizing the full range of techniques and information at the disposal of data scientists will substantially improve feature generation, data preparation, and explainability.
About the Author
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.
Sign up for the free insideBIGDATA newsletter.
Read More …