Data Science Lingo 101: 10 Terms You Need to Know as a Data Scientist | by Sara A. Metwalli | Dec, 2020


Data science is one of the fields that can be very overwhelming for new people joining in. The term “data science” is broad and used as an umbrella term to cover many subfields. Machine learning is data science, artificial intelligence is data science, natural language processing is data science, and mining data is also considered data science.

All of these terminologies can be — and is — extremely confusing and sometimes discouraging for a newbie. When you decide to join the field, you need to know what the field actually is, what it includes, and the basic terminology.

But gathering this information is not easy, especially if you don’t have the knowledge you need to navigate the web and extract the information you need.

When I first joined the field, I felt I had to juggle many things. Learning the techniques, getting up to date with the research and advancement in the field, and trying to understand the terminology — or as I called it, “the lingo.”

So, a couple of years in, I thought I would write an article to help others joining the field not knowing where to start. I gathered here 10 terms essential for each data scientist to know to build/ develop any data science project.

One of the most important terms in data science that you will hear quite often is “model.” Model training, improving model efficiency, model behavior, etc. But what is a mode?

Mathematically speaking, a model is a specification of some probabilistic relationship between different variables. In Layman’s term, a model is a way of describing how two variables behave together.

Since the term “modeling” can be vague, “statistical modeling” is often used to describe modeling done by data scientists more accurately.

Regression is a machine learning term. In fact, regression is the most basic and simple unsupervised machine learning approach. In regression problems, you often have two values, a target value — also called criterion variables — and other value/s known as the predictors.

An example of that is the job market; how easy/ difficult getting a job is (criterion variable) depends on the demand for the position and its supply (predictors).

There are different types of regression to match different applications; the easiest ones are the linear and logistic regressions.

This is one of the terms that can be quite confusing because it has slightly different meanings based on the scope you’re using it in. For example, in statistics, a parameter is used to describe a probability distribution’s different properties, e.g., its shape, scale.

In data science or machine learning, the term parameter is often used to components the system is learning to be precise. In machine learning, there are two types of models, parametric models, and nonparametric models.

  1. Parametric models have a set number of parameters (features) that are not affected by the number of training data. Linear regression is considered a parametric model.
  2. Nonparametric models don’t have a set number of features, so the technique’s complexity grows with the number of training data. The most known example of nonparametric models is the KNN algorithm.

When you hear or read the term basis, your brain often associate’s it with something negative. However, it’s not always true. In data science, bias is often used to refer to an error in the data.

The reason bias occurs in the data is the results of sampling and estimation. When we choose some data to analyze, we often sample a bigger data pool. The sample you select could be biased, as in, it couldn’t be the correct representation of the pool.

Since the model we training only knows the data we give, it will learn only what it can see. That’s why data scientists need to be fully aware of this fact to create unbiased models.

In general, we use correlation to refer to the degree of occurrence of two or more events. For example, if depression cases increase in cold weather areas, there’s some correlation between cold weather and depression.

Often, things correlate together with different degrees. For example, following a recipe and having a delicious dish have a higher correlation than the weather example above. This correlation degree is called the correlation coefficient.

When the correlation coefficient is 1, that means the two events are very correlated, where if it is 0.9, then the events are weakly correlated. The coefficient can also be negative. In this case, the relation between the events is the opposite. For example, if you eat well, your chances of getting ill will decrease.

Finally, you must always remember correlation doesn’t mean causation.

We already said that a model is a relationship between variables. We also mentioned what parametric and nonparametric models are. Another way to describe models is how much they fit the data they are being applied to.

Overfitting happens when your model considers too much information about that data. So, you end up with an overly complex model and difficult to apply to different training data.

The opposite of overfitting is underfitting. Underfitting happens when the model doesn’t have much information about the data. In this case, you end up with a poorly fitted model.

One of the skills you will need to learn as a data scientist is how to find the middle ground between overfitting and underfitting.

Cross-validation is a way to evaluate the model’s behavior when asked to learn from a dataset different from the training data used to build it. This is a big concern for data scientists because your model will often have good results on the training data but end up with so much noise when applied to real-life data.

There are different ways to apply cross-validation on a model; the three main ways strategies to do so are:

  1. The holdout method where the training data is divided into two sections, one to build the model and one to test it.
  2. The k-fold validation is an improvement on the 1st method. Instead of dividing the data into two sections, the data will be divided into k sections to reach higher accuracy.
  3. The leave-one-out cross-validation here is the extreme case of the k-fold validation. k here will be the same number of data points in the dataset being used.

A hypothesis, in general, is an explanation for some event. Often, hypotheses are made based on previous data and observations. A valid hypothesis is one that can be tested with results, either true or false.

In statistics, a hypothesis must be falsifiable. That means we can test any hypothesis to determine whether it’s valid or not. In machine learning, the term hypothesis refers to candidate models that can be used to map the model’s inputs to the correct and valid output.

Outlier is a term used in data science and statistics to refer to an observation that lies an unusual distance from other values in the dataset. The first thing every data scientist should do when given a dataset is deciding what is considered usual distancing and what’s unusual.

An outlier can represent the different things in the data; it could be noise that occurred during the collection of the data or a way to spot rare events and unique patterns. That’s why outliers shouldn’t be deleted right away; rather, it should be understood and investigated.

Data science is one of the rapidly developing fields. In my opinion — and other experts — it is not going anywhere anytime soon. Our data-dependency is growing with each passing day. When there’s data to mine, analyze, and use to make the world a better place, the need and demand for capable data scientists will continue to rise.

One of the most challenging aspects of joining a new field is learning its lingo. The terms are often used, and what do they actually mean. Not just their meaning, but also how to use these techniques to build a solid data science project.

In this article, I went through what — I think — are 10 data science terms that every data scientist should know its meaning and when to use it and how the meaning changes based on the context and the project being built.

The only way to master the lingo is to build different projects, being consistent and motivated. Eventually, the lingo will be like second nature to you.

Read More …


Write a comment