Econometrics techniques for data science | by Mahbubul Alam | Dec, 2020


I wrote an article a while ago about econometrics (Econometrics 101 for Data Scientists). The article resonated well with readers, but that was a kind of introductory article for data science people who might not be otherwise familiar with the domain.

Inspired by the response to that article, today I’m attempting to take it to the next level by making it a bit comprehensive. I’ll mostly focus on the methods, tools, and techniques used in econometrics that data scientists will benefit from.

Econometrics is a sub-domain of economics that applies mathematical and statistical models with economic theories to understand, explain and measure causality in economic systems.

With econometrics, one can make a hypothesis that the length of education has a positive impact on wage rates; then qualify this relationship with economic theory; and finally, formalize that relationship quantitatively (e.g. 1 additional year of schooling increases wage by 5%) using mathematical and statistical techniques (e.g. regression). A couple of other examples are:

The econometrics domain largely deals with macro-economic phenomena such as employment, wage, economic growth, environment, agriculture and inequality, but these principles are equally applicable in solving business and machine learning problems.

There is no clear-cut boundary within which econometrics operates, so it is difficult to list all the methods, tools and techniques that fall within it. Keeping that in mind, and since I’m writing this article for data scientists, I broadly grouped econometrics methods into four categories: descriptive statistics, hypothesis testing, regression and forecasting.

Let’s do a deeper dive into each category.

Descriptive statistics play a key role in exploratory data analysis (EDA) in data science projects. Descriptive statistics measures central tendency, dispersion and distribution of data using statistical techniques.

Central tendency: In econometrics, the measures of central tendency are a set of “middle” values representative of all observations in a dataset. It describes the distribution of data focusing on a central location around which all other data are clustered. Among the measures of central tendency —

  • Mean measures an average of data points. Arithmetic, geometric, weighted and harmonic mean are some variants of it.
  • Median is the mid-point of data, it’s an alternative to mean. Median is not sensitive to outliers, which makes it advantageous over mean.
  • Mode measures most frequently occurring values in a distribution.
The concept of central tendency (source: author)

Dispersion: In contrast to central tendency, the measures of dispersion quantifies the variability in a dataset, i.e., how data are dispersed with respect to central values.

Commonly used measures of dispersion in econometrics are Range, Interquartile Range (IQR), Standard Deviation, Variance, Mean Absolute Deviation, Coefficient of Variation, Gini Coefficient etc.

Concept of dispersion (source: author)

Distribution: Statistical distribution is a mathematical function used to describe/calculate the probability of occurrence of an observation in the sample and how frequently it occurs.

Normal or Gaussian distribution is the most well-known one but other probability distributions are Binomial, Poisson, Bernoulli, Geometric, Exponential, Chi-squared distribution etc.

Conceptual diagram of normal distribution (Source: M. W. Toews, Wikipedia)

Hypothesis testing generally refers to the examination of a claim against accepted facts (called “null hypothesis”). It uses sample data to verify the claim about the whole population. One can make a claim that people in Arlington County live longer than people in Fairfax County. Since it is not possible to survey all the people, a researcher would take samples of the population in both counties and test the hypothesis (i.e. the claim) against the null hypothesis (i.e. there is no difference in life expectancy between counties).

So hypothesis testing is a test of a claim, but how exactly can you measure the claim’s validity? There are several tests for that:

  • A t-test is used when there’s 1 independent variable (e.g. gender) with 2 levels (e.g. boys vs girls) and 1 dependent variable (e.g. test score).
  • ANOVA is used when there’s 1 independent variable with more than 2 levels. In a hypothesis such as “liberals, conservatives and independents differ in opinion on the proposed tax policy” we have 1 independent variable (party affiliation) with more than 2 levels (liberals, conservatives, independents) and 1 dependent variable (opinion on tax policy).
  • Chi-Squared test compares observed against expected outcomes. Let’s say 15 households were surveyed on their pet preferences — cats, dogs, birds. The expected outcome (null hypothesis) is cats-5, dogs-5, and birds-5. However, after the survey, observed outcomes were found to be cats-2, dogs-10, birds-3. The Chi-Squared test will test the hypothesis that pet preferences are significantly different.

Two additional and important concepts in hypothesis testing are:

  • A measure called p-value is used as evidence to support or reject the claim (hypothesis); a small p-value means the null hypothesis can be rejected (i.e. the claim is statistically valid). In statistical terms, p = 0.001 means there is only a 1% probability that the result is due to chance, hence the null hypothesis (accepted facts) can be rejected.
  • Another concept in hypothesis testing is confidence intervals (CI), a measure of the degree of uncertainty. CI provides a range of values within which a parameter can belong.

Regression is a vast topic and I could go forever writing about it. But below I’m summarizing key methods and associated techniques/models used in regression problems in the econometrics domain.

  • Linear models are widely used techniques for continuous dependent variables. Two specific techniques in linear models family are simple and multiple regression. Simple linear regression has one dependent variable and it is explained by one independent variable (e.g. weight vs height). Multiple linear regression, on the other hand, has more than one explanatory variable (e.g. weight explained by height and age). There are several variants of linear models such as Ridge and LASSO regression.
Conceptual formulation of a multiple regression model (source: author)
  • Panel data models are specialized regression techniques used in modeling time series data. It is a powerful method for forecasting time-dependent observations. Some techniques used within panel data models are Pooled OLS (ordinary least square), Fixed Effects model and Random Effects model.
  • Count data models are used to model count data (e.g. number of crimes) as a function of covariates (e.g. unemployment, income). Ordinary regression doesn’t work because it can predict negative or non-integer values, which does not make sense for count values. Two methods for count data regression are Poisson, Negative Binomial.
  • Binary outcome models are used when the dependent variable is binary (e.g. yes/no, approve/not approve). It’s similar to two-class classification problems in machine learning. In econometrics, Logit and Probit models are applied to model binary outcomes.
  • GLM (Generalized Linear Models) are used where linear models fail — either because the outcome is count data or because it is continuous but not normally distributed. A GLM consists of three components: a) a random component, which is an exponential family of probability distributions; b) a systematic component, which is a linear predictor; and c) a link function that generalizes linear regression.
  • VAR (Vector Autoregression): As the name suggests, autoregression is a regression of a variable on itself, on its past value. In this case, independent variables are the past values of the same univariate data series that’s being predicted. Vector Autoregression generalizes this univariate concept and allows for the inclusion of additional correlated variables in the model. In this process, the dependent variable is forecasted using its own past (lag) values as well as the lag values of exogenous factors. For example, if the population of a county is to be forecasted for the year 2050, the conceptual framework of VAR is as follows:

Like regression, forecasting is also a well-research, big topic. Not surprisingly, there is a rich forecasting toolbox with many different options to choose from for data scientists. Again, instead of going deep into theory, I’m going to focus on the commonly used tools and techniques applied within the econometrics domain. More often than not, these techniques are closely related to each other, and limitations in one technique led to the development of another.

  • Benchmark forecasting: These models are collectively known as “benchmark” or “baseline” forecasting. These techniques are rarely applied in practice, but they help build forecasting intuition upon which to add additional layers of complexity. Some techniques in benchmark forecasting are: Naive, Seasonal, Mean, Seasonal naive, Drift, Linear rend, Random walk and Geometric random walk.
Mean and drift model with respect to training and testing data (source: author)
  • Exponential Smoothing: If decomposed, a time series will disaggregate into 3 components: trend, seasonality, and white noise (i.e., random data points). For forecasting purposes, we can predict the predictable components (i.e., trend and seasonality), but not the unpredictable terms which occur in a random fashion. Exponential smoothing can handle this kind of variability within a series by smoothing out white noise. Some variants of exponential smoothing are: Simple Exponential Smoothing, Holt’s Linear Trend, and Holt-Winter Exponential Smoothing
  • ARIMA: It represents a suite of models closely related to each other. Autoregressive Integrated Moving Average (ARIMA) is arguably the most popular and widely used statistical technique for forecasting. As the name suggests, ARIMA has 3 components: a) an Autoregressive component to model the relationship between the series and it’s lagged values; b) a Moving Average component that predicts future value as a function of lagged forecast errors; and c) an Integrated component that makes the series stationary. Some additional variants of this modeling suite are: SARIMA, ARIMAX, SARIMAX etc. (I call them “all caps forecasting models”!)

Read More …


Write a comment