To Balance or Not to Balance?



Figuring out the causal results of an motion—which we name therapy—on an consequence of curiosity is on the coronary heart of many knowledge evaluation efforts. In a perfect world, experimentation by way of randomization of the therapy project permits the identification and constant estimation of causal results. In observational research therapy is assigned by nature, due to this fact its mechanism is unknown and must be estimated. This may be performed by way of estimation of a amount often known as the propensity rating, outlined because the likelihood of receiving therapy inside strata of the noticed covariates.

There are two kinds of estimation methodology for propensity scores.  The primary tries to foretell therapy as precisely as doable.  The second tries to stability the distribution of predictors evenly between the therapy and management teams. The 2 approaches are associated, as a result of completely different predictor values amongst handled and management models may very well be used to raised predict therapy standing. On this submit we focus on points associated to those targets, specification of loss capabilities for the 2 aims, and evaluate each strategies through simulation.

We give attention to an inverse propensity rating weighted estimator for the causal impact. For this estimator, predicting the propensity rating precisely is far more essential than reaching the balancing property. It’s because the estimator just isn’t constructed using the balancing property however wholly depends on reweighting utilizing the propensity rating. Different estimators, equivalent to these based mostly on matching and subclassification, might profit from the balancing property, however the dialogue of these estimators is postponed to a later submit.

The Elementary Downside of Causal Inference

The first objective of many knowledge evaluation efforts is to estimate the impact of an intervention on the distribution of some consequence. For instance, think about that you’re working for a automobile producer. Your organization has lately launched a brand new pickup truck, together with the corresponding on-line commercial marketing campaign. You’re answerable for assessing whether or not the marketing campaign had an impression on gross sales. To do that, you’ve gotten an information set on the individual degree containing, amongst different variables, an indicator of advert publicity, and whether or not the individual purchased the truck. A naïve approach to resolve this drawback can be to match the proportion of consumers between the uncovered and unexposed teams, utilizing a easy check for equality of means. Though it could appear wise at first, this resolution might be
incorrect if the information undergo from choice bias.

For instance the idea of choice bias, think about a gaggle of customers who’ve looked for pickup vehicles by way of Google and a gaggle who has not. People within the former group usually tend to be uncovered to an advert for pickup vehicles. Nonetheless, they’re additionally extra seemingly to purchase a pickup truck no matter advert publicity, as a result of we all know already that they’re desirous about choose up vehicles on condition that they’d performed on-line analysis on them. A naïve comparability of the uncovered and unexposed teams would produce a very optimistic measurement of the impact of the advert, for the reason that uncovered group has a better baseline chance of buying a pickup truck.

There are two methods of getting across the choice bias drawback: (i) randomize advert publicity, and (ii) use an evaluation for observational knowledge. In a randomized trial, we begin with a random pattern of the inhabitants of goal people, after which assign them to certainly one of two therapy arms ‘randomly’ (that’s, impartial of all different noticed or unobserved components). Randomized research are thought of the gold customary for answering causal inference questions, since they assure the 2 teams solely differ of their therapy project and have the identical distribution in all noticed and unobserved pre-treatment variables. Sadly, randomization just isn’t doable in lots of conditions. For instance, in scientific trials, randomization could also be unethical; in economics research randomization is unfeasible in apply; and in advertising research the potential price of a misplaced enterprise alternative could make randomization unattractive.


We now focus on formally the statistical drawback of causal inference. We begin by describing the issue utilizing customary statistical notation. For a random pattern of models, listed by $i = 1. ldots n$, we have now

  • $T_i$ because the binary therapy project variable,
  • $Y_{i1}$ and $Y_{i0}$ because the potential outcomes, and
  • $X_i$ because the vector of $p$ confounders

To simplify notation, for the reason that knowledge are assumed i.i.d., we drop the $i$ index. For $t=0,1$, the potential consequence $Y_t$ is outlined as the result noticed in a hypothetical world by which $P(T=t)=1$ (e.g., $Y_1$ is the result that may have been noticed if everybody was uncovered to the advert). We’re desirous about estimating the common impact of therapy on the handled, or ATT, outlined as $delta = E(Y_{1}-Y_{0}|T=1)$. This amount is interpreted because the anticipated change within the consequence attributable to therapy, the place the expectation is taken over the distribution of the potential outcomes amongst handled models.

As a result of potential outcomes are unobserved, we have to hyperlink the distribution of $Y_t$ to the distribution of the noticed knowledge $(X, T, Y)$. A method linking these distributions is named an identifiability outcome. Identifiability permits us to estimate the moments of $Y_t$ as functionals of the distribution of the noticed knowledge.

The ATT might be estimated from the noticed knowledge if all three of the next
identifiability assumptions maintain:

  1. $Y_t=Y$ within the occasion $T=t$, stating that the noticed consequence is congruent with the potential outcomes $(Y_0,Y_1)$, and
  2. the project mechanism is *strongly ignorable*, that’s $Y_0 perp T mid X$, which roughly states that each one widespread causes of $T$ and $Y$ are measured and contained in $X,$ and
  3. $P(T=0|X=x) > 0$ for all $x$ the place $P(X=x)>0$,

stating there was ‘sufficient experimentation’ within the observational research. That is also known as the positivity assumption.

The primary assumption makes clear the basic drawback of causal inference: for any given unit we observe at most one of many potential outcomes $Y=Ttimes Y_1 +(1-T)occasions Y_0$. A violation to the second assumption formalizes the idea of choice bias. In our pickup truck instance, $T$ signifies publicity to the advert, and $Y$ signifies buy of a pick-up truck. Failure to incorporate an indicator of seek for the time period “pickup truck” in $X$ can be a violation of robust ignorability. Identifiability assumptions are met by design in a randomized trial for each $X$, together with $X=emptyset$.

These assumptions enable the identifiability of $E(Y_1|T=1)$ and $E(Y_0|T=1)$, each portions essential in estimating $delta$. Identifiability of $E(Y_1|T=1)$ is easy: $E(Y_1|T=1) = E(Y|T=1)$ as a result of $Y=Y_1$ within the occasion $T=1$. Identifiability of $psi = E(Y_0|T=1)$ is considerably more durable. We begin by noting that
$$E(Y_0 mid T=1, X)=E(Y_0 mid T=0, X),$$
as a consequence of robust ignorability. Then, $E(Y_0 mid T=0, X) = E(Ymid T=0, X)$, as a consequence of congruence of the potential and noticed outcomes. The latter conditional expectation is properly outlined as a result of positivity assumption. We are able to now use the legislation of iterated expectation to see that

$psi = E(Y_0|T=1)$
  $= E{E(Y_0 mid T=1, X)mid T=1}$
  $= E{E(Y_0 mid T=0, X)mid T=1}$
  $= E{E(Y mid T=0, X)mid T=1}$
  $= E{mu_0(X)mid T=1},$

the place $mu_0(x)$ is a operate of $x$ denoting the anticipated worth of the result underneath $T=0$ and $X=x$. This method is the identifiability outcome that we talked about earlier, and permits us to estimate $delta$ solely from the noticed knowledge $(X, T, Y)$. Word that $mu_0$ is the result expectation
among the many management models. This method makes use of $mu_0$ to foretell the (unobserved) outcomes of the handled group, had they, opposite to the very fact, been within the management group. The outer expectation takes the typical of these predicted values for all handled models.

Identification Utilizing The Balancing Property

Allow us to use $e(x)$ to indicate the propensity rating $P(T=1mid X=x)$, following the conference within the propensity rating literature.

A balancing rating is any operate $b(x)$ satisfying $Xperp Tmid b(X)$. Utilizing this property, easy algebra reveals an equal identification outcome:
$$psi = E{E(Y mid T=0, b(X))mid T=1}.$$
This and extra options of balancing scores are mentioned in a seminal paper by
Rosenbaum & Rubin (1983).

Clearly, $b(x)=x$ is a balancing rating. Nonetheless, $b(x)=x$ just isn’t a really helpful balancing rating because it doesn’t assist alleviate the curse of dimensionality. A extra helpful balancing rating is the propensity rating $e(x)$. Since $e(x)$ is univariate, $psi$ could also be simply estimated by matching, subclassification, and different non-parametric estimation strategies that modify for an estimate of $e(x)$. Estimators that use this concept are stated to be utilizing the balancing property of the propensity rating.

An important remark arising from that is that, for estimators of $delta$ utilizing the balancing property, we don’t require constant estimation of the propensity rating however solely of a balancing rating. However, the reweighted estimator we describe subsequent, does require constant estimation of the propensity rating. This apparently trivial remark is usually missed and
could be a supply of confusion for knowledge analysts engaged on causal inference analyses.

Word that $delta=E(Ymid T=1)-psi$. The expectation $E(Ymid T=1)$ might be simply estimated with the pattern imply of the result among the many uncovered models. Within the simulation part we use this empirical imply estimator. Under we focus on estimators of $psi$, which might be performed through the propensity rating with the strategies we describe beneath.

The Propensity Rating Weighted Estimator

Writing the expectations as integrals plus some extra algebra reveals that
$$psi = E{E(Y mid T=0, X)mid T=1} = frac{E{W Ymid T=0}}T=0),$$
with $W=e(X)/(1-e(X)).$ A pure estimator for $psi$ is then given by
$$hat psi = frac{sum_{i:T_i=0} hat W_i Y_i}{sum_{i:T_i=0} hat W_i},$$
the place $hat W_i = hat e(X_i)/(1-hat e(X_i))$, and $hat e(x)$ denotes an estimator of the propensity rating.

It may be simply checked that if $hat e$ is constant, so is $hat psi$. Then, as a result of $E(Ymid T=1)$ might be estimated persistently, the estimator of $delta=E(Ymid T=1)-psi$ might be constant as properly. It’s then crucial to put all efforts on acquiring consistency of $hat e$. On this submit we evaluation and evaluate two kinds of methodology for estimating the propensity rating: strategies based mostly on predictive accuracy and strategies based mostly on balancing the covariate distribution between handled and management models. Based mostly on our outcomes and the above discussions we argue that predictive accuracy, reasonably than stability, needs to be the criterion guiding the selection of methodology.

It needs to be famous that inverse likelihood weighting just isn’t typically optimum (i.e., environment friendly) and doubly sturdy estimators such because the augmented IPW and the TMLE present a chance to realize the non-parametric effectivity sure (Hahn, 1998). Since our curiosity is to judge estimators for the propensity rating as they relate to estimation of the causal impact, we give attention to the reweighting methodology described above.

Estimating the Propensity Rating

We now introduce extra formally the principle dichotomy of this submit: predictive accuracy vs covariate stability. We do that by describing the strategies when it comes to loss capabilities whose expectation is optimized on the true worth of the propensity rating.

Strategies Based mostly on Predictive Energy

Let $cal F$ be the area of all capabilities of $x$ bounded between zero and one, and let $f$ be a generic operate in that area. We are saying {that a} loss operate $L$ is legitimate for the propensity rating if the next holds:
$$ argmin_{fin cal F} E(L(f; M)) = e, $$
the place $M = (X, T)$. That’s, a loss operate $L$ is legitimate if the anticipated loss is minimized on the true propensity rating $e$. Examples of legitimate loss capabilities are the $L^2$ and detrimental log-likelihood loss capabilities given by

  $L(f;m) = (t-f(x))^2,$ and
  $L(f;m) = -tlog(f(x))-(1-t)log(1-f(x)),$

respectively. Although theoretically legitimate, the $L^2$ loss operate is understood to underperform in comparison with the log-likelihood.

The selection of area $cal F$ (typically known as the mannequin) and loss operate $L$ explicitly defines the estimation drawback. For instance, logistic regression relies on the detrimental log-likelihood loss operate and the area of logistic capabilities
$$ mathcal{F}_textrm{logistic} = bigg{f(x) = frac{1}{1 + exp(-x’beta)}:
beta in mathbb{R}^p bigg}.  $$
A mannequin like $mathcal{F}_textrm{logistic}$, listed by a Euclidean parameter, is also known as a parametric mannequin. Though fascinating, optimization within the full area $cal F$ just isn’t doable in apply when $x$ comprises steady variables, or its dimension is giant (cf. the curse of dimensionality). However, restriction to extra tractable areas equivalent to $cal{F}_textrm{logistic}$ is well-known to result in the difficulty of mannequin misspecification, which happens when the area thought of doesn’t comprise the propensity rating. Within the presence of mannequin misspecification, the estimator $hatpsi$ is inconsistent.

The sector of statistical machine studying offers an answer to this drawback, permitting exploration of bigger areas. For instance, a tree regression algorithm makes use of
$$mathcal{F}_textrm{tree} = bigg{sum_j^J c_j I(x in D_j): c_j in
mathbb{R}, D_jbigg},$$
the place $c_j$ are constants, $I(A)$ is the indicator operate returning one if $A$ is true and 0 in any other case, and $D_j$ are disjoint partitions of the covariate area. One other instance, given by multivariate adaptive regression splines (MARS), makes use of
$$mathcal{F}_textrm{mars} = bigg{f(x) = sum_j^Jbeta_jB_j(x): beta in
mathbb{R}^p; B_jbigg},$$
the place $B_j$ are foundation capabilities of the type of hinge capabilities $max(0, x_j – c_j)$, and $c_j$ are tuning parameters. These areas are bigger than $cal{F}_textrm{logistic}$ above. Beneath lack of domain-specific scientific information supporting the usage of a parametric mannequin, these data-adaptive strategies have a greater likelihood of persistently estimating the propensity rating. Selecting the tuning parameters for data-adaptive strategies equivalent to regression timber and MARS is the topic of numerous analysis articles and books.

We selected the tree regression and MARS estimators just for illustration functions, different choices of off-the-shelf prediction methodology embody random forests, help vector machines, generalized boosted regression fashions, neural networks, $ok$ nearest neighbors, regularized regression, and plenty of extra. A superb evaluation of statistical studying strategies could also be present in Friedman et. al. (2001).

Mannequin Stacking – Tremendous Learner (SL)

When utilizing predictive energy as a criterion, the query arises of methods to choose among the many many prediction strategies out there within the statistical studying literature. We method this query with a data-adaptive mindset that includes the next steps (see van der Laan et al., 2007):

  1. Suggest a finite assortment $mathcal L={hat e_k:ok=1,ldots,Ok}$ of estimation algorithms. An estimation algorithm is a process that takes a coaching knowledge set $mathcal T={M_i,i=1,ldots, n}$ and outputs a operate $hat e_k(x)$.
  2. Contemplate an ensemble learner of the kind $hat e_alpha(x) = sum_{ok=1}^Ok alpha_k hat e_k(x);quadtext{for}quad Zero leq alpha_kleq 1, quad sum_{ok=1}^Kalpha_k=1.$
  3. Let $mathcal V_1, ldots, mathcal V_J$ denote a partition of the index set ${1,ldots, n}$ of roughly the identical measurement. As well as, for every $j$, let the related coaching set be outlined as $mathcal{T}_j = {1,ldots, n}backslash mathcal V_j$. For fastened weights $alpha$, denote $hat e_{alpha, mathcal T_j}$ the ensemble educated utilizing solely knowledge in $mathcal T_j$. Select the weights $alpha$ that reduce the cross-validated danger: $hatalpha =argmin_{alpha} frac{1}{J}sum_{j=1}^Jfrac{1}mathcal V_jsum_{iin mathcal V_j} L(M_i, hat e_{alpha, mathcal T_j})$ topic to $quad Zero leq alpha_kleq 1, sum_{ok=1}^Kalpha_k=1,$ and outline the ultimate estimator as $hat e_{hatalpha}(x)$. This algorithm is carried out within the SuperLearner R bundle (Polley & van der Laan, 2014).

It’s clear that this Tremendous Learner explores a a lot bigger area than any $mathcal F_k$ explored by every of the impartial learners. As a consequence, it has extra probabilities of containing the propensity rating than any given learner, and we count on it to carry out higher asymptotically. In actual fact, it has been proven that this cross-validation scheme will choose the very best estimator because the pattern measurement will increase (van der Laan et al., 2007).

Strategies Based mostly on The Balancing Property

Through Adam’s Law it may be proven that
$$ Ebigg{(1-T)frac{e(X)}{1-e(X)}c(X)bigg} =
E{Tc(X)}, textual content{ for all capabilities } c(x).$$
Right here $c(x)$ is any operate of $x$. Estimation strategies that use this property are known as covariate balancing, since they guarantee that correctly reweighting results in the identical distribution within the handled and management teams.

On account of the balancing property, any estimator $hat e$ satisfying
$$sum_{i:T_i=0}frac{hat e(X_i)}{1-hat e(X_i)}c(X_i) = sum_{i:T_i=1}c(X_i),
textual content{ for all capabilities } c(x)$$
might be anticipated to be a constant estimator of $e(x)$. As a result of it has to carry for all capabilities $c(x)$, this stability situation is analogous to performing a search within the full area $mathcal F$, and is unimaginable to realize in apply. Strategies that intention to realize stability give attention to a user-given set of capabilities $c_1,ldots, c_J$. The duty of selecting the right capabilities $c_j$ is akin to the duty of specifying the right purposeful kind in a parametric mannequin. Because of this, estimators that target covariate balancing are additionally vulnerable to being inconsistent as a consequence of mannequin misspecification.

Now let’s check out two strategies that use covariate balancing to estimate the propensity rating: entropy balancing and covariate balancing propensity rating.

Entropy Balancing (EB)

Entropy balancing (Hainmueller, 2012) is a technique that straight estimates the weights $W_i$, reasonably than the propensity rating, by fixing the next optimization drawback:
$$ hat W = argmin_W sum_{i:T_i=0}W_ilog W_i $$
topic to
$$frac{1}{n_0}sum_{i:T_i=0}W_ic_j(X_i) = frac{1}{n_1}sum_{i:T_i=1}c_j(X_i),
textual content{ for a set of capabilities } c_j:jin{1,ldots J}.$$
The above constrained optimization drawback minimizes the entropy of $W_i$ to acquire receive weights that fulfill the stability situations for the user-specified covariate capabilities $c_j$. These capabilities are chosen by the consumer, and are typically given by decrease order polynomials. As a result of covariate stability can solely be achieved for a finite (typically small) variety of capabilities $c_j$, this methodology might be inconsistent until the right capabilities $c_j$ are specified. In actual fact, Hainmueller (2012) present that entropy balancing is equal to estimating the weights as a log-linear mannequin of the covariate capabilities $c_j(X)$.

See Hainmueller (2012), and the work of Zhao & Percival (2015) for extra particulars on how this optimization drawback is solved, and for additional dialogue.

Covariate Balancing Propensity Rating (CBPS)

This methodology proceeds by specifying a parametric kind for the propensity rating, after which optimizing the detrimental log-likelihood loss, constrained to covariate stability on a set of capabilities $c_j(x)$. For instance, think about the logistic mannequin $mathcal{F}_textrm{logistic}$ outlined above. CBPS proceeds by fixing the next optimization drawback:
$$ e = argmin_{fin mathcal{F}_textrm{logistic}} E(L(f;M)), $$
topic to
$$sum_{i:T_i=0}frac{hat f(X_i)}{1-hat f(X_i)}c_j(X_i) =
sum_{i:T_i=1}c_j(X_i), textual content{ for a set of capabilities } c_j:jin{1,ldots
the place $L$ is the detrimental log-likelihood loss $L(f;w)=-tlog(f(x))-(1-t)log(1-f(x))$. The constraint set of this methodology might be seen to be equal to the constraint of EB, solely the loss operate optimized modifications. This drawback could also be solved utilizing empirical chance (Qin & Lawless, 1994) or different strategies. For a extra detailed dialogue the reader is referred to the unique analysis article by Imai & Ratkovic (2014).

Simulation Examine

We evaluate the assorted strategies for estimating the propensity rating in a simulation research. We use efficiency metrics equivalent to bias and imply squared error for the estimation of $delta$, our causal estimand of curiosity, outlined as the typical impact of therapy on the handled. The strategies used are:

Covariate stability:

  • EB.
  • CBPS (actual and over-parameterized).

Predictive accuracy:

  • Multivariate adaptive regression splines (MARS) (Friedman, 1991).
  • Random forest with default R tuning parameters (Breiman, 2001).
  • Help vector machines with linear kernel (Cortes & Vapnik, 1995).
  • The MLE in a logistic regression mannequin.
  • Bayesian logistic regression assuming a diffuse regular prior with imply zero.
  • $L^1$ regularized logistic regression (Lasso) (Tibshirani, 1996).
  • SL (discrete and full) the place the total methodology is described above and the discrete methodology doesn’t weight every algorithm however chooses the very best one underneath cross-validation for predictions (van der Laan et al., 2007).

We use the information producing mechanism described beneath. This simulation scheme was first utilized in an influential paper by Kang & Schafer (2007), and have become a normal for evaluating estimators for causal estimands. Kang & Schafer use this knowledge producing mechanism for instance bias arising in estimation of an consequence imply underneath informative nonresponse. The problems arising of their drawback are equivalent in nature to points arising in estimation of causal results, due to this fact their setup may be very acceptable for instance our factors. You’ll be able to learn the unique analysis paper to search out out extra.

The true set of covariates is generated independently and identically distributed, from the next distribution:
$$(Z_{1}, Z_{2}, Z_{3}, Z_{4}) sim N_4(0, I_4)$$
the place $I_4$ is the 4 dimensional id matrix. The result is generated as
$$ Y = 210 + 27.Four Z_{1} + 13.7 Z_{2} + 13.7Z_{3} + 13.7 Z_{4} + epsilon $$
the place $epsilon sim N(0, 1)$. The propensity rating is outlined as
$$ P(T = 1 mid Z = z) = textual content{expit}(-z_{1} + 0.5 z_{2} – 0.25z_{3} – 0.1 z_{4}). $$
the place expit is the inverse-logit transformation. Word that $delta = 0$, whereas $E(Y|T=1)=200$ and $E(Y|T=0)=220$, demonstrating the choice bias.

Within the simulation we’ll study the efficiency of the above algorithms underneath the accurately specified propensity mannequin from the information era course of and underneath a misspecified mannequin. Misspecification happens when the next transformations are noticed rather than the true covariates:

  $x_{1} = exp(z_1/2),$
  $x_{2} = z_{2}/(1 + exp(z_{1})),$
  $x_{3} = (z_{1} z_{3}/25 + 0.6)^3,$
$x_{4} = (z_{2} + z_{4} + 20)^2.$

These transformations had been proposed within the authentic paper with the intention to mirror a real-life drawback. It is vitally unlikely a researcher observing $(X_{1}, X_{2}, X_{3}, X_{4})$ would specify the right mannequin reflecting these transformations. Nonetheless, $X$ comprises all of the related info and robust ignorability holds. With out the right purposeful kind, a researcher utilizing parametric strategies is assured to suit a misspecified mannequin. This reality is properly documented by Kang & Schafer (2007). The case the place solely the $X$s are noticed is essentially the most fascinating because it exemplifies most, if not all, actual knowledge evaluation issues.

To match strategies we conduct a simulation research by producing $10,000$ datasets in accordance with the above knowledge era course of. Then, for every dataset we estimate the propensity rating utilizing the true covariates $Z$ and the remodeled covariates $X$ in accordance with the assorted estimators we’re analyzing. Then we estimate $delta$ and mixture throughout simulations to look at the estimated absolute bias
frac{1}{10^4} bigg| sum_{j = 1}^{10^4} (hatdelta_j – delta) bigg|
and root imply sq. error (RMSE)
sqrt{frac{1}{10^4} sum_{j = 1}^{10^4} (hat delta_j – delta)^2}
of the completely different strategies. These formulation symbolize Monte Carlo integrals that intention to approximate the true bias and RMSE of the estimators. The most important estimated error in these integrals was round 0.Three which is small relative to the bias and MSE sizes we see.

The algorithms within the plot are ordered in accordance with the sum of the respective metrics weighted by the sq. root of the pattern measurement thought of. By doing this, we favor algorithms that carry out higher in bigger datasets, in accordance with the statistical notion of consistency. Algorithms geared toward reaching stability within the covariates are proven in boldface.

The similarity between the RMSE and bias plots teaches us that a lot of the poor efficiency is pushed by bias reasonably than variance.

As we acknowledged earlier than, essentially the most fascinating case for practitioners is when the remodeled covariates are noticed. On this case, MARS is the very best estimator. The 2 variations of the Tremendous Learner comply with intently. This shouldn’t be taken as an argument in favor of utilizing MARS in each sensible drawback. Reasonably, it’s proof of the significance of utilizing a principled mannequin choice software, since nobody algorithm needs to be anticipated to outperform all others uniformly throughout datasets and issues (cf. no free lunch
theorem). Likewise, strategies based mostly on random forests and SVM carry out fairly poorly. This can be stunning to some readers who know that these strategies are data-adaptive and have seen them carry out properly in a number of purposes. That is one other demonstration of our declare that no methodology needs to be blindly trusted and all strategies needs to be examined towards the information for the precise drawback at hand. Now we have seen purposes by which the Lasso, Random Forests, or perhaps a easy logistic regression outperforms all rivals when it comes to predictive energy.

Relating to covariate stability, we see that the 2 variations of the CBPS carry out comparatively properly. CBPS is a GLM with an additional optimization constraint to fulfill covariate stability. In our simulation, this additional constraint did enhance the bias and MSE in comparison with a normal GLM.

The entropy balanced estimator performs fairly poorly. Word that the EB is an estimator that achieves stability within the pattern covariate means, and as such would at all times move the routine validation assessments advocated within the balancing rating literature. Within the case of EB it seems that the predictive accuracy of the propensity rating is sacrificed to make sure covariate stability. As our reweighting estimator for $psi$ is constructed on the premise of constant propensity rating estimation (reasonably than the balancing property), it isn’t stunning to see the poor efficiency of EB.

Curiously, covariate balancing strategies outperform all different strategies—together with an accurate logistic regression mannequin—when the right covariates are noticed. It’s because the response is linear within the appropriate covariates and thus an estimator that ensures full stability on the true covariates robotically reduces all the bias. Nonetheless, that is merely a theoretical curiosity, as nearly each sensible scenario belongs within the first panel of the plot reasonably than the second.


For our reweighting estimator we see that goal, versatile data-adaptive estimators for the propensity rating usually carry out greatest within the case of a misspecified mannequin. It’s because these algorithms are capable of higher discover the covariate area than their parametric counterparts and higher recuperate, or at the least approximate, the right purposeful kind.

EB and CBPS each require specifying capabilities of the covariates that should be balanced. This may be very helpful if the researcher has prior area information and is aware of what capabilities of the covariates have an effect on the response and therefore should be balanced. This form of a-priori information imposes a construction on the propensity rating and thus is akin to figuring out that the propensity rating belongs
to a parametric household of distributions. Sadly, goal information of this kind is absent in most sensible conditions, notably in giant dimensional issues.

To conclude, within the absence of subject-matter information supporting the usage of parametric purposeful types for the propensity rating and the balancing situations, predictive accuracy needs to be used to pick an estimator amongst a set of candidates. This assortment might embody covariate balanced
estimators, and may comprise versatile data-adaptive strategies able to unveiling complicated patterns within the knowledge. Particularly, we advocate for the usage of mannequin stacking strategies such because the Tremendous Learner algorithm carried out within the SuperLearner R bundle.


Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.

Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks.” Machine learning 20.3 (1995): 273-297.

Friedman, Jerome H. “Multivariate adaptive regression splines.” The annals of statistics (1991): 1-67.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “The elements of statistical learning.” Vol. 1. Springer, Berlin: Springer series in statistics, (2001)

Hahn, Jinyong. “On the role of the propensity score in efficient semiparametric estimation of average treatment effects.” Econometrica (1998): 315-331.

Hainmueller, Jens. “Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies.” Political Analysis 20.1 (2012): 25-46.

Imai, Kosuke, and Marc Ratkovic. “Covariate balancing propensity score.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.1 (2014): 243-263.

Kang, Joseph DY, and Joseph L. Schafer. “Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data.” Statistical science (2007): 523-539.

Qin, Jin, and Jerry Lawless. “Empirical likelihood and general estimating equations.” The Annals of Statistics (1994): 300-325.

Rosenbaum, Paul R., and Donald B. Rubin. “The central role of the propensityscore in observational studies for causal effects.” Biometrika 70.1 (1983): 41-55.

Tibshirani, Robert. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267-288.

van der Laan, Mark J., and Polley, Eric C., and Hubbard, Alan E. “Super learner.” Statistical applications in genetics and molecular biology (2007): Vol 6.1

Eric Polley and Mark van der Laan (2014). SuperLearner: Super Learner Prediction. R package version 2.0-15.

Zhao, Qingyuan, and Daniel Percival. “Primal-dual Covariate Balance and Minimal Double Robustness via Entropy Balancing.” arXiv preprint arXiv:1501.03571 (2015).


Source link

Write a comment