## Maximum likelihood with appropriate calibration goes a long way.

by Amr M. Alexandari & Avanti Shrikumar

In this tutorial, we will see how we can use a combination of model calibration and a simple iterative procedure to make our model predictions robust to shifts in the class proportions between when the model is trained and when it is deployed.

Say we build a classifier using data gathered in June to predict the probability that a patient has COVID19 based on the severity of their symptoms. At the time we train our classifier, covid positivity was at a comparatively lower rate in the community. Now it’s November, and the rate of covid positivity has increased considerably. Is it still a good idea to use our classifier from June to predict who has covid?

To understand why the classifier from June might underestimate the prevalence of covid, let’s imagine our classifier is built using a single variable capturing symptoms severity that we will call the “disease score” (this argument generalizes to classifiers built using many variables, as well as classifiers that have more than 2 classes as the output, but the intuition is easiest in the single variable & two-class case). Here is a toy visualization of what the ground-truth distribution of symptom severity might look like for positive and negative cases, in a situation where 10% of all tested cases in June were positive:

The red line shows the predicted fraction of positives for an ideal classifier that matches the ground-truth. This ground-truth probability can be calculated by taking the height of the orange bar at a given disease score and dividing by the total height of the blue and orange bars at that disease score added together.

Now let us visualize the case where the proportion of positives among the tested population has risen to 70% in November (note: this an extreme shift, and we show it solely for ease of visualization). The symptoms of covid have not changed between June and November, which means the overall shape of the blue distribution and orange distribution would stay the same. However, the height of the blue bars relative to orange bars would increase to reflect the greater proportion of positives, giving:

The solid red line shows the predictions from the classifier trained in June, while the dashed red line shows the true fraction of positives for data gathered in November. As we can see, the ideal classifier for data gathered in June underestimates the probability that a patient has COVID19 when the classifier is deployed in November. This phenomenon is called label shift or prior probability shift. If we knew the labels for the testing data, we could train a new classifier to output the class probabilities — but in practice, we don’t know the labels for the testing data — all we observe is the overall distribution of symptom severity for both the covid-positive and covid-negative patients combined, which looks something like this:

So, how can we get an updated classifier?

It turns out there is a simple way to adapt our classifier to account for the shift in class proportions. To see how this method works, let us introduce some terminology. The dataset that we train on is called the “source domain”, while the dataset we deploy the model on is called the “target domain”. We will use the following notation:

• y denotes the class, which can be one of “positive for covid” or “negative for covid”
• p(y) denotes proportion of patients belonging to class y in the source domain (i.e. in June)
• q(y) denotes the proportion of patients belonging to class y in the target domain (i.e. in November)
• p(y|x) denotes the conditional probability in the source domain (i.e. in June) that the patient belongs to class y given symptoms that look like x
• q(y|x) denotes the conditional probability in the target domain (i.e. in November) that the patient belongs to class y given symptoms that look like x
• p(x|y) denotes the conditional probability in the source domain (i.e. in June) of observing certain symptoms, given that you know the patient belongs to class y
• q(x|y) denotes the conditional probability in the target domain (i.e. in November) of observing certain symptoms, given that you know the patient belongs to class y

Let’s first begin by taking stock of which of these quantities we do have information about. We can estimate p(y) by simply calculating the average class proportions in our data from the source domain. We also have an estimate of p(y|x) from building our classifier on the source domain (remember, the classifier was trained to estimate the probability that a person belongs to a particular class — in this case covid vs. no covid — given their symptoms). What about p(x|y)? While it is technically possible to build an estimate of p(x|y) using the source-domain data, in practice this can be very hard when x is high-dimensional. Thus, we will not assume that we have access to p(x|y). However, if we assume that the symptoms of covid do not change between June and November, we can assume that p(x|y) = q(x|y). To see why this is useful, let us consider how we can update our classifier if we were given a guess for the value of q(y). From Bayes’ rule, we have: