What on Earth is a Gaussian Process? | by Jacob Moore | Jan, 2021

[ad_1]


TL/DR

A Gaussian Process is a nonparametric model. By contrast, linear regression is a parametric model. Consider y = m*x + b. In this model, m and bare parameters, which say, “on average, scale x by m and shift by b to get y.” The parameters learned greatly condense the nature of the relationship between {x :y}. Nonparametric models, by contrast, don’t offer such a simple summarization of the {x:y} relationship, offering zero parameters (or perhaps infinite parameters, based on your viewpoint!) In a Gaussian Process, each training data point is its own dimension (as opposed to several observations from the same dimension.) However, if we say that every point is its own mean, with zero variance, our function will be excessively wiggly (overfit) and offer no inference for test data points. A kernel function is used, which largely answers the question, “How much should any given pair of data points communicate with one another?” In the Radial Basis Function (RBF) kernel, a given data point only appreciably communicates with its immediate neighbors and negligibly so with distant neighbors. This design allows the Gaussian process to answer the question, “For any test x, how similar should its test y be to its (training) X neighbors?” Two key insights: (A) The predicted y is a linear combination of training Y weighted by the distances of the corresponding X. And (B) These observed training points are the model parameters — which could theoretically be infinite (or at least have no fixed upper bound!) Two relationships are exploited: The {X_train, Y_train} relationship and the {X_train, X_test} relationship. By modeling the joint distribution of {X_train, X_test, Y_train, Y_test} given {X_train, X_test, Y_train} (using conditional Gaussian rules) we’re able to infer a distribution over “functions” which describe our beliefs about Y_test. In essence, every training data point is a parameter. Due to this, the training data cannot be discarded as it proves the (potentially infinite) parameters describing our beliefs about Y_test.

Intro

For the below, I’ll be exploring code from a blog written by Martin Krasser, which implements a Gaussian Process from scratch. I found the code to work as advertised but the explanations were not sufficiently “plain English” for someone unfamiliar with Gaussian Processes. This article will help make a Gaussian Process seem perfectly intuitive; by the end, the TL/DR above will seem crystal clear.

The Kernel

Below is the author’s implementation of the RBF kernel:

def kernel(X1, X2, l=1.0, sigma_f=1.0):
sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) -2 /
* np.dot(X1, X2.T)
return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)

In English, the meat of the function is: x_1^2 + x_2^2 — 2*x_1*x_2 which is (x_1 — x_2)^2. Then this result is scaled and exponentiated in a way that is near identical to the normal distribution PDF.

A Practical Example

import numpy as np
import random
import matplotlib.pyplot as plt
dims = 25
x = np.linspace(1,10,dims)
y = np.sin(x) +2
plt.plot(x,y)
credit: Me

Linear Combination of Training Points

cov = kernel(x.reshape(-1,1),x.reshape(-1,1),l=0.5,sigma_f=0.5)
gp1 = np.dot(cov,y)
plt.scatter(x,y)
for i in range(5):
samples = multivariate_normal(mean=gp1,cov=cov)
plt.plot(x,samples)
Credit: Me

In the above plot, we have the true x and true y scattered in blue, with a handful of line plots above. These line plots are samples from a multivariate normal distribution where the “mean function” (mu parameter in multivariate gaussian) is the dot product of weights (kernel output) with target (training y values) as discussed in the TL/DR, whereas the covariance is the (sigma parameter) is the kernel output, itself. We can see here that the general shape has been learned but these samples both need to be scaled and shifted as they currently do not match the training data points well at all! Nonetheless, we can see plainly that a Gaussian process is simply a linear combination of kernel weights and training points.

So why are we grossly inaccurate? We haven’t trained the model yet! Consider linear regression, if we simply randomly guessed the values of m and b, then plotted, would we be surprised by its inaccuracy? Not at all. The tricky part about Gaussian Processes is that the training data points are the model parameters, but that doesn’t imply that model training isn’t required.

Bayes Rule

posterior  = likelihood * prior    / evidence
P(theta|X) = P(X|theta) * P(theta) / P(X)
= P(X,theta) / P(X)
= joint / marginal
= conditional

In the above, I demonstrate that the familiar Bayes Rule, used in parametric inference, can be rearranged to produce the conditional distribution. Because our model parameters are the training data points, we want to ask the question, “What do we believe about {Y_test} given {X_test, X_train, Y_train}?”

X1, X2 = X_train, X_test
Y1, Y2 = Y_train, Y_test
posterior = joint / marginal
= P(Y2,Y1,X2,X1) / P(Y1,X2,X1)
= P(Y2|Y1,X2,X1)
= conditional

It’s well known that if a joint distribution is gaussian then a distribution of a subset of its variable (the marginal) is also gaussian. Thus we only need to assume that P(Y1,Y2,X1,X2) is gaussian in order to exploit gaussian conditioning rules, which give us the posterior distribution (which allows for training/prediction.)

Gaussian Conditioning

First, let’s consider the conditional mean formula. We take the mean of x_1 and we update it by some factor. This factor (read right to left in formula below) is:

  1. The specific distance of the observed x2 from its mean. This answers the question, “how surprising is the specific x2 in relationship to mu_2?
  2. An element of the precision matrix (inverse of covariance matrix), corresponding to (x2, x2), which answers the question, “How tightly is x2 clustered about mu_2 with respect to all other variables?” Of note, every element in the precision matrix is not simply 1/element, as matrix inversion is a more sophisticated process, considering every relationship simultaneously.
  3. An element of the covariance matrix, corresponding to (x1, x2), which answers the question, “(On average), how much do x1 and x2 move together?

Put all three of these together and we can articulate our expectations about x1 given an observed x2.

mu_1|2 = mu_1 + cov_[1,2] * inv(cov_[2,2]) * (x2 - mu_2)

Second, let’s consider the conditional covariance formula. We take the variance about x1 and subtract from it a factor (read right to left):

  1. The covariance of (x1, x2) (see previous section)
  2. The precision about (x2,x2) (see previous section)
  3. The covariance of (x1,x2) — again

Put elements 1 and 2 together and you effectively have a ratio, which says “What’s the relative strength of the relationship between (x1,x2) when we account for how tightly x2 is clustered about mu_2 (despite its relationship with all other variables?)” This is just a ratio; when we project it back onto the covariance (x1,x2) we scale down this covariance — answering the question, “What portion of the covariance remains after we consider how tightly x2 is clustered about mu_2?

the last step is simply removing this factor from the variance about x1. “How much does x1 vary when we remove the effect of x2 because its position is already known (aka conditioning)?

cov_[1|2] = cov_[1,1] - cov[1,2] * inv(cov_[2,2]) * cov_[1,2]

If you want to see the exhaustive proof on where these formulas were derived, help yourself. Be warned, the proof is tedious demonstration of the intricacies of linear algebra. Hence why I have tried to explain the effect of these formulas intuitively.

Kernel Conditioning

I introduced the idea earlier that the training data observations are the parameters of this model; weights produced by the kernel determine the specific linear combination of y_train values used to infer the y corresponding to any given x. Due to this relationship, the “mean function” is often to set to 0. (In other words, it’s unnecessary to add additional complexity to a model that’s already very flexible.) You’ll notice that in the below formula mu_1 and mu_2 have been replaced by 0.

mu_[2|1]  =  0 + K_[1,2] * inv(K_[1,1]) * (y1 - 0)
= K_[1,2] * inv(K_[1,1]) * y1
cov_[2|1] = K_[2,2] - K_[1,2] * inv(K_[1,1]) * K_[1,2]

Posterior Function in Python

from numpy.linalg import invdef posterior(X_test, X_train, Y_train, l=1.0, sigma_f=1.0):
K_train = kernel(X_train, X_train, l, sigma_f)
K_joint = kernel(X_train, X_test, l, sigma_f)
K_test = kernel(X_test, X_test, l, sigma_f) + 1e-8 *
np.eye(len(X_test))
K_train_inv = inv(K_train)

mu_joint = K_joint.T.dot(K_train_inv).dot(Y_train)
cov_joint = K_test - K_joint.T.dot(K_train_inv).dot(K_joint)

return mu_joint, cov_joint

To demonstrate an important concept, I will only supply the training data, with no test data, then sample from it.

x1 = x.reshape(-1,1)
mu_test, cov_test = posterior(x1,x1,y)
samples = np.random.multivariate_normal(mu_test, cov_test, 3)
plt.scatter(x1,y)
for i in range(len(samples)):
plt.plot(x1,samples[i])
Credit: Me

Notice that 3 different “functions” were sampled. (Function means a multivariate distribution where each dimension represents a sequence of unique observations from the same dimension.) But, the exact same points were sampled every time! How can this be?

Let’s look at just the first row of the covariance function.

cov_test[0,:].round(1)
>>>
array([-0., -0., -0., 0., 0., 0., 0., 0., -0., -0., -0., -0., -0., -0., -0., -0., -0., -0., 0., 0., 0., 0., 0., 0., 0.])

The variance is 0 for each element. This means that the mean will be sampled every timebecause there’s no variability! Likewise, let’s look at the mean function prior to taking the dot product with y1. We want to see what weights are being used in the (conditional kernel, Y_train) dot product:

def conditional_kernel(X_test, X_train, Y_train, l=1.0, 
sigma_f=1.0):
K_train = kernel(X_train, X_train, l, sigma_f)
K_joint = kernel(X_train, X_test, l, sigma_f)
K_test = kernel(X_test, X_test, l, sigma_f)
K_train_inv = inv(K_train)
return K_joint.T.dot(K_train_inv)
t = conditional_kernel(x1,x1,y1)
t[0,:].round(1)
>>>
array([ 1., -0., -0., 0., 0., -0., 0., -0., -0., -0., 0., -0., 0., 0., -0., -0., 0., -0., 0., -0., -0., 0., -0., 0., -0.])

Notice that in the weights observed above, the first element receives a weight of 1 and all other elements receive 0. In terms of linear combinations, the first observation in Y_train is predicted…as 100% of itself with 0 variance. Hence why this multivariate gaussian samples one and only one unique sequence.

So why has this happened? Well, we’ve asked the gaussian process to fit one dimension to each observation in the sequence — and it’s done exactly that! However, this doesn’t help us infer anything about test points between observations.

Below, we’ll take a look at a (nearly) proper Gaussian Process, which uses training observations as parameters for estimation of the test inputs.

dims = 10
x_train = np.linspace(1,10,dims).reshape(-1,1)
y_train = np.sin(x_train) +2
x_test = np.linspace(1,10,100).reshape(-1,1)
mu_test, cov_test = posterior(x_test,x_train,y_train)
samples = np.random.multivariate_normal(mu_test.transpose()[0],
cov_test, 5)
plt.scatter(x_train,y_train)
for i in range(len(samples)):
plt.plot(x_test,samples[i])
Credit: Me

Now we can see where the Gaussian Process is most (and least!) confident. There is more variability in the regions furthest between points as it is guessing what the specific linear combination of y_train values to assign to any given x_test value. In general, it’s very confident in its estimates!

Note, every function sampled, intersects all training data points. This is the effect of conditioning. The variance is 0 and the training data points predict themselves, exactly, every time. But in reality, we sometimes want to loosen this constraint to allow for smoother functions, which are less likely to overfit to potential outliers.

Predictions with Noise

Loosening the constraint that each training data point must be self predicted exactly is a simple change, we simply take the variance about the target variable and add it to all elements along the diagonal in the covariance function. This might seem like a strange step, but remember that on-diagonal elements in the covariance matrix reflect the variance of that element; off-diagonal elements reflect relationships between elements. If we want to loosen this constraint but not negatively affect the way the kernel assesses similarity between training data points, then we simply augment the on-diagonal elements.

def posterior(X_test, X_train, Y_train, l=1.0, sigma_f=1.0, 
sigma_y=1e-1):
K_train = kernel(X_train, X_train, l, sigma_f) + sigma_y**2 *
np.eye(len(X_train))
K_joint = kernel(X_train, X_test, l, sigma_f)
K_test = kernel(X_test, X_test, l, sigma_f) + 1e-8 *
np.eye(len(X_test))
K_train_inv = inv(K_train)
mu_joint = K_joint.T.dot(K_train_inv).dot(Y_train)
cov_joint = K_test - K_joint.T.dot(K_train_inv).dot(K_joint)
return mu_joint, cov_joint

Now, let’s take a look at (improved) model performance:

mu_test, cov_test = posterior(x_test,x_train,y_train)
samples = np.random.multivariate_normal(mu_test.transpose()[0],
cov_test, 5)
plt.scatter(x_train,y_train)
for i in range(len(samples)):
plt.plot(x_test,samples[i])
Credit: Me

Dimensions of Dimensions

Extending this case to higher dimensions is actually pretty straightforward. We’ll still assign a linear combination of Y_train to any X_test input (and X_train input in the noisy model, as discussed previously.) When we observe that every X is defined in 2+ dimensions (say, height and weight), the kernel needs to evaluate similarity between observations in higher dimensions. Fortunately, this is no issue whatsoever for the kernel!

Hyperparameter Tuning

You might be curious why I never discussed hyperparameter tuning, namely sigma_f and l in the kernel function. The author of the blog does use an optimization method to arrive at the best hyperparameter values — and I encourage you to review how he has implemented this method.

I much prefer the Bayesian approach; rather than converging on point-estimates for optimal hyperparameters, I prefer to see the whole distribution over these parameters. Point-estimate methods may get stuck in local optima (as opposed to the global optima), but sampling methods generally explore the whole distribution and can somewhat mitigate this risk (exceptions exist for complex geometries, which puzzle almost any method.) My favorite Bayesian modeling library is PyMC3.

Final Thoughts

I plan to build out a Gaussian Process tutorial with PyMC3 in a future post — or simply expand this post to include a short demonstration of applied GP modeling.

If this post helped you understand Gaussian Processes better, you liked my writing style, or otherwise enjoyed the read — please don’t forget to subscribe 🙂

Read More …

[ad_2]


Write a comment