Advanced Options with Hyperopt for Tuning Hyperparameters in Neural Networks | by Nicholas Lewis | Jan, 2021
[ad_1]
If you’re anything like me, you spent the first several months looking at applications of machine learning and wondering how to get better performance out of the model. I would spend hours, if not days, making minor tweaks to the model, hoping for better performance. Surely, I thought, there should be a better way to improve the model than manually checking dozens of combinations of hyperparameters.
That’s when I came across this excellent article on the Python package Hyperopt, which uses a Bayesian optimization model to determine the optimal hyperparameters for a machine learning model. Gone are the days of random guesswork and time-consuming trial and error when trying to fit a model to data! Using Bayesian optimization to tune your model also has the advantage that, while it’s important to understand what the hyperparameters are controlling, you don’t have to be the expert in how a model works in order to use it. Rather, you can try out a completely new type of model that you’ve never used before, and with some simple code, you can get some great results.
Before we dive into some examples, I do want to mention a few other packages that accomplish similar objectives but with different methods. Scikit-learn has a RandomizedSearchCV and GridSearchCV method, which are used to search across a space of hyperparameters. If you’re using Keras, Hyperas also provides a nice wrapper for hyperparameter optimization (however, you might run into some limitations if you’re trying to do some of these more advanced applications). The grid search and random search options are much more exhaustive, and thus more time-consuming, but the advantage is you’re not running the risk of getting stuck with a local minimum, which is a possibility when using the Bayesian approach. The disadvantage is that your search will likely spend a lot more time trying out ineffective hyperparameters.
For our data, we’ll generate some First Order Plus Dead Time (FOPDT) model data. FOPDT models are powerful and straightforward models that are often used in industry for preliminary results. They are a way of describing what happens in response to a changing stimulus. For example, we can model how the speed of a car changes based on how much you press the gas pedal.
In the above equation, y(t) is the output variable, u(t) is the input variable, and Kₚ, τₚ, and θₚ are process constants that determine the behavior for the output relative to the input. To put this in concrete terms, think of the gas pedal on your car. You can press the pedal down a given amount (input variable, or u(t), is % the pedal is pushed), and the speed of the car (output variable, or y(t)) will increase accordingly. Kₚ describes how much the speed changes compared to how much you press the pedal; τₚ indicates how quickly the speed will increase (commonly reported as the acceleration of the car); θₚ is the dead time variable, and accounts for any delay between pressing the gas pedal and the speed actually starting to change.
Simulating an FOPDT model in Python is actually quite straightforward. We start with a function rearranges the FOPDT equation to solve for the derivative dy/dt:
def fopdt(y,t,um,Km,taum):
# arguments
# y = output
# t = time
# uf = input linear function (for time shift)
# Km = model gain
# taum = model time constant
# calculate derivative
dydt = (-(y-yp0) + Km * (um-u0))/taum
return dydt
Once we have this, we can create another function that will simulate the first order response to whatever inputs we pass. The crucial part of this is using odeint
from scipy
.
def sim_model(Km,taum):
# array for model values
ym = np.zeros(ns)
# initial condition
ym[0] = yp0
# loop through time steps
for i in range(0,ns-1):
ts = [t[i],t[i+1]]
y1 = odeint(fopdt,ym[i],ts,args=(u[i],Km,taum))
ym[i+1] = y1[-1]
return ym
With the functions in place, we can set up our simulation. We start with specifying how many data points we need, as well as the model parameters (feel free to change these around to see a different model response — maybe you want to simulate a race car by simulating a fast acceleration, or a higher maximum speed by increasing the gain, or Kₚ).
# Parameters and time for FOPDT model
ns = 10000
t = np.linspace(0,ns-1,ns)
u = np.zeros(ns)# Additional FOPDT parameters
yp0 = 0.0
u0 = u[0]
Km = 0.67
taum = 160.0
We’ll generate some step data now, which I chose to just generate by randomly changing the input value (u, or gas pedal %) between 0 and 100, and keeping it at that level for a period of time between 5 and 15 minutes. This allows us to see a good variety of steady state and transient data.
# Generate step data for u
end = 60 # leave 1st minute of u as 0
while end <= ns:
start = end
end += random.randint(300,900) # keep new Q1s value for anywhere
from 5 to 15 minutes
u[start:end] = random.randint(0,100)
Now, we can simply call our sim_model
function from earlier, and we’ll have the first order response to the input data we just generated.
# Simulate FOPDT model
y = sim_model(Km,taum)
Now, this data will look really pretty, which doesn’t reflect reality. There’s always going to be some amount of noise in our sensors, so to get some more real-looking data, we’ll generate some artificial noise and add it to our simulated data. You can also change the amount of noise in your own code and see how it changes the final outcome.
# Add Gaussian noise
noise = np.random.normal(0,0.2,ns)
y += noise
Finally, we’ll go ahead and put the data all together, and also do a bit of preprocessing in anticipation for our neural network model. At this point, it’s really straightforward data, but our model will perform better when the data is scaled. This is simple enough thanks to the MinMaxScaler
from scikit-learn:
# Scale data
data = np.vstack((u,y)).T
s = MinMaxScaler(feature_range=(0,1))
data_s = s.fit_transform(data)
Now we have some FOPDT model data. Let’s take a quick look at it to see what it looks like, and then we’ll discuss what machine learning application we want to do with it.
The gray indicates the data that we’ll set aside for final testing. The orange line (pedal %) is the input, which we called u
in the code. The blue line (speed, with the artificially added noise) is the process variable (PV) or output data, which we represented with y
. So as you can see, as we press the gas pedal down more, the speed gradually goes up until it reaches a steady state, and as we take the foot off the gas, the speed decreases.
So now, what do we want to do with this data? We want to create a machine learning model that simulates similar behavior, and then use Hyperopt to get the best hyperparameters. If you look at my series on emulating PID controllers with an LSTM neural network, you’ll see that LSTMs worked really well with this type of problem. What we want to do is train an LSTM model that would follow this same type of FOPDT model behavior.
Keras is an excellent platform for constructing neural networks. We could keep this really basic, and do something like the following:
# Keras LSTM model
model = Sequential()model.add(LSTM(units = 50,
input_shape = (Xtrain.shape[1],Xtrain.shape[2])
)
)
model.add(Dropout(rate = 0.1))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mean_squared_error')es = EarlyStopping(monitor='val_loss',mode='min',verbose=1,patience=15)
result = model.fit(Xtrain, ytrain, verbose=0, validation_split=0.1,
batch_size=100,
epochs=200)
In the above code, we start with the LSTM layer and specify the units
hyperparameter, as well as the input shape. We add a Dropout layer, which is helpful to avoid overfitting the data, and then a Dense layer, which is needed for the model to output the results. Then we compile the model with the adam
optimizer and mean_squared_error
loss metric, add a line of code to stop training the model once the loss plateaus, and then fit the model. Simple enough…until we look at how it actually does:
The plot above shows the real speed and gas pedal from our simulation, then also the predicted and forecasted speeds. Predicted speed makes predictions based on feeding the real measurements into the model, while the forecasted speed takes the previous predicted speeds and feeds them into the model — hence the cause for so much drift and the inability to recover from the drift. As you can see, this model is hardly satisfactory. The good news is that we have some options to improve it. We could generate more data or let it train for more epochs (or fewer, depending on if we overfit or underfit). It’s certainly worth checking those. But the other option is to adjust the hyperparameters, either by trial and error, a deeper understanding of the model structure…or the Hyperopt package.
The purpose of this article isn’t an introduction to Hyperopt, but rather aimed at expanding what you want to do with Hyperopt. Looking at the Keras block of code above, there are several hyperparameters we could pick out to optimize, such as units
in the LSTM layer, rate
in the Dropout layer, and batch_size
when we’re fitting. Finding optimal values of these would be covered in an introductory Hyperopt tutorial. However, we may find it useful to add some extra LSTM and Dropout layers, or even look at a more optimal window of datapoints to feed into the LSTM. We may even find it beneficial to change the objective function that we’re trying to minimize.
Now, using Hyperopt is very beneficial to the beginner, but it does help to have some idea of what each hyperparameter is used for and a good range. We start by defining the range of values we want to search over. Again, the syntax of this step is covered in any introductory Hyperopt tutorial, so my purpose is to show a few nuances. Here’s the code:
from hyperopt.pyll.base import scope
#quniform returns float, some parameters require int; use this to force intspace = {'rate' : hp.uniform('rate',0.01,0.5),
'units' : scope.int(hp.quniform('units',10,100,5)),
'batch_size' :
scope.int(hp.quniform('batch_size',100,250,25)),
'layers' : scope.int(hp.quniform('layers',1,6,1)),
'window' : scope.int(hp.quniform('window',10,50,5))
}
Most of the time, you can just use the regular options for hp.uniform
, hp.choice
, hp.logchoice
, etc. However, the hp.quniform
option returns a float, even though it is a whole number like 1.0 or 5.0. Some Keras hyperparameters require this to be an integer type, so we force Hyperopt to return an integer by including scope.int
.
To construct our model, we put everything inside a function, with the possible parameters as the argument:
def f_nn(params):
# Generate data with given window
Xtrain, ytrain, Xtest, ytest =
format_data(window=params['window'])# Keras LSTM model
model = Sequential()if params['layers'] == 1:
model.add(LSTM(units=params['units'],
input_shape=(Xtrain.shape[1],Xtrain.shape[2])))
model.add(Dropout(rate=params['rate']))
else:
# First layer specifies input_shape and returns sequences
model.add(LSTM(units=params['units'],
return_sequences=True,
input_shape=(Xtrain.shape[1],Xtrain.shape[2])))
model.add(Dropout(rate=params['rate']))# Middle layers return sequences
for i in range(params['layers']-2):
model.add(LSTM(units=params['units'],
return_sequences=True))
model.add(Dropout(rate=params['rate']))# Last layer doesn't return anything
model.add(Dense(1))
model.add(LSTM(units=params['units']))
model.add(Dropout(rate=params['rate']))
model.compile(optimizer='adam', loss='mean_squared_error')es = EarlyStopping(monitor='val_loss',mode='min',
result = model.fit(Xtrain, ytrain,
verbose=1,patience=15)
verbose=0,
validation_split=0.1,
batch_size=params['batch_size'],
epochs=200)# Get the lowest validation loss of the training epochs
validation_loss = np.amin(result.history['val_loss'])
print('Best validation loss of epoch:', validation_loss)return {'loss': validation_loss,
'status': STATUS_OK,
'model': model,
'params': params}
You’ll notice our first line in the function formats the data into Xtrain, ytrain, Xtest, and ytest. This formats our X and y data into the format required by the LSTM, and importantly adjusts the window of input points, based on the range we specified earlier for window
. Then we start our Keras model. It has the same elements as before, but you’ll notice that rather than specify a numeric value for a hyperparameter such as rate
, we allow it to be in the range we specified by setting it as rate = params['rate']
. We also add some logic to allow for multiple LSTM and Dropout layers. Finally, we compile and fit the model just as before, and then we need an objective function to minimize. To start, we just take the validation loss and use that as our objective function, which will suffice the majority of the time (we’ll explore an instance where we might want something else in just a minute). The last step is to return information we might want to use later in the code, such as the loss of our objective function, the Keras model, and the hyperparameter values.
To run the actual optimization, be prepared for some long run times. Training an LSTM always takes a bit of time, and what we’re doing is training it several times with different hyperparameter sets. This next part took about 12 hours to run on my personal computer. You can speed up the process significantly by using Google Colab’s GPU resources.
The actual code you need is straightforward. We set the trials
variable so that we can retrieve the data from the optimization, and then use the fmin()
function to actually run the optimization. We pass the f_nn
function we provided earlier, the space
containing the range of hyperparameter values, define the algo
as tpe.suggest
, and specify the max_evals
as the number of sets we want to try. With more trials, we’re more likely to get the optimal solution, but there’s the downside of waiting more time. For something like a classifier that trains data fast, it’s easy to get several hundred evaluations within a few seconds, but with the LSTM, the 50 evaluations specified here takes several hours.
trials = Trials()
best = fmin(f_nn,
space,
algo=tpe.suggest,
max_evals=50,
trials=trials)
After you’ve let that run, you can take a look at some of the results. I use a bit of list comprehension to access the data stored in trials
. We can see everything that we returned in our original f_nn
function, including loss
, model
, and params
. Our best model and set of parameters will be associated with the lowest loss, while the worst model and parameter set will have the highest loss. Let’s go ahead and save those as variables so we can plot the results.
best_model = trials.results[np.argmin([r['loss'] for r in
trials.results])]['model']best_params = trials.results[np.argmin([r['loss'] for r in
trials.results])]['params']worst_model = trials.results[np.argmax([r['loss'] for r in
trials.results])]['model']worst_params = trials.results[np.argmax([r['loss'] for r in
trials.results])]['params']
Now we’ve run the optimization and saved the model (and for good measure the set of hyperparameters), it’s time to see how the model looks. We’ll look at two different approaches. The first approach involves taking the previous window of actual input data points (pedal %) and using that to predict the next output (speed). We’ll call this the “prediction.” This is quite simply found by taking our test data and applying the model.predict()
function. It looks like this:
# Best window
best_window = best_params['window']# Format data
Xtrain, ytrain, Xtest, ytest = format_data(window=best_window)Yp = best_model.predict(Xtest)
We also want to look at one other aspect, though. Lets say we’re trying to forecast where the speed will go without having the moment-to-moment feedback. Rather than taking the actual data values, we use the LSTM prediction to make the next prediction. This is a much harder problem, since if the LSTM prediction is only slightly off, the error can be compounded over time. We’ll call this method the “forecast,” indicating that we’re using the LSTM predictions to update the input values and forecast for a time range. I put this into a function forecast()
:
def forecast(Xtest,ytest,model,window):
Yf = ytest.copy()
for i in range(len(Yf)):
if i < window:
pass
else:
Xu = Xtest[i,:,0]
Xy = Yf[i-window:i]
Xf = np.vstack((Xu,Xy)).T
Xf = np.reshape(Xf, (1, Xf.shape[0], Xf.shape[1]))
Yf[i] = model.predict(Xf)[0] return Yf
Let’s see what this looks like:
OK, it’s actually not that bad. This is actually using the worst model and hyperparameter set. Looking at the best set we came up with after 50 iterations looks like this:
The “prediction” is pretty close, and the “forecast” is much better. We could probably get even better results from the prediction if we let it try more optimizations. But can we do better with the forecast?
This is where the objective function comes in. You can get quite clever with the objective function that you’re minimizing to account for all types of situations where you want to see better results. For example, what if we also want to account for how much time the model takes to train? We could change our loss
score to include an element of time — maybe something like multiply by the time it takes, so that a fast training time is rewarded.
In our case, we want to reward the model that has a good forecast. This is actually quite simple, given that we already have the forecast()
function. After setting up our f_nn()
function as before, we can add a few more lines to change our objective function. Recall that previously, we simply set our loss
to the validation_loss
value. Now, we actually run the forecast in our f_nn
model like so:
# Get validation set
val_length = int(0.2*len(ytest))
Xval, yval = Xtrain[-val_length:], ytrain[-val_length:]# Evaluate forecast
Yr, Yp, Yf = forecast(Xval,yval,model,params['window'])
mse = np.mean((Yr - Yf)**2)
Note that we have to separate out our validation set — we can’t use our test set, since that would bias the results. Then we simply calculate the mean squared error (mse
) between the forecast and actual values. This is then saved as our loss function in the return
line like so:
return {'loss': mse,
'status': STATUS_OK,
'model': model,
'params': params}
That’s all there is too it. After running this again, we get the following results:
We could also look at the worst results, just to prove that Hyperopt is actually doing something for us:
Using the best model looks fantastic ! Like we discussed, forecasting is extremely sensitive to small errors, so given the time range that we’re forecasting over, this looks really impressive. We’d expect to see the drift become a bit more pronounced over a longer time frame, but it’s a major improvement using the updated objective function. We could also run the optimization for more than 50 evaluations, and we’d possibly come up with something better.
These are just a few examples of how you can utilize Hyperopt to get increased performance from your machine learning model. While the exact methods used here might not be used in your particular situation, I hope that some ideas were sparked and that you can see some more potential uses for Hyperopt. I’ve included the code from my different simulations on my Github repo.
How else could you take this further? You could easily add a time element to your objective function if you want to find the most time-efficient and accurate model. The principles in here are easily applied to any other machine learning model with hyperparameters, and you might find it much faster to use this on classifiers. Finally, you could speed up the process significantly by using GPU — if your computer doesn’t have one, Google Colab is a great resource. Let me know your thoughts, and feel free to connect with me on LinkedIn.
Read More …
[ad_2]