# Understanding and applying results of bayesopt

35 views (last 30 days)
Sebastian on 15 Sep 2019
Commented: Don Mathis on 20 Sep 2019
Hi,
I have some difficulties understanding the Matlab documentation of the bayesopt function.
For example, the bestPoint function offers a couple of "best points" of a Bayesian optimization result. Which one should be used in order to get the best out-of-sample predictive accuracy?
Let's say I let bayesopt find the "best" hyperparameters for a regression tree ensemble (by actually using fitrensemble directly instead of the bayesopt function) and obtain the following result graphs: What do both graphs (if at all) tell about the "best point", convergence, predictive accuracy etc. (generally, but also considering especially this example)? Are there any sources that explain these concepts, at least at a higher level, so that I can better make use of bayesopt?

Don Mathis on 16 Sep 2019
Edited: Don Mathis on 16 Sep 2019
It looks like no new minima are being found, and that the model of the objective function is stabilizing, but it's not a good model. The model has minima that are negative. A negative value for log(1+Loss) implies that Loss<0, which is impossible for MSE loss.
I've seen this happen when there is a steep "cliff" in the objective function (over hyperparameter space). The Gaussian Process model of that function smooths out the cliff and thereby undershoots the true function (and zero) at the base of the cliff. In fact, the reason that the objective function when optimizing regression fit functions is defined as log(1+Loss) instead of Loss, is to try to reduce the size of such cliffs to reduce the chance of overshoots like this.
To diagnose this, you could look at the values of the objective function that are being found, to see if they differ by orders of magnitude.
Regarding bestPoint, since the model is not giving a resonable estimate of the minimum of the objective function, it would probably be better to trust the minimum observed point, and use the 'min-observed' criterion.

#### 1 Comment

Sebastian on 19 Sep 2019
Thanks, Don. Would it make sense using bayesopt directly in such circumstances where the Gaussian Process model fails to find the best point, trying out some Name-Value pairs that are not available when using Bayesian optimization indirectly through, e.g., fitrensemble?
If so, do you have some tips which combinations of parameters I could try (NumSeedPoints=20, AcquisitionFunctionName='expected-improvement-plus' etc.)? Does ist make any sense not setting 'IsObjectiveDeterministic' to true when my objective function is deterministic?

Don Mathis on 19 Sep 2019
Edited: Don Mathis on 19 Sep 2019
You already have access to some of those options (e.g., AcquisitionFunctionName) through fitrensemble, via the 'HyperparameterOptimizationOptions' name value pair. https://www.mathworks.com/help/stats/fitrensemble.html?searchHighlight=fitrensemble&s_tid=doc_srchtitle#d117e406156
I wouldn't expect increasing NumSeedPoints to help. Changing the acquisition function may help, but it's hard to know without more information. You can also try a grid search or random search via the 'Optimizer' field of the struct you pass to 'HyperparameterOptimizationOptions'. Neither of those methods would use a model at all, so it would at least be interesting to see what you get in that case. I would also look at the command-line output and all the plots to see what it's exploring. You can view all the plots by calling plot(model.HyperparameterOptimizationResults,'all') on your final model.
In any case, if it is a "cliff" scenario, it's likely that it has found the best point(s) at the bottom of the cliff, even if it expects the loss to be less than zero there.
Also: IsObjectiveDeterministic is already set to false for fitrensemble, and yes it can help even if the function is actually deterministic.

Sebastian on 20 Sep 2019
Here are the remaining plots you suggested, Don. Can you see anything special here? Do computations times help with analysis of convergence and accuracy?   Additional information derived from the HyperparameterOptimizationResults object: The 3000x1 vectors ErrorTrace, FeasibilityTrace and FeasibilityProbabilityTrace are full of -1, true, 1, respectively.
Don Mathis on 20 Sep 2019
Your objective function plot shows that you have values ranging from about .002 to about 1.3, which can be fine as long as the shape of the surface isn't too steep. But the fact that there seems to be a lot of points at about the .02 level AND a lot of points an order of magnitude higher (.2 -1.3) suggests that maybe there is a flat surface at .02 and also some very high regions. So the GP has to model the high regions at 1.2, the flat region at around .02, and the actual minimum which is at around .002. We see that the model overshoots the minimum by about .008 (producing a negative value). An overshoot of .008 is not too bad a modeling job, given that the range it is modeling is about 1.3. Unfortunately, because the model is Gaussian it is unbounded below, and goes below zero, which is unrealistic. Still, I would guess that it's finding good solutions.
To do even better, one thing you might try is to limit the domain that it's searching over. Can you tell from your command-line output what hyperparameter values lead to the highest points (e.g., 1.0 and above?). If you see a pattern there, you could try to limit the range of one or more of the variables. For example:
Y = double(categorical(Y));
% Get hyperparameters
h = hyperparameters('fitrensemble',X,Y,'tree')
h.Name
h.Range
% Narrow the range of the 'LearnRate' parameter
h(3)
h(3).Range = [.01 .1];
h(3)
% Optimize, passing updated hyperparameters
model = fitrensemble(X,Y,'OptimizeHyperparameters',h)