Why is the calculated Rsquare different between the embedded fit function and the EzyFit function (from File Exchange)?
3 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Sorry, I think I have overwritten a new question over an old one... is there a way to restore the old question?
0 Commenti
Risposta accettata
John D'Errico
il 22 Mag 2023
Modificato: John D'Errico
il 22 Mag 2023
Do you understand that R^2 is not valid, when computed for a model with no constant term? Instead, I recall there are variations of R^2 that are more valid for models with no constant term.
Did you notice that R^2 for one of those computations was negative? That clearly suggests a problem. That your model has NO constant term also suggests why there is a problem.
That your data has one outlier in it, that would heavily influence the fit is another problem. I won't get into that at all.
Simple one number measures like R^2 are a problem. They are a far larger problem if you don't understand what they are telling you. And if you are worried about R^2 at all, on a problem with no constant term, then you don't understand R^2.
Let me spend some time writing and explaining...
x1 = [
1734.46674110029
1721.86718990168
1456.18495599912
1748.16876863704
1262.25401459449
1584.17734249873
1859.79267910209
864.395721875175
1952.12705609501
1503.45890484099
1976.60164096334
3862.90470480267
1914.88763478115
1373.87007296104
1826.95766710343
1515.55469767365
1765.08584136511
1318.10668583756];
y1 = [
4289.03428582923
2246.49016736711
1540.98595650498
2038.68253981628
3541.64494736076
4039.09183624669
3602.87690315152
1747.01379244271
4285.91071657769
2935.38381778387
2432.46183991586
4121.49502991896
2455.73593671295
4682.44200564969
3633.35882024405
1763.8042370884
1803.2292759675
3724.18628227003];
Essentially, R^2 is a very simple measure that compares the variability in the data itself, IF we had essentially no model at all. We can get that from the variance.
var(y1)
Note that the variance SUBTRACTS OFF THE MEAN OF THE DATA. It implicitly assumes the model for this process is a constant model, with gaussian noise added. So the implicit model in that variance computation is just
y = a0 + noise
And we can recover the best least square estimate of a0 from the mean.
a0 = mean(y1)
But what did you fit? You tried to fit a model that lacks a constant term.
y = a1*x
We can get that directly from backslash, or we can use fit. Since we will do these computations essentially by hand, I'll use backslash.
format long g
a1 = x1\y1
I can also use fit though, just to convince you that backslash was correct.
mdl = fittype('a1*x1','indep','x1');
[fittedmdl,G] = fit(x1,y1,mdl)
fittedmdl.a1
So the same value, fit should be actually a little less accurate here, because fit uses an iterative procedure. So the slop lies in the convergence tolerance.
You will notice that fit returns a NEGATIVE R^2. Again, that should be a hint. NO CONSTANT TERM.
Now, what does R^2 tell us? R^2 compares how well the current model does in terms of reducing the variability in the data, compared to no model more complex than assuming the prcess is simply a constant one, plus noise.
SSbase = sum((y1 - mean(y1)).^2)
SSmdl = sum((y1 - a1*x1).^2)
As you can see, the base sum of squares, where I subtracted off only the mean is SMALLER then the sum of squares when I subtracted off the estimate from this model. The R^2 computation is now a simple one.
R2bad = 1 - SSmdl/SSbase
So again, a negative R^2 tells us that your model does not fit the data better than if you had just used a constant approximation for the process.
Finally, we might consider if a better meaure (for THIS process) is how well the fit reduces the simple sum of squares of your data, had we not subtracted off the mean.
R2nocon = 1 - SSmdl/sum(y1.^2)
This assume that the default model for the process is
y = noise
with a presumed mean of zero. And that is probably what ezyfit computed, since it apparently knew the model has no constant term.
Finally, I'll plot the various models.
plot(x1,y1,'mo')
hold on
plot(fittedmdl,'r')
yline(mean(y1),'b')
The blue horizontal line is a model of the process where there is no signal at all, just random noise. In fact, that actually fits the data better in terms of explaining the sum of squares of errors, compared to the linear fit with no constant term.
In the end, mono-numerosis is a BAD thing. NEVER RELY ON A SIMPLE NUMBER TO TELL YOU IF YOUR MODEL IS ANY GOOD, certainly not if you don't understand the number in the first place. I would even go further, to tell you to rely on your eyes and your brain, NOT on any number. If the fit looks right, it is right. At the very least, think about what you are doing. Is the fit adequate for what you need?
1 Commento
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Descriptive Statistics in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!