How do I do Regression for multiple subjects

Hi
This a stats question for regression analysis
I have data for 19 subjects who did 8 blocks of a task.
I want to see if data increase "significantly" across blocks.
Which option is correct (statistically speaking)
They give different results!!
Option 1: Fit every subject and then t-test on the coefficient of rate of change (slope)
coeff = [];
for s=1:19
fit = fitlm(1:8, data(s,:)) ;
coeff(s,1) = fit.Coefficients.Estimate(2) ;
end
[h,p,ci,stats] = ttest(coeff)
p = 0.1030
Option 2: Average across all subject and then run fitlm (on the group average) and use the stats in the fit LinearModel
M = mean(data) ;
fit = fitlm(1:8, M) ;
fit.Coefficients.pValue(2) = 0.155
thanks
pb

Risposte (2)

dpb
dpb circa un'ora fa
Spostato: dpb circa un'ora fa
Strictly speaking, you're supposed to know the test for hypothesis before designing and running the experiment.
The Q? here is what is the actual null hypothesis to test? If the hypotheis is one that there is a difference in difficulty between tasks, then the better test would be Friedman...if the question is whether there's a trend in difficulty versus test, that's different. Then it's a Q? of whether it's for the population as a whole or for each individual.
Without additional background on the test and objectives, I'd suggest Friedman.
data=readmatrix('data.txt');
stackedplot([data mean(data,2)])
friedman(data);
It is pretty low for significance, but you have no replications. As the bottom plot shows, there does appear to be a slight curvature in the average versus block although it would be difficult to show significance I suspect (I didn't try).
If the Friedman test were to turn out significant there are then post-hoc tests that can be utilized to identify which specifically differ.
I considered using friedman however there are no actual reeititions, consideering that there are 19 subjects and 8 tasks. However, the t-test is inappropriate, since the data appear not to be normally distributed. If none of the task scores can be zero or negative, then the lognormal distibution may be more appropriate.
A1 = readmatrix('data.txt')
A1 = 19×8
6.9011 10.7700 13.8070 16.1510 17.0180 20.1360 22.2640 13.6310 17.1100 16.0630 20.2080 12.0660 14.0380 28.7000 20.7400 17.3030 8.0825 13.1410 12.7320 25.8700 7.0998 6.7292 11.2540 7.6235 5.4173 8.9503 4.2487 1.8015 1.7145 0.3866 2.1199 9.4284 10.9450 16.4700 16.8660 13.6870 22.1510 8.3364 20.4840 26.0050 5.6463 2.2456 2.2951 8.6048 3.3609 1.6777 4.6868 2.2203 4.5081 6.7629 4.8043 7.7382 2.7232 5.6407 11.4580 4.4322 3.7913 3.2618 11.2600 1.4352 2.0173 7.4965 13.1190 5.4684 5.5242 0.4020 5.3666 2.5546 13.5970 7.7196 6.3798 0.3766 4.1787 9.2256 11.7950 6.0266 5.7861 7.8338 2.6496 2.6319 4.0627 3.1675 5.1919 8.8082 5.1689 7.8048 9.7785 13.1400 2.7955 10.9590 15.6410 3.8313 10.4450 19.3890 15.3080 8.5590 3.4822 4.1784 2.4938 3.5147 3.8841 4.4561 2.3881 4.7253 5.0303 5.4010 6.4618 3.4920 8.4972 3.8040 10.9280 5.8896 9.7166 11.0600 8.3587 6.1232 17.4880 14.1780 13.3290 10.9400
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
figure
plot((1:size(A1,2)), A1)
grid
xlabel('Task')
ylabel('Sobject Score')
legend(compose('Subject #%2d', 1:size(A1,1)), Location='northoutside', NumColumns=4)
xlim('padded')
figure
boxchart(A1, Notch='on')
xlabel('Task')
ylabel('Sobject Score')
figure
boxplot(A1, Notch='on')
xlabel('Task')
ylabel('Sobject Score')
figure
tiledlayout(4,5)
for k = 1:size(A1,1)
nexttile
histfit(A1(k,:), 5, 'lognormal')
title(sprintf('Subject #%2d',k))
end
sgtitle('Lognormal Fit')
figure
tiledlayout(4,5)
for k = 1:size(A1,1)
nexttile
histfit(A1(k,:), 5, 'normal')
title(sprintf('Subject #%2d',k))
end
sgtitle('Normal Fit')
tmean = mean(A1) % Task Means
tmean = 1×8
6.9375 9.0530 10.4197 8.6694 9.2670 10.0667 10.6752 9.0549
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
tmed = median(A1) % Task Medians
tmed = 1×8
5.5242 9.0569 9.7983 6.1133 7.0998 7.7245 10.9280 8.5590
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Looking at the median values (also displayed in the boxchart and boxplot figures), there does not appear to be any specific trend across tasks.
I am not certain what to suggest, however there does not appear to be anything significant across tasks in these results.
.

11 Commenti

Pat
Pat circa 22 ore fa
Spostato: dpb circa 4 ore fa
Hi Star Strider and dpb
The 8 columns are repetitions, it's each subject data in different blocks of a task. So they are repeated measures in fact
i know I ould use one-way ANOVA with repeated mesures, but a significant main effect of Blocks woudn't tell me much, i'd have to run multiple t-test amonts the 8 blocks.
pb
dpb
dpb circa 20 ore fa
Spostato: dpb circa 4 ore fa
You have 19 subjects each of whom did eight different tasks once. That's not replication unless some of the eight tasks are the same task presented again?
I don't think regression for slope is meaningful here; the task numbers aren't a suitable predictor variable.
I'm not certain what your oriignal design or intent is.
There doesn't appear to be any specific trend in the median values for each task block.
A1 = readmatrix('data.txt');
tmed = median(A1)
tmed = 1×8
5.5242 9.0569 9.7983 6.1133 7.0998 7.7245 10.9280 8.5590
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
mdl = fitlm(1:numel(tmed), tmed)
mdl =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ _________ (Intercept) 6.7417 1.4191 4.7507 0.0031566 x1 0.30195 0.28103 1.0745 0.32391 Number of observations: 8, Error degrees of freedom: 6 Root Mean Squared Error: 1.82 R-squared: 0.161, Adjusted R-Squared: 0.0216 F-statistic vs. constant model: 1.15, p-value = 0.324
figure
plot(mdl)
grid
.
Pat
Pat circa 18 ore fa
Spostato: dpb circa 4 ore fa
like I said it's repeated measures, 8 blocks during the same task done one after the other; same day in the same experimental session...task lasts about 20 min; it's different for each subj since it's a fatiguing task...people fatigued at different rate...
you're right about the unimpressive increase across blocks. But for other parts of the data in the same dataset (or other data I have) there is a stronger increase....it does not change the question I asked !!!
But that was not the question!!
Which method is more statistically sound:
regression on each subject and then t-test on the slopes (beta weight of the regression) OR average of the group and then regression on that mean.
-pb
The null hypothesis is yet to be clearly stated. If the "block" is considered an ordinal sequence number and the hypothesis is that there is an overall increase in duration with time/repitition, then the question is whether the conclusion is whether the change is for the population as a whole or for any specific individual. Presuming the population, the answer is obvious, use a measure of the durations of the population. Given @Star Strider's observation the observations are not normally distributed, the median may be a better choice or applying a Box transformation first.
If one were to do each individually, one has the isse of deriving a suitable correction for the problem of significance inflation for multiple comparisons.
I though I explained very clearly that I just want to know which of the option is more statistically sound!!
Forget the data I sent, I thought it would make the question clearer, but it seems to trouble you, so forget them...I have plenty of others data to look at
*** Option 1 or 2 ***
it's more of a theorotical question I guess
the null hypothesis is: no Slope change across the blocks
it means nothing to have block 3, for example, being larger/smaller than the others blocks. I need to see if there is a continuous trend across the blocks
-pb
dpb
dpb circa 3 ore fa
Modificato: dpb circa 3 ore fa
As I noted, it depends upon whether you're wanting to draw a conclusion regarding the population of subjects or by subject...the latter option leading to the issue of multiple comparisons across however many subjects there may be to extend from each individual to the population.
Since this case has an equal number of observations (lacking any missing) for each subject, the average of the slopes of all individual regressions will be the same as the slope of all subjects fitted together.
However, there's another possible problem with using regression for this...it may turn out to not be an issue, but by observation of the data sample, at least some of the traces appear to have a shape and a quadratic may have zero slope but still have a significant curvature. Would such a trend that won't be discovered by using a trend line alone be an issue? As noted, probably won't happen, but it's possible.
Hi
good point about the curvature, it is something that I'm intereted in looking at...in fact I am trying a quadratic fit
fit_poly2 = fitlm(tbl, 'vd ~ x + x^2', 'robustOpts', 'on') ;
I want to look at population level, not subject by subject. So The subj variable should be random so it can be extended to the population. Then, option 1 would be prefered right?
The mean of the slopes is similar for eac option but teh p-value is different, that might matters for all the data I have to look at.
When I fit for each subject I don't use thet stats (ie p-value), just the beta-weights so I don;t see where the multiple comparisons issue arises from.
data=readmatrix('data.txt');
M = mean(data) ;
fit = fitlm(1:8, M)
fit =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ __________ (Intercept) 8.0641 0.83073 9.7073 6.8594e-05 x1 0.26751 0.16451 1.6261 0.15504 Number of observations: 8, Error degrees of freedom: 6 Root Mean Squared Error: 1.07 R-squared: 0.306, Adjusted R-Squared: 0.19 F-statistic vs. constant model: 2.64, p-value = 0.155
X=repmat([1:8],height(data),1);
fit = fitlm(X(:),data(:))
fit =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ ______ ______ __________ (Intercept) 8.0641 1.1084 7.2753 1.7909e-11 x1 0.26751 0.2195 1.2187 0.22485 Number of observations: 152, Error degrees of freedom: 150 Root Mean Squared Error: 6.2 R-squared: 0.00981, Adjusted R-Squared: 0.0032 F-statistic vs. constant model: 1.49, p-value = 0.225
N=height(data);
coeff=nan(N,2);
for s=1:N
fit = fitlm(1:8, data(s,:));
coeff(s,:)=fit.Coefficients.Estimate;
end
mean(coeff)
ans = 1×2
8.0641 0.2675
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[~,p] = ttest(coeff(:,2))
p = 0.1032
[nnz(coeff(:,2))==height(coeff) all(coeff(:,2))>0]
ans = 1×2 logical array
1 1
As shown, the estimates are identical; the p-values are different owing to the DOF reduction in collapsing to the 8 observations with the means versus all data.
The t-test on the array of coefficients is not meaningful as there's no error variance of the quality of fit; I'm not at all certain how you must have gotten the number you posted initially.
Overall, I think the better approach is the mean.
the coefficients in your coeff variable are all the same!! that's why you have that p-value so low. your code did not run the regression for each subject
I treat the bw of the regression done in each subject, as it were any type of dependent variable, like reaction time values, scores on a test...i don't think i need the error term, just the slope is enought
dpb
dpb 20 minuti fa
Modificato: dpb 7 minuti fa
DOH! <Head slap!> I whiffed on somehow having not gotten the fit inside the loop, indeed.
It is a biased estimator for the aforementioned reasons, however; you've completely removed the likelihood of any being non-significant by keeping only the coefficient itself. Without the variability in the observations and thereby the uncertainty in the coefficients, the conclusion will always be there's a trend unless it turns out to be identically zero or totally random such that the mean turns out to be zero. It's not a meaningful test statistic for the stated hypothesis.
I still think for the composite group the mean estimator would be the better choice although I'm still not totally convinced it's the best approach--but otomh a better alternative isn't coming to me.

Accedi per commentare.

Richiesto:

Pat
il 27 Mag 2026 alle 16:40

Modificato:

dpb
1 minuto fa

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by