Repeat for-loop from the beginning after recalculating values

Ok, I am having a hard time figuring this out. I have a 6 X 6 matrix of values, in this case, pressure measurements in psi (P31.mat attached). I'm trying to apply rejection criteria for the outliers in this set of pressures. Each column of P31 is from a separate pressure transducer (6 measurements from each). I calculate the mean, standard deviation, and the number of elements (not including NaN) in each column using the first loop. In the second loop I calculate the deviation of each measurement (delta_31) from the mean (of the column).
In the third loop I use if-statements to determine a value for tau, according to the number of elements in each column. Then I multiply that tau value by the standard deviation of that column and compare each deviation value (delta_31). If delta_31(i,j) is greater than the calculated value for tau*stddev, then replace that delta value with a NaN, as well as the corresponding (i,j) value in P31. I then recalculate the stdev, mean, and number of elements. The problem is I'm trying to repeat the loop again to determine if there is another outlier in that column and I can't seem to get it. I thought that by stating i = 1, the loop should begin again, but it doesn't. Can anyone help? I apologize if this is confusing.
for k = 1:size(P31,2)
x_31(k) = mean(P31(:,k),'omitnan');
SDP_31(k) = std(P31(:,k),'omitnan');
n_31(k) = numel(P31(:,k)) - sum(isnan(P31(:,k)));
nu_31(k) = n_31(k) - 1;
end
for j = 3:size(P31,2)
for i = 1:size(P31,1)
delta_31(i,j) = abs(P31(i,j) - x_31(j));
end
for j = 3:size(P31,2)
for i = 1:size(P31,1)
% Conditional if-statements to determine the values for tau
if n_31(j) == 3
tau_31 = 1.150;
elseif n_31(j) == 4
tau_31 = 1.393;
elseif n_31(j) == 5
tau_31 = 1.572;
elseif n_31(j) ==6
tau_31 = 1.656;
elseif n_31(j) == 7
tau_31 = 1.711;
end
% Calculate tau*S to compare with the delta's
tauS_31(j) = tau_31*SDP_31(j);
% if-statement to replace outliers with NaN and recalculate
% everything. This actually works, but only for eliminating one
% outlier from the column.
if delta_31(i,j) > tauS_31(j)
delta_31(i,j) = NaN;
P31(i,j) = NaN;
x_31(j) = mean(P31(:,j),'omitnan');
SDP_31(j) = std(P31(:,j),'omitnan');
n_31(j) = numel(P31(:,j)) - sum(isnan(P31(:,j)));
for k = 1:size(P31,1)
delta_31(k,j) = abs(P31(k,j) - x_31(j));
end
i = 1; % this is supposed to start the i-loop again, but doesn't
else
i = i + 1;
end
end
end

3 Commenti

I forgot to say this. I am trying to use the Thompson Tau technique here. I know there are files out there that can do this, but they all rearrange the vectors in either descending or ascending order. I cannot do that here. Each row of P31 is from a single experiment, so I need all of the data in that row. Rearranging any of the vectors will throw off the result.
"I thought that by stating i = 1, the loop should begin again..."
No, that is pretty-much a standard implementation in (most) all (modern) programming languages--the loop limit is stored before the loop begins and in Matlab any change to the loop index variable is overwritten by for for the next iteration from a local copy.
You could use one of those routines simply by saving the original index order to rearrange the order back to original when done.
The Thompson tau is one-at-a-time so it's sufficient on each iteration to only test the largest absolute difference; the ordering is some algorithm's way to look at the two end points each iteration rather than find max(abs(difference)) each pass.

Accedi per commentare.

 Risposta accettata

dpb
dpb il 6 Set 2018
Modificato: dpb il 6 Set 2018
Directly addressing the Q?, here's a basic function to do the removal--
function x=thompsontau(x)
% Return array x w/ outliers replaced w/ NaN from input x
% Each column in x is assumed to be variable, rows are observations
%
% dpbozarth -- 5Sep2018
[r,c]=size(x); % row, col dimensions
cols=1:c; % later lookup for columns if outliers exist
mn=mean(x,'omitnan'); % mean by column
sd=std(x,'omitnan'); % std deviation
n=sum(isfinite(x)); % finite observations
[mx,ix]=max(abs(x-mn)); % value, location (row) of max deviation
isOut=mx>fntau(n).*sd; % column locations are outliers if T
if any(isOut)
indxOut=sub2ind([r c],ix(isOut),cols(isOut)); % locations in x of outliers
x(indxOut)=nan; % mark as outlier
end
end
function tau = fntau(n)
% returns critical Thompson tau for n observations
tcrit=tinv(1-0.05/2,n-2); % critical t value
tau=tcrit./sqrt(n)./sqrt(n-2+tcrit.*tcrit).*(n-1);
end
This vectorizes your loop approach for the first case; your mission, should you choose to accept, is to now encapsulate this to do the subsequent tests until any(isOut) is false.
This could be done by calling the above in a loop or calling it recursively inside this function; will leave that as "exercise for the student" (which seems only appropriate for a t-test based algorithm :) )
NB: This operates on the full input array; if there's reason to not operate on some columns of an array, simply pass the subset. Or, as in this case, the first two columns will never have an outlier show up so it doesn't matter much, but for some datasets obviously that might not be so.
But, don't build such "magic constants" into functions that makes them unsuitable for other use as a general practice.
PS: HINT The recursive solution is by far the neater here and only takes inserting three additional code lines, all of which are consecutive, and two of which are Matlab branching keywords. :) (And, actually, the latter two aren't absolutely mandatory but I think it makes the code more legible to include them).
NB2: fntau uses tinv which requires Statistics TB. Lacking that, will have to revert to using a lookup table or finding another replacement for the t-value.

12 Commenti

[PW Answer moved to Comment...dpb]
I am not using the first two because there are definitely no outliers in those sets. Also, those pressures are obtained using a pressure gauge and a much larger pressure transducer. To be quite honest, I just don't need to do anything with those values, other than get a mean and standard deviation. Thank you all for your input. I will try everything and get back to you.
The Q? was asked before had seen the data given the initial description only of what the data are but was driven more by the generic than specific of writing code to not have "magic" numbers or logic within it. One almost always finds there are other uses for a routine such as this so might as well make it suitable for what is going to be bound to change. :)
So, to use the function with your data that you know you don't need to do anything with, just call it with
P31clean(:,4:6)=thompsontau(P31(:,4:6));
or, since you know a priori there aren't any outliers in there anyway, it doesn't hurt to just pass the whole array; the first two columns will come out unscathed...
P31clean=thompsontau(P31);
Of course, that's after you make the one change to handle the multiple passes! :)
Awesome, thank you for all your help. I think this will work perfectly.
So, did you get the recursive version working? :)
Yes, I put the function into a while loop and now it runs until all outliers have been eliminated. Thank you so much for your help!
That's cheating! :) I'll attach the final version; I also incorporated an optional parameter to allow one to change the alpha parameter for the critical t values...
But, the critical implementation is
...
if any(isOut)
indxOut=sub2ind([r c],ix(isOut),cols(isOut)); % locations in x of outliers
x(indxOut)=nan; % mark as outlier
x=thompsontau(x); % recursively check for any more
end
is the minimal change. While not strictly needed, I thought incorporating the else clause made it a little more human-legible as to intent...
...
if any(isOut)
indxOut=sub2ind([r c],ix(isOut),cols(isOut)); % locations in x of outliers
x(indxOut)=nan; % mark as outlier
x=thompsontau(x); % recursively check for any more
else
return
end
function x=thompsontau(x,varargin)
% Return array x w/ outliers replaced w/ NaN from input x
% Each column in x is assumed to be variable, rows are observations
% Default for the critical t-values is alpha=0.05 two-sided. Supply the
% total alpha as an optional second parameter if desired to expand/loosen
% tolerance
%
% x = thompsontau(x,0.10)
% Specify a 10% two-sided interval
% dpbozarth -- 5Sep2018
% -- 7Sep2018 Incorporated optional alpha level
defaultalpha=0.05;
fnValidAlpha=@(x) isnumeric(x) && isscalar(x) && iswithin(x,0,1);
p=inputParser;
addRequired(p,'x')
addOptional(p,'alpha',defaultalpha,fnValidAlpha)
parse(p,x,varargin{:})
alpha=p.Results.alpha;
[r,c]=size(x); % row, col dimensions
cols=1:c; % later lookup for columns if outliers exist
mn=mean(x,'omitnan'); % mean by column
sd=std(x,'omitnan'); % std deviation
n=sum(isfinite(x)); % finite observations
[mx,ix]=max(abs(x-mn)); % value, location (row) of max deviation
isOut=mx>fntau(n,alpha).*sd; % column locations are outliers if T
if any(isOut)
indxOut=sub2ind([r c],ix(isOut),cols(isOut)); % locations in x of outliers
x(indxOut)=nan; % mark as outlier
x=thompsontau(x,alpha); % check for more
%else
% return
end
end
function tau = fntau(n,alpha)
% returns critical Thompson tau for n observations
tcrit=tinv(1-alpha/2,n-2); % critical t value
tau=tcrit./sqrt(n)./sqrt(n-2+tcrit.*tcrit).*(n-1);
end
If you aren't cheating, you aren't trying..... Here's the code I have now
while i <= 2
[r,c] = size(P31); % row, col dimensions
cols = 1:c; % later lookup for columns if outliers exist
x_31 = mean(P31,'omitnan'); % mean by column
SDP_31 = std(P31,'omitnan'); % std deviation by column
n_31 = sum(isfinite(P31)); % finite observations
[mx,ix] = max(abs(P31-x_31)); % value, location (row) of max deviation
tauS_31 = mx > fntau(n_31).*SDP_31; % column locations are outliers
% The function fntau finds tau values according to the number of
% elements.
if any(tauS_31)
indxOut = sub2ind([r c],ix(tauS_31),cols(tauS_31));
P31(indxOut) = nan; % mark as outlier
i = 1;
else
i = i + 1;
end
That'll only do two iterations max, won't it? The recursive will do all that continue to indicate there is an outlier.
I thought of making a limit on the number to remove another optional input...
No, it goes all the way through. I've checked it multiple times. It does exactly what it's supposed to do. I have also figured out how to run it over multiple data matrices, one at a time. I just have to figure out a way to save the data before it continues on the next matrix.
Presuming you initialize i=0; it can't run more than 3 times for i=0,1,2, then it has to terminate. Now, that may be what you want for this particular data given there are but six observations max to begin with, but it isn't the general case.
To use it for multiple data arrays and for various other combinations of input/output is why I cast it into a separate function and why such functionality should be factored out as functions instead of being inline in code.
If you can call it as a function, then you've got the whole panoply of Matlab control structure to bear and the variables in the function are totally independent of those in the main application so you can call it with any data set at hand.
Seriously, it works for all of the data sets. I've gone through each one step by step using the debugging feature to make sure it is doing what it is supposed to. All of them come out exactly like they should. Some run through 7 times, some only 3, etc. Remember that it resets the value of i if it detects an outlier. Also, when I run for multiple data sets, i also resets back to 1 before the end of the j-th loop. It is most definitely working.
Ah! My old eyes had missed that, indeed.
I still think just
if any(isOut)
...
x=thompsontau(x); % recursively check for any more
end
is much simpler.

Accedi per commentare.

Più risposte (2)

dpb
dpb il 6 Set 2018
Modificato: dpb il 6 Set 2018
The first loop can be eliminated entirely--use the Matlab array syntax:
x_31=mean(P31,'omitnan');
SDP_31=std(P31,'omitnan');
n_31=sum(isfinite(P31));
nu_31=n_31-1;
I edited your code above to clean up the indenting...looks to me there's a missing end at some point; I didn't try to figure out where that might belong; the trivial assumption that may or may not be right is it's the last one for the big for...need to check on that.
Moving right along...
for j = 3:size(P31,2)
for i = 1:size(P31,1)
delta_31(i,j) = abs(P31(i,j) - x_31(j));
end
for j = 3:size(P31,2)
for i = 1:size(P31,1)
...
Oh. You've got nested loops on j...perhaps the missing end belongs here before the second j loop?
Like
for j = 3:size(P31,2)
for i = 1:size(P31,1)
delta_31(i,j) = abs(P31(i,j) - x_31(j));
end
end
for j = 3:size(P31,2)
for i = 1:size(P31,1)
...
maybe? If that's so then again use vector syntax--
delta_31(:,3:end)=abs(P31(:,3:end)-x_31);
Now you're down to the removal criteria, but what's the deal with only using the 3:end sensors instead of all six? I don't see what that's all about??? If you have outliers, seems like all sensors would be wanted to be checked.
I'm running out of time/energy here this evening but one last little detail to make code a little easier --
if n_31(j) == 3
tau_31 = 1.150;
elseif n_31(j) == 4
tau_31 = 1.393;
elseif n_31(j) == 5
tau_31 = 1.572;
elseif n_31(j) ==6
tau_31 = 1.656;
elseif n_31(j) == 7
tau_31 = 1.711;
end
% put this at very beginning of your script/function...
tau=[1.150, 1.393, 1.572, 1.656, 1.711]; % lookup table values
Then to use, simply write
tau_31(j)=tau(n_31(j)-2);
I'm sure most if not all of the remainder can also be vectorized with, probably, one loop on the outside to do the multiple comparisons.
Take a look at the way the preceding loops were reduced and give the rest a shot; I'll try to look in again in the morning.

1 Commento

For tau, go back to definition...
tcrit=@(n) tinv(1-0.05/2,n-2); % t distribution t(alpha/2,n-2)
fntau=@(n) tcrit(n)./sqrt(n)./sqrt(n-2+tcrit(n).*tcrit(n)).*(n-1);
Then when you need tau, just write
tau=fntau(n);
This has been vectorized so can pass a vector of N

Accedi per commentare.

I'm not familiar with that particular method. Is it close enough to one of the methods supported by the isoutlier function that you can use that function?
If it is not but is used often in a particular application area to detect outliers in data associated with that area, consider filing an enhancement request through Technical Support (using the Contact Us link in the upper-right corner of this page) describing that application area, giving more information and/or a reference for that algorithm, and asking for it to be added to isoutlier.

3 Commenti

dpb
dpb il 8 Set 2018
Modificato: dpb il 11 Set 2018
This goes back into my "wayback" machine, Steven. I had to go dig through the "old black notebook of stuff" I collected over working career before there was such a thing as online way to go find things and we saved paper copies of papers! :)
Thompson, W.R., 1935, On a criterion for the rejection of observations and the distribution of the ratio of the de­viation to the sample standard deviation, Ann. Math. Stat., 6:214-219.
Now I find the reference dissertation I found through Dr. Clark, utmost gentleman and mentor, who was staff Senior Mathematician at B&W NPGD back in the late 60s/early 70s is available via google search!!!
This is all quite dated by now, of course, and I've made no attempt to keep up with modern literature, but the basis for the Thompson test.
And, I forget now just what the conclusions were altho I do know and should have pointed out that repetitive application blindly is fraught with difficulty despite the cuteness of being able to write the algorithm. One issue Grubbs pointed out is that the probability of an arbitrary point being an outlier isn't the same as the probability that the largest observation (say) is an outlier with the respect to the remaining observations, the former of which is the Thompson test, not the latter.
ADDENDUM
Ain't science wunnerful!!! :)
Thompson(1935) full text pdf document of the original paper. I do not seem to have a record of the communications after.
The subsequent modification as used above and that is fairly widely spread on various university sites I'm not positive about the origin, my quick algebra on (4) p36 didn't quite duplicate the above although I won't claim I didn't blunder.
ADDENDUM 2
I didn't have a copy but I did find a scribbled note where I jotted down the Grubbs paper reference and managed to now find it online, too... Grubbs(1950).
While I think in general some additional tools might be incorporated in ML, it's not clear to me what is currently considered "state of the art" but I doubt Thompson is the last word on the subject (altho I wouldn't say it shouldn't be an available option).
I generally when trying to catch up on something haven't looked at in quite some time see what NIST currently says; I see they have the MAD (interesting historical use of that acronym exists :) ), as well as Grubbs (one outlier) and and extended Grubbs for multiple as well as another ESD for unknown number with "just" an expected upper limit.
I used a textbook from years ago to reference this method: "Introduction to Engineering Experimentation", by Wheeler and Ganji (3rd Ed.), chapter 6. The other Statistics textbooks I have do not cover it at all.
I never ran across that text; the similar subject-matter I used was Bevington, Data Reduction and Error Analysis for the Physical Sciences, but it didn't touch on the subject. I see there is a revised edition now, but it seems to be virtually identical in content with just some references to there now being such a thing as a personal computer and available software.

Accedi per commentare.

Prodotti

Release

R2018a

Richiesto:

il 5 Set 2018

Commentato:

dpb
il 11 Set 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by