Finding peaks in a very noisy signal

Hi, I have a large set of noisy signals, see examples below.
I want to calculate the mean frequency of the pulsations, and to do that I have to first numerically detect the peaks. I have an algorithm, but it relies heavily on empirical choice of parameters for the findpeaks function, and typically it takes me 3-4 tries before I get them right for a certain dataset.
Can someone point me in the right direction on how to make my algorithm more adaptive, or how to approach this differently? My code pasted below. Code in .mlx and test files added as an attachment.
filename = uigetfile("Test*.mat");
if ~(filename), return, end
load(filename,"dt","signal")
time = dt*(1:length(signal));
halfM = 0.5*mean(signal);
% First find maxima separated by a multiple of time step, of a minimum height of 1% of the max value in the whole dataset
[pks, locs] = findpeaks(signal, time,"MinPeakDistance",5*dt,"MinPeakProminence",0.01*max(signal));
% Then find minima to make sure closely spaced, sharp peaks are differentiated in the next step
[min_pks, min_locs] = findpeaks(-signal, time,"MinPeakDistance",5*dt,"MinPeakProminence",0.01*max(signal));
% Combine minimum and maximum points
pks = [pks -min_pks]; locs = [locs min_locs];
% Sort the points for further processing
[locs, idx] = sort(locs); pks = pks(idx);
% Custom values for minimum separation and maximum peak width
minSep = 5e-9; maxPW = 1e6;
% Plot the results with peak prominences and widths marked
figure
findpeaks(pks, locs, "MinPeakProminence", halfM, "MinPeakHeight", 2*halfM, "MinPeakDistance", minSep, "MaxPeakWidth", maxPW,"Annotate","extents");
% Plot the original signal with all peaks and valleys marked in small orange dots,
% and major peaks marked in large yellow dots
figure, plot(time, signal), hold on, scatter(locs, pks, Marker=".",LineWidth=1)
[pks,locs] = findpeaks(pks, locs, "MinPeakProminence", halfM, "MinPeakHeight", 2*halfM, "MinPeakDistance", minSep, "MaxPeakWidth", maxPW);
scatter(locs, pks, Marker="o", MarkerFaceColor="flat"), hold off
xlabel 'Time [\mus]', ylabel 'Signal power [a.u.]'

 Risposta accettata

Your signals have broadband noiise, so a frequency-selective filter is not going to apply here. The Savitzky-Golay filter (or wavelet denoising) are best suited for signals with broadband noise, so I miplemented it here. Change the ‘framelen’ value to give the best result. (I chose a 3-order polynomiial, since that generally works best imn my experience, however you can experiiment with that as well.)
Try this —
files = dir('*.mat');
for k = 1:numel(files)
LD = load(files(k).name);
Fs(k) = 1/LD.dt;
s{k} = LD.signal;
L = numel(s{k});
t{k} = linspace(0, L-1, L)/Fs(k);
end
figure
tiledlayout(numel(files),1)
for k = 1:numel(files)
nexttile
plot(t{k}, s{k})
grid
xlabel('Time')
ylabel('Amplitude')
title(string(files(k).name))
end
sgtitle('Original')
figure
tiledlayout(numel(files),1)
for k = 1:numel(files)
nexttile
[FTs1,Fv] = FFT1(s{k},t{k});
plot(Fv, abs(FTs1)*2)
grid
xlabel('Frequency')
ylabel('Magnitude')
xlim([0 5E+8])
title(string(files(k).name))
end
sgtitle('Fourier Transforms')
figure
tiledlayout(numel(files),1)
for k = 1:numel(files)
nexttile
framelen = 1+5E+3;
s_filt = sgolayfilt(double(s{k}), 3, framelen);
plot(t{k}, s_filt)
grid
xlabel('Time')
ylabel('Amplitude')
title(string(files(k).name))
end
sgtitle('Savitzky-Golay Filtered Signal')
function [FTs1,Fv] = FFT1(s,t)
% Arguments:
% s: Signal Vector Or Matrix
% t: Associated Time Vector
t = t(:);
L = numel(t);
if size(s,2) == L
s = s.';
end
Fs = 1/mean(diff(t));
Fn = Fs/2;
NFFT = 2^nextpow2(L);
FTs = fft((s - mean(s)) .* hann(L).*ones(1,size(s,2)), NFFT)/sum(hann(L));
Fv = Fs*(0:(NFFT/2))/NFFT;
% Fv = linspace(0, 1, NFFT/2+1)*Fn;
Iv = 1:numel(Fv);
Fv = Fv(:);
FTs1 = FTs(Iv,:);
end
.

9 Commenti

Looks promising, I'll definitely try it. One question though, do you think I could somehow use the FFT data to automatically pick framelen value based on the signal characteristics? You have nicely visualized the fact that I have very different frequency content between different signals, so fixed filter parameters will not be enough to capture the variation here. I really need to automate it without the need to manually pick a value each time, because I have >100 datasets, and while many of them are similar, I don't have a good way of categorizing that.
Thank you!
‘... do you think I could somehow use the FFT data to automatically pick framelen value based on the signal characteristics?
Unfortunately, no. The FFT can be very useful in choosing the cutoff frequencies for frequency-selective filters, however the ‘framelen’ value depends on the noise characteristics. I experimented a while ago on using the period (1./Fv) of the FFT for this, however it did not produce any useful result. It iis necessary to just ‘cut and try’ diffeerent ‘framelen’ values, as in many signal processing problems. The only useful informatin that it is possible to get from the signal sith respect to the ‘frameelen’ value is the length of a regularly-sampled signal. I generally begin with a ‘framelen’ value about 1% of the signal length, and go from there.
Since my signals have a lot of high-frequency content, I had to go way down with 'framelen'. I found values in the 30-50 range best.
You seem like an experienced DSP user, so I'll pick your brain if you don't mind. I tried doing FT of the raw signals, calculating the weighted mean frequency, akin to a center of mass calculation where every point is weighted by the distance (frequency). As expected, I got a mean time (as inverse of the mean frequency) that is shorter for TEST1 and TEST4, vs longer for 2 and 3 which makes sense just by visual inspection, since 1 and 4 need shorter windows to preserve signal characteristics. Does this method make any sense to you?
Also, I tried a range of window lengths and polynomial orders, and got very similar results at, for example, poly = 3, framelen = 31 and at poly = 9, framelen = 71. In that case, I would obviously choose the first configuration since it's going to be so much faster, right?
Thanks in advance for the discussion, I'm happily surprised that I got any answers, since this is not a simple problem and I wasn't sure if anyone would have the time and expertise to pick it up!
P.S. I wonder if ML could be useful here, but that's way beyond my current expertise.
I don’t mind at all. I like signal processing problems!
As I mentioned, I usually begin with ‘framelen’ values about 1% of the signal length, and then experiment.
I generally do not do much with the mean frequency, and when I do, II tend to use the meanfreq function. A weiighted mean, where the weighting vector is the frequency, would obviously tend to give more emphasis to the higher frequencies. It seems to me that emphasizes high-frequency noise, if any is present. (The inverse of mean frequency would be mean period.)
The method definitely makes sense, although it is not my usual approach. I am not certain what you want from the signal vectors you have, since I do not know your objective.
I haven’t timed it, however I suspect the (3,31) option would be faster.
What does ‘ML’ refer to here?
I'm sorry, ML meant Machine Learning. I see this acronym a lot, but I should've made it clear.
Good point about the weighted mean frequency. I write a lot of code on my own because I like to fully understand and control what's going on, but sometimes I miss good built-in function because of that. I will check what results I get from meanfreq.
My ultimate objective is to estimate the pulsation frequency in the signals. It's really hard because, as you probably noticed, even in a single dataset I have a wide range of peak heights, peak widths, peak prominences; there are multi-peak formations... it's a headache. My current approach is to roughly detect peaks, calculate the distances between consecutive peaks, and estimate the pulsation frequency based on that.
I plot and analyze parameters such as peak height, peak width, peak prominence to estimate cutoff parameters for MATLAB's findpeaks function, but my biggest problem is that I always have to pick the parameters for each dataset manually. I can write an algorithm that works amazing for a certain dataset, but I'm having a hard time making it robust. I feel like with the huge difference between magnitudes of different peaks and their widths, I have to resort to a localized method, which performs a moving-window analysis of the signal first and pre-determines the values of parameters for the signal processing. I wish I could do a Savitzky-Golay filter that changes window size / polynomial order dynamically within one array of data, but my gut feeling is that even if I write a function that does that, it will take forever to execute.
I still do not understand what your objective is, and the system you are measuring. Two options that may bee helpful are wavelet analysis, and using the emd (empirical mode decomposition) function (introduced in R2018a). (This is similar to independent components analysis for blind-source separation.) This occurred to me this morning, although it took me a while to remember what the function is, since I don’t use it often.
Try this —
files = dir('*.mat');
for k = 1:numel(files)
LD = load(files(k).name);
Fs(k) = 1/LD.dt;
s{k} = LD.signal;
L = numel(s{k});
t{k} = linspace(0, L-1, L)/Fs(k);
end
figure
tiledlayout(numel(files),1)
for k = 1:numel(files)
nexttile
plot(t{k}, s{k})
grid
xlabel('Time')
ylabel('Amplitude')
title(string(files(k).name))
end
sgtitle('Original')
figure
for k1 = 1:numel(files)
L = numel(s{k1});
sd = resample(s{k1}, L, fix(L*1E+1));
% sd = sum(sin((1 : 50 : 200).' * 2*pi*(0:1E+4-1)/1E5 + (1:4).' )).'
tr = linspace(0, numel(s{k1}), numel(sd)).'/Fs(k1);
[imf{k1},resid] = emd(sd);
szimf = size(imf{k1});
disp("imf size = "+szimf(1)+" rows, "+szimf(2)+" columns")
figure
tiledlayout(ceil(size(imf{k1},2)/2),2)
for k2 = 1:size(imf{k1},2)
nexttile
plot(tr, imf{k1}(:,k2))
grid
xlabel('Time')
ylabel('Amplitude')
title("imf(:,"+k2+")")
end
sgtitle("Empirical Mide Decomposition Analysis Of: "+string(files(k1).name))
end
imf size = 10000 rows, 10 columns
imf size = 10000 rows, 9 columns
imf size = 10000 rows, 10 columns
imf size = 10000 rows, 8 columns
This code saves the results in a cell array, and also illustrates how to index into it (in the plot calls). Columns 6 or 7 of the ‘imf’ matrix may be the results you want.
I doubt that it would be possible to improve on this result.
.
With findpeaks, I generally use the 'MinPeakProminence' name-value pair to select only the most porminent signals. That neatly avoids problems with choosing the peak amplitudes ('MinPeakHeight') while still returning significant peaks in a region.
With respect to the pulses, if they have irregular intervals, modeling them as a Poisson process (poissfit) could be worth exploring.
I am pleased that the sgolayfilt function seems to be working for you. I use it frequently with signals having broadband noise. It essentially works as a FIR comb filter (visible by calculating the transfer function by dividing the Fourier transform of the filtered output by the Fourier transform of the input), however it’s generally easier to implement and is not sensitive to the signal length.
I’m currently reviewing independent components analysis in order to write a function to perform it. (At least one already exists in the File Exchange, however I want to understand how it works, and coding it myself is the best way to do that.) That would separate the various components of your signal in order to isolate the noiise.
I’m not certain what you’re doing, or whether the noise is part of the process or due to the instrumentation. If it’s instrumentation noise, using shielded (coaxial) leads, effectiive grounding and a reference lead may be ways to control it.
Vitek Stepien
Vitek Stepien il 19 Set 2024
Modificato: Vitek Stepien il 19 Set 2024
Star Strider, it was a pleasure working with you. I did some more tweaking with findpeaks, sorting the pulses, trying to pick only the "right ones" but I don't seem to make a good progress. This is for my work, but that particular piece of analysis was more interesting to me personally than important in the professional sense, so I'm putting it on hold and moving on for now. Thanks for sharing your knowledge, I definitely learned something new! If I ever pick it up and make a significant progress, I'll update this thread.
P.S. This data is all generated with a numerical model, so there is no instrumentation impact at all. All of the broadband noise comes from the dynamics of the modeled process itself.
Thank you!
As always, my pleasure!
I iintend to continue working on my independent components analysis function, since I believe it can be helpful here, and in other appications. When I get it up and running successfully, I will update that analysis here as well.
I wish you well in your research.
If I can be of any further help, please let me know.

Accedi per commentare.

Più risposte (1)

Vitek Stepien
Vitek Stepien il 18 Set 2024
Let me explain my work a bit, but first comments on the previous answers:
  • I often have very noisy signals to analyze, and I use a bunch of different filtering techniques, but I wasn't familiar with the S-G filter. It works really well, I'm very happy with the quality-preserving smoothing I get, so thank you for introducing me to it!
  • I was vaguely familiar with EMD but didn't use it for a long time. I don't think it helps me a lot here.
  • I use wavelet transform for other aspects of my work, but I'm not sure if it would here. I believe it produces qualitatively similar results to the EMD, but in a more continuous format.
Now to what I'm doing: I'm analyzing the output of a nonlinear proces called stimulated Brillouin scattering, which is an optical phenomenon that manifest itself in a generation of an optical wave that, in the context of my work, propagates in the system in the direction opposite to the original signal. This generated wave is seeded by noise and amplified in a nonlinear fashion, producing not-quite-random signal that is full of noise, with more or less sporadic pulses in power. This power vs time is the data in the four test files. In the figure below, I marked three such pulses with arrows.
The problem is that these pulsations are not quite periodic, and the signal is not really zero at other times - it fluctuates all the time. The distance between the first two peaks is about 50 nanoseconds, and between the next two is about 65 nanoseconds, so they are pretty close. Visually inspecting a whole bunch of datasets I see that there is a regularity to the pulsations, and so I wanted to estimate the regularity. See below a sample histogram for one of the datasets:
As you can see, I can kind of calculate the mean time between consecutive pulsations, and inverse that to get an estimate of mean pulsation frequency. But I'm not happy with how the pulsations are detected, because depending on the algorithm I use, some of the are overlooked, others are counted multiple times because they have a more complex structure, etc. See example of my old algorithm's output below:
Yellow dots mark the pulsations detected by the algorithm. Now "eyeballing", I would say there are only two large pulses here, marked with the red arrows. Obviously, many more were detected, including the really small one on the right side, which skews the analysis. I'm trying to improve the algorithm so that it doesn't count the high-frequency spikes, or the tiny peaks on the sides.

Prodotti

Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by