How do I obtain 8760 mean hourly data points from 10.6 years data?

I have data from 01-Jan-2013 to 23- Aug-2023. I have attached the XWinddata.xlsx file. The data comprise hourly measured wind speed, direction, temperature and pressure. I want to find hourly average values for the period so that I have 8760 hourly data corresponding to hours in a year. I have cleaned the data to remove NANs. I failed to obtain the required values. I have followed two similar questions that I came accross on this site but was getting errors. Please may you assist.
Following the codes given I created my code
XWinddata.DoY = day(XWinddata.Timestamp,'dayofyear'); % I follwed the answer in the above link, it worked
XWinddata.HoD = hour(XWinddata.Timestamp); % This also worked
head(XWinddata) % Also worked and the DoY and HoD columns were added to my timetable
t = varfun(@mean,XWinddata,'GroupingVariables',{'DoY' 'HoD'},'OutputFormat','table'); % I got an error and got stuck
Error obtained while running the code
Error in tabular/varfun>@(s,varargin)dfltErrHandler(grouped,funName,s,varargin{:}) (line 201)
errHandler = @(s,varargin) dfltErrHandler(grouped,funName,s,varargin{:});
Error in tabular/varfun (line 280)
outVals{igrp} = errHandler(s,inArg);
I also obtained errors.
Thank you in anticipation.

 Risposta accettata

tt=readtimetable('XWinddata.xlsx');
head(tt,1)
Timestamp Velocityms DirectionDegrees TemperatureDegrees PressurePa yearhour DoY ____________________ __________ ________________ __________________ __________ ____________ ___ 11-Jan-2013 04:00:00 5.1 180 16.9 0.0093189 {'01/11 04'} 11
tt=removevars(tt,{'yearhour','DoY'}); % clean so start over
tt=addvars(tt,day(tt.Timestamp,'dayofyear'),hour(tt.Timestamp),'NewVariableNames',{'DoY','HoD'},'After',{'PressurePa'});
ttHr=varfun(@mean,tt,'GroupingVariables',{'DoY','HoD'},'OutputFormat','table');
height(ttHr)
ans = 4486
[head(ttHr);tail(ttHr)]
ans = 16×7 table
DoY HoD GroupCount mean_Velocityms mean_DirectionDegrees mean_TemperatureDegrees mean_PressurePa ___ ___ __________ _______________ _____________________ _______________________ _______________ 1 0 20 4.705 135.5 21.66 0.009343 1 2 20 5.255 128.5 21.25 0.0093378 1 4 20 5.68 125.5 20.96 0.0093348 1 6 20 5.705 117.5 20.64 0.0093323 1 8 20 5.64 116.5 20.265 0.0093322 1 10 20 5.67 130 20.1 0.009336 1 12 20 5.595 126.5 20.15 0.0093402 1 14 20 5.5 125 20.255 0.0093432 366 8 4 4.325 127.5 21.4 0.00933 366 10 4 3.725 110 21.45 0.0093339 366 12 4 3.3 107.5 20.875 0.0093391 366 14 4 3.125 147.5 21.05 0.0093391 366 16 4 2.725 92.5 21.525 0.0093403 366 18 4 3.5 115 21.7 0.0093416 366 20 4 3.475 100 22 0.0093374 366 22 4 4.075 80 22.125 0.0093302
But, you are missing most of the odd hours...
[numel(unique(tt.DoY)) numel(unique(tt.HoD)) numel(unique([tt.DoY tt.HoD],'rows'))/2 366*24]
ans = 1×4
366 24 4486 8784
As the above confirms, there are only 4,486 unique DoY, HoD combinations, not the 8784 that exist in a 366-day year. That's because you have data for most days only on two-hour increments.
This also leads back to the problem that the poster in one of the earlier links had in that there are leap years and they will show up with years that do have 366 days in them and the DOY algorithm will always count them in its conversion, even if you were to remove Feb 29 data from the timetable before computing DoY.
You didn't show us the top level user code error, only the internal error trap so we can't tell for absolute certain what went wrong; I'd guess the problem was you didn't remove the 'yearhour' variable (or use 'InputVariables' named argument pair) to not try to average it as well -- being nonnumeric, you can't take the mean of it.
NOTA BENE: However, it appears as you have only a very limited number of observations for the odd hours; whether to try to interpolate to fill those first or not I'll leave to you to decide...
Says there almost two entries for each time stamp, however, and interpolation only works for unique x values so will have to average first and then fill in..
tt=removevars(tt,{'DoY','HoD'}); % clean to start over
tt=rmmissing(retime(tt,'hourly','mean')); % average remove missing hours
tt=retime(tt,'hourly','linear'); % now fill in odd hours
tt=addvars(tt,day(tt.Timestamp,'dayofyear'),hour(tt.Timestamp),'NewVariableNames',{'DoY','HoD'},'After',{'PressurePa'});
[numel(unique(tt.DoY)) numel(unique(tt.HoD))]
ans = 1×2
366 24
isleapyr=@(t)(eomday(year(t),2)==29); % leap year logical function
isleapday=@(t)((month(t)==2)&(day(t)==29)); % leap day logical function
tt(isleapday(tt.Properties.RowTimes),:)=[]; % now remove Feb29 globally
[numel(unique(tt.DoY)) numel(unique(tt.HoD))] % still 1:366 in DoY
ans = 1×2
366 24
ix=isleapyr(tt.Properties.RowTimes)&(month(tt.Properties.RowTimes)>2); % dates needing fixup
tt.DoY(ix)=tt.DoY(ix)-1; % and adjust to 365 days
[numel(unique(tt.DoY)) numel(unique(tt.HoD))] % now we're back to 365 day year w/ DoY to match
ans = 1×2
365 24
ttHr=varfun(@mean,tt,'GroupingVariables',{'DoY','HoD'},'OutputFormat','table');
height(ttHr)
ans = 8760
[head(ttHr);tail(ttHr)]
ans = 16×7 table
DoY HoD GroupCount mean_Velocityms mean_DirectionDegrees mean_TemperatureDegrees mean_PressurePa ___ ___ __________ _______________ _____________________ _______________________ _______________ 1 0 10 4.705 135.5 21.66 0.009343 1 1 10 4.98 132 21.455 0.0093404 1 2 10 5.255 128.5 21.25 0.0093378 1 3 10 5.4675 127 21.105 0.0093363 1 4 10 5.68 125.5 20.96 0.0093348 1 5 10 5.6925 121.5 20.8 0.0093336 1 6 10 5.705 117.5 20.64 0.0093323 1 7 10 5.6725 117 20.452 0.0093323 365 16 10 3.615 114 20.23 0.0093648 365 17 10 3.7075 119 20.355 0.0093652 365 18 10 3.8 124 20.48 0.0093655 365 19 10 3.8225 124.75 20.63 0.0093645 365 20 10 3.845 125.5 20.78 0.0093636 365 21 10 4 126.5 20.935 0.0093613 365 22 10 4.155 127.5 21.09 0.0093589 365 23 10 4.43 131.5 21.375 0.009351
The above does it by just throwing away Feb 29, you could alternatively perhaps collapse Feb 28-29 to Feb 28 or Feb 29/Mar 1 to Mar 1 to somehow keep the values observed in the mix, but the above is the simplest approach, certainly.

8 Commenti

Thank you very much for a swift response. I really appreciate. I am very new to MATLAB. What do you mean by ' the top level user code error' so that I am able to answer that part? It is important for me to answer that part because it can lead to a code that brings the results that I am expecting. I want something like:
Day 1 hour 1 (value)
Day 1 hour 2 (value)
Day 1 hour 3 (value)
:
:
Day 365 hour 22 (value)
Day 365 hour 23 (value)
Day 365 hour 24 (value)
Such that at the end I have 8760 hours with the corresponding values.
dpb
dpb il 2 Ott 2023
Modificato: dpb il 2 Ott 2023
OK, I misread intent apparently.
See above amended answer regarding the number of values you get; you don't have the data for most of the odd hours, that would take interpolating.
As for 365 vs 366 days in a year; it would take adjusting the DoY values to remove 366 one way or another; the 366th day is Dec 31 if you just delete those elements; March 1 is day 60 in non leapyears but 61 in leap years so to keep calendar dates the same would mean moving the DoY down one for leap years.
Alternatively, if one were to eliminate Feb 29 initially, one could manually create the DoY excluding leap years instead of using the builtin day() function to do so and then get the 8760 number.
For the user error, there should have been more red text of the error when it failed before from the actual code line that failed--that it worked as I did it by eliminating the string day/hour field before computing the mean is pretty conclusive that your not having done that was the problem.
If have missing data, you'll either have to interpolate or eliminate it -- you have options on import to either remove rows that you chose or insert missing values by variable/row; then you have options for dealing with those missing values.
Since almost all odd hours were missing in the original, I went ahead and interpolated to fill in -- but it also showed that there were about two entries on average for each timestamp so had to average those down to one value first before could interpolate.
The Q? might be why there are duplicate time data and no odd hours; is that a fignewton from the collection process, maybe, that the odd hours got rounded to the even so there are two values at the same time in the data file but they're supposed to have been the odd-even hours instead? I'll leave that conundrum to you to figure out; I presume you know how the data were collected to be able to sort that far easier than I can.
But, if that is real but there are hourly data but just not complete to fill in the odd hours, then there's always the option to use the 'nanflag' in the mean function that just ignores NaN values; one would then get a number as long as there were any values for each date/time for each variable. To do tht would require using an anonymous function in place of the text or simple function handle to pass the auxiliary argument -- something akin to
ttHr=varfun(@(x)mean(x,'omitnan'),tt,'GroupingVariables',{'DoY','HoD'},'OutputFormat','table');
I am failing to figure out what happend to the original data when I imported it. It has both odd and even hours for the period. Here is the original data.
readcell('XWinddata.xlsx')
ans = 93242×6 cell array
{'Date' } {'Time' } {'Velocity(m/s)'} {'Direction(Degrees)'} {'Temperature(Degrees)'} {'Pressure(Pa)'} {[01-Jan-2013 00:00:00.000]} {[01-Jan-2018 00:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 01:00:00.000]} {[01-Jan-2018 01:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 02:00:00.000]} {[01-Jan-2018 02:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 03:00:00.000]} {[01-Jan-2018 03:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 04:00:00.000]} {[01-Jan-2018 04:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 05:00:00.000]} {[01-Jan-2018 05:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 06:00:00.000]} {[01-Jan-2018 06:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 07:00:00.000]} {[01-Jan-2018 07:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 08:00:00.000]} {[01-Jan-2018 08:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 09:00:00.000]} {[01-Jan-2018 09:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 10:00:00.000]} {[01-Jan-2018 10:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 11:00:00.000]} {[01-Jan-2018 11:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 12:00:00.000]} {[01-Jan-2018 12:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 13:00:00.000]} {[01-Jan-2018 13:00:00.000]} {'-' } {'-' } {'-' } {'-' } {[01-Jan-2013 14:00:00.000]} {[01-Jan-2018 14:00:00.000]} {'-' } {'-' } {'-' } {'-' }
Well, it seems to have a Date and a Time column header with inconsistent date data in the beginning....I did download and open the file in Excel and indeed, the Date column does include actual time in the cell, not just the date and the Time column is a full date-time string just formatted to look like time only.
I'd suggest if the spreadsheet was created from some other form of the initial input it that process be fixed to not mix apples and oranges although Excel and dates can be an issue in convincing it to contain what you desire, specifically, it tends to have "mind of its own".
Or, when reading, just ignore the Time column entirely.
If do revise, it would also make more sense in the numeric columns with mising data to just leave the cells empty instead of inserting the character string--that's mixing data types that don't go together.
But, that's NOT the same file as the original--I also downloaded and opened it -- it is missing the odd hour entries -- would have to see how it was created to figure out why; perhaps an unintended extra colon ":2" got into an addressing vector and saved only alternate rows, impossible to tell how it got to be what it is without the session operations. But, saving the two with the same file name makes it much easier to be able to confuse "who's who in the zoo!".
Thank you very for the assistence. Your suggestion worked after correcting the DateTime.
Moral: Have to be VERY careful with Excel and dates...<for many reasons>. See @Walter Roberson's comment to the Q? there for another nasty trait.
The site is very useful. Thanks once again

Accedi per commentare.

Più risposte (0)

Categorie

Prodotti

Release

R2023a

Richiesto:

il 2 Ott 2023

Commentato:

il 5 Ott 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by