How do you get multiple minimum values in a column based on grouping and then isolate the corresponding rows for the whole table?

5 visualizzazioni (ultimi 30 giorni)
I've got a list of ages in a table, sorted, and grouped by 'names'. Something like this:
perName=findgroups(data.names);
b = splitapply(@(x1){sort(x1)}, data.age, perName);
How do I select the youngest 10 ages (smallest) for every name group (or largest for that matter)? Do I need to build a custom function then reference that function in the splitapply command?
Normally, I can sort the column and just trim out (or isolate) the top several using something like this (for example):
data(1:15,:) = [];
However, I've spent a long time without any luck to trim/isolate based on grouping. It was easy to get the min value using @min funciton, but I'm not sure how to get 'multiple' min or max values for each group
Also, once I've managed to find the top ten, how do I also isolate the corresponding rows in the table for different columns?
Any help is much appreciated!
(edit: I've attached the matlab table for reference. Note this is just made up 'dummy' data for practice. Once I figure it out I have to apply to a much larger dataset.)

Risposta accettata

dpb
dpb il 28 Lug 2020
Modificato: dpb il 29 Lug 2020
[g,ig]=findgroups(data.names);
b=splitapply(@(x) {maxk(x,10)},data.age,g);
b contains the k oldest elements for each group...you'll have to use an m-file function to return the optional second index from maxk to locate the position of the N individuals in the group.
>> [cellstr(ig) b]
ans =
7×2 cell array
{'Ben' } {10×1 double}
{'Billy'} { 4×1 double}
{'Happy'} { 4×1 double}
{'Jenny'} {10×1 double}
{'Joe' } { 8×1 double}
{'Kate' } {10×1 double}
{'Smith'} {10×1 double}
>>
ADDENDUM: The indexing would look something like--
function [mx,ix]=getmaxk(x,n)
% wrapper for maxk cuz can't return second output w/ anonymous function
[mx,ix]=maxk(x,n);
mx={mx};
ix={ix};
end
fnmaxk=@(x) getmaxk(x,10);
[b,ib]=splitapply(fnmaxk,data.age,g);
ADDENDUM SECOND:
OK, it's not so bad after all -- have to do it explicitly right now; seems like a good candidate for an enhancement to splitapply() to have available as an auxiliary variable.
Start with
function sq=getmaxkgrp(x,v)
% returns the group variable sequence associated with the n maxk values of x
n=10;
[mx,ix]=maxk(x,n);
sq=v(ix);
end
which has the 10 value hardcoded; you can use the indirect method illustrated above to be able to change the number for which you're looking.
Then, as suggested, augment the table with
data.seq=[1:height(data)].'; % add the sequence number in table of each row
ix=splitapply(@(x1,x2) {getmaxkgrp(x1,x2)},data.age,data.seq,g); % and return that seq number for the top n
tmaxk=(cellfun(@(ix) data(ix,1:end-1),ix,'uni',0)); % retrieve those records from full table
tmaxk=vertcat(tmaxk{:}); % and combine into a table
Above results in--
>> tmaxk =
56×3 table
names age pet
_____ ___ ____
Ben 31 fish
Ben 31 bird
Ben 31 bird
Ben 31 bird
Ben 31 fish
Ben 21 dog
Ben 21 fish
Ben 21 dog
Ben 21 fish
Ben 17 dog
Billy 30 fish
Billy 30 bird
Billy 30 fish
Billy 30 bird
Happy 31 bird
Happy 31 cat
Happy 31 bird
Happy 31 cat
Jenny 30 cat
Jenny 30 dog
Jenny 30 dog
Jenny 30 fish
Jenny 30 dog
Jenny 30 fish
Jenny 30 dog
Jenny 30 fish
Jenny 30 cat
Jenny 30 dog
Joe 79 dog
Joe 79 dog
Joe 21 dog
Joe 20 cat
Joe 19 cat
Joe 18 bird
Joe 15 cat
Joe 15 cat
Kate 60 fish
Kate 60 fish
Kate 60 fish
Kate 51 fish
Kate 51 fish
Kate 51 fish
Kate 51 fish
Kate 50 bird
Kate 50 dog
Kate 50 bird
Smith 38 fish
Smith 38 bird
Smith 38 fish
Smith 38 bird
Smith 38 bird
Smith 38 fish
Smith 30 bird
Smith 29 bird
Smith 29 cat
Smith 29 cat
>>
I didn't test it extensively but I believe it will have returned the correct values. Using a smaller value for N would make that part much easier; left as "Exercise for Student!" :)
  4 Commenti
dpb
dpb il 29 Lug 2020
I didn't spearmint (for doublemint, either, for that matter! :) ) to see; it's not too much of a hassle to build the extra variable and it has the advantage of being a numeric variable and so usable as index as is. One MIGHT be able to make use of the internal 'rownames' property to avoid having to create another table variable.
But, it may not be passable as a parameter, either, so maybe not...above seems the most direct route available at the moment.
Joshua Murray
Joshua Murray il 30 Lug 2020
Thanks! I've made some minor edits to suit the actual data and its working quiet well to find the min/max values. I was resorting to doing this manually (over 1000times) as I hadn't figure anything out yet after thinking about it for half a week... so this has saved me allot of time- don't have much experience with Matlab & similar handling large data. Cheers.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Tables in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by