How to partition data in a very specific way

Question

Alejandro De Felipe il 21 Lug 2017

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/349755-how-to-partition-data-in-a-very-specific-way

Risposto: Greg Heath il 27 Lug 2017

I am designing a neural network to classify subjects into two classes and I am having some trouble in preparing the data.

I have been looking for a while for the proper way of partitioning data that is going to be fed to a neural network in a specific way, but I don't find it.

Concretely, what I need is to divide my data in such a way that the 30% of the observations is going to be the test set. From the remaining observations, another two groups will be formed with a ratio of 50/50 and will be used as training and validation set using cross validation (I know this are not the traditional ratios, but I've been asked to do it in this way). For this second partition, as I will be changing the training and validation sets, I will implement cross validation to ensure the data independence.

Initially I used crossvalind function, but I noticed that this function doesn't take into account the classes proportion. Later I tried to use cvpartition, as one of it's implementations allows me to apply stratified k-fold cv, but I don't know how to form groups with a specific ratio.

This is the way in which actually I divide data into test and "other" sets (the last one will be then divided into training and validation set):

INDICES = crossvalind('Kfold',size(data,1),10); % Dividing data into 10 groups...
testInd = (INDICES == 1 | INDICES == 2 | INDICES == 3); otherInd = ~testInd; % and grouping 3 of them in test (30%)
testSet = data(testInd,:); otherSet = data(otherInd,:);
testTarg = targets(testInd); otherTarg = targets(otherInd);

And this one the way in which I form the training and validation test from the remaining data:

CVO = cvpartition(otherTarg,'k',K);
for i=1:K
   trainIdx = CVO.training(k); valIdx = CVO.test(k);
   trainPos = find(trainIdx); valPos = find(valIdx);
   trainSet = otherSet(trainPos,:); valSet = otherSet(valPos,:);
   trainTarg = otherTarg(trainPos); valTarg = otherTarg(valPos);
end

As far as I know, the test and other sets don't have proportional classes and the training and validation sets do not have the required amount of data (half the data of the other group). At this point, I wonder if there is a function that allows me to do what I want, or I can do it with the functions I already know, but I'm not using them correctly.

Thank you in advanced for your attention.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Alejandro De Felipe il 23 Lug 2017

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/349755-how-to-partition-data-in-a-very-specific-way#answer_275234

Apri in MATLAB Online

I have just found the solution for my problem.

In order to divide my data in testSet and otherSet I will be using a code found here and that I have modified a little bit:

function [ X, y, partition ] = generar_sets( X, y, k )
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% To ensure that the training, testing, and validating dataset have similar
% proportions of classes (e.g., 20 classes). This stratified sampling
% technique provided the analyst with more control over the sampling process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% X - dataset
% k - number of fold
% classData - the class data
%
% Output:
% X - new dataset
% partition - fold index
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
n = size(X, 1);
partition = zeros(n, 1);
% shuffle the dataset
[~, idx] = sort(rand(1, n));
X = X(idx, :);
y = y(idx);
% find the unique class
group = unique(y);
nGroup = numel(group);
% find min max number of sample per class
nmin = 100;
for i=1:nGroup
    idx = find(y == group(i));
    ni = length(idx);
    nmin = min(nmin, ni);
end
% create fold indices
foldIndices = zeros(nGroup, nmin);
for i=1:nGroup
    idx = find(y == group(i));
    foldIndices(i, 1:numel(idx)) = idx;
end
% compute fold size for each fold
foldSize = zeros(nGroup, 1);
for i=1:nGroup
    % find the number of element of the class
    numElement = numel(find(foldIndices(i,:) ~= 0));
    % calculate number of element for each fold
    foldSize(1,i) = floor(numElement*0.25);     % foldsize:   |-------| clase 1 | clase 2|
    %                                                         testSet |         |        |
    %                                                         |-------|---------|--------|
    %                                                         otroSet |         |        |
    %                                                         |--------------------------|
    foldSize(2,i) = floor(numElement*0.75);
end
ptr = ones(nGroup, 1);
for i=1:k % Elijo que grupo formar (test u otro)
    for j=1:nGroup % Elijo por qué clase empezar
        if ptr(j)+foldSize(i,j)>size(foldIndices,2)
            idx =  foldIndices(j, (ptr(j): (ptr(j)+foldSize(i,j)-1) ));
        else
            idx =  foldIndices(j, (ptr(j): (ptr(j)+foldSize(i,j)) ));
        end
        if(idx(end) == 0)
            idx = idx(1:end-1);
        end
        partition(idx) = i;
        ptr(j) = ptr(j)+foldSize(i,j);
    end
end
% dump the rest of index to the last fold
idx = find(partition == 0);
partition(idx) = k;
data = [X partition];
for i=1:k
    idx = find(data(:, end) == i);
    fold = y(idx);
    disp(['fold# ', int2str(i), ' has ', int2str( numel(fold) ) ]);
    for j=1:nGroup
        idx = find(fold == group(j));
        percentage = (numel(idx)/numel(fold)) * 100;
        disp(['class# ', int2str(j), ' = ', num2str(percentage), '%']);
      end
      disp(' ');
  end
  end

For dividing otherSet in validation and training set and applying k-fold cv I will be using cvpartition function.

I am quite sure this would work exactly as I expected but, if not, I am still interested in your answers,

Thank you

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 2

Greg Heath il 24 Lug 2017

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/349755-how-to-partition-data-in-a-very-specific-way#answer_275326

Apri in MATLAB Online

The NN Toolbox can be used to obtain many sufficiently independent estimations of error by replacing stratification with double randomization.

1. Random data division
2. Random weight initialization

This much easier to use than n-fold crossvalidation .

Greg

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 3

Greg Heath il 27 Lug 2017

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/349755-how-to-partition-data-in-a-very-specific-way#answer_275730

Apri in MATLAB Online

There are a variety of ways to divide your data. See the help and doc descriptions of

divideblock, divideind and divideint

Hope this helps.

Thank you for formally accepting my answer

Greg

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

How to partition data in a very specific way

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (3)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

How to partition data in a very specific way

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposte (3)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti