How to partition data in a very specific way

3 visualizzazioni (ultimi 30 giorni)
Alejandro De Felipe
Alejandro De Felipe il 21 Lug 2017
Risposto: Greg Heath il 27 Lug 2017
I am designing a neural network to classify subjects into two classes and I am having some trouble in preparing the data.
I have been looking for a while for the proper way of partitioning data that is going to be fed to a neural network in a specific way, but I don't find it.
Concretely, what I need is to divide my data in such a way that the 30% of the observations is going to be the test set. From the remaining observations, another two groups will be formed with a ratio of 50/50 and will be used as training and validation set using cross validation (I know this are not the traditional ratios, but I've been asked to do it in this way). For this second partition, as I will be changing the training and validation sets, I will implement cross validation to ensure the data independence.
Initially I used crossvalind function, but I noticed that this function doesn't take into account the classes proportion. Later I tried to use cvpartition, as one of it's implementations allows me to apply stratified k-fold cv, but I don't know how to form groups with a specific ratio.
This is the way in which actually I divide data into test and "other" sets (the last one will be then divided into training and validation set):
INDICES = crossvalind('Kfold',size(data,1),10); % Dividing data into 10 groups...
testInd = (INDICES == 1 | INDICES == 2 | INDICES == 3); otherInd = ~testInd; % and grouping 3 of them in test (30%)
testSet = data(testInd,:); otherSet = data(otherInd,:);
testTarg = targets(testInd); otherTarg = targets(otherInd);
And this one the way in which I form the training and validation test from the remaining data:
CVO = cvpartition(otherTarg,'k',K);
for i=1:K
trainIdx = CVO.training(k); valIdx = CVO.test(k);
trainPos = find(trainIdx); valPos = find(valIdx);
trainSet = otherSet(trainPos,:); valSet = otherSet(valPos,:);
trainTarg = otherTarg(trainPos); valTarg = otherTarg(valPos);
end
As far as I know, the test and other sets don't have proportional classes and the training and validation sets do not have the required amount of data (half the data of the other group). At this point, I wonder if there is a function that allows me to do what I want, or I can do it with the functions I already know, but I'm not using them correctly.
Thank you in advanced for your attention.

Risposte (3)

Alejandro De Felipe
Alejandro De Felipe il 23 Lug 2017
I have just found the solution for my problem.
In order to divide my data in testSet and otherSet I will be using a code found here and that I have modified a little bit:
function [ X, y, partition ] = generar_sets( X, y, k )
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Pree Thiengburanathum
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Description:
% To ensure that the training, testing, and validating dataset have similar
% proportions of classes (e.g., 20 classes). This stratified sampling
% technique provided the analyst with more control over the sampling process.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Input:
% X - dataset
% k - number of fold
% classData - the class data
%
% Output:
% X - new dataset
% partition - fold index
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
n = size(X, 1);
partition = zeros(n, 1);
% shuffle the dataset
[~, idx] = sort(rand(1, n));
X = X(idx, :);
y = y(idx);
% find the unique class
group = unique(y);
nGroup = numel(group);
% find min max number of sample per class
nmin = 100;
for i=1:nGroup
idx = find(y == group(i));
ni = length(idx);
nmin = min(nmin, ni);
end
% create fold indices
foldIndices = zeros(nGroup, nmin);
for i=1:nGroup
idx = find(y == group(i));
foldIndices(i, 1:numel(idx)) = idx;
end
% compute fold size for each fold
foldSize = zeros(nGroup, 1);
for i=1:nGroup
% find the number of element of the class
numElement = numel(find(foldIndices(i,:) ~= 0));
% calculate number of element for each fold
foldSize(1,i) = floor(numElement*0.25); % foldsize: |-------| clase 1 | clase 2|
% testSet | | |
% |-------|---------|--------|
% otroSet | | |
% |--------------------------|
foldSize(2,i) = floor(numElement*0.75);
end
ptr = ones(nGroup, 1);
for i=1:k % Elijo que grupo formar (test u otro)
for j=1:nGroup % Elijo por qué clase empezar
if ptr(j)+foldSize(i,j)>size(foldIndices,2)
idx = foldIndices(j, (ptr(j): (ptr(j)+foldSize(i,j)-1) ));
else
idx = foldIndices(j, (ptr(j): (ptr(j)+foldSize(i,j)) ));
end
if(idx(end) == 0)
idx = idx(1:end-1);
end
partition(idx) = i;
ptr(j) = ptr(j)+foldSize(i,j);
end
end
% dump the rest of index to the last fold
idx = find(partition == 0);
partition(idx) = k;
data = [X partition];
for i=1:k
idx = find(data(:, end) == i);
fold = y(idx);
disp(['fold# ', int2str(i), ' has ', int2str( numel(fold) ) ]);
for j=1:nGroup
idx = find(fold == group(j));
percentage = (numel(idx)/numel(fold)) * 100;
disp(['class# ', int2str(j), ' = ', num2str(percentage), '%']);
end
disp(' ');
end
end
For dividing otherSet in validation and training set and applying k-fold cv I will be using cvpartition function.
I am quite sure this would work exactly as I expected but, if not, I am still interested in your answers,
Thank you

Greg Heath
Greg Heath il 24 Lug 2017
The NN Toolbox can be used to obtain many sufficiently independent estimations of error by replacing stratification with double randomization.
1. Random data division
2. Random weight initialization
This much easier to use than n-fold crossvalidation .
Greg

Greg Heath
Greg Heath il 27 Lug 2017
There are a variety of ways to divide your data. See the help and doc descriptions of
divideblock, divideind and divideint
Hope this helps.
Thank you for formally accepting my answer
Greg

Categorie

Scopri di più su Deep Learning Toolbox in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by