用归回模型导出的代码,在用原始数据重新训练,是否存在泄露问题Is there any leakage issue when the code exported by the rollback model is retrained with the original data
4 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
我尝试通过归回工具箱进行模型训练,选择5折交叉验证和留出20%数据作为测试集,然后我将训练好的模型导出。在用之前的数据重新进行训练,只是调整了数据的顺序,自己划分了训练集与测试集,还是20%作为测试。请问这种方法,重新训练出来的模型是否存在数据泄露的问题。下面是我的原始代码。抱歉,我的英文水平不好,所以我是用的翻译软件。
I attempted to train the model by returning to the toolbox, selected 50% cross-validation and set aside 20% of the data as the test set, and then I exported the trained model. When retraining with the previous data, only the order of the data was adjusted. The training set and the test set were divided by myself, and 20% was still used for testing. May I ask if there is a problem of data leakage in the model retrained by this method? The following is my original code. Sorry, my English proficiency is not good, so I use translation software.
导入数据
% res = xlsread('数据集.xlsx');
res = table2array(data1);
% 增加随机数种子确保复现性
rng(1000);
划分训练集和测试集
n = size(data1,1);
temp = randperm(n); %打乱数据集,随机生成索引
n1 = round(0.8*n);
% TreeBagger函数的输入格式要求样本作为行,特征值作为列
%这次转置的目的是方便后续按照特征进行归一化
P_train = res(temp(1: n1), 1: 10)'; % 特征值
T_train = res(temp(1: n1), 11)'; % 目标变量
M = size(P_train, 2); % 样本个数
P_test = res(temp(n1+1: end), 1: 10)';
T_test = res(temp(n1+1: end), 11)';
N = size(P_test, 2); % 样本个数
数据归一化
% mapminmax最小-最大归一化,第一个参数是归一化后的数据,第二个参数是一个结构体,用于对后续测试数据做相同的归一化
[p_train, ps_input] = mapminmax(P_train, 0, 1);
p_test = mapminmax('apply', P_test, ps_input); %使用训练集的归一化参数,对测试集进行完全相同的缩放。避免数据泄露
[t_train, ps_output] = mapminmax(T_train, 0, 1);
t_test = mapminmax('apply', T_test, ps_output);
转置以适应模型
%这次转置的目的是将数据集调整到适合树模型的输入格式要求
p_train = p_train'; p_test = p_test';
t_train = t_train'; t_test = t_test';
n_features = size(p_train, 2);
% inputTable = trainingData;
predictorNames = {'AN1', 'VR1', 'ARE1', 'EF2', 'E12', 'ST_1', 'St_1', 'CRR_1', 'AT_1', 'At_1'};
% predictors = inputTable(:, predictorNames);
predictors = p_train;
% response = inputTable.UTS;
response = t_train;
% isCategoricalPredictor = [false, false, false, false, false, false, false, false, false, false];
% 训练回归模型
% 以下代码指定所有模型选项并训练模型。
template = templateTree(...
'MinLeafSize', 8, ...
'NumVariablesToSample', 'all');
regressionEnsemble = fitrensemble(...
predictors, ...
response, ...
'Method', 'Bag', ...
'NumLearningCycles', 30, ...
'Learners', template);
% 使用 predict 函数创建结果结构体
predictorExtractionFcn = @(t) t;
ensemblePredictFcn = @(x) predict(regressionEnsemble, x);
trainedModel.predictFcn = @(x) ensemblePredictFcn(predictorExtractionFcn(x));
% 向结果结构体中添加字段
trainedModel.RequiredVariables = {'AN1', 'VR1', 'ARE1', 'EF2', 'E12', 'ST_1', 'St_1', 'CRR_1', 'AT_1', 'At_1'};
trainedModel.RegressionEnsemble = regressionEnsemble;
trainedModel.About = '此结构体是从回归学习器 R2024b 导出的训练模型。';
trainedModel.HowToPredict = sprintf('要基于新表 T 进行预测,请使用: \n yfit = c.predictFcn(T) \n将 ''c'' 替换为此结构体的变量名,例如 ''trainedModel''。\n \n表 T 必须包含由以下属性返回的变量: \n c.RequiredVariables \n变量格式(例如矩阵/向量、数据类型)必须与原始训练数据匹配。\n忽略其他变量。\n \n有关详细信息,请参阅 <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appregression_exportmodeltoworkspace'')">How to predict using an exported model</a>。');
% % 提取预测变量和响应
% % 以下代码将数据处理为合适的形状以训练模型。
% %
% inputTable = trainingData;
% predictorNames = {'AN1', 'VR1', 'ARE1', 'EF2', 'E12', 'ST_1', 'St_1', 'CRR_1', 'AT_1', 'At_1'};
% predictors = inputTable(:, predictorNames);
% response = inputTable.UTS;
% isCategoricalPredictor = [false, false, false, false, false, false, false, false, false, false];
% 执行交叉验证
partitionedModel = crossval(trainedModel.RegressionEnsemble, 'KFold', 5);
% 计算验证预测
validationPredictions = kfoldPredict(partitionedModel);
% 计算验证 RMSE
validationRMSE = sqrt(kfoldLoss(partitionedModel, 'LossFun', 'mse'));
仿真测试-预测
% 为了后续计算误差
t_sim1 = trainedModel.predictFcn(p_train);
t_sim2 = trainedModel.predictFcn(p_test);
%% 数据反归一化
T_sim1 = mapminmax('reverse', t_sim1, ps_output);
T_sim2 = mapminmax('reverse', t_sim2, ps_output);
%% 均方根误差RSME
error1 = sqrt(sum((T_sim1' - T_train).^2) ./ M);
error2 = sqrt(sum((T_sim2' - T_test ).^2) ./ N);
绘图
figure
plot(1: M, T_train, 'r-*', 1: M, T_sim1, 'b-o', 'LineWidth', 1)
legend('真实值', '预测值')
xlabel('预测样本')
ylabel('预测结果')
string = {'训练集预测结果对比'; ['RMSE=' num2str(error1)]};
title(string)
xlim([1, M])
grid
figure
plot(1: N, T_test, 'r-*', 1: N, T_sim2, 'b-o', 'LineWidth', 1)
legend('真实值', '预测值')
xlabel('预测样本')
ylabel('预测结果')
string = {'测试集预测结果对比'; ['RMSE=' num2str(error2)]};
title(string)
xlim([1, N])
grid
% %% 绘制误差曲线
% figure
% plot(1: trees, oobError(net), 'b-', 'LineWidth', 1)
% legend('误差曲线')
% xlabel('决策树数目')
% ylabel('误差')
% xlim([1, trees])
% grid
% %% 绘制特征重要性
% figure
% bar(importance)
% legend('重要性')
% xlabel('特征')
% ylabel('重要性')
相关指标计算
% R2
R1 = 1 - norm(T_train - T_sim1')^2 / norm(T_train - mean(T_train))^2;
R2 = 1 - norm(T_test - T_sim2')^2 / norm(T_test - mean(T_test ))^2;
disp(['训练集数据的R2为:', num2str(R1)])
训练集数据的R2为:0.78599
disp(['测试集数据的R2为:', num2str(R2)])
测试集数据的R2为:0.77068
% % MAE
% mae1 = sum(abs(T_sim1' - T_train)) ./ M;
% mae2 = sum(abs(T_sim2' - T_test )) ./ N;
%
% disp(['训练集数据的MAE为:', num2str(mae1)])
% disp(['测试集数据的MAE为:', num2str(mae2)])
%
% % MBE
% mbe1 = sum(T_sim1' - T_train) ./ M ;
% mbe2 = sum(T_sim2' - T_test ) ./ N ;
%
% disp(['训练集数据的MBE为:', num2str(mbe1)])
% disp(['测试集数据的MBE为:', num2str(mbe2)])
%% 绘制散点图
sz = 25;
c = 'b';
figure
scatter(T_train, T_sim1, sz, c)
hold on
min_val = min([T_train, T_sim1']) * 0.95;
max_val = max([T_train, T_sim1']) * 1.05;
plot([min_val max_val], [min_val max_val], 'k--', 'LineWidth', 1)
axis([min_val max_val min_val max_val]);
xlabel('训练集真实值');
ylabel('训练集预测值');
title('训练集预测值 vs. 训练集真实值')
figure
scatter(T_test, T_sim2, sz, c)
hold on
min_val = min([T_test, T_sim2']) * 0.95;
max_val = max([T_test, T_sim2']) * 1.05;
plot([min_val max_val], [min_val max_val], 'k--', 'LineWidth', 1)
axis([min_val max_val min_val max_val]);
plot(xlim, ylim, '--k')
xlabel('测试集真实值');
ylabel('测试集预测值');
title('测试集预测值 vs. 测试集真实')
1 Commento
cdarling
il 19 Mag 2025
由于代码无法直接运行,根据你的描述进行回答
数据泄露指的是,在训练阶段已经训练过的数据,后续再使用它进行预测、检验
由于训练阶段已经对这组数据进行过训练,调整了模型参数,让模型能够生成正确的结果,因此再次使用这一组数据输入模型查看结果,就没有太大意义了
训练模型主要是为了它能够适应不太一样的输入,支持各种真实情况的输入,因此需要使用不同的输入来检验它
退一步讲,如果只针对已知的输入来决定输出,那么只要对已知的输入进行if判断,输出相应的值即可,不需要训练更复杂的模型了
Risposte (0)
Vedere anche
Categorie
Scopri di più su MATLAB Report Generator in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!