Why are the results of forward and predict very different in deep learning?

Question

cui,xingxing il 20 Giu 2020

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/551863-why-are-the-results-of-forward-and-predict-very-different-in-deep-learning

Modificato: cui,xingxing il 27 Apr 2024

When I use the "dlnetwork" type deep neural network model to make predictions, the results of the two functions are very different, except that using the predict function will freeze the batchNormalizationLayer and dropout layers.While forward does not freeze the parameters, he is the forward transfer function used in the training phase.

From the two pictures above, there are orders of magnitude difference in the output of the previous 10 results. Where does the problem appear?

All my data is here.

-------------------------Off-topic interlude, 2024-------------------------------

I am currently looking for a job in the field of CV algorithm development, based in Shenzhen, Guangdong, China,or a remote support position. I would be very grateful if anyone is willing to offer me a job or make a recommendation. My preliminary resume can be found at: https://cuixing158.github.io/about/ . Thank you!

Email: cuixingxing150@gmail.com

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Daniel Vieira il 5 Ago 2021

5
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/551863-why-are-the-results-of-forward-and-predict-very-different-in-deep-learning#answer_761882

Modificato: Daniel Vieira il 5 Ago 2021

I ran into this exact problem, and I think I found a solution, I'll discover it when my model finishes training...

As others said before, the problem occurs because batchNorms behave differently in forward() and predict(). But there is still a problem here: if you trained your model (forward), it should have converged to a solution that works well in inference (predict), but it doesn't. Something is wrong in the training too.

What is wrong is that batchNorms don't update parameters the same way as other layers through (adam/rmsprop/sgdm)update functions. They update through the State property of the dlnetwork object. Consider the code:

[gradients,loss] = dlfeval(@modelGradients,dlnet,dlX,Ylabel);

[dlnet,otherOutputs]=rmspropupdate(dlnet,gradients,otherInputs);

function [gradients,loss] = modelGradients(dlnet,dlX,Ylabel)

Y=forward(dlnet,dlX);

loss=myLoss(Y,Ylabel);

gradients=dlgradient(loss,dlnet.Learnables);

end

The code above is wrong if you have batchNorms, it won't update them. The batchNorms are updated through the State property returnet from forward and assigned to dlnet:

[gradients,state,loss] = dlfeval(@modelGradients,dlnet,dlX,Ylabel);

dlnet.State=state; % THIS!!!

[dlnet,otherOutputs]=rmspropupdate(dlnet,gradients,otherInputs);

function [gradients,state,loss] = modelGradients(dlnet,dlX,Ylabel)

[Y,state]=forward(dlnet,dlX); % THIS!!!

loss=myLoss(Y,Ylabel);

gradients=dlgradient(loss,dlnet.Learnables);

end

Now that dlnet has a State property updated at every forward() call, the batchNorms are updated and your model should converge to a solution that works for predict().

I would also like caling MathWorks attention that this detail is only present in documentation in ONE example of GAN networks (in spite of the omnipresence of batchNorm layers in deep learning models) and is never mentioned explicitly.

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Daniel Vieira il 25 Ott 2022

Modificato: Daniel Vieira il 25 Ott 2022

Contrastive loss has a symmetry to it, L(x,y) is the same as L(y,x). Because of that, intuitively I would say it doesn't matter which 'side' you pick to update the State. But don't take my word on that.

For triplet loss, lacking the same symmetry, it doesn't feel right to me using any layer that updates State.

In either case, I would go for the instanceNorm layer rather than batchNorm.

EDIT:

on a second thought, there is actually a choice that makes sense to pick the State. To use contrastive or triplet loss, you are surely using a multiple input network, like a siamese architecture. These architectures are built intending to compare the 'test input' to the 'standard input', let's put it this way. The loss may be symmetric, but your 'data picking' for input 1 and input 2 is not, one of them is being drawn at random, the other is being selected in accordance, to direct the training. It makes sense to me to use the State generated by the random one, it is the one that will be 'faithful' to the population of your dataset.

Filippo Vascellari il 26 Ott 2022

Yes, sorry my random means: for contrastive, the first image in the pair is chosen randomly, the second in the pair is chsoen with probability of 50% to be another image of the same person or an image from different person, while in triplet loss the first image (anchor), is chosen randomly and i pick randomly another image from the same class(same person) and another image chosen at randomly from another random class(person).

If you will look the link I put in the previous answer, you will see a strange behavour in the loss graph in the image for the triplet loss, it doesn't follow a curve behavour and i think is due to the fact that the subnetwork (resnet18 pretrained on my dataset) used as feature extractor is already trained with optimal performance and gives a good embedding for each image despeite the fluctuations you will see that are due to some batch with triplet that are harder than the other and the loss will increase. I don't think that the random pick of triplet is wrong, also because the hard mining or semi hard mining is infeasible for my laptop also with 16GB of RAM and a GTX 1660 Ti Mobile as graphi card, but the strange beahvour is due to the lack of update of the batch norm layers during trainig due to the question:"How i update the state of batch norm layers if the forward is make on more than one dlarray?"

Keep in mind that I frozen all the layer in the resnet18 untill the avarage pooling layer called pool5 in pretrained resnet18, what i keep free to be trained is/are the fully connected layer/layers at the end

Daniel Vieira il 26 Ott 2022

ok, good, you are doing the data sampling right then. In this case, the State should be picked from the "true random" input, as the others are "conditional random" and won't give TrainedMean and TrainedStd that are representative of your dataset.

about the loss curve, I completely missed the file you atached yesterday, only saw now, sorry... it looks noisy, but might not be wrong. Noisy loss can be caused by small minibatch, for example. Or a particularly hard problem, like training detectors for small and faint objects in huge images. You can experiment with the minibatch size if you are not working at the max your GPU can handle. And a look at your dataset, perhaps something in your images is making the model harder to train...?

If you apply some smoothing to that curve you might have a clearer view of the downward trend (the standard MATLAB training plot does have some smoothing). This amount of 'noise' does mean that your model still has a good deal of error margin, though.

Your code also lacks regularization, it might be worth looking into it (but I don't think this is the cause of the problems you describe). -> https://www.mathworks.com/help/deeplearning/ug/specify-training-options-in-custom-training-loop.html#mw_50581933-e0ce-4670-9456-af23b2b6f337

"Keep in mind that I frozen all the layer in the resnet18 untill the avarage pooling layer called pool5 in pretrained resnet18, what i keep free to be trained is/are the fully connected layer/layers at the end"

in here you got me confused, if all that is being trained are the fullyconnected layers why do you need the State for the batchNorms? they are frozen as well, aren't they?

Filippo Vascellari il 28 Ott 2022

In my training i use, in the case of contrastive, batch of 64 and, in case of triplet, batch of 32 images that keep the entire RAM and GPU, so i can't use higher batch values. The images of the dataset are taken from CasiaWebface and are resized to 224x224.

I understand that the noise of the curve (fluctuations ) are due to both margin and triplets chosen in that batch, if the triplet are hard the loss will increase during iteration.

At the end I discover tha the not curve behavour of the triplet loss using my resnet18 pretrained on dataset is due to the fact that feature extraction made by resnet18 pretrained is already "good", in fact in during test it produce a small number of False Positive and False Negative without training it on rtriplet loss, in fact the curve beahvour is obtained if i use a not pretrained resnet18 on my dataset, i used the one trained on ImageNet.

I will try to use L2Regulariztion to reduce the noise in triplet loss because i already tested the contrastive and i obtain good results.

For the batchUpdate i need it because in my test i train different netwroks: crossentropy, triplet and contrastive, the last two are made in 2 versions: only triplet or contrastive loss and another version that combines classification loss and triplet/contrastive loss, to obtain this versione the netwrok must be entirely updated, also the batch norm layers, so for this reason i need the state update. Think for example: the contrastive loss needs the loss itself + the sum of the classification loss of the pair images (anchor-pos or anchor-neg) in this case i need to update the batch layer whit which state? the anchor one or pos/neg one?

Accedi per commentare.

Answer 2

vaibhav mishra il 30 Giu 2020

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/551863-why-are-the-results-of-forward-and-predict-very-different-in-deep-learning#answer_459103

Hi there,

In my opinion you are using BatchNorm in training and not in testing, so how can you expect to get the same results from both. You need to use batchnorm in testing also with the same parameters as training.

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

cui,xingxing il 7 Lug 2020

Modificato: cui,xingxing il 7 Lug 2020

Thank you for your reply! But isn't the method function predict of dlnetwork freeze the BatchNorm mean and variance during model inference?

1、If it is frozen BN, why is the second parameter state returned by predict empty?

2、in testing, If I want to use Batchnorm parameters in the training phase, how should the code be modified during the inference model?

Sincerely hope to get your reply, thank you！

Accedi per commentare.

Answer 3

cui,xingxing il 12 Lug 2020

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/551863-why-are-the-results-of-forward-and-predict-very-different-in-deep-learning#answer_464588

test.pdf

I wrote an analysis blog on this issue, see the attachment link. The question that still bothers me is how does batchnorm() forward and predict?

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 4

Luc VIGNAUD il 29 Giu 2021

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/551863-why-are-the-results-of-forward-and-predict-very-different-in-deep-learning#answer_735928

Thank you for raising this question. I did observe this issue playing with GANs and the difference comes indeed from the batchNorm. I ended using InstanceNorm instead but the question remains and should be answered by the matlab team ...

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

cui,xingxing il 8 Dic 2023

Apri in MATLAB Online

Only batchNormalizationLayer have State property,instanceNormalizationLayer don't. See following example:

layers = [

imageInputLayer([28 28 3])

convolution2dLayer(5,20)

batchNormalizationLayer

instanceNormalizationLayer

reluLayer

maxPooling2dLayer(2,'Stride',2)

fullyConnectedLayer(10)

softmaxLayer

classificationLayer]

layers =

9×1 Layer array with layers: 1 '' Image Input 28×28×3 images with 'zerocenter' normalization 2 '' 2-D Convolution 20 5×5 convolutions with stride [1 1] and padding [0 0 0 0] 3 '' Batch Normalization Batch normalization 4 '' Instance Normalization Instance normalization 5 '' ReLU ReLU 6 '' 2-D Max Pooling 2×2 max pooling with stride [2 2] and padding [0 0 0 0] 7 '' Fully Connected 10 fully connected layer 8 '' Softmax softmax 9 '' Classification Output crossentropyex

net = dlnetwork(layers(1:end-1))

net =

dlnetwork with properties: Layers: [8×1 nnet.cnn.layer.Layer] Connections: [7×2 table] Learnables: [8×3 table] State: [2×3 table] InputNames: {'imageinput'} OutputNames: {'softmax'} Initialized: 1 View summary with summary.

net.Learnables

ans = 8×3 table

Layer Parameter Value ______________ _________ ___________________ "conv" "Weights" { 5×5×3×20 dlarray} "conv" "Bias" { 1×1×20 dlarray} "batchnorm" "Offset" { 1×1×20 dlarray} "batchnorm" "Scale" { 1×1×20 dlarray} "instancenorm" "Offset" { 1×1×20 dlarray} "instancenorm" "Scale" { 1×1×20 dlarray} "fc" "Weights" {10×2880 dlarray} "fc" "Bias" {10×1 dlarray}

net.State

ans = 2×3 table

Layer Parameter Value ___________ _________________ ________________ "batchnorm" "TrainedMean" {1×1×20 dlarray} "batchnorm" "TrainedVariance" {1×1×20 dlarray}

Accedi per commentare.

Why are the results of forward and predict very different in deep learning?

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Più risposte (3)

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

Why are the results of forward and predict very different in deep learning?

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

8 Commenti Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Più risposte (3)

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti