Main Content

This example shows how to use the fast gradient sign method (FGSM) and the basic iterative method (BIM) to generate adversarial examples for a pretrained neural network.

Neural networks can be susceptible to a phenomenon known as *adversarial examples* [1], where very small changes to an input can cause the input to be misclassified. These changes are often imperceptible to humans.

In this example, you create two types of adversarial examples:

Untargeted — Modify an image so that it is misclassified as any incorrect class.

Targeted — Modify an image so that it is misclassified as a specific class.

Load a network that has been trained on the ImageNet [2] data set and convert it to a `dlnetwork`

.

```
net = squeezenet;
lgraph = layerGraph(net);
lgraph = removeLayers(lgraph,lgraph.Layers(end).Name);
dlnet = dlnetwork(lgraph);
```

Extract the class labels.

classes = categories(net.Layers(end).Classes);

Load an image to use to generate an adversarial example. The image is a picture of a golden retriever.

img = imread('sherlock.jpg'); T = "golden retriever";

Resize the image to match the input size of the network.

```
inputSize = dlnet.Layers(1).InputSize;
img = imresize(img,inputSize(1:2));
figure
imshow(img)
title("Ground Truth: " + T)
```

Prepare the image by converting it to a `dlarray`

.

`X = dlarray(single(img),"SSCB");`

Prepare the label by one-hot encoding it.

T = onehotencode(T,1,'ClassNames',classes); T = dlarray(single(T),"CB");

Create an adversarial example using the untargeted FGSM [3]. This method calculates the gradient $${\nabla}_{X}L(X,T)$$ of the loss function $$L$$, with respect to the image $$X$$ you want to find an adversarial example for, and the class label $$T$$. This gradient describes the direction to "push" the image in to increase the chance it is misclassified. You can then add or subtract a small error from each pixel to increase the likelihood the image is misclassified.

The adversarial example is calculated as follows:

${\mathit{X}}_{\mathrm{adv}}=\mathit{X}+\u03f5.\mathrm{sign}\left({\nabla}_{\mathit{X}}\mathit{L}\left(\mathit{X},\mathit{T}\right)\right)$.

Parameter $\u03f5$ controls the size of the push. A larger $\u03f5$ value increases the chance of generating a misclassified image, but makes the change in the image more visible. This method is untargeted, as the aim is to get the image misclassified, regardless of which class.

Calculate the gradient of the image with respect to the golden retriever class.

gradient = dlfeval(@untargetedGradients,dlnet,X,T);

Set `epsilon`

to 1 and generate the adversarial example.

epsilon = 1; XAdv = X + epsilon*sign(gradient);

Predict the class of the original image and the adversarial image.

YPred = predict(dlnet,X); YPred = onehotdecode(squeeze(YPred),classes,1)

`YPred = `*categorical*
golden retriever

YPredAdv = predict(dlnet,XAdv); YPredAdv = onehotdecode(squeeze(YPredAdv),classes,1)

`YPredAdv = `*categorical*
Labrador retriever

Display the original image, the perturbation added to the image, and the adversarial image. If the `epsilon`

value is large enough, the adversarial image has a different class label from the original image.

showAdversarialImage(X,YPred,XAdv,YPredAdv,epsilon);

The network correctly classifies the unaltered image as a golden retriever. However, because of perturbation, the network misclassifies the adversarial image as a labrador retriever. Once added to the image, the perturbation is imperceptible, demonstrating how adversarial examples can exploit robustness issues within a network.

A simple improvement to FGSM is to perform multiple iterations. This approach is known as the basic iterative method (BIM) [4] or projected gradient descent [5]. For the BIM, the size of the perturbation is controlled by parameter $$\alpha $$ representing the step size in each iteration. This is as the BIM usually takes many, smaller, FGSM steps in the direction of the gradient. After each iteration, clip the perturbation to ensure the magnitude does not exceed $\u03f5$. This method can yield adversarial examples with less distortion than FGSM.

When you use untargeted FGSM, the predicted label of the adversarial example can be very similar to the label of the original image. For example, a dog might be misclassified as a different kind of dog. However, you can easily modify these methods to misclassify an image as a specific class. Instead of maximizing the cross-entropy loss, you can minimize the mean squared error between the output of the network and the desired target output.

Generate a targeted adversarial example using the BIM and the great white shark target class.

targetClass = "great white shark"; targetClass = onehotencode(targetClass,1,'ClassNames',classes);

Increase the `epsilon`

value to 5, set the step size `alpha`

to 0.2, and perform 25 iterations. Note that you may have to adjust these settings for other networks.

epsilon = 5; alpha = 0.2; numIterations = 25;

Keep track of the perturbation and clip any values that exceed `epsilon`

.

delta = zeros(size(X),'like',X); for i = 1:numIterations gradient = dlfeval(@targetedGradients,dlnet,X+delta,targetClass); delta = delta - alpha*sign(gradient); delta(delta > epsilon) = epsilon; delta(delta < -epsilon) = -epsilon; end XAdvTarget = X + delta;

Predict the class of the targeted adversarial example.

YPredAdvTarget = predict(dlnet,XAdvTarget); YPredAdvTarget = onehotdecode(squeeze(YPredAdvTarget),classes,1)

`YPredAdvTarget = `*categorical*
great white shark

Display the original image, the perturbation added to the image, and the targeted adversarial image.

showAdversarialImage(X,YPred,XAdvTarget,YPredAdvTarget,epsilon);

Because of imperceptible perturbation, the network classifies the adversarial image as a great white shark.

To make the network more robust against adversarial examples, you can use adversarial training. For an example showing how to train a network robust to adversarial examples, see Train Image Classification Network Robust to Adversarial Examples.

Calculate the gradient used to create an untargeted adversarial example. This gradient is the gradient of the cross-entropy loss.

function gradient = untargetedGradients(dlnet,X,target) Y = predict(dlnet,X); Y = stripdims(squeeze(Y)); loss = crossentropy(Y,target,'DataFormat','CB'); gradient = dlgradient(loss,X); end

Calculate the gradient used to create a targeted adversarial example. This gradient is the gradient of the mean squared error.

function gradient = targetedGradients(dlnet,X,target) Y = predict(dlnet,X); Y = stripdims(squeeze(Y)); loss = mse(Y,target,'DataFormat','CB'); gradient = dlgradient(loss,X); end

Show an image, the corresponding adversarial image, and the difference between the two (perturbation).

function showAdversarialImage(image,label,imageAdv,labelAdv,epsilon) figure subplot(1,3,1) imgTrue = uint8(extractdata(image)); imshow(imgTrue) title("Original Image" + newline + "Class: " + string(label)) subplot(1,3,2) perturbation = uint8(extractdata(imageAdv-image+127.5)); imshow(perturbation) title("Perturbation") subplot(1,3,3) advImg = uint8(extractdata(imageAdv)); imshow(advImg) title("Adversarial Image (Epsilon = " + string(epsilon) + ")" + newline + ... "Class: " + string(labelAdv)) end

[1] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. “Explaining and Harnessing Adversarial Examples.” Preprint, submitted March 20, 2015. https://arxiv.org/abs/1412.6572.

[2] *ImageNet*. http://www.image-net.org.

[3] Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. “Intriguing Properties of Neural Networks.” Preprint, submitted February 19, 2014. https://arxiv.org/abs/1312.6199.

[4] Kurakin, Alexey, Ian Goodfellow, and Samy Bengio. “Adversarial Examples in the Physical World.” Preprint, submitted February 10, 2017. https://arxiv.org/abs/1607.02533.

[5] Madry, Aleksander, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. “Towards Deep Learning Models Resistant to Adversarial Attacks.” Preprint, submitted September 4, 2019. https://arxiv.org/abs/1706.06083.

`dlnetwork`

| `onehotdecode`

| `onehotencode`

| `predict`

| `dlfeval`

| `dlgradient`