How to interpret Anomaly Scores for One Class Support Vector Machines

50 visualizzazioni (ultimi 30 giorni)
I am using One Class Support Vector Machines for anomaly detection. Here is the anomaly scores histogram (attached) for the model trained with 274 samples and tested with 31 samples. How do I determine the true/false prediction rates from the anomaly scores histogram.
Thank You

Risposta accettata

Kaustab Pal
Kaustab Pal il 26 Ago 2024
Hi @NCA,
To determine true and false prediction rates, it's crucial to set an appropriate threshold on the anomaly scores. Samples with scores above this threshold are classified as anomalies, while those below are considered normal. By examining the distribution of anomaly scores in the histogram, you can identify natural separations or clusters that suggest a reasonable threshold.
Once the threshold is set, you can calculate the following metrics:
  • True Positives (TP): The number of actual anomalies correctly identified.
  • False Positives (FP): The number of normal samples incorrectly classified as anomalies.
  • True Negatives (TN): The number of normal samples correctly identified.
  • False Negatives (FN): The number of actual anomalies incorrectly classified as normal.
Using these values, you can compute various performance metrics to evaluate your model:
Please find below a short code snippet:
% Sample anomaly scores and ground truth labels
anomalyScores = [0.1, 0.4, 0.35, 0.8, 0.7, 0.6, 0.2, 0.9, 0.3, 0.5];
groundTruth = [0, 0, 0, 1, 1, 1, 0, 1, 0, 1]; % 1 for anomaly, 0 for normal
% Set an appropriate threshold
threshold = 0.5;
% Initialize counters
TP = 0;
FP = 0;
TN = 0;
FN = 0;
% Evaluate predictions based on the threshold
for i = 1:length(anomalyScores)
if anomalyScores(i) > threshold
if groundTruth(i) == 1
TP = TP + 1; % True Positive
else
FP = FP + 1; % False Positive
end
else
if groundTruth(i) == 0
TN = TN + 1; % True Negative
else
FN = FN + 1; % False Negative
end
end
end
% Calculate metrics
precision = TP / (TP + FP);
recall = TP / (TP + FN);
f1Score = 2 * (precision * recall) / (precision + recall);
accuracy = (TP + TN) / (TP + FP + TN + FN);
% Display results
fprintf('Precision: %.2f\n', precision);
fprintf('Recall: %.2f\n', recall);
fprintf('F1 Score: %.2f\n', f1Score);
fprintf('Accuracy: %.2f\n', accuracy);
I hope this answers your query!
  2 Commenti
NCA
NCA il 26 Ago 2024
Thanks Kaustab for the detailed explanation. I would like to know why you set the number 1 as the anomaly and 0 as the normal, can I do the reverse as I was following OCSVM from MATLAB where it asigned the negative scores as anomalies and postive scores as normal based on the threshold of 0.Please see the attached Anomaly Histogram with the title "FOR MATLAB ANSWERS".
Secondly I am assuming you created "groundTruth" so in my case I need to create a file termed "groundTruth" with a value of 0 or 1 against each of my "anomalyScores" for my 31 test samples?
Thanks
Umar
Umar il 27 Ago 2024

Hi @NCA,

In anomaly detection, the classification of samples as anomalies or normal is contingent upon the threshold set on the anomaly scores. In your case, you can indeed reverse the labeling of anomalies and normal samples; the key is consistency in your approach. If your model, such as OCSVM, designates negative scores as anomalies, you should adjust your ground truth accordingly. Regarding the creation of the groundTruth variable, it is essential to have a corresponding label for each anomaly score. For your 31 test samples, you should create a binary array where each entry reflects whether the sample is an anomaly (1) or normal (0). This will enable you to accurately compute metrics like True Positives, False Positives, and others, ensuring a robust evaluation of your model's performance.Here’s a brief code snippet to illustrate how you might set up your groundTruth:

% Example ground truth for 31 samples
groundTruth = [0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0];

The code snippet above will allow you to effectively evaluate your anomaly detection model. Hope this helps clarify your question, “Secondly I am assuming you created "groundTruth" so in my case I need to create a file termed "groundTruth" with a value of 0 or 1 against each of my "anomalyScores" for my 31 test samples?” Please let us know if you have any further questions.

Accedi per commentare.

Più risposte (1)

Umar
Umar il 26 Ago 2024

Hi @NCA,

To address your query regarding, “I am using One Class Support Vector Machines for anomaly detection. Here is the anomaly scores histogram (attached) for the model trained with 274 samples and tested with 31 samples. How do I determine the true/false prediction rates from the anomaly scores histogram. “

Please see my response to your comments below.

First, I generated synthetic data, as you can see in the code, rng(1) command sets the random number generator seed to 1, by making sure that the results can be reproduced, trainData generates 274 samples from a standard normal distribution (mean = 0, variance = 1) for training and testData creates a test dataset consisting of 31 normal samples and 10 anomalies (shifted by 5 units on the x-axis).

rng(1); % For reproducibility
trainData = randn(274, 2); % 274 samples for training
testData = [randn(31, 2); randn(10, 2) + 5]; % 31 normal samples and 10 
anomalies

Then, created labels for training data which creates a label vector for the training data, where all entries are set to 1, indicating that all training samples are considered normal.

trainLabels = ones(size(trainData, 1), 1); 

Now, training one class SVM is implemented in which fitcsvm function trains a One-Class SVM model using the training data and labels, KernelFunction', 'gaussian’ specifying the use of a Gaussian kernel for the SVM, ’Standardize', true normalizes the data to have zero mean and unit variance and ‘ClassNames', [1; -1] defines the class labels for the model.

ocsvmModel = fitcsvm(trainData, trainLabels, 'KernelFunction', 'gaussian', 
'Standardize', true, 'ClassNames', [1; -1]);

Afterwards, predicting anomaly scores for test data which uses the trained SVM model to predict labels and scores for the test data. The score variable contains the anomaly scores, which indicate how likely each sample is to be an anomaly.

   [predictedLabels, score] = predict(ocsvmModel, testData);

Then, created subplots first histograms in which a figure with two subplots is created. The first subplot displays a histogram of the anomaly scores for the test data while the second subplot shows the histogram of the anomaly scores for the training data.

figure;
% Subplot for test data
subplot(2, 1, 1);
histogram(score(:, 2), 30, 'FaceColor', 'b', 'FaceAlpha', 0.5);
title('Anomaly Scores Histogram - Test Data');
xlabel('Anomaly Score');
ylabel('Frequency');
% Subplot for training data
subplot(2, 1, 2);
trainScores = predict(ocsvmModel, trainData);
trainAnomalyScores = trainScores(:, 1); % Get anomaly scores for training data
histogram(trainAnomalyScores, 30, 'FaceColor', 'r', 'FaceAlpha', 0.5);
title('Anomaly Scores Histogram - Training Data');
xlabel('Anomaly Score');
ylabel('Frequency');

Afterwards, determining true/false prediction rates in which a threshold of 0 is set to classify scores as anomalies. Scores greater than this threshold are considered anomalies. Also, the trueLabels vector is created to represent the actual labels of the test data.

   threshold = 0; % Set threshold for anomaly detection
    predictions = score(:, 2) > threshold; % True if score indicates anomaly
    % True labels: 1 for normal, -1 for anomaly
    trueLabels = [ones(31, 1); -ones(10, 1)]; 

Then, I implemented code to calculate true positive, false positive, true negative and false negative based on the predictions and true labels.

TP = sum(predictions(trueLabels == -1)); % True Positives
FP = sum(predictions(trueLabels == 1));  % False Positives
TN = sum(~predictions(trueLabels == 1)); % True Negatives
FN = sum(~predictions(trueLabels == -1));% False Negatives

The true positive rate (sensitivity) and false positive rate are calculated to evaluate the model's performance.

truePositiveRate = TP / (TP + FN);
falsePositiveRate = FP / (FP + TN);

Finally, the true positive and false positive rates are printed to the console, providing insight into the model's effectiveness in detecting anomalies.

fprintf('True Positive Rate: %.2f\n', truePositiveRate);
fprintf('False Positive Rate: %.2f\n', falsePositiveRate);

Please see attached.

Please let me know if this helped resolve your problem. Please let me know if you have any further questions.

Prodotti


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by