Linear Binary Classification
Binary classification problems
Where do they arise?
Binary classification problems arise when we seek to separate two sets of data points in , each corresponding to a given class. We seek to separate the two data sets using simple ‘‘boundaries’’ in , typically hyperplanes. Once the boundary is found, we can use it to predict the class a new point belongs to, by simply checking on which side of the boundary it falls.
Such problems arise in many situations:
In recommendation systems, say for movies, we have information about a given user's movie preferences. Hence we may separate the whole movie database in two different sets (movies that the user's like, and the others). The goal of classification is to determine wether a given new movie will be liked by the user.
In spam filtering, we may have a set of emails which we know are spam, and another set of emails that are known to be legitimate.
In fault detection, we have a set of past signals, some of which are known to correspond to a fault, and others known to correspond to a situation free of fault. The goal of classification is to predict wether a new measured signal corresponds to a fault.
In document classification we may be interested in determining wether a given news article is about a given topic of interest, or not.
In time-series prediction, we may try to determine wether a future value (such as a stock price) will increase or decrease with respect to its current value.
Features
Each data point in the classification problem is a vector that contains a numerical representation the features that we use to make a class prediction. Features may include:
frequencies (or some other numerical score) of given words in a dictionary (see the bag-of-words representation of text;
Boolean variables that determine the presence or absence of a specific feature (such as wether a specific actor played in the movie that the data point represents);
Numerical values such as blood pressure, temperature, prices, etc.
Basics of linear classification
Assume we are given a collection of data points, , which comes with a label that determines which class it belongs to.
The linear binary classification problems involves a ‘‘linear boundary’’, that is a hyperplane. An hyperplane can be described via the equation
for some and .
Such a line is said to correctly classify these two sets if all data points with fall on one side (hence ) and all the others on the other side (hence ). Hence, the affine inequalities on :
guarantee correct classification. The above constitute linear (in fact, affine) inequalities on the decision variable, . This fact is the basis on which we can build a linear programming solution to a classification problem.
|
The image shows a linear classifier which separates two sets of points in . In two dimensions, a hyperplane corresponds to a simple line. In this example, the data sets are linearly separable.
|
Classification rule
Once a classifier is found, we can classify a new point by assigning to it the label
The above constitutes the classification rule.
Strict separability
The data is strictly separable (meaning that the separating hyperplane does not contain any data points) if and only if the above inequalities can be made strict for some choice of . If that is the case, then we can always scale the variable and obtain the inequalities:
Case of non-separable data sets
The previous constraints are feasible if and only if the data is strictly separable. If the data is not strictly separable, we can allow for errors in the inequalities, and minimize the sum of these errors.
We obtain the problem
The geometric interpretation of the above is that we are minimizing the sum of the distances from the wrongly classified points, to the separating hyperplane.
In the above problem, we can eliminate the variable and obtain a formulation involving the minimization of a polyhedral function:
where the notation denotes the positive part of a real number .
Feature selection
Motivation
In many cases, a sparse classifier, that is, a vector with many zeros, is desirable.
Indeed, the classification rule involves the scalar product between the classifier vector and a feature vector . If for some , then the rule ignores the value of the -th feature to make a prediction about the label. In this sense, the -th feature is ‘‘not important’’ in the classification task. Thus, the classifier allows not only to make a prediction, but also to single out important features in the data. This is referred to as feature selection.
Feature selection via -norm
Assume that the data is separable. We can try to minimize the cardinality of (number of non-zero elements in) the classifier vector , under the constraint that the classifier has no errors on the data set. This is a hard problem.
Instead, we can use the -norm heuristic, which consists of replacing the cardinality objective by the -norm of . In the separable case, we end up with the constrained polyhedral minimization problem
The above can be formulated as an LP.
If the data is not separable, we need to trade-off our concern for sparsity, against the error function we used above. This is done via the polyhedral minimization problem:
where is a penalty parameter that allows to choose our trade-off.
Robustness
Against box uncertainty
In some cases, the data points are not known exactly. A typical uncertainty model is to assume that each data point is only known to belong to a ‘‘box’’ around a given point :
where is a measure of the size of the uncertainty, and 's represent the (known) ‘‘nominal’’ values of the data points.
Let us assume that the nominal data points are strictly separable. A robust classifier corresponds to a hyperplane that not only separates the known points , but the boxes around them. That is, a robust hyperplane satisfies, for every :
The above condition is equivalent to
The above condition guarantees that the data points remain separated even if we perturb the original ones (staying in the boxes). This approach requires the knowledge of the ‘‘uncertainty size’’, .
If is not known, we can simply find a classifier which maximizes the allowable perturbation level . Using homogeneity, we can always enforce . Then, maximizing is equivalent to minimizing the -norm, that is:
This model is the same as the one we saw for feature selection based on the -norm heuristic.
Against spherical uncertainty
We can consider another data uncertainty model, in which the data points are only known to belong to hyper-spheres:
where is a measure of the size of the uncertainty.
Assuming again the nominal data points to be strictly separable, leads to the problem
The above is a QP.
If the data is not separable, we need to trade-off our concern for robustness, against the error function we used above. This is done via the QP problem:
where is a penalty parameter that allows to choose our trade-off.
|