Nearest neighbor classifiers | 1-nearest neighbor classifier (1-NN classifier) | Select the class of the reference pattern closest to *x* | Just prepare patterns with ground-truth as reference patterns. Thus, no explicit training step. If we need to reduce the reference patterns, some pre-selection might be done in advance | Simple but powerful. Generally, accuracy increases with the number of reference patterns. Many variations by the metric to evaluate “closeness.” Computationally expensive with huge reference patterns |

k-nearest neighbor classifier (k-NN classifier) | Select the majority class of the *k* closest reference patterns to *x* | An improved version of 1-NN. More robust to outliers than 1-NN. Usually, an odd number, say, 3 or 5, is used as *k* to avoid tiebreak |

Discriminant function methods | Bayesian classifier | The optimal classifier which uses a posterior probability distribution as the discriminant function | Estimate statistical properties, such as likelihood *p*(*x* | *c*) and the prior probability *P*(*c*), for all classes *c* | Theoretically optimal (by minimizing the Bayes risk) but practically it is difficult to realize because the accurate statistical properties are difficult to estimate |

Linear classifier | Use a linear function of *x* for each class (see Figure 13e) and select the class giving the maximum function value | Error-correcting learning and Widrow-Hoff learning are classic. Recently, SVM is more common | Class boundary is given as a set of hyper-plane in feature space. A special case of Bayesian classification. If each class only has a single reference pattern, 1-NN classifier is reduced to this |

Piecewise linear classifier | Use multiple linear discriminant functions for each class | Consider each cluster of a class as a subclass and train linear classifiers to discriminate subclasses | Class boundary is given as a set of polygonal chains. 1-NN classifier is a special case of this classifier |

Quadratic classifier | Use a quadratic function of *x* for each class | Estimate likelihood *p*(*x*|*c*) as a Gaussian function. Then its logarithm is a quadratic function of *x*. SVM with the second-order polynomial kernel is another choice | Class boundary is given as a set of quadratic curves. A special case of Bayesian classification. Mahalanobis distance is its simplified version |

Support vector machine (SVM) | Determine the class boundary at the center of gap between two classes. By using a so-called kernel, it is possible to have various types of class boundary | Solve a quadratic optimization problem. The problem is to derive the optimally centered discrimination boundary | SVM is a general method to train various discriminant functions in an optimization framework and can provide a linear or a quadratic or a more flexible class boundary. Only two-class classification |

Multilayer perceptron (neural network) | It combines feature extraction and classification modules into one framework. The classification is done by aggregating the outputs from trainable units, called perceptron | Back-propagation is a popular choice. Note that it can train not only classifier but also feature extraction | Huge variations by its inner structure. Perceptron in the simplest case is a linear function whose coefficients are trainable. Nonlinear perceptron is also used |

| Voting | Select the majority class in the results by multiple classifiers | If individual classifiers are trained, no further training is necessary | Any classifier can be used. Various voting schemes can be used |

Classifier ensemble methods | Boosting | Select the class that wins a weighted voting by multiple classifiers | Classifiers are trained complementary; difficult patterns by a classifier are treated as important patterns on training another classifier. The weight for the voting is a reliability of the classifier | Many versions. AdaBoost is the most popular one. Any two-class classifier can be used |

Decision tree/random forest | Decision tree makes a final classification by hierarchical classifiers. Random forest is a set of decision trees | For a decision tree, ID3 and C4.5 are classic training methods. The key idea of the training is to evaluate the importance of each feature | Random forest is a doubly ensemble method because it is an ensemble of decision trees and each decision tree is also a hierarchical ensemble classifier |