Search prefilters for mid-infrared absorbance spectra of clear coat automotive paint smears using stacked and linear classifiers

Authors


Abstract

By using stacked partial least squares classifiers and genetic algorithms for feature selection and classification, it is demonstrated that search prefilters can be developed to extract investigative lead information from clear coat paint smears. The results obtained in this study also show that identifying specific wavelengths or wavelet coefficients in IR spectral data is superior to identifying informative wavelength windows when applying pattern recognition techniques to IR spectra from the paint data query (PDQ) database when differentiating paint samples by assembly plant. Search prefilters developed using specific wavelengths or wavelet coefficients outperformed search prefilters that utilized spectral regions. Clear coat paint spectra from the PDQ database may not be well suited for stacking as there are few spectral intervals that can reliably distinguish the different sample groups (i.e., assembly plants) in the data. The information contained in the IR spectra about assembly plant may not be highly compartmentalized in an interval, which also works against stacking. The similarity of the IR spectra within a plant group and the noise present in the IR spectra may also be obscuring information present in spectral intervals. Copyright © 2014 John Wiley & Sons, Ltd.

1 INTRODUCTION

Paint samples are often recovered from collisions where damage to vehicles or injury or death to a pedestrian has occurred. The Royal Canadian Mounted Police (RCMP) has shown that automobiles can be identified by comparing the color, layer sequence and chemical composition of each individual layer of the recovered paint sample to known automotive paint systems [1, 2]. To make these comparisons possible, the RCMP has developed the paint data query (PDQ) database [3, 4], which contains over 16,000 samples (street samples and factory panels) that correspond to over 60,000 individual paint layers, representing the paint systems used in most domestic and foreign vehicles sold in North America. PDQ is a database of the physical attributes, chemical composition and IR spectrum of each layer of the original manufacturer's paint system.

Searches in PDQ are text based. If the original automotive paint layers are present in the recovered paint sample, PDQ can assist in identifying the make, model and line of the motor vehicle in a limited production year range. However, paint samples that do not contain the color coat or at least one of the undercoat layers pose a problem for PDQ because the search relies heavily on the relatively large variations in color and the chemical formulations of these layers. Because modern automotive paints use thinner undercoat and color coat layers protected by a thicker clear coat layer, a clear coat paint smear, all too often, is the only layer of paint left at the crime scene. The text-based system of PDQ does not allow for the effective searching of clear coats because modern clear coats applied to any automotive substrate have only one of two possible formulations (i.e., they are coded as either acrylic melamine styrene or acrylic melamine styrene polyurethane). There are no inorganic fillers or color with which to further discriminate a clear coat paint sample. In these cases, the text-based portion of the PDQ database cannot be used to identify the motor vehicle.

To assess the evidentiary information content of clear coat paint smears, pattern recognition techniques have been developed to search the IR spectral libraries of the PDQ database to differentiate between similar but nonidentical IR paint spectra. At present, the capability to perform direct searching of IR spectra in PDQ does not exist, and spectral search algorithms commercially available often cannot distinguish subtle differences between clear coat paint spectra from one vehicle model or line to the next. To tackle the problem of library searching in the PDQ database, a prototype library search system is being developed to identify the assembly plant, model and line of an automobile from a clear coat paint spectrum. Search prefilters are used to truncate the library spectra to a specific assembly plant or a group of assembly plants. As the size of the library is truncated for a specific match, both the selectivity and accuracy of the search are increased.

The approach taken by our research group in the application of pattern recognition techniques to problems in library searching [5-7] utilizes a genetic algorithm (GA) for pattern recognition and feature selection to identify individual wavelengths from normalized IR spectra or wavelet coefficients from wavelet-transformed spectra for the purpose of developing classifiers employed as search prefilters. An alternative approach to variable selection first proposed by Kalivas [8] and investigated in this study attempts to improve upon classification results by selecting wavelength windows in the spectra rather than searching for distinct features in an IR spectrum. Informative spectral regions in clear coat paint spectra were identified using a recently developed classification technique called stacked partial least squares discriminant analysis [9] (SPLSDA) to create classification models for search prefilters. Stacking, a concept first proposed by Brieman [10], is similar to ensemble modeling, a concept that has appeared in many fields.

By utilizing GAs and stacked classifiers, it is demonstrated that search prefilters can be developed to extract investigative lead information from clear coat paint smears. The results obtained in this study also show that identifying specific wavelengths or wavelet coefficients in IR spectral data is superior to identifying informative wavelength regions when applying pattern recognition techniques to IR spectra from the PDQ database when differentiating paint samples by assembly plant.

2 METHODOLOGY

2.1 Stacked classifiers

In a closely related technique, stacked partial least squares (PLS) regression, small intervals of the data matrix comprising the X-block are each regressed on the Y-block values separately [11]. The simple regression models are then combined, giving a simpler and often better regression model than a global model that utilizes all regions of the spectrum. For classification, a discriminant analysis-based classifier is used on each of the small intervals to classify the samples. One reason for using SPLSDA over other classification techniques is the inherent dimension reduction obtained for each PLS model—most are simple, with few latent variables needed to describe the class-related information in the data.

In a stacked PLS model, the calibration spectra comprising the training set are first partitioned into a set of n disjoint wavelength regions of equal width. All spectra (training set and validation set) are partitioned in the same way. For the calibration set, n interval PLS models are developed between the target property vector (manufacturing plant) and each of the n intervals, and a set of PLS interval regression vectors is obtained. These interval PLS models are then combined using a set of weighting values determined by a cross-validation procedure to form a stacked model where each model has a specific weight defined by the reciprocal of the cross-validated error rate of the PLS model developed on the kth interval normalized to the sum of the reciprocal of the cross-validated error for all of the models. Direct application of the individual PLS regression models to a validation set partitioned as above gives the value for the class membership of the validation set samples using the previously established weights.

There is a resampling step iterated about this entire process to minimize the chance of overly optimistic results, which is crucial for the success of this method. This step involves separating the data into two parts, one for the cross-validation step and one for the prediction step. Each of the intervals in the first half of the data must be cross-validated to determine the root mean square error of cross-validation (RMSECV). The RMSECV (see Equation (1), where yi is the sample class membership value, xk is the kth spectrum interval and bk is the regression vector for the kth spectrum interval calculated for the PLS model) is used in the formulation of weights for each interval in the final, stacked, model. For the kth interval with sk as the reciprocal of the RMSECV, the weight wk is calculated as shown below (see Equation (2), where math formula is the reciprocal of the cross-validated error rate for the kth PLS discriminant analysis model). The summation normalizes all of the weights to a unit sum. If the RMSECV is 0 for an individual interval, an appropriate weight is used instead (math formula). The purpose of calculating the weights in this way is to ensure that a high weight is assigned to any interval where samples are assigned values close to their class membership (coded Y) values.

display math(1)
display math(2)

The calculated weight matrix is then used to effectively scale each interval's regression coefficients, which are then all summed together to obtain a single regression coefficient matrix. Ideally, the regression coefficient matrix creates predicted Y (class membership) values such that the target class' Y values are well separated from all others. The threshold Y value to use is one that leads to the minimum overlap of the target class from all others. The predicted Y (class membership) values are calculated as seen in Equation (3), where X is the second half of the data and β is the regression coefficients for the kth PLS model.

display math(3)

Once the establishment of a classification model using SPLSDA has been completed, the classification of clear coat paint samples can begin. Another set of paint samples, collected using a Fourier transform IR spectrometer from a different manufacturer, will use the previously calculated regression coefficient matrix to find the predicted class membership values. The same set of discriminants used in the final classification in SPLSDA is also used to classify the paint samples in the validation set.

The discriminants in SPLSDA were optimized using the technique of cross-validation. A subset of the training (calibration) set data that is withheld from the construction of the discriminant is used for testing. The p samples for testing are removed randomly with the calibration performed on the remaining (m − p) samples. Repeated discriminant development and test cycles are averaged over all samples and evaluated as a function of the number of latent variables. Cross-validation may involve leaving out one sample per test cycle (full cross-validation) or leaving out every third or fourth sample in each test cycle (segmented cross-validation).

In this study, Venetian blind cross-validation [12] was employed with a 0.5 holdout fraction and 10 repeats (for most spectra) to optimize the number of latent variables in each (calibration set) spectrum for each PLS model developed. However, some classes had fewer samples and smaller holdout fractions. Venetian blind cross-validation was used as it is computationally inexpensive and is reliable when there are many samples. Venetian blind cross-validation was applied to interval sizes ranging from 2 to 25 subregions of the spectral response to optimize both the number of latent variables for each interval and the interval size at once. (In this study, the number of latent variables was limited to 20 for each interval.) However, some classes had fewer samples and required fewer latent variables. The stacking process guards against any overfit or underfit of the intervals by PLS as this would predict poorly in the cross-validation and receive low weights in the stacked model.

A two-way cross-validation was performed to identify the spectral windows and the number of latent variables in each spectral window that yielded the minimum prediction error. The best spectral windows were selected to give the minimal cross-validation error in stacking. For stacked classification, all spectra were preprocessed using Savitzky–Golay second derivative smoothing with a default window size of 15 wavelengths followed by mean centering inside the cross-validation.

2.2 Genetic algorithm for pattern recognition analysis and feature selection

The pattern recognition GA [13-18] identifies a set of spectral features that optimize the separation of the classes in a plot of the two or three largest principal components (PCs) of the data. Because PCs maximize variance, the bulk of the information encoded by the selected features is about differences between the classes in the data set. Chance classification is not a serious problem as the bulk of the variance or information content of the features selected is about the classification problem of interest. In addition, the GA focuses on those classes and or samples that are difficult to classify as it trains using a form of boosting to modify the fitness landscape. Boosting minimizes the problem of convergence to a local optimum because the fitness function of the GA is changing as the population evolves toward a solution. Over time, samples that consistently classify correctly are not as heavily weighted in the analysis as samples that are difficult to classify. The pattern recognition GA learns its optimal parameters in a manner similar to a neural network. The algorithm integrates aspects of strong and weak learning to yield a “smart” one-pass procedure for feature selection and classification. Further details about the pattern recognition GA can be found elsewhere [19-21].

3 EXPERIMENTAL

Infrared spectra from the PDQ library for General Motors (GM) automobiles between the years 2000 and 2006 were collected using four different spectrometers: Bio-Rad (Hercules, CA, USA), Thermo-Nicolet (Madison, WI, USA) 40A, Bio-Rad 60A and two Thermo Nicolet 6700 Fourier transform IR spectrometers. Each IR spectrometer was nominally run at a 4-cm−1 resolution. A major challenge in this study is that IR spectra of the clear coats were initially not properly aligned along their x or y axes as these spectra were collected on different IR spectrometers. There are differences in the alignment of the optical systems as the spectrometers are from different vendors and were manufactured in different years. The need to carefully and judiciously preprocess the data is necessary for the development of the search prefilters. In this study, spectral line shapes between instruments were matched using convolution and deconvolution functions [22] developed with Nicolet's OMNIC software system. An instrumental line function representative of the two Thermo Nicolet instruments and developed by OMNIC was applied to the Bio-Rad spectra to ensure that all measurements made by the Bio-Rad instrument were comparable to spectra collected on the two Thermo Nicolet instruments. This ensured wavelength alignment along the x-axis for all clear coat spectra of GM automobiles between the years 2000 and 2006. For IR spectra of clear coat samples common to both the Bio-Rad and Thermo Nicolet instruments, subtraction of their spectra after performing this alignment procedure yielded zero at each point.

For alignment along the x-axis (wavelength) of the spectra, application of line functions to all spectra ensured that the number of data points per spectrum was the same. To ascertain that all spectra were properly aligned with respect to each other, the spectrum of a known sample measured on both the Bio-Rad and Thermo Nicolet instruments was compared through vector subtraction, which yielded a data vector with each element of the vector equal to zero. Without aligning the IR spectra, discriminating both plant groups and plants would have been impossible, as selected features would not have been in tandem with the informative wavenumbers across all spectra.

For alignment along the y-axis (transmittance) of the spectra, the editor in OMNIC was used to ensure that all IR spectra started from the same transmittance value. The thickness of the sample and the pressure applied by the transmission diamond cell in collecting the spectra were such that an absorbance of unity was obtained for the carbonyl band (~1730 cm−1) in all paint spectra in the PDQ library. Spectral alignment along the y-axis using the carbonyl band was straightforward as there are no sloping baselines or baseline offsets in the diamond cell transmission spectra. As the diamond cell transmission spectra were of high quality, it was not necessary to correct the spectra using derivatives.

Prior to pattern recognition analysis, each IR transmission spectrum was normalized to unit length to adjust for variations in the optical path of the cell. IR spectra used in this study were obtained from 18 GM assembly plants in North America. Only the clear coat paint layer from metallic parts was used to develop search prefilters. Clear coats from plastic substrates (e.g., bumpers) were excluded as automotive paint is applied to these substrates in the plant that manufactures these components, not the plant where the vehicle is assembled. Table 1 designates the 18 GM manufacturing plants used in this study.

Table 1. General Motors assembly plants used to develop stacked classifiers as search prefilters
Plant IDPlantMakeLine
  1. ARL = Arlington, BOW = Bowling, DOR = Doraville, FAI = Fairfax, FOR = Fort Wayne, HAM = Hamtramck, JAN = Janesville, LAN = Lansing, LIN = Linden, LRD = Lordstown, MOR = Moraine, ORI = Orion, OSH = Oshawa, PON = Pontiac, RAM = Ramos Arizpe, SHR = Shreveport, SIL = Silao, SPH = Spring Hill, CAD = Cadillac, CHE, Chevrolet, GMC = General Motors Corporation, OLD = Oldsmobile, BUI = Buick, SAA = Saab, STR = Saturn, SUB = Suburban, YUK = Yukon, ESD = Escalade, CTA = Tahoe, CVT = Corvette, VTR = Venture, SIL = Silhouette, MTA = Montana, UPL = Uplander, TAR = Terraza, GRA = Grand Prix, MAL = Malibu, ITR = Intrigue, SLV = Silverado, SIE = Sierra, BON = Bonneville, DEV = Deville, LUC = Lucerne, LES = LeSabre, SEV = Seville, ELD = Eldorado, YUK = Yukon, BZR = Blazer, STS = Starfire, JMY = Jimmy, SFR = Sunfire, CAV = Cavalier, COB = Cobalt, PST = Pursuit, ENV = Envoy, 9S7 = Saab Truck, TBZ = Trailblazer, SON = Sonic, PG6 = Firefly, AUR = Aurora, PKA = Park Avenue, ALL = Allure, REG = Regal, RZV = Rendezvous, AZT = AzteK, HHR = Hummer, COL = Colorado, AVL = Avalanche, YXL = YukonXL.

1ARLCAD, CHE, GMCSUB, YUK, ESD, CTA
3BOWCAD, CHECVT, XLR
4DORPONVTR, SIL, MTA, UPL, TAR
5FAICHE, OLD, PONGRA, MAL, ITR
8FORCHE, GMCSLV, SIE
10HAMBUI, CAD, PONBON, DEV, LUC, LES, SEV, ELD
12JANGMCCTA, SUB, YUK
14LANPONSTS
16LINCHE, GMCBZR, JMY, S10
17LRDPONSFR, CAV, COB, PST
18MORCHE, GMC, SAAJMY, ENV, 9S7, BZR, TBZ, SON
21ORIPON, BUIBON, PG6, LES, AUR, PKA
22OSHGMC, PONALL, REG
23PONCHE, GMCSLV, SIE, SIL
24RAMBUI, CHE, PONCAV, SFR, RZV, AZT, HHR
25SHRCHE, GMCS10, COL, SON
26SILCHE, GMC, SAAAVL, SUB, YXL
27SPHSTRSSL, ION, SC1, SC2, SL1, VUE

4 RESULTS AND DISCUSSION

A hierarchical classification scheme formulated from a visual inspection of the data was used to develop the search prefilters as the simultaneous classification of all 18 assembly plants was not possible. The spectra were initially divided into two categories on the basis of the carbonyl band at 1730 cm−1. In one category, the carbonyl band in each IR spectrum is a singlet (Plant Groups 1, 3 and 4), whereas in the other category, the carbonyl band is a doublet (Plant Groups 2 and 5). An examination of the expanded fingerprint region (2000–400 cm−1) for these two categories reveals three distinct spectral patterns for the first category and two distinct spectral patterns for the second category with each pattern designated as a specific plant group. An unknown is classified as to its plant group using a single search prefilter, and then a second search prefilter is used to identify the specific assembly plant or assembly plants within the plant group to which membership of the unknown is assigned. Assembly plants comprising each plant group are listed in Table 2, and clear coat paint spectra that comprised the training set and validation set used in the development of the plant group and plant search prefilters for the PDQ database are summarized in Table 3.

Table 2. Manufacturing plants comprising each plant group
Plant groupPlant ID numberManufacturing plant
  1. ARL = Arlington, DOR = Doraville, FAI = Fairfax, FOR = Fort Wayne, LAN = Lansing, MOR = Moraine, PON = Pontiac, BOW = Bowling, HAM = Hamtramck, ORI = Orion, LIN = Linden, LRD = Lordstown, OSH = Oshawa, SHR = Shreveport, JAN = Janesville, RAM = Ramos Arizpe, SIL = Silao, SPH = Spring Hill.

11, 4, 5, 8, 14, 18, 23ARL, DOR, FAI, FOR, LAN, MOR, PON
23, 10, 21BOW, HAM, IRI
316, 17, 22, 25LIN, LRD, OSH, SHR
412JAN
524, 26, 27RAM, SIL, SPH
Table 3. Clear coat paint spectra in the training set and validation set for plant group search prefilter
Plant groupNumber of spectra
Training set samples (Thermo Nicolet)Validation set samples (Bio-Rad)
17880
22031
36951
4613
52143
Total194221

The first step in this study was to develop a search prefilter to classify the clear coats by plant group. A five-way classification study was undertaken using the Thermo Nicolet IR spectra. Each IR spectrum in the training set was partitioned into n adjacent spectral intervals of equal length where the number of intervals was varied from 2 to 25. For the training set, n PLS discriminant models were developed between the class membership of the samples and each of the n spectral intervals. The performance of each PLS model was evaluated with the contribution of each PLS model to the overall discriminant weighted according to the cross-validated error rate of the model. Figure 1 summarizes the cross-validated error rates for the 194 Thermo Nicolet spectra from Plant Group 1 as a function of the number of spectral intervals and the number of latent variables for each PLS model. The cross-validated error rate was 0% when the number of latent variables for each interval was greater than 3. Double cross-validation identified specific wavelength windows in each spectrum used for stacking. The number of spectral intervals used for stacking varied from 2 to 4. The pattern recognition GA was also able to correctly classify every Thermo Nicolet IR spectrum in the training set (Figure 2).

Figure 1.

Cross-validated error rates for the 194 Thermo Nicolet spectra (circle, training set) and 221 Bio-Rad spectra (square, validation set) as a function of the number of spectral intervals and the number of latent variables (LVs) for each partial least squares model: (a) Plant Group 1, (b) Plant Group 2, (c) Plant Group 3, (d) Plant Group 4 and (e) Plant Group 5. SPLSDA, stacked partial least squares discriminant analysis.

Figure 2.

Plot of the two largest principal components of the 11 wavelengths identified by the pattern recognition GA. Each training set sample and each validation set sample is represented as a point in the principal component plot. For the training set, T-1 = Plant Group 1, T-2 = Plant Group 2, T-3 = Plant Group 3, T-4 = Plant Group 4 and T-5 = Plant Group 5. For the validation set, P-1 = Plant Group 1, P-2 = Plant Group 2, P-3 = Plant Group 3, P-4 = Plant Group 4 and P-5 = Plant Group 5.

Figure 1 also summarizes the error rate for the 221 (Bio-Rad) IR spectra that comprised the validation set. An error rate of 0% for each plant group can be achieved when the number of latent variables used to model the wavelength windows comprising each stacked model is 5. The PC plot developed from the 11 wavelengths identified by the pattern recognition GA was also able to correctly classify every sample in the validation set (Figure 2). When linear discriminant analysis was used to develop a classifier for these same 11 wavelengths, 100% correct classification was again achieved for both the training set and validation set (Table 4).

Table 4. Linear discriminant analysis for plant groups using 11 spectral features
Plant groupTraining set samples (Thermo Nicolet)Validation set samples (Bio-Rad)
SamplesMissesSuccessSamplesMissesSuccess
1780100800100
2200100310100
3690100510100
460100130100
5210100430100
Total19401002210100

The plant group study yielded a successful outcome because the appropriate feature selection techniques were used in conjunction with classification methods. When univariate feature selection methods (e.g., variance weights and Fisher weights) or multivariate feature selection methods (e.g., modeling power and discriminatory power) were computed, they did not produce classification results as clear-cut as those reported here. This indicates there is some challenge to this classification problem. If variable selection techniques are not used, the classification success rates obtained for the plant group are markedly lower.

The next step in this study was to develop search prefilters to classify the IR spectra by assembly plant for Plant Groups 1, 2 and 3. Plant Group 4 contains only a single assembly plant, and the IR spectra from the three assembly plants comprising Plant Group 5 cannot be differentiated as their spectra are superimposable. In this phase of the study, Plant Groups 1, 2 and 3 were first investigated using the pattern recognition GA to search for significant structure in the data using PC plot and k-nearest neighbor (PCKaNN) to score the features. We selected the pattern recognition GA as our benchmark because the pattern recognition GA is not limited to using mathematics for only modeling but is also capable of functioning as a data microscope to sort, to probe and to look for hidden relationships in the data as well as to uncover the class structure of the data.

Because IR spectra within a plant group are more similar than IR spectra from different plant groups, more powerful preprocessing methods were judged to be necessary to extract information about the assembly plant from the IR spectra of the clear coats. The Symlet6 mother wavelet at the eighth level of decomposition, that is, 8Sym6, was implemented using the discrete wavelet transform and applied to each vector-normalized IR spectrum to denoise and deconvolute the IR data into wavelet coefficients. The criterion used to select this mother wavelet is based solely on the ability of 8Sym6 to extract information about the assembly plant from the spectra. A decrease in the ability of the GA to correctly classify the IR spectra was observed when other mother wavelets were used to denoise and deconvolute the spectra.

Figure 3 shows a plot of the two largest PCs of the 26 wavelet coefficients identified by the pattern recognition GA for the assembly plants comprising the first plant group (Table 5). Each IR spectrum is represented as a point in the PC plot of the data. Plant 18 (Moraine OH) is well separated from the other assembly plants in the PC plot. IR spectra from the other six manufacturing plants (Arlington TX, Doraville GA, Fairfax KS, Fort Wayne IN, Lansing MI and Pontiac MI) are superimposable, which prevented further discrimination by assembly plant for these clear coats. Although the pattern recognition GA was parameterized to search for wavelet coefficients to separate all seven assembly plants, the class structure of the data detected by the GA when performing feature selection indicated that only a single assembly plant (Plant 18, Moraine, OH) can be identified among the seven assembly plants that constitute this plant group. This result was verified when individual spectra from the other six plants was examined visually and overlaid for comparison.

Figure 3.

Plot of the two largest principal components of the 26 wavelet coefficients identified by the pattern recognition genetic algorithm for manufacturing plants comprising the first plant group. Each training set sample and each validation set sample is represented as a point in the principal component plot. For the training set, T-1 = Arlington TX, Doraville GA, Fairfax KS, Fort Wayne IN, Lansing MI and Pontiac MI and T-18 = Moraine OH. For the validation set, P-1 = Arlington TX, Doraville GA, Fairfax KS, Fort Wayne IN, Lansing MI and Pontiac MI and P-18 = Moraine OH.

Table 5. Training set and validation set for Plant Group 1
PlantsNumber of samples
Training set (Thermo Nicolet)Validation set (Bio-Rad)
181713
1, 4, 5, 8, 14, 236167
Total7880

A validation set of 80 IR spectra (Table 5) was employed to assess the predictive ability of the 26 wavelet coefficients identified by the pattern recognition GA. We chose to map the 80 spectra directly onto the PC map defined by the 78 IR spectra of the training set and the 26 wavelet coefficients identified by the pattern recognition GA. Figure 3 also shows the validation set samples projected onto the PC map developed from the training set data. Each projected sample is in a region of the map with paint samples that have the same class label: either Plant 18 or Plants 1, 4, 5, 8, 14 and 23.

Figure 4 shows a plot of the two largest PCs of the 20 IR spectra of the training set and the 26 wavelet coefficients identified by the pattern recognition GA for assembly plants comprising Plant Group 2 (Table 6). Each IR spectrum is represented as a point in the plot. All three manufacturing plants (Bowling Green KY, Hamtramck MI and Orion MI) form distinct and well-separated clusters in the PC plot. Only one training set sample (Hamtramck MI) is misclassified.

Figure 4.

Plot of the two largest principal components of the 20 training set IR spectra and the 24 wavelet coefficients identified by the pattern recognition genetic algorithm for the manufacturing plants comprising the second plant group. Each training set sample and each validation set sample is represented as a point in the PC plot. For the training set, T-3 = Bowling Green KY, T-10 = Hamtramck MI and T-21 = Orion MI. For the validation set, P-3 = Bowling Green KY, P-10 = Hamtramck MI and P-21 = Orion MI.

Table 6. Training set and validation set for Plant Group 2
PlantsNumber of samples
Training set (Thermo Nicolet)Validation set (Bio-Rad)
3610
10913
2158
Total2031

A validation set of 31 IR spectra (Table 6) was employed to assess the predictive ability of the 26 wavelet coefficients identified by the pattern recognition GA. The 31 IR spectra were projected onto the PC plot defined by the 20 IR spectra of the training set and 24 wavelet coefficients identified by the pattern recognition GA (Figure 4). All validation set samples (except for a Hamtramck MI clear coat) are in a region of the map with paint samples that have the same class label.

Figure 5 shows a plot of the two largest PCs of the 69 training set samples and the five wavelet coefficients identified by the pattern recognition GA for manufacturing plants comprising the third plant group (Table 7). Clear coats from Oshawa Ontario (Plant 22) are divided into three distinct sample groups: Chevrolets, Buicks and GMC trucks. The Buicks form a separate cluster in the PC plot as do the GMC trucks. Automobiles from Lordstown OH (Plant 17) also cluster in a distinct region of the PC map of the data. There is a fourth cluster consisting of Chevrolets from Oshawa Ontario (Plant 22), automobiles from Linden NJ (Plant 16) and trucks from Shreveport LA (Plant 25).

Figure 5.

Plot of the two largest principal components of the 69 training set IR spectra and five wavelet coefficients identified by the pattern recognition genetic algorithm for manufacturing plants comprising the third plant group. Each training set sample and each validation set sample are represented as points in the PC plot. For the training set, T-1 = Plant 17, T-2 = Plant 22 (GMC trucks), T-3 = Plant 22 (Buick automobiles) and T-4 = Plants 16, 22 (Chevrolet automobiles) and 25 (GMC trucks). For the validation set, P-1 = Plant 17, P-2 = Plant 22 (GMC trucks), P-3 = Plant 22 (Buick automobiles) and P-4 = Plants 16, 22 (Chevrolet automobiles) and 25 (GMC trucks).

Table 7. Training set and validation set for Plant Group 3
PlantsNumber of samples
Training set (Thermo Nicolet)Validation set (Bio-Rad)
171916
22 (Trucks)138
22 (Buick cars)95
16, 22 (Chevrolet cars), 25 (all trucks)2825
Total6954

Figure 5 also shows the validation set samples (Table 7) projected onto the PC plot developed from the 54 IR spectra of the training set and the five wavelet coefficients identified by the pattern recognition GA. Each validation set sample lies in a region of the map with other paint samples that have the same class label. For Plant Group 3, wavelet-transformed IR spectra of clear coats were differentiated by assembly plant and also by model and line for a given assembly plant.

The significance of the GA runs are twofold: (1) search prefilters can be developed to extract information from clear coats independent of the instrument used to generate the data and (2) the manufacturing plant responsible for the clear coat paint layer can be narrowed down to a single plant or a few assembly plants. Wavelets played a pivotal role in uncovering information about the assembly plant from the IR spectra through deconvolution of overlapping spectral responses.

Search prefilters were also developed for the IR spectra in each of the three plant groups investigated by the pattern recognition GA using the stacked PLS discriminants. For Plant Group 1, the development of the search prefilter for assembly plant focused on the binary classification problem: Plant 18 (Moraine OH) versus the other six assembly plants (Arlington TX, Doraville GA, Fairfax KS, Fort Wayne IN, Lansing MI and Pontiac MI). Previous efforts to separate all seven assembly plants using the stacked classifiers indicated significant overlap between the classes in the data. Figure 6 summarizes the error rates for both the training and validation sets using the stacked PLS models for the binary classification problem investigated. The cross-validated error rate for the training set was 0% when nine spectral intervals identified by double cross-validation and six latent variables for each of the intervals were used for stacking. These same conditions yielded an error of 18% for the validation set. By comparison, the classification success rates obtained by the pattern recognition GA for Plant 18 are 100% for both the training set and validation set.

Figure 6.

Cross-validated error rate (circles) and prediction error (squares) for the Moraine OH plant as a function of the number of latent variables (LVs) for the nine wavelength windows identified by double cross-validation for the stacked partial least squares classifier. SPLSDA, stacked partial least squares discriminant analysis.

For Plant Group 2, the development of a search prefilter for assembly plant required the solution of a three-way classification problem (Bowling Green KY/Plant 3, Hamtramck MI/Plant 10 and Orion MI/Plant 21). Training set and validation set results for the stacked PLS classifier developed from the IR spectra of Plant Group 3 are summarized in Figure 7. The error rate for each assembly plant in the training set is 0% when four latent variables are used to model the wavelength windows used for stacking. For the validation set, the average error rate is 38% (Bowling Green KY/Plant 3), 2% (Hamtramck MI/Plant 10) and 16% (Orion MI/Plant 21). By comparison, the pattern recognition GA identified 24 features that correctly classified all but one sample in both the training set and validation set. The lone training set sample and lone validation set sample incorrectly classified by the GA were from the Hamtramck MI assembly plant.

Figure 7.

Cross-validated error rates (circles) for the training set and prediction error rates (squares) for the validation set as a function of the number of latent variables (LVs) for each of the wavelength windows used in stacking: (a) Bowling Green KY, (b) Hamtramck MI and (c) Orion MI. SPLSDA, stacked partial least squares discriminant analysis.

Figure 8 shows the cross-validated (i.e., training set) and prediction (i.e., validation set) error rates for each assembly plant or plant subgroup comprising Plant Group 3, which were detected by the pattern recognition GA. A search prefilter developed for assembly plants by the stacked PLS classifiers achieved classification success rates of 100% for all assembly plants or plant subgroups comprising Plant Group 3. However, error rates for the validation set were 20% (Lordstown OH/Plant 17), 25% (Buicks from Oshawa Ontario/Plant 22), 25% (GMC trucks from Oshawa Ontario/Plant 22) and 30% (automobiles from Linden NJ/Plant 16, Chevrolets from Oshawa Ontario/Plant 22 and trucks from Shreveport LA/Plant 25). By comparison, the pattern recognition GA achieved 100% correct classifications for both the training set and validation set.

Figure 8.

Error rates for the training set (circles) and prediction error rates (squares) for the validation set as a function of the number of latent variables (LVs) for the wavelength windows used in stacking: (a) Lordstown OH/Plant 17, (b) Buicks from Oshawa Ontario/Plant 22, (c) GMC trucks from Oshawa Ontario/Plant 22 and (d) automobiles from Linden NJ/Plant 16, Chevrolets from Oshawa Ontario/Plant 22 and trucks from Shreveport LA/Plant 25. SPLSDA, stacked partial least squares discriminant analysis.

5 CONCLUSION

The results from the assembly plant studies indicate that a data reduction strategy to enhance classification of mid-IR data by identifying specific wavelengths is superior to one based on identifying informative wavelength regions. Both the pattern recognition GA and the stacked classifiers used preprocessing to extract discriminatory information from the IR spectra about assembly plant. In a previous study, Brown [23] demonstrated that a second derivative preprocessing of spectral data prior to the application of stacked methods performed as well as wavelet-based methods. Stacked classifiers performed better at examining broader features such as those present in near-IR spectra, whereas heuristic methods such as the pattern recognition GA are better suited to extracting information from narrower spectral bands.

A two-step procedure for development of search prefilters, which involves the application of wavelets to decompose each IR spectrum into wavelet coefficients that represent both the high-frequency and low-frequency components of the signal and the use of a GA for pattern recognition analysis to identify wavelet coefficients that contain information about the assembly plant of the paint samples, is superior to the use of stacked PLS discriminants in the wavelength domain where the reciprocal of the cross-validated error rate for each PLS classifier is used as the weighting value for the stacked models. Search prefilters developed using specific wavelengths or wavelet coefficients outperformed search prefilters that utilized specific wavelength windows. The similarity of the IR spectra within a plant group and the noise present in the IR spectra may be obscuring information present in the wavelength windows. Clear coat paint spectra from the PDQ database may not be well suited for stacking as there are few spectral intervals that can reliably distinguish the different sample groups (i.e., assembly plants) in the data. Furthermore, the information contained in the IR spectra about the assembly plant may not be highly compartmentalized in an interval, which also would work against stacking. The large difference in the error rate between the validation set and training set for the stacked models is probably indicative of overfitting by the PLS models developed for stacking.

Acknowledgements

This research was supported by award no. 2010-DN-BX-K17 from the National Institute of Justice, Office of Justice Programs, US Department of Justice. The opinions, findings and conclusions or recommendations expressed in this publication/program/exhibition are those of the author(s) and do not necessarily reflect those of the Department of Justice.

Ancillary