Paste, aggregate, or air? That is the question

The Ambassador Bridge between Detroit, Michigan, and Windsor, Ontario, has served for almost 100 years as North America's busiest international border crossing. But in 2025, the Ambassador will be replaced by the new Gordie Howe International Bridge. The Gordie Howe is a cable‐stayed bridge, with two massive 220 m tall concrete piers on opposite banks of the St. Claire River, a single clear span of 853 m, and 42 m of clearance over this busy waterway. To ensure durability in this harsh freeze‐thaw environment, air‐entrained concrete is specified throughout. And, to ensure the quality of air entrainment, the ASTM C 457 Procedure C, Contrast Enhanced Method is employed. While a similar automated microscopic approach has been in use for well over a decade according to EN 480‐11 Determination of air void characteristics in hardened concrete, this is the first large‐scale application of automated air void assessment in North American infrastructure. According to the ASTM Procedure C, the air void characteristics are determined through digital image processing, while the paste content may be determined by either mix design parameters, manual point count, or ‘other means’. Of these three options, point counting is used for Gordie Howe; but in parallel, during each point count, the digital image coordinates and phase identifications for each evaluated stop are recorded. This allows for training of a neural network, for automated determination of paste content, as demonstrated here.

quantified through digital image processing.To conduct Procedure C, an even layer of opaque black ink is applied to the polished surface, and white powder pressed into the air voids.Any excess powder remaining on the surface is scraped and wiped away, leaving air voids white, and the aggregate and hardened paste black.A digital image (or images) is collected, and a threshold applied to distinguish air from nonair.While the method for the determination of air void content and parameters (void frequency, average chord length and specific surface) are prescribed in Procedure C, the method for the determination of the hardened paste content (a value needed to compute the spacing factor) is left open.Specifically, Procedure C specifies to 'determine the specimen paste content (p) using the mixture design parameters or by other means such as a point count'. 3The standard's phrasing of 'other means' encompasses a variety of alternative approaches, such as machine learning and computer vision techniques.
5][6][7][8] However, standards governing current construction quality assurance practices have yet to catch up with these advances.As such, ASTM C457 Procedure C was specified by the owners for the project examined here: the ongoing construction of the Gordie Howe International Bridge.In this case, for each concrete core or cast cylinder received, two sets of digital images were collected: one from the as-polished surface, and another after applying the dark ink and white powder contrast enhancement.The air void content and parameters were measured as prescribed in Procedure C by thresholding the contrast enhanced image.The paste content was measured by manual point count on a falsecolour image compiled using the aligned as-scanned and post contrast enhancement images.Since air content is already determined from the threshold contrast enhanced image, the task of determining paste content is simplified; a quick binary yes/no decision is made at each point count stop as to whether or not an aggregate phase is encountered.Paste content is then determined by subtracting the sum of the air content and aggregate content from 100%.During this process, the pixel coordinates and yes/no identity of each point count stop are recorded.This approach allowed for the adoption of a machine learning strategy for paste content determination as explored here.

Experimental
From the Canadian side of the bridge 96 mm diameter concrete cores were received for analysis, while from the United States side of the bridge cast 100 mm diameter × 200 mm length cylinders were received for analysis.The concrete samples were cut longitudinally in half with a water-cooled diamond saw, with one half further sectioned into two sub half cylinder samples, yielding two separate 100 × 75 mm surfaces for subsequent polishing.The surfaces were initially ground flat by application of hand pressure on a water-cooled rotating 120 µm (100 grit) diamond embedded platen (S.G.Frantz Co., Inc., Samson-Patmore Polishing Machine) to remove unevenness left over from the cutting procedure.Next, the surfaces were automatically lapped (Crane Packing Company, LapMaster 12) using a water and 12 µm (600 grit) SiC suspension for 30 min.After lapping, the sample surfaces were cleaned with a pressurised water spray, oven dried at 35 • C, and the surface reinforced through brush application of a 5:1 ratio mixture of nitrocellulose to acetone.The final polish was achieved by hand pressure on the same water-cooled rotating platen, but with fixed abrasive 12 µm (600 grit) SiC adhesive backed sandpaper, and cleaned with a pressurised water spray.Upon completion of polishing, any remaining nitrocellulose reinforcement was removed in an acetone bath, after which four small (∼5 × 5 mm) prismatic reflective stickers were placed at the corners of the prepared surfaces.
A flatbed scanner (EPSON Perfection V850 Pro) was used to scan the sample surfaces at a resolution of 125 pixels/mm (3175 dpi) in 24-bit RGB colour.After the initial scan, the sample surfaces were darkened by drawing a series of slightly overlapping lines with a black felt-tipped pen (Sharpie King Size permanent maker) and the air voids packed with 2 µm wollastonite white powder (NYCO NYAD 1250) as described in ASTM C457. 3 A stereo zoom microscope (Nikon SMZ-2T) was used to assist with manual darkening of voids found in the aggregate with a black felt tip pen.This final contrast enhanced surface was scanned as a grey-scale image (8-bit) also at 3175 dpi resolution.

Image annotation and label generation
Plugins and scripts on FIJI 9 were used to perform image registration between the original and contrast enhanced images, 10 as well as to generate labels on the scanned core sample data. 11An example of an original scanned image and the corresponding aligned contrast enhanced image is provided in Figure 1.
For the routine quality assurance testing of cores and cylinders received from the bridge, the initial point counts (hereafter referred to as Point Count Method 1) were conducted by a trained user on a false-colour merged image  of the combined original scan and black-white contrast enhanced scan to create two label classes: aggregate and nonaggregate.A total of 500 points were collected from each scanned surface (1000 points total for each concrete core or cylinder) and the coordinates of each point count result were stored using a custom script in FIJI, allowing for segregation of the point count stop focus areas into aggregate and nonaggregate regions.For the purposes of this study, the same point counts were repeated, but this time three label classes, nonaggregate, fine aggregate, and coarse aggregate were identified (hereafter referred to as Point Count Method 2).
After the images were labelled, a FIJI macro-script was created to split the full source image into smaller square subimages based on the coordinates from the point count.From each sample, 500 nonoverlapping subimages were generated around each point count stop, and sorted by their respective labels as aggregate (1) and nonaggregate (0) containing images for Point Count Method 1, and coarse aggregate (2), fine aggregate (1) and nonaggregate (0) for Point Count Method 2. Cropping between point count stops resulted in 850 × 850 pixel (6.8 × 6.8 mm) frames, with ∼2500 labelled images total used in this study (from five core/cylinder half sub samples).Figures 2 and 3 display the point count results.Figure 4 shows example windows from each of the 5 core/cylinder sub samples, and Figure 5 summarises the results of the point count labelling process.Although the same operator performed both point counts, using the exact same point count stop coordinates, from Figure 5 it is clear that some error is introduced when making a distinction between aggregate and nonaggregate.This could partly be attributed to biased decision making when the cross-hair is at the border between an aggregate and nonaggregate phase.

Image enhancement and feature synthesis for machine learning model
Enhancing images for a machine learning model is an important step that directly impacts the model's ability to learn, generalise, and perform accurately on unseen data. 12mage enhancement techniques, such as normalisation, contrast adjustment, and noise reduction, improve image quality and consistency across the dataset, making it easier for the model to identify and learn the underlying patterns and features essential for classification or detection tasks.
Distinguishing phases within concrete images is often affected by issues of similar colouration between aggregate and binder, or other factors such as dust, lighting changes and sample preparation artefacts, all of which can be captured during image collection, and all of which can contribute to low phase contrast in the image. 8In addition to making it difficult for convolutional neural network (CNN) models to recognise specific patterns and meaningful features from the data, a homogenous dataset may not generalise well to new and different data, impacting the performance and limiting the potential applications of such models.
To address these issues, particularly for the problem of phase identification in concrete, image preprocessing steps are often applied. 8In this study, data enhancement was conducted through several stages.Prebuilt functions from Python Image Processing libraries, namely Open Source Computer Vision Library (OpenCV) and SciKit-Image (part of the SciPy ecosystem) were applied, which provided robust toolkits for enhancing image data diversity and quality. 13,14s part of the ASTM C457 Procedure C process, the contrast enhanced surface is threshold to isolate air voids.This thresholding process converts the 8-bit contrast enhanced surface image to a binary image (Figure 6).This manipula-  tion enhances the representation of the void space and can be used to add more details to help isolate nonaggregate phases (paste and air) from the fine and coarse aggregate labels.
For the original as-scanned 24-bit RGB original images, the distribution of pixel values is continuous with little variation in the distribution, making it difficult to draw boundaries between cement and aggregate phases (Figure 7A).After applying normalisation and equalisation operations, the contrast between aggregate and nonaggre-gate is improved (Figure 7B).Image normalisation adjusts the pixel intensity values across all images to have a uniform scale, reducing any noise outliers, while image equalisation is a technique used to improve contrast in images, by spreading out the most frequent intensity values.This is particularly useful in images where the background and foreground have continuous pixel values, thereby enabling adding localised contrast and making the background and foreground more distinguishable and features more discernible.After enhancing the image through equalisation, the green channel of the enhanced (equalised) RGB image was replaced with a black and white (BW) threshold image of the contrast enhanced surface (Figure 7C).
Cropping labelled images to 75% (638 × 638 pixel), 50% (425 × 425 pixel) and 25% (213 × 213 pixel) subsets of their original size serves as an innovative approach to simulate the effects of a denser point count in the context of analysing image data with a CNN model.This method creates variations of the original dataset that mimic having more detailed or zoomed-in views of the subjects within the images.By doing so, the model is exposed to different scales and resolutions of the features, which allows the model to learn from a broader spectrum of feature representations, potentially improving its generalisation capabilities and robustness.

Convolutional neural network (CNN) model architecture
A CNN model was built to perform classification on the labelled images generated from the point count coordinates, as illustrated in Figure 8.To train the model, 2500 enhanced RGB images in the same format as Figure 7C, all with identical magnifications, across all the five samples were used as input.This dataset was randomly divided into training, validation and test sets using the following proportions: 75%, 12.5% and 12.5%, respectively, where the annotation labels were directly taken from the user-input point count results (i.e.no further annotation algorithm or pixel-based annotation was added).As a consequence, this leads to images being categorised as either aggregate (fine or coarse) or nonaggregate according to the central cross-hair, despite the full frame potentially containing elements of different classifications (phases) within them.
For both point count methods, four independent models were created that used the original enhanced images, 75% cropped enhanced images, 50% cropped enhanced images and 25% cropped images respectively, with the same testing/validation/train proportions across the models.Each subset was stratified based on the label to ensure equal proportions of all labels across the training, validation and test dataset.The training was performed using 400 epochs (64 batches per epoch) on a commercial GPU (NVIDIA RTX GPU 4060 GDDR6).
The CNN model, designed in a Keras and TensorFlow framework, 15 streamlined the image inputs into a stan- dard size of 100 × 100 pixels across 3 colour channels.The architecture commenced with an input layer, transitioning into a convolutional layer (32 filters, kernel size 3 × 3) and a max pooling layer (2 × 2) to reduce spatial dimensions to 50 × 50.Each convolutional sequence was followed by batch normalisation and a dropout layer to enhance regularisation.Batch normalisation, applied post convolution and pooling, aids in stabilising the learning process by normalising the inputs of each layer to have zero mean and unit variance, which helps in accelerating training and reducing sensitivity to network initialisation. 12 second convolutional layer identical in configuration to the first further reduced the dimensions to 25 × 25, followed by similar normalisation and dropout stages.The network then flattened the output to prepare for dense connections, leading into a 512-unit dense layer, another batch normalisation, and dropout sequence.This pattern was repeated for a subsequent 256-unit dense layer.The final output was produced by a 3-unit dense layer with softmax activation to allow multiclass classification.Throughout, the model employed Adam optimisation and cross-entropy loss, to aid in efficient learning and generalisation.Unlike traditional stochastic gradient descent, Adam maintains per-parameter learning rates, adjusted based on the average of recent gradients and squared gradients.This adaptability makes it well suited for handling sparse gradients and different parameter scales, thus contributing to the model's effective learning.

Model performance and results
In machine learning classification tasks, imbalanced datasets can make it difficult to present a meaningful universal metric to assess the model performance.Instead, by using a range of evaluation metrics, such as accuracy, precision, recall and the F1 score, holistic insight is offered into model performance. 12These metrics originate from the confusion matrix, where the confusion matrix itself is defined by four parameters: True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN), with each class receiving a detailed breakdown.Both accuracy and precision were considered as the main performance metrics, to evaluate both overall model performance and to prioritise optimisations that improved performance of any minority classes.Tables 1 and 2 summarise the model performance results run on the different input images for both the Training and Test data, for 2-class data from Point Count Method 1 and 3-class data from Point Count Method 2. For the 2-class model, cropping the images resulted in marginal improvements for both accuracy and recall (i.e.prioritising finding the positive instances of a sample in a dataset).But, when the model performance is split by classes, although high scores are achieved for the general aggregate (1) category, the scores for nonaggregate/cement (0) range from 42% to 56%, which is comparable to random guessing.Another challenge with using the 2-class model is that there two times as many more aggregate labelled images compared to nonaggregate labels, which also make it challenging.
In contrast, in Point Count Method 2 with the 3-class model, there is marked improvement in performance with implementing the cropping algorithm.Coarse aggregate (2) consistently has strong performance across all input sizes, showing reliable identification of this class, but the nonaggregate label shows significant improvements in accuracy from 50% cropping onwards (from 12% at the full-sized images, up to 90% with 25% cropped on unseen test data).The precision and recall scores are lower for nonaggregate and fine aggregate (1) labels, indicating some difficulty in differentiating between these two classes.
To gain deeper insight into the model's performance, a confusion matrix was generated for the best performing model, where a pair-wise comparison was also carried out comparing the different labels, as seen in Figure 9. Through the pair-wise analysis in Figure 9D, it is revealed that most frequently fine aggregate images are mislabelled as coarse aggregate, where this mislabelled value (24.24%) was greater than the number of correctly identified fine aggregate (15.15%).There is a similar but not as pronounced misclassification differentiating between nonaggregate from fines, with both false-positive and falsenegative values being prominent across both classes (about a quarter of the labels for both, Figure 9B).  of fine aggregates, leading to inaccuracies in the model's identification.
Cropping images to highlight specific areas from the cross-hair mimics labelling techniques seen in other research that employs pixel-wise and instance segmentation, by effectively minimising background noise by ensuring the focal phase occupies a larger portion of the image. 5,6While the coarse aggregate's substantial size led to consistent and accurate identification across various image dimensions, both fine aggregate and nonaggregate phases (including voids and paste) showed improved results with reduced image sizes.By correlating the resized images to a corresponding point count density, we can depict the link between a basic CNN model's anticipated accuracy and the volume of images necessary for attaining such precision, as demonstrated in Figure 13.An insight from this analysis is the requirement of roughly 6500 labelled images to reach an accuracy close to 80% on novel concrete samples, which corresponds to labelling about 3.5 cores with 500 points each.

CONCLUSIONS
In this work, we proposed a machine learning method to help streamline the analysis of concrete AVS according to ASTM C 457 Procedure C. 3 While the automated measurement of air voids is accomplished using the simple threshold approach outlined in Procedure C, in practice, quantification of the paste content is accomplished indirectly by conducting manual yes/no point counts to determine the aggregate content.Paste content is then inferred after knowledge of the aggregate and air content.Since the results of these simple yes/no point counts are recorded, with stop identities and pixel coordinates known, an attempt was made to use this point count data to train a model for automated aggregate content determination (and in turn paste content).However, the simple yes/no labelling as to whether a label was aggregate or nonaggregate, proved insufficient for proper model training and resulted in low accuracies and precision in identifying the nonaggregate phase.A different 3-class point count approach was tried, distinguishing between coarse aggregate, fine aggregate and nonaggregate, as well as trimming images closer to the cross-hair location to simulate more dense point counts, resulting in improved model performance across all three classes and notably improving the nonaggregate accuracy by 60% on unseen test data.Ultimately, the CNN relies on the entire image frame to arrive at a decision, whereas a human point count operator is more concerned with the identity of the phase directly under the point count cross-hairs.As a result, frames that contain a combination of coarse aggregate, fine aggregate and nonaggregate (paste and air) phases can create challenges for the CNN.Nevertheless, by applying image preprocessing steps, expanding the dataset and enriching point count density, the CNN demonstrates robust performance in handling diverse images from a complex real-world dataset.

F I G U R E 1
Sample US#1 original scanned surface (A) and after contrast enhancement (B).

F I G U R E 2
Labelled Point Count Method 1 results for sample US#1 where blue = nonaggregate, and pink = aggregate for cross-hairs (A) and with overlays (B).

F I G U R E 3
Labelled Point Count Method 2 results for sample US#1 where blue = nonaggregate, pink = coarse aggregate, and yellow = fine aggregate for cross-hairs (A) and with overlays (B).F I G U R E 4Example images extracted from point count analysis across the five core/cylinder half samples with the point count stop cross-hairs over coarse aggregate (top row), fine aggregate (middle row) and nonaggregate (bottom row).

F I G U R E 5
Paired class distributions across samples for Point Count Method 1 (M1: left side columns) and Point Count Method 2 (M2: right side columns).

F I G U R E 6
Portion of scanned image (A) with histogram (B) and threshold result (C).

F I G U R E 7
Original as-scanned image with corresponding 3D RGB histogram (A), same image after enhancement (B) and after substitution of the G-channel with binary air void image to create the false-colour combined image (C).

F I G U R E 8
Architectural overview of the convolutional neural network (CNN).This diagram illustrates the sequential layers of the CNN model, including convolutional layers, max pooling, dropout for regularisation, and dense layers leading to the softmax output for classification.The model is designed to process 100 × 100 pixel images for multiclass image recognition tasks.

Figure 10
illustrates the comparison between actual labels (A) and those predicted by the model (B).Figures 11 F I G U R E 9 Confusion matrix results on unseen test data for the best performing model, 3-class classification with 25% trimmed images, all values expressed as percentages (%).The all classes confusion matrix (A) is further broken down into smaller pair-wise confusion matrices comparing two classes to each other: (B) nonaggregate (0) vs. fine aggregate (1); (C) nonaggregate (0) vs. coarse aggregate (2); (D) fine aggregate (1) vs. coarse aggregate (2).This view allows us to visualise that the model frequently struggles to differentiate between fine aggregates and coarse aggregates.

F I G U R E 1 0
Visualisation of interval test results performed on the US#1 sample withheld during training.Labels are colour-coded overlays where nonaggregate (0) = blue; fine aggregate (1) = yellow; coarse aggregate (2) = pink.(A) displays the ground truth labels, (B) are predictions generated on the 3-class 25% cropped image model.and12 present instances of incorrect class identification.In Figure11, the cross-hair is positioned at or near the boundary between coarse aggregate and paste, and the presence of a significant proportion of other materials within the frame potentially contributes to errors in the model's predictions.Likewise, in Figure12, the misclassification of fine aggregates as coarse ones is noted, which could be attributed to the close proximity and high concentration F I G U R E 1 1 Examples of coarse aggregate images misclassified as fine aggregate in US#1.F I G U R E 1 2 Examples of fine aggregate images misclassified as coarse aggregate in US#1.

F I G U R E 1 3
Summary of test accuracy as a function of image quantity for different label categories, across all the 3-class models.Nonaggregate and fine aggregate demonstrate increasing accuracy with greater volume on input images.
TA B L E 1 Model performance on 2 class data, 0 = nonaggregate, 1 = aggregate, from Point Count Method 1.Note: Scores over 80% in the metrics are in bold.
This work was funded by Natural Sciences and Engineering Research Council of Canada, Grant/Award Number RGPIN-2020-328326, and industry collaboration with WSP Golder Canada, who provided the concrete cylinder and core samples from the Gordie Howe International Bridge used in this study.