Learning-free handwritten word spotting method for historical handwritten documents

Word spotting on degraded and noisy historical documents can become a challenging task considering the computational time and memory usage required to scan the entire document image. This paper proposes a new effective technique for multi-language word spot-ting using a two different feature extraction techniques, Histogram of Oriented Gradi-ents (HOG) and Speeded Up Robust Features (SURF) features. First, regions of interest (ROIs) are extracted using a cross-correlation measure, and the extracted ROIs are re-ranked using feature extraction and matching methods. The algorithm handles two types of scenarios: Segmentation-based and segmentation-free. It also facilitates the search for words that occur once as well as multiple times in the image. Evaluations were conducted on the George Washington and HADARA datasets using a standard evaluation method. The proposed methodology shows improved performance over contemporary technologies currently being used in the word spotting research ﬁeld.


INTRODUCTION
With tremendous improvements in technology in recent times, the number of historical documents, handwritten and printed, that are being scanned has increased. Even with all the recent developmental trends, there has been little advancement in word spotting in these historical documents, especially in handwritten ones. Searching a particular word using a query image or text string and indexing the words for further use in historical documents is a hot topic of research because of its inherent challenges. Word spotting can be characterized as locating a particular word in a historical paper with reference to a query image (Query-by-Image) or string (Query-by-String). Digitization of historical publications can become quite difficult because of the low and degraded quality of the paper, noise in the documents (such as holes), dissemination of the ink all over the page, scribbling of notes on the sides, shadows, poor illumination while capturing the image and the fact that the writing style, font size and alignment can differ considerably on each page of the document.
There are two distinct techniques followed in the word spotting methods, which are (1a) segmentation-based, which is dependent on the segmented words of the entire page, either from the available ground truth or using word-segmentation This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology techniques, which in itself is another field of research; (1b) segmentation-free, which is independent of any word segmentation on the page and involves searching through the entire page. Word spotting methods can also be categorized as (2a) training-based methods that need labelled data to train a model to be used in word spotting; (2b) trainingfree methods, which do not need any labelled data for word spotting.
Feature extraction and matching algorithms are used mainly in the word spotting field to represent the words in the documents. Features are important information found in the document image, such as corners, edges, blobs, and pixel distribution, which are the local features describing differences from its immediate surroundings by texture, colour, or intensity. Structural features are very useful in the retrieval of information from content such as the shape and structural components of the document images. An examples of the local-featuresbased extraction methods is SURF extraction. Histogram of Oriented Gradients (HOG), which is helpful in catching the angle and direction of the gradients in the image, is very helpful in analysing the gradient features of the image. Useful information around the representative key points is detected and then a matching algorithm uses them to match the key points between two images.
A learning-free word spotting method is proposed in this paper to reduce the computational time and the need for larger annotated data. Deep learning methods require a relatively large amount of training data with ground truth for training. In case of historical documents, availability of annotated data is very limited. This inspired to look into alternative learning-free methods. An effective two-stage word spotting method utilizing a correlation measure and feature extraction and matching techniques to spot the query image in the entire document image is discussed. The first stage involves measuring the correlation between the images of the query with the target image on which the query image has to be spotted. A lower threshold is selected for better accuracy, which gives smaller regions of interest (ROIs) than the size of the entire page. In the secondstage, feature extraction is done on the ROIs alone instead of the entire page. A ranking and re-ranking is then performed on the ROIs based on the sets of features extracted during the second stage. This paper has contributed to word spotting in two different scenarios: (1) The whole image is segmented into individual word images and (2) the whole image without any word segmentation is considered.
SURF feature extraction has been proved to work well with rotated images as it is scale-invariant. Most of the document images obtained from the historical document are skewed. SURF suits best for historical documents for this reason than other feature extraction methods like SIFT [15]. Similarly the main motive is to combine gradient based features like SURF and HOG to see if there is any improvement in the performance over the performance when considering either one of them only as the feature extraction method. Pattern based features like Gabor [26] and LBP [25] compares between the document image and the query image by using a primitive in both the images and comparing their representation. It has been noted that these pattern based features perform well when combine with gradient based features. Hence SURF and HOG feature sets was considered in the proposed method.
The rest of the paper is as follows. Section 2 gives related works on word spotting, and different techniques used are mentioned. In Section 3, a detailed discussion of the proposed methodology with figures and flowcharts is incorporated. Outline, structure, and description of the datasets used are cited in Section 4, along with the experimental setup. A detailed narration of the results from the proposed method is given and are compared against other efficient techniques in this section. Section 5 explains the contribution given under the discussions. Finally, the conclusion is enclosed in Section 6, presenting an outline of the future works for the proposed method.

RELATED WORKS
Previous works on the word spotting in historical documents can be distinguished as training-based or training-free and word segmentation-based or segmentation-free methods. Supervised machine-learning-based word spotting can be successful only when a training dataset is available, but a query involving out-of-vocabulary words can affect the performance. Word segmentation is another field of interest in research, which is far more difficult for segmentation-based word spotting. Later in this section, works related to word spotting in handwritten documents will be discussed.

Training-free methods
Mhiri et al. [1] proposed a training-free and segmentation-free technique by representing the document images in terms of features, learning these features by hierarchy and then applying spherical k-means on the learned hierarchy. Then, a two-stage matching system is employed, in which (1) a sliding window method is used for the pre-selection as it is a faster approach and (2) description of the spatial organization of the regional features are then re-ordered to present the closely matched words in the image. Srihari et al. [2] proposed a methodology exploiting the binary shape features. Binary shape features are extracted for both the query image and the image of the handwritten document, and then an efficient matching algorithm is used to match the descriptors found around the features extracted. Finally, correlation distances between the matches are computed to find the top matching words. Experiments have been carried out on an Arabic multi-writer dataset.
Zagoris et al. [3] proposed a word spotting method using local features that are document-oriented. The method does not require any training data, as it considers extraction of key points that are document-oriented and computes instrumental information present around the extracted features. Then, a matching technique using proximity is done locally based on the spatial context. The experimental results from this work are quite elaborate using the Bentham Dataset, Washington Dataset, and Barcelona Historical Handwritten Marriages Dataset (BH2M).
Yao et al. [4] devised a word spotting method in which a two-directional Dynamic Time Warping is applied on the HOG descriptors obtained from local features techniques such as SIFT [15], HOG [24], and LBP [25]. Here, HOG descriptors detected along the same row are normalized as a vector and the descriptors detected along the same column are normalized as another vector. Further, a two-directional DTW method is applied on the normalized vectors to determine the distance between the image used for querying and the normalized vectors found from the HOG descriptors. The results are shown on the GW dataset on a segmentation-free context.
Dey et al. [5] presents a contemporary platform for word spotting, which does not require any training data for learning. Local Binary Patterns are detected in an enormous number of historical pages and a spatial sampling is obtained out of it for easy representation.
Rath and Manmatha [6,7] whose method is considered as the benchmark in the field of word spotting compared the features extracted from the one-dimensional images by the Dynamic Time Warping method. Pruning is done prior to applying DTW. The method is applied to the George Washington dataset with two different experimental setups. Kane et al. [8] reported a way for word spotting by pruning all the words in the page to discard words that are unlikely matches for the query image, followed by pre-processing. A score is generated on the pruned words based on various comparisons. A high score should indicate a good match, and a low score should indicate a poor match. The George Washington dataset has been utilized for experimenting on three different setups. A pipeline consisting of pre-processing, feature extraction and matching has been done by the authors in [19]. A similar pipeline has been used in the proposed method as well.
Patnke et al. [9] gave a detailed description of the HADARA dataset, including the structure, line and word segmentation, ground truth creation, and transcription of the pages, and integrated different template-based word spotting methods in their proposed method. A detailed evaluation procedure on segmentation-free word spotting methods on the HADARA dataset has been provided with equations and explanations [10].
Faisal and AlMaadeed [11] have recommended a segmentation-free solution to word spotting with queryby-example using normalized cross-correlation-based template matching. A sliding window of the size of the image in query is applied over every pixel in the search image. Normalized crosscorrelation measure is calculated for each of the pixels with values from +1 to −1. The value of the threshold is adjusted to match any variations between the images. The HADARA dataset has been used for conducting the experiments.

2.2
Training-based methods Ghosh and Valveny [12] developed a segmentation-free word spotting method. In their paper, the Fischer vector in combination with pyramidal histogram of characters labels (PHOC) is calculated. With this combination as input, a basic model is trained, which uses SVM as a classifier.
Sudholt and Fink [13] presented a super-efficient architecture of Convolutional Neural Networks, which works well using query image and text string as input. The architecture is based on the pyramidal histogram of characters labels of the training data. Some of the works to match the query and document image using deep learning can be observed in [20,21]. Ref. [22] explains the architecture and working of the word spotting using different CNN networks available in the field.
The works mentioned above are either dependant on a large training dataset or segmentation-based, which is still under research, or feature extractions, which have a very large computational time and memory usage. The technology proposed in the current paper overcomes the abovementioned problems to implement a training-free and segmentation-free word spotting method with minimum computational time and memory usage.

Overall workflow
The recommended approach considers word-segmented images as well as whole document images without word segmentation. Before entering into the main process of word spotting, a high-level pre-processing step is necessary because of the poor quality of the image captured from the handwritten historical documents. Figure 1 represents the comprehensive workflow diagram of the proposed method. The first step involved after pre-processing is the calculation of the correlation measure between the images, which is used for a query against the document image. In a segmentation-free scenario, ROIs are extracted from this first step taking the entire page as the input, whereas in a segmentation-based case, the correlation measure between the image of the query and each word image segmented from the document is used to remove The second step is the extraction of features from the ROI obtained in the first step. Both local and gradient features are extracted, and the combination of both features are used to increase the accuracy in spotting a word closely related to the query image. Here, the features extraction is done extracted sequentially. First SURF was extracted and then HOG was extracted. A ranking is given using the SURF feature extracted and a re-ranking on the obtained ranks from SURF features are done using the HOG features extracted. Local features used in the methodology are SURF, and the gradient feature used is (HOG). Re-ranking of the ROI is based on this scale, and rotation-independent features are extracted.

Pre-processing
Pre-processing the document image, though a separate area of research, is a critical step in word spotting. Figure 2 shows the steps used for pre-processing. The images are normalized to avoid any illumination and contrast differences. The degraded, low-quality image is sharpened by passing it through a Gaussian filter with a fixed filter size of 9 × 9. Morphological operations such as erosion and dilation are then further carried out on the sharpened image. Basic pre-processing is done for better clarity, such as converting the RGB image into greyscale and thresholding it. Owing to degradation and bleeding of ink, there can be openings, and wordings on the scanned image can be very light and hardly visible. Performing an erosion followed by dilation on the sharpened image will close the gaps, if any, and will increase the visibility of the words in the document image.
The document image can have multiple noises, unwanted fragments of letters, and unnecessary punctuation marks, which can increase the number of useless features extracted from the image. To avoid this situation, noise reduction needs to be performed. In this literature, contours are drawn over the sharpened morphologically modified images and contour areas are calculated for all the contours detected. Contours for which the areas are lesser than a threshold value are considered noise and are removed from the image. Contours with areas less than 400 are considered as noise and removed from the image. This number 400 was chosen after numerous trial and error on the document images used.

Extracting regions of interest
Feature extraction on the entire document image can become difficult, as it will be difficult to work with a larger data. Deciding on the location of the matching word can become ambiguous and will also increase the computation time. To reduce these difficulties, ROIs are selected on the document image based on the cross-correlation measure obtained between the query image and historical document (Figure 3). Template matching is used to calculate the cross-correlation measure in this case. The similarity between the query image and the document image is high if the cross-correlation measure calculated is close to 1. The template matching algorithm with TM_CCORR_NORMED as an input parameter is used to extract the ROIs from the document image. Here TM_CCORR_NORMED is used as the desired function was cross-correlation to measure the distance between the query and the document image. In the template matching algorithm, a sliding window approach is employed with window size equal to the size of the query image, and normalized cross-correlation is calculated for each pixel on the training image. The results obtained from normalized crosscorrelation measure may range from a maximum value of +1 to minimum of −1. A lower threshold value is decided based on the dataset used to increase the accuracy of the system. Though there are more false positives detected during this step, a reranking step is introduced to remove the false positives and increase the overall efficiency of the system. In most cases, the query image is part of the document image on which the word spotting is being done, hence the normalized cross-correlation measure is selected, which gives competent accuracy. Normalized cross-correlation (NCC) can be defined as the dot product of two normalized vectors. For example, F(a,b) is the normalized vector from the query image (f(a,b)) and T(a,b) is the normalized vector returned as a result of the sliding window (t(a,b)) on the historical document image.
t f , t are the standard deviations of query image and the train image based on pixel-based sliding window, respectively.
where n is the total number of pixels in the image.
The working knowledge of the SURF and HOG feature extraction and matching methods are explained in detail.
SURF: Speeded Up Robust Features [14] is an improved version of SIFT [15] which produces local feature detection and descriptors that can be used for object detection. It has been claimed that SURF feature extraction techniques are faster when compared with SIFT [15]. SURF feature extraction is good for images that are blurred and skewed, whereas, it is not good for images with changes in illumination and view point [27]. SURF uses the wavelet response in both horizontal and the vertical direction and applies Gaussian weights to it. Since the wavelet response can be calculated even for any skewed images, de-skewing is not required here which speeds up the process.
Histogram of oriented gradients (HOG): HOG [24] is also a feature extraction technique used for object detection/image recognition in images. The magnitude and the orientation of the gradients in the image are calculated. The image on which HOG has to be applied is pre-processed and resized to an image of fixed size. From this image, the horizontal and vertical gradients are calculated. Then the angle and magnitude of the gradients are calculated using the below formulas.
It has to be noted that the magnitude of the gradients will be very high when there is a lot of change in the intensity. With this method, only gradients which contains important information are kept and other noises are ignored. An 8 × 8 sliding window is passed over the gradient image and the histograms are calculated for this window. This sliding window size has to be reasonable enough to capture the important features of the image. A 9-bin histogram is an array of size 9 corresponding to angles between 0 and 180. The gradients from the image are clustered into one of the bins. Finally, a HOG feature vector is calculated from the histograms.
Instead of processing the entire image, ROIs that have a higher possibility of matching the query image are detected. This not only makes the calculation of the features easy but also reduces the computational time. Feature extractions are done on the ROI to further access the probability of the location being the best match to the query image. In this study, the SURF features are used for extracting the possible features from the ROI (Figure 4). A detailed report of the algorithm was outlined previously [14]. Then, a matching algorithm is applied to match the key point descriptors of the query image with the ROI calculated from the document image. In this context, a Brute-Force matching algorithm has been observed to provide higher accuracy. In a Brute-Force algorithm the distance between two feature vectors (query and ROIs from the document image) are calculated using the L2 norm by default. Brute-Force algorithm works best reducing the number of outliers and returning only descriptor pairs that are very close to each other. There can be many useless key points matched between the image, so Lowe's ratio method [15] is applied to take only the good matching key points based on the ratio of the distance of the first and the following matched key points being less than a threshold value. In this experiment, the descriptor matches are considered as a match if the ratio of the distance between them is less than or equal to 0.4. SURF feature extractors work well for blurred images also. Sincere efforts have been taken in the pre-processing step to make the blurred image darker. SURF extractor helps in finding the change in intensities irrespective of the skewness in the image. SURF feature extraction method defined in the opencv library is used with default parameters setting. Feature vectors are obtained from the SURF features extracted from both the query image and the ROIs of the document image. These feature vectors are compared to if they are closely associated by calculating the distance using the Brute-Force algorithm. After calculating the distance between all the features from the query and ROIs, only useful matches are returned.

Re-ranking using HOG
Histogram of oriented gradients (HOG) is a feature descriptor technique that is commonly used to detect the magnitude and orientations of all the gradients in the image. In this context, HOG feature description techniques are used to detect the descriptors on the query image and words for segmentationbased scenario and ROIs in the segmentation-free scenario. Basically, two feature vectors are obtained from the HOG feature descriptor and the correlation distance between these two vectors is calculated, which may range from (+1) max to 0 (min). Images for which the correlation distance is minimum have a word similar to that of the query image. Here, the ROIs extracted are resized to the shape of the query image for comparing of the features extracted between them. HOG can either be calculated on the entire image or on particular areas of interest in the image. It can also be used to detect useful descriptor information around particular key points extracted. Figure 5 visualizes the HOG descriptor for the query image with various cell sizes. Based on the information available, the best cell size is decided. After some experiments, it was seen that the cell size of 16 × 16 gave the best results. The HOG gradients for both the query and the ROI is calculated. The distance between them is calculated. The query image is decided to match the ROI if the correlation distance between them is greater than 45. In the case of segmentation-based scenario, the whole document is segmented into words using the ground truth provided along with the datasets. Instead of the document images as a whole, each individual image is checked to see if it matches the query image.

EXPERIMENTAL SETUP
The above-proposed methodology was experimentally implemented on two different publicly available datasets, that is, the 1. George Washington dataset 1 and 2. HADARA dataset 2 .

George Washington dataset
The George Washington dataset has 20 pages written by George Washington from the Library of Congress, which can be seen on their website. It is a classic example of a multi-writer scenario. It includes approximately 5000 words that can be segmented from the 20 pages originally found from the ground truth provided in the dataset. The query images are selected based on the number of occurrences and the total length of the word itself. For better efficiency, words that occur more than 20 times and word

Hadara dataset
The HADARA dataset consists of 80 pages of historical documents along with the ground truths for segmentation of words. This dataset is written by an author from Egypt/Palestine in the Arabic language approximately 06. 833 AH (Islamic calendar).
The HADARA dataset has both black and red ink with a few side notes written, and the page is highly degraded. The dataset has 48-bit TIFF images with 16 bits per colour channel (true 12 bits per colour channel). Figure 7 contains an example query and document image from the HADARA dataset. As in most of the word spotting techniques [1,3,4,7,9,13,[17][18][19][20] the performance of the proposed methodology is calculated based on the mean average precision and recall. Precision (P) can be defined as the fraction of relevant words found in all of the retrieved words to the total number of the relevant words available. Given a query image, all the matches found by the proposed method is N retrieved . All the matching images from the N retrieved which are matching with the query image is considered as a true positive (N relevant ). Precision, recall and mAP are calculated based on this consideration.
Recall (R) is used in measuring the total number of retrieved relevant word images verses against the total number of relevant words present. To describe the overall performance of the method, the mean average precision (mAP) is calculated. mAP is the mean average of all the precisions of different query images against which the platform was tested.
mAP is the mean average precision, N is the total number of query images used, and P i is the precision calculated for each of the query images.
Query images considered are the segmented word images from the document image with more than 20 occurrences and word lengths of six letters or greater. Precision, recall for all the query images considered are calculated and the mAP is calculated over all the precisions calculated. A limited amount of research is only available for Arabic language. This made it difficult to compare all the experimental results obtained from the proposed method as there is no benchmarks to compare it with. Say, HADARA dataset is the widely used Arabic dataset for word spotting. Even in HADARA word spotting method, there is only segmentation-free method. The results are compared against the HADARA method and the Ulysses [23] method. Ulysses is a commercially available word spotting software.
For evaluation, the obtained results are compared against the ground truth that came along the datasets. A prediction is marked correct only if the result from the proposed method matches the ground truth 100%. Table 1 shows the results obtained from the HADARA dataset and compares them against the baseline method proposed by the HADARA team. As there are no segmentation-based methods employed on the HADARA dataset, only the result from the proposed methodology of word-segmentation free scenario has been given. The proposed platform for word spotting has outperformed other efficient techniques applied on the HADARA dataset. Tables 2 and 3 consists of both segmentation-based and segmentation-free scenarios. This table illustrates the results of our method and compares them against other prominent methods using the George Washington dataset for segmentationbased and Table III for segmentation-free scenarios. Each table has three cases, one for the results from using SURF features alone by the proposed method. Second was using only HOG without using SURF. The third case first SURF was used to rank  the ROIs and then a re-ranking was done using HOG to remove more false positives. It can be seen that the proposed two-step word spotting method involving a cross-correlation measure and feature extraction with matching techniques has out-performed all the above mentioned recent techniques.
To measure the computational time required to match the query image with the document image, the starting time of the experiment and the ending time (when all the document images are scanned) are measured. The difference in the start time and the end time gives the total amount of time required to match the query image with all the document images. Table 4 shows the computation time taken for matching each query with all the other document images for each of the feature extraction method for GW dataset. The time here represents the time taken to extract the features and do the matching. For combo, the time here is the time taken to SURF and HOG features separately and ranking first using SURF features and re-ranking again using HOG features.

DISCUSSION
It has to be noted that the proposed methods works effectively on the historical handwritten documents. This can be due to the fact that extensive pre-processing is done on the noisy historical document to enhance the writing. Also, most of the noises are removed during the pre-processing step. One more important aspect to be observed is that the feature extraction methods used in this paper are scale-invariant. This allows them to work efficiently on the handwritten documents where the letters, words and sentences are skewed. Since the feature extraction is scale and rotation independent, the matching between the query image and the document images are exact. Also, two sets of feature extraction methods are used. This helps in exact matching even though one feature extraction fails, the other feature extraction methods comes to the rescue. SURF features works better when the image is skewed, whereas HOG features does not de-skew the image and captures the angle of rotation. This functionality makes the HOG features well suited for historical handwritten document. In the segmentation-based scenario for George Washington dataset, the proposed method has significant improvement compared with the HOG method only. This is because in segmentation-based methods the size of the image is small compared to segmentation-free methods. HOG has improved performance when the image size is small.
It has to be noted that the proposed solution has outperformed [20] where a similar learning free method is used. This can be because only one feature extraction method is used in [20] whereas the proposed method has two stages and different feature extraction techniques. Though the learning-based methods [21] has used a CNN architecture for training a model, the results are lower when compared to the proposed method. The dataset used in [21] is George Washington dataset which has approximately 5000 segmented word images overall with 943 classes. In most of the classes, there is only one image which is very low for training a CNN model. This requires the implementation of additional steps like data augmentation to increase the training data size. There is no certainty if the data augmentation methods will increase the performance. Data augmentation has increased the performance in some of the methods presented in [22]. On the other hand, learning-based methods like in [13,22] have better performance than the proposed method. Experiments were conducted to reproduce the PHOCNet and TPP-PHOCNet architecture on GW datasets. The mentioned parameters followed in [13,22] were followed. Experiments were conducted on ASUS laptop with one GPU. The mAP calculated from the experiments are 96.21 and 97. 16 for PHOCNet and TPP-PHOCNet respectively. As it can be seen, the mAP measure is higher than the proposed method.
The main aspect to be considered here is the time taken to train the model. The computational time for predicting a single image is almost same in both learning-based and learningfree methods. The difference in the performance metrics is not too much between the learning-based and learning-free methods. Hence the learning-free method like the proposed solution is suitable for word spotting in historical documents.
It is observed that the training time required by PHOCNet is nearly 8 h and TPP_PHOCNet is nearly 11 h. The time required to load the trained weights and test on single image is 32 s for both PHOCNet and TPP_PHOCNet. The computational time required by the proposed method on single image is 20 s. The computation time for finding the matches by the proposed method is a little bit longer than machine learning and deep learning models like in [13,22]. The pre-trained weights of the trained model are used to make predictions on samples from GW datasets to determine the computation time. But an important thing to consider here is that there is no need for any training in the proposed method. Hence the training time and the storage space are less in the proposed method compared to machine learning methods [13,22]. As for the storage space, [13,22] requires at least 560 MB to store the trained weights. In comparison to that, only 13 KB is required by the proposed method.