In most previous studies, the effectiveness of stemming algorithms has been compared by determining the retrieval performance for various experimental test collections. The present work assesses performance by counting the number of identifiable errors during the stemming of words from various text samples. This entails manual grouping of the words in each sample; software has been developed to facilitate this. After grouping, the words are stemmed and indices are then computed which represent the rate of understemming and overstemming. Results are presented for three stemmers (Lovins, Porter, and Paice/Husk), in each case using three distinct text samples. Although the results are not entirely clear cut, it appears that the Lovins stemmer is inferior to the other two in terms of general accuracy. The way in which the indices vary with the size of the text sample is also investigated. © 1996 John Wiley & Sons, Inc.