Recognising a problem.

Anyone who has spent any time reading through some of the millions of pages in the Ultrapedia Library will have come across recognition errors. These errors occur for a variety of reasons, some of which I will try to explain.

Ligatures and Blackletter text &.

Ligatures occur where two or more letters are joined-together, and were a labour saving device predominantly utilized by typesetters in the late eighteenth and early nineteenth centuries.

The OCR programs we use at Ultrapedia are optimized for recognising ‘modern’ fonts. If you printed out this page and then fed the page into our OCR system it would be recognised with 100 percent accuracy. This is because the system has been ‘trained’ so that it ‘knows’ the difference between an ‘O’ and an ‘0’. Generally speaking – the older the book, the older the font that the book was printed in.

WikiPedia has more information on typesetting technology over the ages, but for this explanation I shall use books from the early nineteenth century as an example. Some of the more common ligatures of this era are ‘fi’, ‘ct’, ‘ae’, and ‘ff’.

When the OCR system encounters one of these ligatures it simply cannot make sense of it. Instead, the system makes a ‘best guess’. We detect and correct some of these errors during spell-checking, but not always. The only OCR program that supports recognition of Blackletter and Fraktur fonts is ABBYY FineReader XIX. Unfortunately, the cost of using this program is prohibitive as they charge for each page. The Metadata Engine Project has more information of the problems associated with the recognition of Fraktur and Blackletter fonts.

Dual Language Books

Although our OCR system has the abilty to recognise a book printed in two different languages we have not done this yet because of time constraints. Consequently, certain types of books – English to French dictionaries for example, are bound to contain many recognition errors because we have only instructed the OCR program to recognize English text. Recognition errors occur because the English language does not contain accents, and when the OCR comes across a ‘grave’ or cedilla’ in a word it can become confused.

Multiple Quote Marks, and Marginalia

Certain books in the Ultrapedia Library use unusual Quotation marks, or “Quote” marks. Instead of using a single quote mark to indicate the the beginning of a quotation, and another quote mark to indicate the end; these unusual books use quote marks at the beginning of each line, and another at the end. Misrecognition of these quote marks gives the text a ‘ragged’ appearance, and sometimes looks like as if someone has drawn vertical lines around the paragraph.

Marginalia has proven to be a particular problem in certain Book Catalogues, and also in Biblical, and ecclesiastical works.

