If you have spent very much time at all in front of a computer you will almost certainly have spoken to it at one time or another. I know I do. And I’m not just referring to rhetorical questions like ‘what did you go and do that for’, or ‘why did you have to go and crash right now’.
Talking to a computer is always a one-way conversation these days, but it hasn’t always been that way! Hidden away in some dark recesses of the Ultrapedia Library are stories about computers that could not just talk, but could also walk, run, and even ride horses. I even found a book that told of two computers who ran away together, got married, had a baby, and lived happily ever after.
This is, of course, another example of how the meaning of words change over time. In the nineteenth century a computer was a person – a person who made computations – a human computer. These human computers were employed to ‘figure out’ problems like ‘what angle of elevation must I set my cannon too in order to send a cannonball over yonder hill’, or ‘in what year will Halleys Comet reappear in the night sky’.
However fanciful these examples are, they illustrate very well one of the inherent subtleties of the Ultrapedia Library, namely, that if a word wasn’t in use up to 1923, then it won’t exist in books of that era either. For example, the word ‘telephone‘ will appear in the library because it was invented in the nineteenth century. The word ‘television‘ will not occur in the library because it was not invented until 1925 – two years past the cutoff date for something to be placed in the public domain.
Another of my favourite examples is ‘spice girl‘. If you were to search the Ultrapedia Library hoping to find something about ‘posh’ or ‘baby’ spice you will be sadly disappointed. According to Ultrapedia, a spice girl was generally a native of the ‘spice islands‘ who collected nutmegs for a living.
In other words, if something was only invented or discovered after 1923, then it simply cannot exist in the Ultrapedia Library. I’m sure you get the idea, so if you’re looking for information about ‘playstations’, ‘cellphones’, airliners, motorcades, or any other recent innovations – then I suggest you look elsewhere.
The search engine at the heart of Ultrapedia can perform both simple and complex searches. The ‘simplest’ way of searching the library is to enter a word or phrase into the search-box. We have deliberately ‘throttled back’ the search engine to deliver a maximum of five thousand results to reduce information overload, while at the same time increasing overall system performance.
If you get more than a few hundred search results we recommend narrowing your search by entering more text into the search box; for example ‘queen’ will return many more results than ‘queen victoria’.
Sometimes, you may also need to extend your search to get meaningful results. Lets say for example that you entered ‘mendeleev’ into Wikipedia, as expected, you would find he entry for ‘Dmitri Mendeleev‘, however, entering ‘mendeleev’ into Ultrapedia currently returns just one result. The reason for this is that the popular way of spelling the name of this famous chemist has changed over the centuries. This is where Ultrapedias ‘Fuzzy Search‘ capability comes into it’s own.
We included fuzzy searching partly to help counteract recognition errors in the library. Fuzzy searching instructs the search engine to try alternate spellings of the word. If you enter ‘mendeleev’ into Ultrapedia again, but this time check the ‘Fuzzy search’ option, with a Fuzziness of ‘5’ you will get over fifty results. Closer examination of the results will show that “Mendeleev’ can also be spelled ‘Mendeleef’, Mendeleeff’, and numerous other variations.
The single worst class of books for recognition errors are those that contain a lot of tables. Although some smaller tables are recognised correctly, most tables in the Ultrapedia Library are not. Unless the original scanned document is in pristine condition and printed clearly, with plenty of space between the table elements the accuracy of the recognised table will be low. Almanacs, books on statistics, and anything with the word ‘table’ in the title will consequently be poorly recognised.
Other classes of books that recognise poorly are those with lots of mathematical formulae, algebra books being the worst. Books with tables of logarithms, or tangents etc. are likewise not to be trusted, as well as being entirely superfluous. Maps are another example of poor recognition candidates, likewise with books that contain musical scores.
The OCR engine can get so confused that it sometimes reorients the page from portrait to landscape before doing the recognition with disastrous results. Fortunately, you will normally never see examples of these bad pages unless you deliberately search for them.
One of the more perplexing problems we are trying to solve is a scheme to rationalise all the fonts used in the books. Our success in recreating the actual ‘look and feel’ of each book is mitigated by our inability to make a more exact match between the printed fonts and the on-screen fonts.Drop Caps and Large InitialsCertain books in the Ultrapedia Library are printed with a large ‘initial’ at the start of some chapters and paragraphs. Although our OCR system can sometimes detect and correct these anomolies it can give the text a ragged appearance, as the text will be reflown, justified or hyphenated incorrectly.The usual way of printing large initials is with the large initial aligned with the rest of the text on the line. If a large initial is aligned with the ‘top’ of the other characters in the line it is called a ‘Drop Cap’ which is much more problematical for us, as we cannot auto-correct it. Consequently, you may occasionally come across pages with enormous initials.
Anyone who has spent any time reading through some of the millions of pages in the Ultrapedia Library will have come across recognition errors. These errors occur for a variety of reasons, some of which I will try to explain.
Ligatures occur where two or more letters are joined-together, and were a labour saving device predominantly utilized by typesetters in the late eighteenth and early nineteenth centuries.
The OCR programs we use at Ultrapedia are optimized for recognising ‘modern’ fonts. If you printed out this page and then fed the page into our OCR system it would be recognised with 100 percent accuracy. This is because the system has been ‘trained’ so that it ‘knows’ the difference between an ‘O’ and an ‘0’. Generally speaking – the older the book, the older the font that the book was printed in.
WikiPedia has more information on typesetting technology over the ages, but for this explanation I shall use books from the early nineteenth century as an example. Some of the more common ligatures of this era are ‘fi’, ‘ct’, ‘ae’, and ‘ff’.
When the OCR system encounters one of these ligatures it simply cannot make sense of it. Instead, the system makes a ‘best guess’. We detect and correct some of these errors during spell-checking, but not always. The only OCR program that supports recognition of Blackletter and Fraktur fonts is ABBYY FineReader XIX. Unfortunately, the cost of using this program is prohibitive as they charge for each page. The Metadata Engine Project has more information of the problems associated with the recognition of Fraktur and Blackletter fonts.
Dual Language Books
Although our OCR system has the abilty to recognise a book printed in two different languages we have not done this yet because of time constraints. Consequently, certain types of books – English to French dictionaries for example, are bound to contain many recognition errors because we have only instructed the OCR program to recognize English text. Recognition errors occur because the English language does not contain accents, and when the OCR comes across a ‘grave’ or cedilla’ in a word it can become confused.
Certain books in the Ultrapedia Library use unusual Quotation marks, or “Quote” marks. Instead of using a single quote mark to indicate the the beginning of a quotation, and another quote mark to indicate the end; these unusual books use quote marks at the beginning of each line, and another at the end. Misrecognition of these quote marks gives the text a ‘ragged’ appearance, and sometimes looks like as if someone has drawn vertical lines around the paragraph.
Marginalia has proven to be a particular problem in certain Book Catalogues, and also in Biblical, and ecclesiastical works.
Here you will find details of the latest books that we have recognised. So far we have recognised over 35 thousand public domain and out of copyright books. We have made around ten thousand books from the library available for full text search and retrieval at our website here.
Entering a search term into the Ultrapedia search engine will return a list of books. Clicking on the title of the book will display the first page in the book that the search term appears in. The books are stored on our servers as Adobe PDF files. If you have Adobe Acrobat Reader or Professional the selected page will open there.
To navigate to the next book containing your search term click the ‘Next Doc’ button. Each page in the book is stored as an individual PDF document during initial testing, so the ‘First Hit’, ‘Next Hit’, and ‘Prev Hit’ are non-functional.
All the books in the Ultrapedia Library are recognized and stored so as to retain the same page formatting and fonts as the original book. The entire Ultrapedia Library has also been spell-checked and corrected with an overall success rate of approximately 95% over the entire collection.