Beyond V3s – What’s Next?

We are already part way through creating V4s of books in the Ultrapedia Library. Perhaps we would feel like we were making a bit more progress than we actually are if we had started at the letter ‘Z’ instead of ‘A’ like we did. The January 2008 edition of the Ultrapedia Library has nearly 1400 titles listed under the ‘A’ category.

Before I say more about the specifics of V4s I would first like to thank Bruji for introducing me to the concept of the Silent Update. I do this because I expect the introduction of the forthcoming Ultrapedia V4s to be silent in true Bruji tradition. There is something almost preternatural about using Bookpedia. It feels as if Conor and Nora pick one or two feature requests every week from the forums, and then implement these new features to see if anyone notices.

It is my vain hope that updating the Ultrapedia Library to V4 will be just as silent.

Creating V4s is a Manual Task and simply involves the extraction of each books Index and Table of Contents. You might well ask why it took us so long to realize that indexing an index was not such a hot idea… If I ever find out the answer I’ll be sure to let you know…

In the meantime however, the new and improved V4s will be introduced first on Ultrapedia Search.

Not Only… but Also…

Hand in hand with creating V4s, we have also created V5s of the first 1000 or so books in the Ultrapedia Library. We started making V5s when we started to see a growing number of ‘Mutilated Pages’ or ‘Gross Errors’ creeping into the V3 collections. I made a quick reference to this problem in my blog entry Turning the Tables.

We actually create the ‘V5s’ from ‘V1s’. To create a V5 we delete everything except ‘Plate’ type images from each book, so a typical V5 page will consist of two discrete parts – the plate image itself, and the textual ‘Legend’ or description of the plate– the bit of text under the picture.

We have decided to use the fantastic JALBUM to display the V5’s in a ‘Gallery Format’, but these galleries have not been opened to the public yet. You can keep track of our progress in creating V5’s by joining the ‘V5’ project in the forums.

Is that a V8 in your Pocket ?

I think now would be a good time to recap on the V-numbers we have used so far – so here goes.

V0 – The book is not suitable for OCR
V1 – The book is a good candidate for OCR
V2 – Only used in-house
V3 – The book has been OCR’d and published on the website
V4 – Presently used in-house only – see above
V5 – Presently used in-house only – see above
V6 – There is no V6 yet
V7 – There is no V7 yet
V8 – Presently used in-house only for ‘Proof-Reading’

We have only ever produced a few dozen V8’s, none of which are currently available online. We created the V8s initially in response to the looks of bewilderment on the faces of friends, family and colleagues who we subjected to many a long-winded demonstration of Ultrapedia and it’s precursors.

The V8 is a classic case of a picture being worth a thousand words. A V8 is made by ’collating’ the V1 and V3 versions of a book into a single PDF that displays the V1 page alongside the V3 page. In fact it’s such an effective demonstration of the benefits of OCR that I now feel compelled to create a few more V8’s to help prove my point.

Displaying the unrecognized (V1) and recognized (V3) pages side by side on a suitably large screen like the Apple 30 inch cinema displays we use in-house actually generated more than a few ‘WOW’s and at least once a jaw motion was changed from a yawn to a drop.

Self ingratiating comments aside, even now I think I would have been better off just including a couple of V8 screen-shots. I will track down a few V8s and post them later on.

Copyright, Fair Use, Creative Commons etc.

 
Also Posted in the Ultrapedia Forums
Without Prejudice…
We are not trying to make any enemies here at Secret Studio, and we do not want to impinge on someone else’s copyright – dead or alive. Copyright around the world is changing, and we can’t keep up with all the rules.

So, if you believe that such and such a book is not in the public domain please let us know either by email  or by posting in the Copyright Issues forum. 

Small Print
If by some (albeit very slim) chance that you are the author of one of the books in the Ultrapedia Library I would first like to say thank you, and secondly I would like to apologise if making your book available online has caused you any financial distress. Thirdly I invite you to post any useful health and lifestyle tips you might like to share because the chances are that you are at least 95 years old. 

If it seems like I am taking this whole copyright thing a bit fast and easy, well I assure you that I don’t.  And to be honest – neither do you! Just to get this far into Ultrapedia you would have had to page through the Secret Studio terms of use and license agreement – riveting stuff you must admit.

What – you mean you didn’t read it? Good! Because it was just some boiler-plate text that we licensed OK?  I think of terms and conditions (T&C) as akin to a safety-net for a tightrope walker.  It is a sad reflection on this day and age that something we are essentially giving away for free needs an asterisk * 

* see Small Print

How to get the best from the Ultrapedia website

Thanks for helping beta-test the new Ultrapedia website. We sincerely hope that you enjoy exploring the new Ultrapedia Website as much as we have enjoyed making it.

If you are new to Ultrapedia here is a brief history of the project. Ultrapedia is one of several  projects currently in work at Secret Studio who are, among other things,  experts in digitizing all kinds of books and documents. Secret Studio operates one of the worlds largest OCR (Optical Character Recognition) Farms. Our servers are fed a steady stream of books in the form of large PDF image files which are processed by the servers in the OCR Farm into Recognised versions of the book or publication.

The OCR technique we employ is very thorough, and uniquely, retains the look and feel of the original written book.  This allows the search engine to deliver almost a ‘carbon copy’ of the original page from the book, and not just a text extract from the original.
Optical Character Recognition not only shrinks the size of the books by up to 90%, but it also allows us to create a fully searchable version of the book. As new books are processed they are then indexed, split into single pages, and then published to the Ultrapedia Library Search. 

We launched Ultrapedia Library Search in April 2007. At that time we had indexed 2,760,581 individual pages. Our OCR Farm has continued churning away 24/7 – 7 days a week as well, and the Ultrapedia Library Search now contains 5,977,204 pages. Over the last eight months we have added approximately 400 thousand pages a month to the library. 

We have created this beta site as a jumping off point to introduce a browseable version of the Ultrapedia Library alongside the traditional Ultrapedia Search Engine. To help keep track of the fusion (or is it confusion) of these two vast information resources we have also incorporated the Ultrapedia Blog, and the Ultrapedia Support Forum.

To sum up; Ultrapedia Search Engine has been designed to deliver single pages from the Ultrapedia Library, and the Ultrapedia Library Browser has been designed so you can browse through the library and to download books –  if you are a registered user.

Just like the rest of the site, the User Login and Registration system are also under construction and may not work all the time. If your login or registration fails please try again later.
 
Remember ! If you do get a login failure you can always access  the Ultrapedia Blog where we will post live status reports of our servers,

Important ! We are not currently accepting New User Registrations from ‘ScreenName’ type email accounts such as MSN Hotmail, Yahoo Mail etc. …   sorry.

How we do it.

 
Our OCR Farm runs 24 hours a day, and we like to keep it busy. Before we add a book to the OCR Queue we ‘Top and Tail’ the book removing blank and extraneous pages to leave the Title Page of the book as the new first page. We then remove any other blank or unrecognisable pages from the book as well. We use Adobe Acrobat Professional for this stage of the editing process.The OCR Farm can be configured to recognise books in many languages, but for now most of the books in the search engine are English. We therefore have a backlog of about 15 thousand non-English books that we have recognized but not yet indexed. We are experimenting with several other search engines to deliver these non-English books. If you are interested in following our progress please visit the PROJECT TOOLS section in the Ultrapedia forums. 
 
Lately, most of the books we have processed have come via our partner Google Book Search. As we continue to expand the Ultrapedia Library we will also be adding books from two other Secret Studio projects – The Pointmore Library, and The Philatelink Library which are both Scanned and Recognised in-house.
 
Once a book has been ‘topped and tailed’ a ‘V1’ suffix is appended to the original filename. Our OCR Engine is designed to detect these ‘V1’ files, and they are then added to the OCR Queue where they wait until a Recognition Server is available to perform Character Recognition on the book.
 
As the OCR Farm outputs recognised books the suffix on the books filename is changed from ‘V1’ to ‘V2’. When a book reaches the ‘V2’ stage the newly recognised book enters a ‘Workflow’ where several other enhancements are performed.
 
The first stage in the Workflow is a continuous ‘Batch’ Operation that monitors for new ‘V2” stage files and then embeds the page thumbnails and sets the ‘Open View’ options of the PDF to aid verification. The next part of the Workflow adds ‘Headers’ and ‘Footers’ to the file. Each book is then ‘Page Checked’ to ensure there are no unrecognized pages or ‘Gross Errors’. We also ensure the book contains valid ‘Metadata’ for ‘Book Title’ and ‘Author’. OCR can be a tricky process – see my previous article Recognising a Problem.
 
Once a book has been recognised and verified its filename suffix is changed from ‘V2’ to ‘V3’ and the book is then split into individual pages for indexing into the Ultrapedia Library Search.
 
That then is pretty much it for the Search Engine – Raw PDF images of books go in one end… and single recognised pages come out the other end, get indexed and published to the search engine. This is an ongoing project which currently yields about 400 thousand new pages monthly.
 
We haven’t quite finished with Single Page V3s yet however. To keep track of which V3s have been indexed we then store a complete copy of the recognised book in huge database. We use this database to compare newer revisions of the books as we sometimes discover better original copies.
 
In common with all other computer systems Ultrapedia is not immune to GIGO – a very succinct acronym for an OCR Farm – Garbage In = Garbage Out. We keep our eyes open for newer, better, higher resolution scans all the time, and as more and more libraries join the Google Book Search program new and better scanning techniques are often employed which can result in our discovery of what we call a ‘Replacement Candidate’ for a book currently in the live search engine.
 
We keep track of all the books we have recognized in another, smaller database, and when we come across another copy of a book we have already recognised we add the new book to a ‘Reprocess Queue’ for recognition and exhaustive cross-comparison of the older (live) version and the newer ‘Replacement Candidates’. If you are interested in following the ‘life cycle’ of a replacement candidate V3 please visit the PROJECT TOOLS section in the Ultrapedia forums.
 
Various checks are done on the two files:
 
Word Count and Comparison
Spell Check and Comparison
Image Check and Comparison
 
If the ‘Replacement Candidate’ proves to have less errors then the ‘live’ version the new version is indexed into the search engine, and the old version is removed. For an up to date list of ‘Replacement Candidates please visit the PROJECT TOOLS section in the Ultrapedia forums.

OK Computer.

If you have spent very much time at all in front of a computer you will almost certainly have spoken to it at one time or another. I know I do. And I’m not just referring to rhetorical questions like ‘what did you go and do that for’, or ‘why did you have to go and crash right now’.

Talking to a computer is always a one-way conversation these days, but it hasn’t always been that way! Hidden away in some dark recesses of the Ultrapedia Library are stories about computers that could not just talk, but could also walk, run, and even ride horses. I even found a book that told of two computers who ran away together, got married, had a baby, and lived happily ever after.

This is, of course, another example of how the meaning of words change over time. In the nineteenth century a computer was a person – a person who made computations – a human computer. These human computers were employed to ‘figure out’ problems like ‘what angle of elevation must I set my cannon too in order to send a cannonball over yonder hill’, or ‘in what year will Halleys Comet reappear in the night sky’.

However fanciful these examples are, they illustrate very well one of the inherent subtleties of the Ultrapedia Library, namely, that if a word wasn’t in use up to 1923, then it won’t exist in books of that era either. For example, the word ‘telephone‘ will appear in the library because it was invented in the nineteenth century. The word ‘television‘ will not occur in the library because it was not invented until 1925 – two years past the cutoff date for something to be placed in the public domain.

Another of my favourite examples is ‘spice girl‘. If you were to search the Ultrapedia Library hoping to find something about ‘posh’ or ‘baby’ spice you will be sadly disappointed. According to Ultrapedia, a spice girl was generally a native of the ‘spice islands‘ who collected nutmegs for a living.

In other words, if something was only invented or discovered after 1923, then it simply cannot exist in the Ultrapedia Library. I’m sure you get the idea, so if you’re looking for information about ‘playstations’, ‘cellphones’, airliners, motorcades, or any other recent innovations – then I suggest you look elsewhere.

Fuzzy Thinking.

The search engine at the heart of Ultrapedia can perform both simple and complex searches. The ‘simplest’ way of searching the library is to enter a word or phrase into the search-box. We have deliberately ‘throttled back’ the search engine to deliver a maximum of five thousand results to reduce information overload, while at the same time increasing overall system performance.

If you get more than a few hundred search results we recommend narrowing your search by entering more text into the search box; for example ‘queen’ will return many more results than ‘queen victoria’.

Sometimes, you may also need to extend your search to get meaningful results. Lets say for example that you entered ‘mendeleev’ into Wikipedia, as expected, you would find he entry for ‘Dmitri Mendeleev‘, however, entering ‘mendeleev’ into Ultrapedia currently returns just one result. The reason for this is that the popular way of spelling the name of this famous chemist has changed over the centuries. This is where Ultrapedias ‘Fuzzy Search‘ capability comes into it’s own.

We included fuzzy searching partly to help counteract recognition errors in the library. Fuzzy searching instructs the search engine to try alternate spellings of the word. If you enter ‘mendeleev’ into Ultrapedia again, but this time check the ‘Fuzzy search’ option, with a Fuzziness of ‘5’ you will get over fifty results. Closer examination of the results will show that “Mendeleev’ can also be spelled ‘Mendeleef’, Mendeleeff’, and numerous other variations.

Turning the Tables

The single worst class of books for recognition errors are those that contain a lot of tables. Although some smaller tables are recognised correctly, most tables in the Ultrapedia Library are not. Unless the original scanned document is in pristine condition and printed clearly, with plenty of space between the table elements the accuracy of the recognised table will be low. Almanacs, books on statistics, and anything with the word ‘table’ in the title will consequently be poorly recognised.

Other classes of books that recognise poorly are those with lots of mathematical formulae, algebra books being the worst. Books with tables of logarithms, or tangents etc. are likewise not to be trusted, as well as being entirely superfluous. Maps are another example of poor recognition candidates, likewise with books that contain musical scores.

The OCR engine can get so confused that it sometimes reorients the page from portrait to landscape before doing the recognition with disastrous results. Fortunately, you will normally never see examples of these bad pages unless you deliberately search for them.

The Fonts of all Knowledge.

One of the more perplexing problems we are trying to solve is a scheme to rationalise all the fonts used in the books. Our success in recreating the actual ‘look and feel’ of each book is mitigated by our inability to make a more exact match between the printed fonts and the on-screen fonts.Drop Caps and Large InitialsCertain books in the Ultrapedia Library are printed with a large ‘initial’ at the start of some chapters and paragraphs. Although our OCR system can sometimes detect and correct these anomolies it can give the text a ragged appearance, as the text will be reflown, justified or hyphenated incorrectly.The usual way of printing large initials is with the large initial aligned with the rest of the text on the line. If a large initial is aligned with the ‘top’ of the other characters in the line it is called a ‘Drop Cap’ which is much more problematical for us, as we cannot auto-correct it. Consequently, you may occasionally come across pages with enormous initials.

Recognising a problem.

Anyone who has spent any time reading through some of the millions of pages in the Ultrapedia Library will have come across recognition errors. These errors occur for a variety of reasons, some of which I will try to explain.

Ligatures and Blackletter text &.

Ligatures occur where two or more letters are joined-together, and were a labour saving device predominantly utilized by typesetters in the late eighteenth and early nineteenth centuries.

The OCR programs we use at Ultrapedia are optimized for recognising ‘modern’ fonts. If you printed out this page and then fed the page into our OCR system it would be recognised with 100 percent accuracy. This is because the system has been ‘trained’ so that it ‘knows’ the difference between an ‘O’ and an ‘0’. Generally speaking – the older the book, the older the font that the book was printed in.

WikiPedia has more information on typesetting technology over the ages, but for this explanation I shall use books from the early nineteenth century as an example. Some of the more common ligatures of this era are ‘fi’, ‘ct’, ‘ae’, and ‘ff’.

When the OCR system encounters one of these ligatures it simply cannot make sense of it. Instead, the system makes a ‘best guess’. We detect and correct some of these errors during spell-checking, but not always. The only OCR program that supports recognition of Blackletter and Fraktur fonts is ABBYY FineReader XIX. Unfortunately, the cost of using this program is prohibitive as they charge for each page. The Metadata Engine Project has more information of the problems associated with the recognition of Fraktur and Blackletter fonts.

Dual Language Books

Although our OCR system has the abilty to recognise a book printed in two different languages we have not done this yet because of time constraints. Consequently, certain types of books – English to French dictionaries for example, are bound to contain many recognition errors because we have only instructed the OCR program to recognize English text. Recognition errors occur because the English language does not contain accents, and when the OCR comes across a ‘grave’ or cedilla’ in a word it can become confused.

Multiple Quote Marks, and Marginalia

Certain books in the Ultrapedia Library use unusual Quotation marks, or “Quote” marks. Instead of using a single quote mark to indicate the the beginning of a quotation, and another quote mark to indicate the end; these unusual books use quote marks at the beginning of each line, and another at the end. Misrecognition of these quote marks gives the text a ‘ragged’ appearance, and sometimes looks like as if someone has drawn vertical lines around the paragraph.

Marginalia has proven to be a particular problem in certain Book Catalogues, and also in Biblical, and ecclesiastical works.

Welcome to the Ultrapedia Blog

Here you will find details of the latest books that we have recognised. So far we have recognised over 35 thousand public domain and out of copyright books. We have made around ten thousand books from the library available for full text search and retrieval at our website here.

Entering a search term into the Ultrapedia search engine will return a list of books. Clicking on the title of the book will display the first page in the book that the search term appears in. The books are stored on our servers as Adobe PDF files. If you have Adobe Acrobat Reader or Professional the selected page will open there.

To navigate to the next book containing your search term click the ‘Next Doc’ button. Each page in the book is stored as an individual PDF document during initial testing, so the ‘First Hit’, ‘Next Hit’, and ‘Prev Hit’ are non-functional.

All the books in the Ultrapedia Library are recognized and stored so as to retain the same page formatting and fonts as the original book. The entire Ultrapedia Library has also been spell-checked and corrected with an overall success rate of approximately 95% over the entire collection.

35 Thousand recognised books… and counting