Beyond V3s – What’s Next?

We are already part way through creating V4s of books in the Ultrapedia Library. Perhaps we would feel like we were making a bit more progress than we actually are if we had started at the letter ‘Z’ instead of ‘A’ like we did. The January 2008 edition of the Ultrapedia Library has nearly 1400 titles listed under the ‘A’ category.

Before I say more about the specifics of V4s I would first like to thank Bruji for introducing me to the concept of the Silent Update. I do this because I expect the introduction of the forthcoming Ultrapedia V4s to be silent in true Bruji tradition. There is something almost preternatural about using Bookpedia. It feels as if Conor and Nora pick one or two feature requests every week from the forums, and then implement these new features to see if anyone notices.

It is my vain hope that updating the Ultrapedia Library to V4 will be just as silent.

Creating V4s is a Manual Task and simply involves the extraction of each books Index and Table of Contents. You might well ask why it took us so long to realize that indexing an index was not such a hot idea… If I ever find out the answer I’ll be sure to let you know…

In the meantime however, the new and improved V4s will be introduced first on Ultrapedia Search.

Not Only… but Also…

Hand in hand with creating V4s, we have also created V5s of the first 1000 or so books in the Ultrapedia Library. We started making V5s when we started to see a growing number of ‘Mutilated Pages’ or ‘Gross Errors’ creeping into the V3 collections. I made a quick reference to this problem in my blog entry Turning the Tables.

We actually create the ‘V5s’ from ‘V1s’. To create a V5 we delete everything except ‘Plate’ type images from each book, so a typical V5 page will consist of two discrete parts – the plate image itself, and the textual ‘Legend’ or description of the plate– the bit of text under the picture.

We have decided to use the fantastic JALBUM to display the V5’s in a ‘Gallery Format’, but these galleries have not been opened to the public yet. You can keep track of our progress in creating V5’s by joining the ‘V5’ project in the forums.

Is that a V8 in your Pocket ?

I think now would be a good time to recap on the V-numbers we have used so far – so here goes.

V0 – The book is not suitable for OCR
V1 – The book is a good candidate for OCR
V2 – Only used in-house
V3 – The book has been OCR’d and published on the website
V4 – Presently used in-house only – see above
V5 – Presently used in-house only – see above
V6 – There is no V6 yet
V7 – There is no V7 yet
V8 – Presently used in-house only for ‘Proof-Reading’

We have only ever produced a few dozen V8’s, none of which are currently available online. We created the V8s initially in response to the looks of bewilderment on the faces of friends, family and colleagues who we subjected to many a long-winded demonstration of Ultrapedia and it’s precursors.

The V8 is a classic case of a picture being worth a thousand words. A V8 is made by ’collating’ the V1 and V3 versions of a book into a single PDF that displays the V1 page alongside the V3 page. In fact it’s such an effective demonstration of the benefits of OCR that I now feel compelled to create a few more V8’s to help prove my point.

Displaying the unrecognized (V1) and recognized (V3) pages side by side on a suitably large screen like the Apple 30 inch cinema displays we use in-house actually generated more than a few ‘WOW’s and at least once a jaw motion was changed from a yawn to a drop.

Self ingratiating comments aside, even now I think I would have been better off just including a couple of V8 screen-shots. I will track down a few V8s and post them later on.

Copyright, Fair Use, Creative Commons etc.

 
Also Posted in the Ultrapedia Forums
Without Prejudice…
We are not trying to make any enemies here at Secret Studio, and we do not want to impinge on someone else’s copyright – dead or alive. Copyright around the world is changing, and we can’t keep up with all the rules.

So, if you believe that such and such a book is not in the public domain please let us know either by email  or by posting in the Copyright Issues forum. 

Small Print
If by some (albeit very slim) chance that you are the author of one of the books in the Ultrapedia Library I would first like to say thank you, and secondly I would like to apologise if making your book available online has caused you any financial distress. Thirdly I invite you to post any useful health and lifestyle tips you might like to share because the chances are that you are at least 95 years old. 

If it seems like I am taking this whole copyright thing a bit fast and easy, well I assure you that I don’t.  And to be honest – neither do you! Just to get this far into Ultrapedia you would have had to page through the Secret Studio terms of use and license agreement – riveting stuff you must admit.

What – you mean you didn’t read it? Good! Because it was just some boiler-plate text that we licensed OK?  I think of terms and conditions (T&C) as akin to a safety-net for a tightrope walker.  It is a sad reflection on this day and age that something we are essentially giving away for free needs an asterisk * 

* see Small Print

How to get the best from the Ultrapedia website

Thanks for helping beta-test the new Ultrapedia website. We sincerely hope that you enjoy exploring the new Ultrapedia Website as much as we have enjoyed making it.

If you are new to Ultrapedia here is a brief history of the project. Ultrapedia is one of several  projects currently in work at Secret Studio who are, among other things,  experts in digitizing all kinds of books and documents. Secret Studio operates one of the worlds largest OCR (Optical Character Recognition) Farms. Our servers are fed a steady stream of books in the form of large PDF image files which are processed by the servers in the OCR Farm into Recognised versions of the book or publication.

The OCR technique we employ is very thorough, and uniquely, retains the look and feel of the original written book.  This allows the search engine to deliver almost a ‘carbon copy’ of the original page from the book, and not just a text extract from the original.
Optical Character Recognition not only shrinks the size of the books by up to 90%, but it also allows us to create a fully searchable version of the book. As new books are processed they are then indexed, split into single pages, and then published to the Ultrapedia Library Search. 

We launched Ultrapedia Library Search in April 2007. At that time we had indexed 2,760,581 individual pages. Our OCR Farm has continued churning away 24/7 – 7 days a week as well, and the Ultrapedia Library Search now contains 5,977,204 pages. Over the last eight months we have added approximately 400 thousand pages a month to the library. 

We have created this beta site as a jumping off point to introduce a browseable version of the Ultrapedia Library alongside the traditional Ultrapedia Search Engine. To help keep track of the fusion (or is it confusion) of these two vast information resources we have also incorporated the Ultrapedia Blog, and the Ultrapedia Support Forum.

To sum up; Ultrapedia Search Engine has been designed to deliver single pages from the Ultrapedia Library, and the Ultrapedia Library Browser has been designed so you can browse through the library and to download books –  if you are a registered user.

Just like the rest of the site, the User Login and Registration system are also under construction and may not work all the time. If your login or registration fails please try again later.
 
Remember ! If you do get a login failure you can always access  the Ultrapedia Blog where we will post live status reports of our servers,

Important ! We are not currently accepting New User Registrations from ‘ScreenName’ type email accounts such as MSN Hotmail, Yahoo Mail etc. …   sorry.

How we do it.

 
Our OCR Farm runs 24 hours a day, and we like to keep it busy. Before we add a book to the OCR Queue we ‘Top and Tail’ the book removing blank and extraneous pages to leave the Title Page of the book as the new first page. We then remove any other blank or unrecognisable pages from the book as well. We use Adobe Acrobat Professional for this stage of the editing process.The OCR Farm can be configured to recognise books in many languages, but for now most of the books in the search engine are English. We therefore have a backlog of about 15 thousand non-English books that we have recognized but not yet indexed. We are experimenting with several other search engines to deliver these non-English books. If you are interested in following our progress please visit the PROJECT TOOLS section in the Ultrapedia forums. 
 
Lately, most of the books we have processed have come via our partner Google Book Search. As we continue to expand the Ultrapedia Library we will also be adding books from two other Secret Studio projects – The Pointmore Library, and The Philatelink Library which are both Scanned and Recognised in-house.
 
Once a book has been ‘topped and tailed’ a ‘V1’ suffix is appended to the original filename. Our OCR Engine is designed to detect these ‘V1’ files, and they are then added to the OCR Queue where they wait until a Recognition Server is available to perform Character Recognition on the book.
 
As the OCR Farm outputs recognised books the suffix on the books filename is changed from ‘V1’ to ‘V2’. When a book reaches the ‘V2’ stage the newly recognised book enters a ‘Workflow’ where several other enhancements are performed.
 
The first stage in the Workflow is a continuous ‘Batch’ Operation that monitors for new ‘V2” stage files and then embeds the page thumbnails and sets the ‘Open View’ options of the PDF to aid verification. The next part of the Workflow adds ‘Headers’ and ‘Footers’ to the file. Each book is then ‘Page Checked’ to ensure there are no unrecognized pages or ‘Gross Errors’. We also ensure the book contains valid ‘Metadata’ for ‘Book Title’ and ‘Author’. OCR can be a tricky process – see my previous article Recognising a Problem.
 
Once a book has been recognised and verified its filename suffix is changed from ‘V2’ to ‘V3’ and the book is then split into individual pages for indexing into the Ultrapedia Library Search.
 
That then is pretty much it for the Search Engine – Raw PDF images of books go in one end… and single recognised pages come out the other end, get indexed and published to the search engine. This is an ongoing project which currently yields about 400 thousand new pages monthly.
 
We haven’t quite finished with Single Page V3s yet however. To keep track of which V3s have been indexed we then store a complete copy of the recognised book in huge database. We use this database to compare newer revisions of the books as we sometimes discover better original copies.
 
In common with all other computer systems Ultrapedia is not immune to GIGO – a very succinct acronym for an OCR Farm – Garbage In = Garbage Out. We keep our eyes open for newer, better, higher resolution scans all the time, and as more and more libraries join the Google Book Search program new and better scanning techniques are often employed which can result in our discovery of what we call a ‘Replacement Candidate’ for a book currently in the live search engine.
 
We keep track of all the books we have recognized in another, smaller database, and when we come across another copy of a book we have already recognised we add the new book to a ‘Reprocess Queue’ for recognition and exhaustive cross-comparison of the older (live) version and the newer ‘Replacement Candidates’. If you are interested in following the ‘life cycle’ of a replacement candidate V3 please visit the PROJECT TOOLS section in the Ultrapedia forums.
 
Various checks are done on the two files:
 
Word Count and Comparison
Spell Check and Comparison
Image Check and Comparison
 
If the ‘Replacement Candidate’ proves to have less errors then the ‘live’ version the new version is indexed into the search engine, and the old version is removed. For an up to date list of ‘Replacement Candidates please visit the PROJECT TOOLS section in the Ultrapedia forums.