Vietnam Center & Archive News and Updates

Friday, March 18, 2011

Searchable PDFs

Over the past nine months the staff of the Vietnam Archive have undertaken a project to convert all of the digitized document PDFs in the Virtual Vietnam Archive into a searchable format. This project is now complete.  Over 275,000 files were re-processed in the course of this project.

When the Virtual Archive was started in 2001 the majority of Internet users were on dial-up connections, so in order to keep file sizes as small as possible, the decision was made to not save digitized documents as searchable PDFs.  OCR (Optical Character Recognition) was performed on the documents, and the text that was generated was added to the database for searching, but the PDF was not saved with that text embedded in it.  Now, with more users having access to high-speed connections, and with the availability of better compression technology for PDFs, we have gone back and embedded the OCR into these documents, allowing users to search for words within the PDF.  To search for a word or phrase in a PDF, open the file and hit ctrl+F on your keyboard.  Please note that the quality of the document affects the quality of the OCR text.  If the digital image of the document is very scratchy or grainy, the OCR may not have been able to pick out as many words as it would with a higher quality document.  The CDEC collection is a good example of this.  The quality of the microfilm that was digitized was not very good, and we were therefore unable to run successful OCR on that collection.  Additionally, hand-written documents will not be searchable.

Some users may be required to update to newer versions of their PDF readers.  Versions of Adobe Reader older than Version 6 will not be able to access the modified PDFs or PDFs newly added to the Virtual Archive.  Adobe Reader can be downloaded free from the Adobe website – http://get.adobe.com/reader/?promoid=BUIGO

No Comments »

No comments yet.

Leave a comment: