ePUB Illustrated Title Graphic
 


Headline Graphic: Repurposing Legacy Documents: To Scan, or Not to Scan!

Guest Authored by:

Dave McCloskey
dmccloskey@gpo.gov

 

About The Author

Photograph of Dave McCloskey

David McCloskey has served as a CD-ROM
Specialist in the Database Retrieval section
within GPO Since 1993.

Currently, he serves as Chief and has been a database developer/data publishing expert for over 9 years, advising customer agencies on their database endeavors.

 


 

Legacy data is all around us. It’s that old form you keep copying and that your agency, for some unknown reason, is using all the time now. It’s that three-ringed binder training manual that never got updated enough to warrant the expense of a complete electronic overhaul. Or it’s that ton of microfiche that always made you wonder: how in the world the documents got that small? So your boss candidly, and with a straight face, advises that you are now in charge of repurposing these legacy documents, “Do what? Repurpose? There was nothing wrong with the documents purpose in the first place!” Ah, what you are now tasked with doing is basic document housekeeping. You must convert these ancient papyrus and script documents into current day, meaningful electronic information packets. “Great!” you say, “Let’s scan these babies into electronic whatever and we’re done… right?” Well, the old adage “a picture is worth a thousand words” is not true in this case.

Capturing legacy data into an electronic format, typically scanning hard copy, is often not enough. More and more in today’s high-tech world your legacy data needs to fulfill an electronic function. This function is far more complicated than a mere “picture” of your legacy data on a computer screen. This data may need to be searched, filled out, or displayed on the Web. Maybe the data is sent via e-mail to a printer to print on paper, or any combination of thereof. One must “repurpose” the data from hard copy (or other formats), into this new meaningful electronic format with thought and foresight of this document’s new electronic function(s). “Whoa! Now where do I start?”

Let’s first begin with the difference between “scanning” and “scanning with Optical Character Recognition (OCR)” to render electronic text. Scanning is simply “taking a picture” of the document and saving the picture electronically as screen pixels in either an on or off condition. OCR software, in its basic form, literally reads the pixels and recognizes the patterns that we humans visually recognize as characters. The OCR software will then arrange the “best guessed” characters as words, which then it may check against an electronic dictionary. These “best guessed” words are saved as actual characters in a file instead of on/off pixels. Some OCR software is capable of recognizing line rules, borders, graphics, fonts, font sizes, and the text’s position on the page and then saving the file in a popular publishing software format for proofing and editing.

“Ok, so what’s the big deal? Just scan my document and OCR it, right?” Not necessarily. Scanning a document is one method of getting legacy data into its electronic counterpart and scanning is not always the best or cheapest method, albeit the most popular. One must ask the question: How much is this legacy data “worth”? If your agency mandate requires that the information simply be made available to the public, simply scanning the document may be for you. However, for most of us, this legacy material must be repurposed. Repurposed material is going to be either printed into a publication, pressed onto a CD-ROM, or put up on the Web to be searched, and will be edited and maintained in your office now as a “multi-purpose repurposed legacy electronic document!” (Wow, high tech lingo, you have to love it.) Thus, begging a final question, repurposed into what form?

In some cases, the legacy data is delicate, yellowed, marked-up, hand-composited hard copy, not to mention that it’s generally one of a kind. Undoubtedly, all of these factors will add to the overall scanning costs, especially if the pages are hand fed into the scanner. Your data may not be a candidate for scanning at all and may have to be keyed. Keying is often thought of as archaic and, therefore, it must be expensive. That’s not always true! Keying data still has its niche in the market. Sometimes keying is done in the final “format” required in your contract and, therefore, no electronic conversions are necessary. Keying may be done twice, simultaneously and compared against each other for proofing purposes, again lowering costs. Or your “one of a kind, only one on the planet type document” may not be able to withstand harsh lights of the scanner or be too brittle to withstand handling. In these instances, scanning may not be the best method for your legacy data.

For arguments sake, lets just say that your data is clean copy and not fragile or one of a kind and it’s a good candidate for scanning. Once the data has been scanned, it is usually OCR’d in a second step. OCR is often costly and the accuracy of the recognition function depends largely upon the condition of the originally scanned document. There are also many shades of OCR software ranging from the relatively inexpensive to the other far end of the dollar spectrum. Typically with OCR software, the more bells, whistles and capability, the more expensive. The latest software packages boast a very high level of accuracy. I personally have been less impressed with real-world data results. The bottom line is, you get what you pay for… I should say, eventually you will pay for what the contractor gets. For the most part, OCR software does most of the work, and because it isn’t perfect, there is no better, but expensive, proof mechanism than the human eye. On occasion, I have witnessed un-proofed OCR data input such as 808 being processed as BOB, and sometimes worse. Proofing, being a human resource, is a costly necessity, but the cost of not proofing could be very embarrassing not to mention inaccurate, which will be important to you later on for functions such as reliable searching.

Now that the sticker shock of scanning, OCR, and proofing have bit into your budget, you have to consider how you want to receive your data or in what format. This may or may not be a separate line item in your contract so long as you know what the output will be when you get your data back. This step will require some thought. You must research how the data is being used currently, how it will be used immediately, and how it may be used in the future. Here is a quick outline to follow and some common formats with each. It is important to discuss with each area below what their particular file format requirements will be.

1. How will the data be used? By whom and in what levels, such as: research, entertainment, informational, data collecting, etc.? This is known as a data analysis and can incorporate many other aspects of your data and layout.

2. What is the most important electronic function of the data? Is it a form to recover data, is it for searching or research, or is it for printing and publishing, etc.? This is a good place to see in what format the data should reside and for whom. The accuracy level necessary of the OCR software/proofing can be determined here as well.

a. Data editing and maintenance group, who will maintain and edit the data and distribute it to the other formats (SGML, popular publishing software, and spreadsheets)? This is normally where the data source for distribution is maintained/deposited.

b. Printing and typesetting group (typeset languages and popular publishing software formats)?

c. Web publishing group (ASCII, HTML, XML, or PDF formats)?

d. CD-ROM group (most formats, usually comparable to Web formats)?

e. A combination of the above groups, or possibly a group yet undetermined.

 

3. Is the data going to be searchable by any or all of the above groups (ASCII, HTML, and PDF)? Some search software required a specific data input format.

4. Is there a group needed for data recovery such as forms and comments (PDF, HTML, or ASCII)?

5. Will the data be altered often? If so, will maintenance be in an easily distributed/converted format for all the users down the line?

6. Should variances be allowed for processes that may be done in the future? An example would be database connectivity (ODBC), the ability to connect to other databases, or compatibility with video, multimedia, DVD, etc.

 

As you can see, this could get very involved and in fact, one could make a career formulating these kinds of work flow diagrams and logistics for each. So let’s take a more down-to-earth for instance.

We have already determined “my legacy data” to be a good candidate for scanning because:
1. It’s good clean copy on one side of the sheet. No bleed through from the other side of the sheet. It has a lot of text, tables, and a few graphics (pictures). It’s one color, black ink on white paper and of a standard size (8-1/2’’x 11’’ or similar)

2. It’s a small job and could be scanned with off the shelf hardware and software (meaning that it’s not a unique situation as, say, microfiche scanning).

3. It’s not fragile and there are many copies so some can be destroyed if an error occurs.

4. It contains some forms. (Scanning is the fastest/cheapest way to get a form to an electronic format.)

My legacy data must be searchable and editable as our commissioner is determined that we will now update this material on a monthly basis for a monthly printed publication, CD-ROM, and for the Web. Scanning with OCR and proofing is now necessary [no matter the cost]. I’m not too concerned about future implementations of the data because I’m sure I’ll be promoted for doing a great job on this project.

The bulk of the work will be editing and maintenance to the data. This we will do here in our agency shop to keep costs down. Therefore, the most important electronic function is in Editing and Maintenance (E&M the same group of people in my imaginary agency). E&M will receive the OCR-formatted files from our contractor in a popular publishing software format because that is what E&M already knows well. From this, E&M will be supplying all the other areas down the line. The same popular publishing software format will be distributed to the Web folks (who like to work their own magic on the final HTML files for the Web). E&M is capable of producing PDF files for the CD-ROM folks and for the printing contractor as well. PDF files definitely for the forms in the document as this works very well for both the CD and Web people. Not too bad, huh!

 

While we’re on the subject, what about PDF? Why didn’t we just scan everything into PDF and use Acrobat Capture (Adobe’s version of OCR software)? PDF files are a very inexpensive way to distribute data. Since cost was not the most important factor here, but keeping my job at my pretend agency was, the printed quality of the files would only be as good as the original scan. Since the original was of average quality, the quality of the printed publication can only be average. Using PDF files as maintenance files doesn’t work as you can’t rigorously edit PDF files as one must be able to do in a publishing environment. The Acrobat Capture (OCR) software isn’t the greatest, but it does offer two methods of viewing the OCR data. One, it allows the user to visually (read) the scanned image while the OCR text is “hidden” for searching purposes. Two, it allows Acrobat to OCR and display what it perceives the words/fonts/text placements to be. The catch for the latter method is the Acrobat OCR “Text” Editor is abysmal at best, so correcting the fonts and spelling errors of the OCR PDF file would be far too time consuming.

So what is PDF good for? Plenty. If there is no need for a master database for monthly corrections and the sole purpose is to disseminate your legacy document to the public, well then, have we got a deal for you! We would scan the document directly into PDF via a network scanner directly to a specialist’s workstation. Since this is only a “picture” PDF file (contains no character text), the specialist would OCR with Acrobat Capture. We use the scanned image, hidden text method described above to allow the user to read the actual document and also allow the user to search the documents “hidden” text. Since we did not proof the data to save time, we get the occasional inaccurate search, but the end user will not read any misspellings. With Acrobat you get the wonderful built in features such as bookmarks, thumbnails, and linking. The form field function is one of the best free fillable forms software package out there. The Web folks with little or no modifications to the format or publication structure may also use these same PDF files.

You must determine your route by what I have outlined above. Calculate your costs in terms of money, time dedication, and document value. Ask the experts and use their advice to make better, more informed decisions. As you progress through the various stages, you will stop and truly begin to wonder, “How much is this legacy data really “worth?”

Keep in mind that in GPO, the section that performs all of the duties described above would be a single section that has been in existence for about 10+ years, the Database Retrieval/Distribution Development Section. The little-known section commonly called the CD-ROM Section, does much more than make CD-ROMs. We have long been experts in database management, document search software/techniques, software design, data manipulation, publishing, and scanning. Most of our customers have been repurposing their data for years without knowing that they were repurposing. We realized early on that the data they were spending tons of money on might also be used for purposes other than the CD-ROM project at hand. This is how we built our reputation as leaders and experts, by offering free advice on the whole picture not just the narrower CD-ROM function. We have worked on all types of data while getting feedback directly from the customer, unlike a contract, which usually requires a contracting officer for intermediate conferencing. This direct communication allows immediate resolution to differences of ideas, saving time and money while limiting mistakes. We have built our reputation with the realization that our customers come to us because they want to. We have always given free advice, as we are vehemently aware that a more informed customer will make a more informed, better decision. If our services interest you, you may contact me at 202-512-0675. The Database Retrieval/Distribution Development Section stands ready to serve you!


This article pertains to:


This article pertains to Print output.

This article pertains to publishing for the Web.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Click here for the ePUB Homepage
Click here to see Ask ePUB!
Click here to return to the cover.
Click here for the GPO homepage.