|
||||
|
|
||||
|
Guest Authored by: Dave
McCloskey
|
About The Author
Legacy data is all around us. It’s that old form you keep copying and that your agency, for some unknown reason, is using all the time now. It’s that three-ringed binder training manual that never got updated enough to warrant the expense of a complete electronic overhaul. Or it’s that ton of microfiche that always made you wonder: how in the world the documents got that small? So your boss candidly, and with a straight face, advises that you are now in charge of repurposing these legacy documents, “Do what? Repurpose? There was nothing wrong with the documents purpose in the first place!” Ah, what you are now tasked with doing is basic document housekeeping. You must convert these ancient papyrus and script documents into current day, meaningful electronic information packets. “Great!” you say, “Let’s scan these babies into electronic whatever and we’re done… right?” Well, the old adage “a picture is worth a thousand words” is not true in this case. Capturing legacy data into an electronic format, typically scanning hard copy, is often not enough. More and more in today’s high-tech world your legacy data needs to fulfill an electronic function. This function is far more complicated than a mere “picture” of your legacy data on a computer screen. This data may need to be searched, filled out, or displayed on the Web. Maybe the data is sent via e-mail to a printer to print on paper, or any combination of thereof. One must “repurpose” the data from hard copy (or other formats), into this new meaningful electronic format with thought and foresight of this document’s new electronic function(s). “Whoa! Now where do I start?” Let’s first begin with the difference between “scanning” and “scanning with Optical Character Recognition (OCR)” to render electronic text. Scanning is simply “taking a picture” of the document and saving the picture electronically as screen pixels in either an on or off condition. OCR software, in its basic form, literally reads the pixels and recognizes the patterns that we humans visually recognize as characters. The OCR software will then arrange the “best guessed” characters as words, which then it may check against an electronic dictionary. These “best guessed” words are saved as actual characters in a file instead of on/off pixels. Some OCR software is capable of recognizing line rules, borders, graphics, fonts, font sizes, and the text’s position on the page and then saving the file in a popular publishing software format for proofing and editing. “Ok, so what’s the big deal? Just scan my document and OCR it, right?” Not necessarily. Scanning a document is one method of getting legacy data into its electronic counterpart and scanning is not always the best or cheapest method, albeit the most popular. One must ask the question: How much is this legacy data “worth”? If your agency mandate requires that the information simply be made available to the public, simply scanning the document may be for you. However, for most of us, this legacy material must be repurposed. Repurposed material is going to be either printed into a publication, pressed onto a CD-ROM, or put up on the Web to be searched, and will be edited and maintained in your office now as a “multi-purpose repurposed legacy electronic document!” (Wow, high tech lingo, you have to love it.) Thus, begging a final question, repurposed into what form? In some cases, the legacy data is delicate, yellowed, marked-up, hand-composited hard copy, not to mention that it’s generally one of a kind. Undoubtedly, all of these factors will add to the overall scanning costs, especially if the pages are hand fed into the scanner. Your data may not be a candidate for scanning at all and may have to be keyed. Keying is often thought of as archaic and, therefore, it must be expensive. That’s not always true! Keying data still has its niche in the market. Sometimes keying is done in the final “format” required in your contract and, therefore, no electronic conversions are necessary. Keying may be done twice, simultaneously and compared against each other for proofing purposes, again lowering costs. Or your “one of a kind, only one on the planet type document” may not be able to withstand harsh lights of the scanner or be too brittle to withstand handling. In these instances, scanning may not be the best method for your legacy data. For arguments sake, lets just say that your data is clean copy and not fragile or one of a kind and it’s a good candidate for scanning. Once the data has been scanned, it is usually OCR’d in a second step. OCR is often costly and the accuracy of the recognition function depends largely upon the condition of the originally scanned document. There are also many shades of OCR software ranging from the relatively inexpensive to the other far end of the dollar spectrum. Typically with OCR software, the more bells, whistles and capability, the more expensive. The latest software packages boast a very high level of accuracy. I personally have been less impressed with real-world data results. The bottom line is, you get what you pay for… I should say, eventually you will pay for what the contractor gets. For the most part, OCR software does most of the work, and because it isn’t perfect, there is no better, but expensive, proof mechanism than the human eye. On occasion, I have witnessed un-proofed OCR data input such as 808 being processed as BOB, and sometimes worse. Proofing, being a human resource, is a costly necessity, but the cost of not proofing could be very embarrassing not to mention inaccurate, which will be important to you later on for functions such as reliable searching. Now that the sticker shock of scanning, OCR, and proofing have bit into your budget, you have to consider how you want to receive your data or in what format. This may or may not be a separate line item in your contract so long as you know what the output will be when you get your data back. This step will require some thought. You must research how the data is being used currently, how it will be used immediately, and how it may be used in the future. Here is a quick outline to follow and some common formats with each. It is important to discuss with each area below what their particular file format requirements will be.
As you can see, this could get very involved and in fact, one could make a career formulating these kinds of work flow diagrams and logistics for each. So let’s take a more down-to-earth for instance.
While we’re on the subject, what about PDF? Why didn’t we just scan everything into PDF and use Acrobat Capture (Adobe’s version of OCR software)? PDF files are a very inexpensive way to distribute data. Since cost was not the most important factor here, but keeping my job at my pretend agency was, the printed quality of the files would only be as good as the original scan. Since the original was of average quality, the quality of the printed publication can only be average. Using PDF files as maintenance files doesn’t work as you can’t rigorously edit PDF files as one must be able to do in a publishing environment. The Acrobat Capture (OCR) software isn’t the greatest, but it does offer two methods of viewing the OCR data. One, it allows the user to visually (read) the scanned image while the OCR text is “hidden” for searching purposes. Two, it allows Acrobat to OCR and display what it perceives the words/fonts/text placements to be. The catch for the latter method is the Acrobat OCR “Text” Editor is abysmal at best, so correcting the fonts and spelling errors of the OCR PDF file would be far too time consuming. So what is PDF good for? Plenty. If there is no need for a master database for monthly corrections and the sole purpose is to disseminate your legacy document to the public, well then, have we got a deal for you! We would scan the document directly into PDF via a network scanner directly to a specialist’s workstation. Since this is only a “picture” PDF file (contains no character text), the specialist would OCR with Acrobat Capture. We use the scanned image, hidden text method described above to allow the user to read the actual document and also allow the user to search the documents “hidden” text. Since we did not proof the data to save time, we get the occasional inaccurate search, but the end user will not read any misspellings. With Acrobat you get the wonderful built in features such as bookmarks, thumbnails, and linking. The form field function is one of the best free fillable forms software package out there. The Web folks with little or no modifications to the format or publication structure may also use these same PDF files. You must determine your route by what I have outlined above. Calculate your costs in terms of money, time dedication, and document value. Ask the experts and use their advice to make better, more informed decisions. As you progress through the various stages, you will stop and truly begin to wonder, “How much is this legacy data really “worth?” Keep in mind that
in GPO, the section that performs all of the duties described above
would be a single section that has been in existence for about 10+ years,
the Database Retrieval/Distribution Development Section. The little-known
section commonly called the CD-ROM Section, does much more than make
CD-ROMs. We have long been experts in database management, document
search software/techniques, software design, data manipulation, publishing,
and scanning. Most of our customers have been repurposing their data
for years without knowing that they were repurposing. We realized early
on that the data they were spending tons of money on might also be used
for purposes other than the CD-ROM project at hand. This is how we built
our reputation as leaders and experts, by offering free advice on the
whole picture not just the narrower CD-ROM function. We have worked
on all types of data while getting feedback directly from the customer,
unlike a contract, which usually requires a contracting officer for
intermediate conferencing. This direct communication allows immediate
resolution to differences of ideas, saving time and money while limiting
mistakes. We have built our reputation with the realization that our
customers come to us because they want to. We have always given free
advice, as we are vehemently aware that a more informed customer will
make a more informed, better decision. If our services interest you,
you may contact me at 202-512-0675. The Database Retrieval/Distribution
Development Section stands ready to serve you! |
|||
|
This
article pertains to:
|
||||
|
|
||||