Home Page
Search Site

TIFFs and Scanning

When you run a text page through a scanner, the scanner takes a picture of the text--very similarly to what happens if you run a document through a copier. In the case of the scanner, rather than taking a picture and putting it on another piece of paper (as a copier does), it takes a picture and creates a digital graphics file (a computer TIFF file or a JPEG file for color). In other words, instead of putting an image on a piece of paper, it puts the image on the computer. At this point, you do not have text; you have a picture of text.

It matters not what program you use to scan: OmniPage, PaperPort, FineReader, Kurzweil (which uses the FineReader engine), Photoshop, Kodak Imaging. All of them work the same way. They start with a graphic because that is what the hardware creates. Once you have the graphic, you can work with it in various ways depending on the program in which you open it.

The confusion comes because in some programs, you do not "see" the underlying graphic file. PaperPort, for example, runs under MS Word. You might think that you are scanning "directly into Word," but what you are actually doing in that case is the following:

  1. creating a TIFF/JPEG (because that's all the hardware does)
  2. using the OCR capability of the software (PaperPort) to extract text from the graphic file
  3. using Word to edit the OCR program

You never actually see the TIFF, but it was there at the start of the process.

An OCR program literally looks at the black shapes (remember it's a picture at first) on the page and says, "Hmmmm, looks like the way a "c" usually looks, looks like an "a," looks like a "t," space after that, must be "cat." It starts out just matching shapes.

As another example, the Kurzweil program runs OCR in the background, and that OCR-ed (is that a word?) text is what the Kurzweil program actually reads, but what you see in the foreground is the original TIFF image. That fact is why in Kurzweil 3000 what you see and what is being read can actually be different.

Other issues to consider:

  1. different OCR programs do have different capabilities; some handle certain situations (tables or foreign languages, for example) better than others
  2. how well the OCR program can extract text is directly related to how good the original scan (i.e., the TIFF file) was
  3. no OCR program claims to be 100% accurate; you will almost always have some errors (but if someone retypes the entire thing for you, you're going to have some errors as well!)
  4. scanning is an art; in general, the more experienced the scanner operator, the better the TIFF that the OCR program has to work with

What the distributed scanning network (and by extension the AMX database) does is to provide the TIFF file so that you can skip the scanning step of the process. Remember, I'm not talking about AB422 and getting text from publishers at this point. I'm talking about beginning with a hard-copy book. Starting at that point, your ONLY alternative to a scanned TIFF image is to rekey the book (i.e., type it in).

When you start with a TIFF, as opposed to scanning text yourself, you open the TIFF file directly using your OCR program. All the OCR programs can do that. After all, a TIFF is what they start with--you're just saved the time and effort of scanning.

With Kurzweil, starting with a TIFF means opening the TIFF file in either K1000 or the K3000 scan and read version (the K3000 with the OCR program FineReader built in). Note that the Kurzweil 3000 read-only station cannot open a TIFF--it has no built-in OCR program. A TIFF file is by definition a graphic, and it must have OCR performed on it before any screen reader can read it. Similarly, with WYNN, you would open the TIFF in the WYNN Wizard version, not the read-only version.

If you're scanning, in the beginning there was TIFF (or JPEG)--always, even if you instantly change it to something else. What you do with it after that is up to you and your students' needs.