Digital documents

Thomas E. Yancey tey7004 at GEOPSUN.TAMU.EDU
Tue Feb 4 15:24:07 CST 1997

The following is a statement about current status of digital documents and their
storage and retrieval that will be of interest to many on this list. It is from
Pal LaFollette and appeared on the Mollusca listserv. With his permission, I
post it here, because it affects all of us and is quite informative.

Tom Yancey

> There are three separate issues involved in the archiving of digital
> documents. The first is captured in this quote from Steve Long:
> "About 2 years ago I threw away over 100 eight inch floppy disks because
> could no longer find machines to read them.  Fortunately I was able to
> convert most of the important data to 5.25 inch floppy disks.  Now, my
> computer has 5.25 and 3.5 inch disks but most users are replacing the
> 5.25 inch with CD-ROM drives and don't have the space in their PC to
> include all three.  In another couple of years, I will have to convert
> to 3.5 inch or CD-ROM for my information.  After that, who knows but you
> can almost guarantee that the storage medium will change again."
> The physical media changes as storage technology advances.  The
> marketplace requires that there be one or two "universal" formats at any
> particular time so that electronic products can be distributed.  At
> present these are the 3.5 inch floppy disk and CD-ROM.  The new DVD
> format is physically the same shape as CD-ROM and DVD readers will be
> "backwards compatible," -- able to read CD-ROMs, at least for a while.
> But after DVD has reached its planned maximum capacity of about 15Gb,
> who can guess what will come next?
> The second issue is made clear by William Schleihauf.
> "...The various media being talked about- tape in particular - has a
> lifespan of only a few years (and that's not counting the operators
> playing frisbee with the tapes on the night shift!).  Companies _now_
> are discovering that some tapes created only a few years ago are coming
> up with i/o errors, and thus the data is lost.  The newer cassettes are
> better, but again, the half-life is measured in years, not decades.
> CDROM- guess what- _maybe_ a century, on average, or so the forecasting
> is."
> The currently available digital media are not of "archival" quality,
> unlike (acid free) paper.  Perhaps it's just as well that the physical
> format of the media keeps changing. It forces people to copy their data
> to new disks every now and then, while the old ones are still readable!
> I've read lengthy discussions of the archival properties of CD-ROM
> disks. Estimates for some CDR (CD-Recordable) media exceed 200 years,
> twice as long as commercially pressed CDs.  But the glass masters from
> which commercial CDs are pressed might last millennia.  On the other
> hand, how long are CD readers likely to be around?  The first generation
> of DVD readers will be backwards compatible, but I doubt CD technology
> in it's present form will last as long as the phonograph. (My LPs, and
> even some 45s and 78s, are still in playable shape, but my turntable
> died years ago).
> But all that kind of misses the point.  One of the tasks performed by
> traditional librarians and archivists is to protect paper from the
> ravages of time, the elements, insects, fungus, fire, flood, and
> undergraduates. Paper can last for hundreds of years, but only if it is taken
> care of. Paper can also turn to dust in days or weeks.
> The analogous task for electronic librarians will be to protect their
> bits, independent of storage method or medium, by periodically
> transferring them all to whatever appears to be the most secure and
> accessible storage technology of the time, and by distributing copies of
> them to as many other widely dispersed locations as possible. But there
> does not yet seem to be an established tradition of digital
> librarianship to shoulder this responsibility and pass it on from one
> generation to the next.  It's very difficult to establish  traditions
> and a commitment to the long term when the technology is in such a state
> of flux.
> A major advantage of the digital medium is that so long as the physical
> media is readable, copies made from it will be identical. There is no
> degradation from generation to generation as there is with analog
> reproduction processes such as microfilm and photocopying.  Nothing is
> fugitive.  And unlike books, which are produced in finite (often rather
> small) numbers that decrease over time, there never need be such thing
> as a "rare" digital document, so long as it is periodically copied and
> made available.  Digital media may not hold up as well as paper to
> adversity and neglect, but it's content can be much more widely
> distributed.  Local disasters (wars, fires, floods, hackers, budget
> cuts) would have a much less lasting impact on a properly managed
> digital archive then on a conventional library.  As soon as the event is
> past and the equipment is replaced, identical copies of the data can be
> restored from other unaffected archives.
> The third and most intractable issue effecting computer documents is, as
> William Schleihauf pointed out, the coding scheme in which the data are
> recorded.
> "_Everything_ stored in a computer is stored with a specific coding
> scheme.  You need to have the "magic decoder ring" to get it all back.
> If you create a file/document with Word Perfect v 6 today, there's no
> guarantee that you'll be able to read it 10 years from now..."
> The real problem here is the use of non-standard "proprietary" document
> and database formats by DBMS, word processors, page layout programs, and
> typesetting systems.  They tend (often deliberately) to be mutually
> incompatible as well as changing over time.
> A solution to this problem was agreed upon eleven years ago by the
> international standards organization, but is only gradually gaining wide
> acceptance, primarily in the publishing industry, government, large
> corporations, and the European Common Market.  The solution is SGML
> (Standard Generalized Markup Language), ISO 8879 (1986) which defines a
> single international standard for coding documents that is hardware,
> operating system, and software independent. Electronic documents in SGML
> format avoid the problems of proprietary formats and obsolescence, but
> can be converted to a proprietary format if this is necessary to perform
> a particular task.  Most commercial typesetting and CD-ROM display
> software now accept SGML documents directly, without conversion.
> Actually, WWW browsers are quasi-SGML viewers in that HTML (HyperText
> Markup Language) files are SGML documents.
> The new buzz word in publishing circles is "repurposing" documents.
> That is, taking a computer file (the manuscript for a reference book,
> for example) and using it to produce a CD-ROM or online database.  If
> the text is in SGML format, it can be used for all three purposes
> without modification.  What allows this to be done is that in SGML, it
> is the content, rather than the appearance, that is marked.  In any
> other text markup systems, one would say <start italic>Astraea
> undosa<end italic> to put the name in italic.  In SGML you would say
> <start genus>Astraea<end genus> <start species>undosa<end species>.  The
> rule "print genus in italic" (or red or 14pt gothic) is defined
> separately from the document, and can be changed without changing the
> document itself.
> There are a variety of SGML editors that allow documents to be created
> and maintained directly.  Unfortunately, it's still pretty much a "big
> boy" technology, the software expensive and clunky, and the conversion
> of existing electronic documents labor intensive.  But this situation
> should improve in time, as more companies enter the arena.
> The bottom line to all this is that digital documents are, and will
> continue to become an ever more useful supplement to the published
> literature, and an inexpensive method of distributing large volumes of
> data, but are not likely to take the place of paper any time soon.
> Given that digital storage methods will continue to evolve for the
> foreseeable future, I would want to witness digital librarians staying
> ahead of the technological wave, maintaining the security and utility of
> their holdings, for a generation or three before I will have as much
> confidence in them and their holdings as I do in paper, conventional
> libraries, and old fashioned librarians.
> On a related subject, does anyone out there have a copy of Sherborn's
> Index Animalium that I could borrow for a few weeks?  (Or buy?)  My plan
> is to scan and OCR it, convert the text to SGML, integrate the parts,
> supplements, and bibliographies, add hyperlinks between the index and
> the bibliography, and put the result on CD-ROM. (I'll give you a copy in
> return.)
> Pat
> Patrick I. LaFollette
> Electronic Publishing
> Auto-Graphics, Inc., 3201 Temple Ave., Pomona, California 91769-3200
> pil at
> phone: (909)595-7004  ext. 387 or (800)776-6939 ext. 387
> fax: (909)595-5190
> res: (909)622-4943

Thomas E. Yancey                                            _______
Department of Geology and Geophysics                       |   |   |
Texas A&M University                                        _  |  _
College Station, TX 77843-3115                             |-| | |||
Voice: 409 845 0643    Fax: 409 845 6162
email: tyancey at

More information about the Taxacom mailing list