[Taxacom] Wanted: a page on marking up

P. Bryan Heidorn pheidorn at uiuc.edu
Wed May 16 09:32:53 CDT 2007


Xiaoya Tang successfully defended her dissertation on a related topic  
last week and will be polishing up her dissertation document over the  
next couple of months. I am copying her here.

In Xiaoya takes a different approach from Hong. Hong can identify the  
topics of sentences and clauses and mark them inline, that is inside  
the original text itself. MARTT can identify many of the parts of  
plants and their properties.  Xiaoya aims at information extraction  
of a smaller set of features creating a document independent of the  
original containing the character and state information. It pulls out  
the shapes of leaves, the lengths and widths, margins and a few other  
features. Xiaoya's method has less breadth than Hong's method but  
more depth. Both use supervised learning methods but very different  
algorithms. As a consequence, I would suspect that Xiaoya's method  
might take weeks or months to adapt to grasses while Hong could move  
much more quickly but at a shallower level.

Which method you use really depends on your goals. If you want to  
make multiple entry keys Xiaoya's method is best. He dissertation  
proved that search of flora using the extracted features + full text  
is better than full text search alone. If you want to mark-up  
original text then Hong's method is best.

-- Bryan
-- 
--------------------------------------------------------------------
   P. Bryan Heidorn
   Graduate School of Library and Information Science
   University of Illinois at Urbana-Champaign
   pheidorn at uiuc.edu
   (V)217/ 244-7792     (F)217/ 244-3302
   http://www.uiuc.edu/goto/heidorn
   Online Calendar: http://www.uiuc.edu/goto/heidorncalendar


On May 16, 2007, at 9:13 AM, Hong Cui wrote:

> Dear Mary,
>
> Realizing the variations in formatting and information content  
> among different collections of taxonomic descriptions, we try to  
> design an approach that can mark up the semantics of the  
> information from different taxonomic collections, yet not requires  
> lots of knowledge engineering efforts from   taxonomists. The idea  
> is to let the computer to learn some knowledge about the taxon  
> domain from some marked examples and then use the learned knowledge  
> to mark new descriptions. In the first step we used our system  
> MARTT to mark FNA, FoC, and FNCT to the clause level. The marked  
> collections can be searched at http://hong.fims.uwo.ca/gsdl/cgi-bin/ 
> library.exe. This should give you an idea about the markup  
> granularity MARTT currently can produce with very minimal human  
> intervention. We are currently working on character-level markup.
>
> Currently the system has learned some good knowledge about a good  
> number of families, but we have not processed any grass collection  
> yet. One good collection like yours would be a great source for  
> MARTT to learn about grass, then it should be able to mark other  
> grass collections with ease.
>
> Depending on your needs, another system we recently developed  
> should be able to convert your descriptions into an XML format with  
> clause-level markup in a "quick-and-dirty" way in hours/days.
>
> I'd be glad to continue this discussion with you.
>
> Hong
>
> ----- Original Message -----
> From: "Weitzman, Anna" <WEITZMAN at si.edu>
> Date: Tuesday, May 15, 2007 9:32 am
> Subject: RE: [Taxacom] Wanted: a page on marking up
> To: Donat Agosti <agosti at amnh.org>, Mary Barkworth  
> <Mary at biology.usu.edu>, Taxacom at mailman.nhm.ku.edu
> Cc: Terry Catapano <thc4ster at gmail.com>, Christiana Klingenberg  
> <christiana at ameisen-net.de>, Guido Sautter <sautter at ira.uka.de>,  
> Hong Cui <hcui7 at uwo.ca>, "P. Bryan Heidorn" <pheidorn at uiuc.edu>
>
> > Dear Mary,
> >
> > The way in which we mark up taxonomic work is being addressed by
> > several groups at the moment, and is subject to a TDWG group's
> > work.  Currently we do not have an agreed standard, but
> > some generalities and possibilities are beginning to
> > appear.
> >
> > Two key questions are: "'what do we want to put into XML?" and
> > "how detailed do we need the atomisation of the content of the  
> paper?"
> >
> > We can take a broad brush approach, and identify the different
> > components of a paper.  This is much the way that most
> > schemas have developed, and is the overall approach of the
> > taxonX.  This clearly is a baseline, upon which we might
> > build further as needed.
> >
> > Within that, we might want to focus on characters, in order to
> > be able to extract them and use them in conjunction with the
> > species/ taxa / concepts to build, for example, keys,
> > descriptions coming from several sources, or field guides, some
> > examples of automating this have been done at University of
> > Illinois with Bryan Heidorn, especially by Hong Cui, one of his
> > former students.
> >
> > We might also wish to focus on the nomenclatural, taxonomic,
> > citation and specimen side of the data.  This is the
> > approach we have taken with taXMLit
> > (http://www.sil.si.edu/digitalcollections/bca/status.cfm <https:// 
> webaccess.si.edu/exchweb/bin/redir.asp?URL=http://www.sil.si.edu/ 
> digitalcollections/bca/status.cfm>  ).  The implementation of this  
> will include interoperability with TDWG standards for names and  
> specimen data, and allow simultaneous access to literature elements  
> and to specimen data, catalogues and other distributed resources.   
> Currently we have marked up a set of taxonomic publications,  
> including a volume of the Biologia Centrali-Americana, and are  
> developing the implementation, a preliminary version of which we  
> plan to show at the TDWG meeting in Bratislava.
> >
> > We are also working on instructions for using taXMLit to mark up
> > taxonomic documents.  That said, the long-term solution is
> > to develop tools (which we and others are working on) to parse
> > these documents into such a format using computer capabilities,
> > including simple logic but also artificial intelligence
> > ('machine learning').  The latter has been used in other
> > areas, including molecular bioinformatics and there are
> > developments that we should be able to use to our advantage in
> > taxonomy.
> > Anna & Chris
> >
> > Anna L. Weitzman, PhD
> > Informatics, Botany and Biodiversity Research
> > National Museum of Natural History
> > Smithsonian Institution
> >
> > 202.633.0846
> > weitzman at si.edu
> >
> > Christopher H.C. Lyal, PhD
> > Beetle Diversity and Evolution Programme,
> > Department of Entomology,
> > The Natural History Museum,
> > Cromwell Road,
> > London SW7 5BD
> > UK
> > tel: +44 (0) 207 942 5113
> > fax: +44 (0) 207 942 5661
> > e-mail c.lyal at nhm.ac.uk
> >
> > ________________________________
> >
> > From: taxacom-bounces at mailman.nhm.ku.edu on behalf of Donat Agosti
> > Sent: Tue 15-May-07 8:30 AM
> > To: 'Mary Barkworth'; Taxacom at mailman.nhm.ku.edu
> > Cc: 'Terry Catapano'; 'Christiana Klingenberg'; 'Guido Sautter'
> > Subject: Re: [Taxacom] Wanted: a page on marking up
> >
> >
> >
> > Dear Mary
> >
> > We developed a mark-up schema for taxonomic work (taxonx
> > (http://taxonx.org <http://taxonx.org/> )), started to mark
> > up literature (eg the ant literature
> > of Madagascar (http://antbase.org/databases/madagascar.htm) and
> > plan to
> > build a dedicated treatment server so the descriptions can
> > easily be
> > accessed. For the mark up process we developed (and are still in
> > the process
> > of refining it) a semiautomatic program (goldenGate
> > http://idaho.ipd.uka.de/GoldenGATE/). You can download it and
> > find there
> > also a manual.
> >
> > Taxonx is a leight weight
> > (http://wiki.cs.umb.edu/twiki/bin/view/Ants/WebHome) schema
> > which can be
> > integrated into publisher's schema, and which builds upon
> > existing schemas
> > and standards.
> >
> > If you have any questions, please contact us at any time
> >
> > Good luck and welcome to the party
> >
> > Donat
> >
> > -----Original Message-----
> > From: taxacom-bounces at mailman.nhm.ku.edu
> > [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Mary
> > BarkworthSent: Tuesday, May 15, 2007 1:51 PM
> > To: Taxacom at mailman.nhm.ku.edu
> > Subject: [Taxacom] Wanted: a page on marking up
> >
> > This arises indirectly from the EoL discussion but it is not that.
> > There have been many posts about marking things up so they can
> > be found.
> > Is there an abc level page on how to do this? Written for
> > someone so old
> > she can remember electric typewriters as being new? I have
> > servers with
> > a fair amount of taxonomic content on the Web (see
> > http://herbarium.usu.edu/webmanual/ and http://utc.usu.edu/keys/
> > )  With
> > Marina Olanova's help I have a translation of Tsvelev's 2006 global
> > treatment of Glyceria almost ready to post.  It contains a
> > discussion,listing of species with publication information and
> > some comments, and a
> > key, and a list of excluded species. It would be nice if this
> > could be
> > found a year from now because it was marked up correctly.
> >
> > I would be delighted to mark these up so it would be easier for  
> people
> > to find them. I am working on putting key words in the headers, but
> > reading about the semantic Web and following these discussions,
> > I have
> > the impression there are better ways to do this. So, those of
> > you who
> > know - please - is there a Web page that explains how in words
> > of one
> > syllable?  Or do I simply define my own mark up language at
> > the top of
> > the document?
> >
> > My suspicion is that there are others like myself who would make  
> their
> > work more accessible if they knew how - hence my public decision of
> > admit ignorance. I have already seen some of the grass descriptions
> > appear, with minor changes (rounding of limits to the nearest
> > 0.5 mm) on
> > other pages with a somewhat different format. The print
> > publication that
> > they come from was cited, slightly inaccurately but it was a
> > reasonableattempt.
> >
> > Mary
> >
> > _______________________________________________
> > Taxacom mailing list
> > Taxacom at mailman.nhm.ku.edu
> > http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> >
> >
> >
> >
> > _______________________________________________
> > Taxacom mailing list
> > Taxacom at mailman.nhm.ku.edu
> > http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
> >
> >
> >




More information about the Taxacom mailing list