[Taxacom] Consider using the draft "species" microformat...
dremsen at gbif.org
Mon Nov 5 03:45:21 CST 2007
It seems to me that the use of microformats in the context of this
discussion are not incompatible with the taxon name recognition
algorithms like TaxonGrab, FAT, or TaxonFinder (formerly FindIT).
They are referring to very different concepts. One is the choice of
an applied markup scheme for tagging taxonomic content and the other
is a text processing tool for identifying untagged taxon names. If
taxon names were explicitly tagged as taxon names you wouldn't need a
natural language processing tool to identify them would you? To
address some of the questions raised below, there are disambiguation
mechanisms one can employ to reconcile false positives such as your
Baracus example using word sense disambiguation methods to infer the
context of the usage. These tend to be "expensive" in terms of
processing time and speed however and it would certainly make life
easier if the original author went to the trouble to use the semantic
equivalent of a highlighter to tag their taxon concepts (not just the
names). But for most existing (particularly print) content this isnt
For species combinations the degree of homography drops dramatically,
particularly for actual (not derived) species names. I'm not clear
on how tagging a canonical homonym using microformats disambiguates
the usage but I assume it has been addressed.
I would think the more relevant issue would be a discussion of the
costs and benefits of using microformats vs a TDWG schema or whether
these are even mutually exclusive in the first place and provide use
cases of where one might be more effective, efficient or practical
than the other. Ultimately I don't believe people will utilize any
of these if the only incentive is that they get to contribute to the
development and solidification of an emerging standard. Nor will
they do it if the effort only serves the interests of someone else
even if that effort is minimal. They will do it when it's worth
their while. Incentivizing these activities is what will make them
On Nov 4, 2007, at 1:19 AM, Andy Mabbett wrote:
> In message <001801c81e67$a3ada4b0$eb08ee10$@ca>, "Shorthouse, David"
> <dps1 at ualberta.ca> writes
>>> The microformat aims to make taxonomic names within the content of
>>> published web pages discoverable to parsing tools.
>> Rather than convincing a developer or web page author to mark-up
>> HTML that
>> differentiates taxonomic names from the sea of other text in the
>> hope that
>> there might someday be a parser,
> You appear to be under a misapprehension. There is already a parser.
> Furthermore, by the end of the year, that parsing ability should be
> built into the second most popular browser on the Internet, allowing
> people to more simply write their own parser, in the same way that the
> tens of thousands of extensions to Firefox currently available have
>> I'd much rather see an organization like
>> uBio first index all names and expose these as LSIDs. With names
>> parsed and
>> exposed as LSIDs, the door is open to cross-domain querying and
> I don't doubt that that's a useful thing to happen; but I don't see it
> as a binary choice. Nor will it resolve many of the issues
> addressed by
> the species microformat.
>> uBio's FindIT already does parse published web pages and OCR'd
>> PDFs for
>> taxonomic names and does so quite well without the need for
>> mark-up (e.g. http://www.biodiversitylibrary.org/).
> And then what does it do with them? Does it address the use-cases
> outlined for the species microformat? Does it cater for vernacular
> names? What happens when that service hits a false positive, such as
> "Baracus" in:
> B. A. Baracus, a character on the television series The A-Team
> <http://en.wikipedia.org/wiki/Ba> ?
> The 'species' microformat directly addresses and can prevent such
> Unless it has changed since we last discussed it, FindIT does not act
> upon the page currently viewed in the browser (which is what happens
> with the microformat), but requires the page's URL to be manually
> submitted to it. Nor does it allow the user to select a service or
> services to which the parsed name is submitted.
>> Microformats have uses
>> elsewhere for previously unstructured content and I think it
>> probably would
>> be quite useful for common names on web pages. But, we already
>> have some
>> structure with taxonomic names, an ever-growing index of all
>> names, and a
>> parsing algorithm in production (parsing PDFs is more important at
>> stage), so let's use these to our advantage.
> Again, why is this a binary choice?
>> Microformats would be so much more attractive if we could point
>> users to
>> existing parsers.
> I already did that, just today.
>> It would also help to give providers a clear reason for
>> expending effort marking up their content. At this stage of
>> development, a content provider doesn't get anything in return for
>> up with a standard in flux.
> They get to contribute to the development and solidification of an
> emerging standard. I realise that not everyone may wish, or be
> able, to
> devote resources to this, but the invitation is made; and I've made it
> without suggesting - much less on false premise - that people turn
> from any other initiative.
> Your post seems to be a rehash of many issues I though we'd resolved
> nearly a year ago:
> Andy Mabbett
> * Are you using Microformats, yet: <http://
> microformats.org/> ?
> Taxacom mailing list
> Taxacom at mailman.nhm.ku.edu
More information about the Taxacom