[Taxacom] Language tags for scientific names

Gregor Hagedorn g.m.hagedorn at gmail.com
Thu Jun 26 07:33:19 CDT 2008

In the Key to Nature EU project and when using the TDWB-BIS SDD
standard we have a need to define certain names are being scientific
names of organisms. Scientific names not only occur as labels of an
object, but also in comments and other free-form text. Most projects
devise special markup for these, but we believe that many more
purposes can be served (including when using scientific names in
DublinCore based metadata schema) if a language tag were available. In
practice "la" for latin is frequently abused for this purpose, but a
more precise and correct tag seems desirable. The problem with "la"
this is that real latin names do exist (monomials), but are ruled by
latin grammar as opposed to the rules for scientific names. This could
lead to confusion, at least when dealing with historic literature.
Furthermore, "la" would make it impossible to mark-up a scientific
name within a latin description of a new species (as required by

By definition, scientific names are language neutral, i.e. they have
the same form in Chinese, German or Russian, and are always using the
latin script. ISO 639-2
(http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt) contains
four special codes:
    * mis, for "miscellaneous languages";
    * mul, for "multiple languages";
    * und, for "undetermined language";
    * zxx, for "no linguistic content".

The IETF language tag specification
http://tools.ietf.org/rfc/bcp/bcp47.txt furthermore supports
experimental tags starting with "x-".

In general, it seems desirable to have a generic form as well as forms
specific to the codes of nomenclature (BC, ICVCN, ICBN, ICNCP, ICZN).
Both specific and generic forms may not already be present in the list
of country codes (ISO 3166-1 alpha-2 or alpha-3) which may be used to
denote the culture of a language (as in en-UK, en-US, en-NZ etc.).
This is currently the case for BC, the only code with 2 or 3 letters.
The name for the generic form is perhaps most difficult to agree, my
proposal would be to use TAX (for taxonomic community), but I welcome
other proposals.

Available options for denoting scientific names using standard
xml:lang or html:lang attributes are thus:

* Register a new basic language code like sc / stn (scientific taxon name)
** Such a proposal was made in 2003 by Andy Mabbett
   (see http://www.alvestrand.no/pipermail/ietf-languages/2003-February/000576.html)
   and refuted on reasons that do not convince me.
** Example name codes: sc-TAX, sc-ICVCN, sc-BC, sc-ICZN

* Use zxx- for "non-language dependent"
** Example name codes: zxx-TAX, zxx-ICVCN, zxx-BC, zxx-ICZN

* Use x- for "experimental or extension range" (IETF only)
** Example name codes: x-TAX, x-ICVCN, x-BC, x-ICZN

My preference would be to use the zxx- range because it probably
informs processors not knowing the specific codes best how to handle
this information (i.e. that it would be appropriate in any linguistic

I look forward for a good discussion, including the pointer to where
someone else has already solved the problem and I have not found it


Gregor Hagedorn (G.M.Hagedorn at gmail.com)
Institute for Epidemiology and Pathogen Diagnostics
Julius Kühn-Institute, Federal Research Center for Cultivated Plants
Königin-Luise-Str. 19 Tel: +49-30-8304-2220
14195 Berlin, Germany Fax: +49-30-8304-2203

The research sector of the Federal Ministry of Food, Agriculture
and Consumer Protection has been restructured on 1. Jan. 2008.
Several institutions are merged into the "Julius Kühn-Institute,
Federal Research Center for Cultivated Plants".

Der Forschungsbereich des Bundesministeriums für Ernährung,
Landwirtschaft und Verbraucherschutz (BMELV) hat seit dem 1. Jan.
2008 eine neue Struktur. Die Biologische Bundesanstalt für Land- und
Forstwirtschaft (BBA) wurde mit anderen Instituten zum "Julius Kühn-Institut,
Bundesforschungsinstitut für Kulturpflanzen" zusammengeschlossen.

More information about the Taxacom mailing list