[Taxacom] BioNames

Erik Rijkers er at xs4all.nl
Sat Jun 8 13:54:11 CDT 2013

On Mon, June 3, 2013 15:24, Roderic Page wrote:
> Hi Erik,
> Sorry for the delay in replying. I need to do a bit of work on this, but for now there are a coupe of ways to get the data.
> 1. There is a crude Darwin Core Archive dump at http://bionames.org/data/darwincore/bionames.zip (warning, ~144 Mb) This
> has names, references, and map between names and reference.

Thank you, that is really nice to play with.

But it also raises some questions:

 - Three zipfile contains three .tsv files:

$ unzip -l /tmp/bionames.zip
Archive:  /tmp/bionames.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
154968043  05-30-2013 21:15   taxa.tsv
717387476  05-30-2013 21:16   media.tsv
129913824  05-30-2013 21:10   references.tsv
     3255  05-29-2013 19:16   meta.xml
---------                     -------
1002272598                     4 files

Two of the .tsv files (taxa.tsv and media.tsv) end with a line that looks like an error message:

$ unzip -p /tmp/bionames.zip taxa.tsv | tail -1
failed [/Users/rpage/Desktop/bionames-data/darwincore/doi/simple_taxa.php:50]: SELECT * FROM names WHERE id=  LIMIT 1

unzip -p /tmp/bionames.zip media.tsv | tail -1
failed [/Users/rpage/Desktop/bionames-data/darwincore/doi/simple_media.php:83]: SELECT * FROM names WHERE id=  LIMIT 1

Now, it's easy enough to grep out those error-lines, but of course one wonders if those files are truncated and would be
larger, had the error not occurred?

The imported tables rowcounts are now:
            taxa  1,520,483
 references     393,552
        media   1,520,487
which corresponds to number of lines in the .tsv files.  Are these correct?

 - Is it worthwhile to check regularly on http://bionames.org/data/darwincore/bionames.zip?  ( Will it get updated?
(weekly, monthly, yearly))

Thank you!

Erik Rijkers

I'll attach a (pretty basic) bash script for loading the .tsv files into postgres.

