[Taxacom] BioNames (slurp)

Erik Rijkers er at xs4all.nl
Sat Jun 8 14:03:42 CDT 2013


On Mon, June 3, 2013 15:24, Roderic Page wrote:
> Hi Erik,
>
> Sorry for the delay in replying. I need to do a bit of work on this, but for now there are a coupe of ways to get the data.
>
> 1. There is a crude Darwin Core Archive dump at http://bionames.org/data/darwincore/bionames.zip (warning, ~144 Mb) This
> has names, references, and map between names and reference.
>

Thank you, that is really nice to play with.

The data was easy to load locally, and no doubt, I'll end up using the JSON api too. Much appreciated.


It also raises some questions:

 - Three zipfile contains three .tsv files:

$ unzip -l /tmp/bionames.zip
Archive:  /tmp/bionames.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
154968043  05-30-2013 21:15   taxa.tsv
717387476  05-30-2013 21:16   media.tsv
129913824  05-30-2013 21:10   references.tsv
     3255  05-29-2013 19:16   meta.xml
---------                     -------
1002272598                     4 files

Two of the .tsv files (taxa.tsv and media.tsv) end with a line that looks like an error message:

$ unzip -p /tmp/bionames.zip taxa.tsv | tail -1
failed [/Users/rpage/Desktop/bionames-data/darwincore/doi/simple_taxa.php:50]: SELECT * FROM names WHERE id=  LIMIT 1

unzip -p /tmp/bionames.zip media.tsv | tail -1
failed [/Users/rpage/Desktop/bionames-data/darwincore/doi/simple_media.php:83]: SELECT * FROM names WHERE id=  LIMIT 1

Now, it's easy enough to grep out those lines, but of course one wonders if those files are truncated and would be
larger, had the error not occurred?

The imported tables rowcounts are now:
            taxa  1,520,483
 references     393,552
        media   1,520,487
which corresponds to number of lines in the .tsv files.  Are these correct?


 - Is it worthwhile to check regularly on http://bionames.org/data/darwincore/bionames.zip?  ( Will it get updated?
(weekly, monthly, yearly))


Thank you!

Erik Rijkers


PS
and TWIMC, here is a (pretty basic) bash/perl script for loading the .tsv files into postgres.


slurp_bionames.sh:

--------------8<----------------------------
#!/bin/sh

zipfile=/tmp/bionames.zip

schema=public

time for x in  "4  taxa.tsv      taxa"  \
              "21 references.tsv refs"  \
              "12 media.tsv      media" ;
do
  set -- $x; ncols=$1; slurp_file=$2; t=$schema.$3;
  echo " [$ncols];  [$slurp_file];  [$t]"

  # create table from header line:
  unzip -p $zipfile $slurp_file | head -n 1 | perl -ne '
    chomp; my @arr = map { $_ .= " text" } split(/\t/);
    print "drop table if exists '$t';\ncreate table '$t'(\n  ",
          lc(join("\n, ", @arr)), "\n);\n";' | psql -qX;

  # then read file into that table:
  unzip -p $zipfile $slurp_file | perl -MEncode -ne '
    next if (/^failed/);
    chomp;
    s{\"}{DOUBLEQUOTE}g;
    my @arr = split(/\t/, $_ );
    while (scalar(@arr) < '$ncols' ) { push(@arr, ""); }
    print encode("UTF8", (join("\t", at arr)."\n"), Encode::FB_CROAK);
  ' | psql -qXc "copy $t from stdin csv header delimiter E'\t';"

  # quick check: linenumber and rownumber be (almost) the same
  echo -n "file lines: "; unzip -p $zipfile $slurp_file  | grep -c '^'
  echo -n "table rows: "; echo "select count(*) from $t" | psql -qtAX
done
--------------8<----------------------------






More information about the Taxacom mailing list