[Taxacom] BioNames (slurp)

Roderic Page r.page at bio.gla.ac.uk
Sat Jun 8 14:45:57 CDT 2013


Hi Erik,

Oops, those errors are because the last line in the file of ids I use to generate the dump was empty, hence the error at the end of the dump. The files should be complete.

The file isn't being updated at the moment, at some point I'll do something about that, but for now I'm focussing on adding content and improving the interface.

Regards

Rod

On 8 Jun 2013, at 20:03, Erik Rijkers wrote:

> On Mon, June 3, 2013 15:24, Roderic Page wrote:
>> Hi Erik,
>> 
>> Sorry for the delay in replying. I need to do a bit of work on this, but for now there are a coupe of ways to get the data.
>> 
>> 1. There is a crude Darwin Core Archive dump at http://bionames.org/data/darwincore/bionames.zip (warning, ~144 Mb) This
>> has names, references, and map between names and reference.
>> 
> 
> Thank you, that is really nice to play with.
> 
> The data was easy to load locally, and no doubt, I'll end up using the JSON api too. Much appreciated.
> 
> 
> It also raises some questions:
> 
> - Three zipfile contains three .tsv files:
> 
> $ unzip -l /tmp/bionames.zip
> Archive:  /tmp/bionames.zip
>  Length      Date    Time    Name
> ---------  ---------- -----   ----
> 154968043  05-30-2013 21:15   taxa.tsv
> 717387476  05-30-2013 21:16   media.tsv
> 129913824  05-30-2013 21:10   references.tsv
>     3255  05-29-2013 19:16   meta.xml
> ---------                     -------
> 1002272598                     4 files
> 
> Two of the .tsv files (taxa.tsv and media.tsv) end with a line that looks like an error message:
> 
> $ unzip -p /tmp/bionames.zip taxa.tsv | tail -1
> failed [/Users/rpage/Desktop/bionames-data/darwincore/doi/simple_taxa.php:50]: SELECT * FROM names WHERE id=  LIMIT 1
> 
> unzip -p /tmp/bionames.zip media.tsv | tail -1
> failed [/Users/rpage/Desktop/bionames-data/darwincore/doi/simple_media.php:83]: SELECT * FROM names WHERE id=  LIMIT 1
> 
> Now, it's easy enough to grep out those lines, but of course one wonders if those files are truncated and would be
> larger, had the error not occurred?
> 
> The imported tables rowcounts are now:
>            taxa  1,520,483
> references     393,552
>        media   1,520,487
> which corresponds to number of lines in the .tsv files.  Are these correct?
> 
> 
> - Is it worthwhile to check regularly on http://bionames.org/data/darwincore/bionames.zip?  ( Will it get updated?
> (weekly, monthly, yearly))
> 
> 
> Thank you!
> 
> Erik Rijkers
> 
> 
> PS
> and TWIMC, here is a (pretty basic) bash/perl script for loading the .tsv files into postgres.
> 
> 
> slurp_bionames.sh:
> 
> --------------8<----------------------------
> #!/bin/sh
> 
> zipfile=/tmp/bionames.zip
> 
> schema=public
> 
> time for x in  "4  taxa.tsv      taxa"  \
>              "21 references.tsv refs"  \
>              "12 media.tsv      media" ;
> do
>  set -- $x; ncols=$1; slurp_file=$2; t=$schema.$3;
>  echo " [$ncols];  [$slurp_file];  [$t]"
> 
>  # create table from header line:
>  unzip -p $zipfile $slurp_file | head -n 1 | perl -ne '
>    chomp; my @arr = map { $_ .= " text" } split(/\t/);
>    print "drop table if exists '$t';\ncreate table '$t'(\n  ",
>          lc(join("\n, ", @arr)), "\n);\n";' | psql -qX;
> 
>  # then read file into that table:
>  unzip -p $zipfile $slurp_file | perl -MEncode -ne '
>    next if (/^failed/);
>    chomp;
>    s{\"}{DOUBLEQUOTE}g;
>    my @arr = split(/\t/, $_ );
>    while (scalar(@arr) < '$ncols' ) { push(@arr, ""); }
>    print encode("UTF8", (join("\t", at arr)."\n"), Encode::FB_CROAK);
>  ' | psql -qXc "copy $t from stdin csv header delimiter E'\t';"
> 
>  # quick check: linenumber and rownumber be (almost) the same
>  echo -n "file lines: "; unzip -p $zipfile $slurp_file  | grep -c '^'
>  echo -n "table rows: "; echo "select count(*) from $t" | psql -qtAX
> done
> --------------8<----------------------------
> 
> 
> 

---------------------------------------------------------
Roderic Page
Professor of Taxonomy
Institute of Biodiversity, Animal Health and Comparative Medicine
College of Medical, Veterinary and Life Sciences
Graham Kerr Building
University of Glasgow
Glasgow G12 8QQ, UK

Email: r.page at bio.gla.ac.uk
Tel: +44 141 330 4778
Fax: +44 141 330 2792
Skype: rdmpage
Facebook: http://www.facebook.com/rdmpage
Twitter: http://twitter.com/rdmpage
Blog: http://iphylo.blogspot.com
Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
Wikipedia: http://en.wikipedia.org/wiki/Roderic_D._M._Page
Citations: http://scholar.google.co.uk/citations?hl=en&user=4Z5WABAAAAAJ
ORCID id: http://orcid.org/0000-0002-7101-9767




More information about the Taxacom mailing list