[Taxacom] Specimen database that works with sequence data
schindeld at si.edu
Fri Nov 14 10:31:20 CST 2014
GenBank records that comply with the BARCODE data standard have the reserved keyword BARCODE. Each of these records includes an identifier for the voucher specimen (it's in the specimen_voucher field in the FEATURES table). The data standard requires that the voucherID be in the form of the Darwin Core triplet (InstitutionCode:CollectionCode:CatalogNumber) but only about half of the ~475,000 BARCODE records are in the proper format. Many records just have field numbers or the identifier provided by BOLD, the workbench at the University of Guelph from which records were submitted to GenBank. For reasons I can't explain, BOLD never implemented a system for storing and submitting the properly formatted voucherID. The Consortium for the Barcode of Life (CBOL) is responsible for the data standard and we will soon be broadcasting a message to the community with instructions for bringing the non-compliant voucherIDs into compliance.
Fortunately there are several 100,000s of records that have properly formatted voucherIDs (or very nearly so) and these will guide you to specimen data in museum databases. For example, I contributed to a barcoding project on birds in the US National Museum and all 2816 resulting BARCODE records have Darwin Core triplets that are hyperlinked to a record resolver that finds the specimen record in the Museum's EMu database. You can find all the records by searching on my last name in GenBank (http://www.ncbi.nlm.nih.gov/nuccore/?term=schindel) and more information is provided in ZooKeys 152: 87-91 (08 Dec 2011) doi: 10.3897/zookeys.152.2473.
CBOL, USNM, Smithsonian
From: Taxacom [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Urmas Kõljalg
Sent: Friday, November 14, 2014 10:57 AM
To: taxacom at mailman.nhm.ku.edu
Subject: Re: [Taxacom] Specimen database that works with sequence data
PlutoF cloud (http://plutof.ut.ee ) provides such services where you can develope combined datasets of different taxon occurrences (including specimen and DNA data). Sequence data can be downloaded as a Fasta file through the Clipboard system provided by the online workbench. Probably most well known such dataset hosted by PlutoF is UNITE fungal rDNA ITS database (http://unite.ut.ee ) which is utilized by many NGS pipelines like QIIME, mothur, SCATA, UTAX, etc.
13-11-2014 14:35 kirjutas Eric Chapman:
> I was wondering if anyone could tell me if there is a database
> available that houses both collection information and DNA sequences of
> multiple genes such that I could query that database in this way:
> For all specimens that are from the US with COI sequences, give me a
> FASTA (or other DNA format) file containing all of the sequences.
> I don't care if the sequences are aligned - I can do that part. I have
> been working with a data file and selecting a subset of sequences by
> hand in MacClade or Mesquite, which has become very time consuming as
> the data set has grown to well over 1000 sequences. I am not skilled
> at writing scripts, so extracting them that way is not practical for
> me. I have never used Sequencher - does it have this capability?
> I would appreciate any input any of you can give me.
> Eric Chapman
> Eric G. Chapman, PhD
> Research Analyst, Collections Manager
> Department of Entomology
> University of Kentucky
> S225 Agricultural Science Center N
> Lexington KY 40546-0091 USA
> (859) 257-3169 (lab)
> (330) 221-7812 (mobile)
> Taxacom Mailing List
> Taxacom at mailman.nhm.ku.edu
> http://mailman.nhm.ku.edu/cgi-bin/mailman/listinfo/taxacom  The
> Taxacom Archive back to 1992 may be searched at:
> http://taxacom.markmail.org 
> Celebrating 27 years of Taxacom in 2014.
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
The Taxacom Archive back to 1992 may be searched at: http://taxacom.markmail.org
Celebrating 27 years of Taxacom in 2014.
More information about the Taxacom