[Taxacom] Specimen database that works with sequence data

Daniel Barker db60 at st-andrews.ac.uk
Fri Nov 14 11:06:18 CST 2014

Dear Eric,

The problem seems difficult, for a couple of other reasons:

- GenBank sequences may be incorrect (incorrect bases in the GenBank record); or

- GenBank sequences may be incorrectly annotated (wrong species, wrong gene, etc., in the GenBank meta-data); or

- In general, GenBank sequences may be inconsistently annotated (hemoglobin vs haemoglobin, sulfur vs sulphur, 'protein/gene name goes here' vs 'hypothetical', etc.)

An approach that only uses GenBank meta-data, without some consideration of the sequences themselves, will be open to some errors from these sources.

Sequence-based approaches that ignore GenBank's annotation, for example BLAST or protein domain searches (e.g. InterProScan or CDD), potentially remove the problem of 'wrong gene' and 'inconsistently annotated'. I can imagine they could help with a minority of cases of 'incorrect bases', too.

However, they introduce other issues that must be dealt with - and which may be summed up as, how to interpret the results BLAST or protein domain searches in biological terms? Which are the relevant cut-offs? Where several models of the protein domain of interest exist, which should be used, or should you make your own?

Thought 1. Would iPhy help? It seems to me that it might. I know of it, but have not used it:


Thought 2. How about a BLAST search at NCBI, but with an Entrez query to limit results? This will cause BLAST to only report matches to that subset of sequences in the database that also match your keyword query. (N.B. Search for protein-coding sequences at the protein level, e.g. BLASTP, PSI-BLAST, BLASTX, TBLASTN or TBLASTX as appropriate; not BLASTN, which for protein-coding sequences is less sensitive and less accurate.)

Best regards,


Daniel Barker
The University of St Andrews is a charity registered in Scotland :
No SC013532

From: Taxacom [taxacom-bounces at mailman.nhm.ku.edu] on behalf of Eric Chapman [ericgchapman at gmail.com]
Sent: 13 November 2014 19:35
To: taxacom at mailman.nhm.ku.edu
Subject: [Taxacom] Specimen database that works with sequence data


I was wondering if anyone could tell me if there is a database available
that  houses both collection information and DNA sequences of multiple
genes such that I could query that database in this way:

For all specimens that are from the US with COI sequences, give me a FASTA
(or other DNA format) file containing all of the sequences.

I don't care if the sequences are aligned - I can do that part. I have been
working with a data file and selecting a subset of sequences by hand in
MacClade or Mesquite, which has become very time consuming as the data set
has grown to well over 1000 sequences. I am not skilled at writing scripts,
so extracting them that way is not practical for me. I have never used
Sequencher - does it have this capability?

I would appreciate any input any of you can give me.

Eric Chapman

Eric G. Chapman, PhD
Research Analyst, Collections Manager
Department of Entomology
University of Kentucky
S225 Agricultural Science Center N
Lexington KY 40546-0091 USA
(859) 257-3169 (lab)
(330) 221-7812 (mobile)
Taxacom Mailing List
Taxacom at mailman.nhm.ku.edu
The Taxacom Archive back to 1992 may be searched at: http://taxacom.markmail.org

Celebrating 27 years of Taxacom in 2014.

More information about the Taxacom mailing list