[Taxacom] challenge of large molecular data sets

John Grehan jgrehan at sciencebuff.org
Wed Nov 12 15:00:26 CST 2008

As I dig my way through the Prasad data I am beginning to appreciate why
there has been little attention to scrutinizing original molecular data.
The first step was opening the files which were in fasta format -
something I had never encountered before (presumably this is run of the
mill for molecular folk). I was able to get a colleague to convert to
NEXUS but I could not directly export into a format I could print or
otherwise work on. When I used PAUP to show the data it somehow produced
the wrong content (NEXUS was ok). My colleague then suggested several
web applications, and one (geneious) actually worked and color coded the
bases. From this I could print the data, but it contains over 21,000
sites (and the non coding is about five times larger). It would take me
forever to go through by hand to locate any putative apomorphies. No
doubt there are sorting programs that could identify those bases that
are unique to the ingroup (in this case humans and great apes) and list
them - assuming my computer capacity will handle that.


Perhaps future molecular systematists should be required to list their
apomorphies (by sequence position listing all the bases for each taxon)
when they are dealing with huge data sets so any obfuscation caused by
the large data sets is avoided and the evidence is transparent - is it
is in good morphological studies (so the same requirement of
documentation should be found in morphological studies as well, although
the documentation is also often poor).


John Grehan


Dr. John R. Grehan

Director of Science

Buffalo Museum of Science1020 Humboldt Parkway

Buffalo, NY 14211-1193

email: jgrehan at sciencebuff.org

Phone: (716) 896-5200 ext 372




Ghost moth research


Human evolution and the great apes




More information about the Taxacom mailing list