Invariant sites and reliability
Kurt Milton Pickett
kpickett at AMNH.ORG
Wed Jul 20 14:52:18 CDT 2005
This behavior occurs because, when implementing a homogeneous ML
model which averages branch lengths over the tree, every character
added shortens edge lengths on the tree. This means that the
probability of erroneously interpreting a homology as homoplasy
declines, and so support for clades increases. You can add invariant
characters to any matrix and get the same results. In a Bayesian
analysis, the relative likelihoods of trees is used to accept a new
one or keep an old one, and the disparity in likelihoods increases as
invariant characters are added, so the number of trees retained in
the credible set declines. This has only to do with the way ML
treats branch lengths, and nothing to do with character types. This
kind of support measure differs from resampling support like the
bootstrap, in which apparent support falls as invariant characters
are added (because the probability of resampling the informative
character falls as the total number of characters increases). In a
homogeneous ML framework, the way the Bayesian support is working is
the way it "should" given the branch-length smoothing of that
framework. The unexpected behavior of the bootstrap under this
framework partly why Goloboff and Farris came up with the poisson
bootstrap, which is available in TNT. Much of this is discussed in
Goloboff, P., J. S. Farris, M. Källersjö, B. Oxelman, M. J. Ramirez
and C. A. Szumik. 2003. Improvements to resampling measures of group
support. Cladistics 19: 324-332.
> Interesting business, that of likelihood (e.g. max likelihood and
> and uninformative characters.
> I've used MrBayes to analyze a contrived data set like this:
> 1 CCCCCCCCCCCCCCCCCCC etc.
> 2 CACCCCCCCCCCCCCCCCC
> 3 CACCCCCCCCCCCCCCCCC
> 4 CCCCCCCCCCCCCCCCCCC
> (You can also use 4 sequences of totally random data, except for
> one site.)
> If you set the model to recognize all parsimony uninformative sites as
> invariable, then you get a low (33%) Bayesian posterior probability
> for the
> tree ((2, 3)4, 1).
> If you set the model to treat all uninformative sites as variable,
> you get
> BPP of 100% for the same tree.
> Doubtless (I think) this last situation is because getting two "A"s
> at one
> site for taxon 2 and 3 is very improbable when all the other sites
> have no
> mutations to "A"s at all.
> This shows a very clear difference in what characters and character
> really mean between analysis in morphology and molecular studies,
> it's hard to put into words.
> Also, choice of model to recognize invariant sites or not makes a big
> difference, doesn't it? How might this affect measures of uncertainty?
> Richard H. Zander
> Bryology Group, Missouri Botanical Garden
> PO Box 299, St. Louis, MO 63166-0299 USA
> richard.zander at mobot.org <mailto:richard.zander at mobot.org>
> Voice: 314-577-5180; Fax: 314-577-9595
> Bryophyte Volumes of Flora of North America:
> Res Botanica:
> Shipping address for UPS, etc.:
> Missouri Botanical Garden
> 4344 Shaw Blvd.
> St. Louis, MO 63110 USA
> -----Original Message-----
> From: A.P. Jason de Koning [mailto:apjdk at ALBANY.EDU]
> Sent: Tuesday, July 19, 2005 9:31 PM
> To: TAXACOM at LISTSERV.NHM.KU.EDU
> Subject: Re: [TAXACOM] Molecular taxonomy: on way out?
>> Why bother expurging the data matrix from non-informative
>> characters when the algorithm doesn't take them into account?
> I agree that throwing out data here is unnecessary, and in general
> can be biasing.
> An unmentioned issue to consider with respect to probabilistic
> analyses (of molecular or phenotypic characters): the inclusion of
> characters with phylogenetically non-informative character state
> distributions can be beneficial because they may contribute
> information to the estimation of model parameters (such as relative
> rates of character state transformation), which can have an
> "indirect" effect on the selection of an optimal topology (though
> perhaps rarely, and not strongly). Why? Because the effect on
> relative rate estimates by 'non-informative' characters can alter the
> relative probabilities of trees for other 'informative' characters!
> For a very large dataset, inclusion of such characters could matter
> for any given phylogeny problem...certainly it matters in molecular
> evolutionary analyses where phylogeny is not the desired endpoint.
> - Jason
Kurt Milton Pickett
Theodore Roosevelt Fellow
Division of Invertebrate Zoology
American Museum of Natural History
79th Street at Central Park West
New York, NY 10024-5192
kpickett at amnh.org
More information about the Taxacom