Circularity & testing.

James Lyons-Weiler weiler at ERS.UNR.EDU
Sun Oct 20 18:27:09 CDT 1996

On Sun, 20 Oct 1996, Alan Harvey wrote:

> Which hypotheses, exactly, about cladistic data might we want to interpret
> _statistically_? Consider, we've got a data set; that is, a list of taxa, a
> list of characters for which two or more homologous states have been
> proposed, and for each taxon, data points indicating the state proposed to
> be present for each character. Note we've had to make hypotheses about two
> kinds of homology, that of characters among taxa (e.g., the feathers of
> pheasants and scales of skinks are homologous characters), and that of
> states among characters (e.g., the feathers of pheasants and the feathers
> of falcons are homologous states). It seems most of the controversy
> involves the latter, to which I will restrict discussion here.
> Let's say hypothesis A is that all of these state homology hypotheses are
> true. If hypothesis A itself is true, then the parsimony algorithm used on
> the data set should yield a tree topology for which there is _no_
> homoplasy. If there is in fact no homoplasy, then of course hypothesis A is
> supported but not proven, and a relevant statistical test should indicate
> the probability that this result (a homoplasy-free tree) could obtain if
> hypothesis A is false. If there is _any_ homoplasy, hypothesis A as worded
> is falsified; that is, at least some of the homology hypotheses are false.
> The tree of course suggests which ones are false, namely, those that are
> homoplastic on the tree, but I have been unable to come up with a relevant
> statistical test, I think because the main hypothesis has already been
> falsified (a type II error test seems nonsensical, e.g., what is the
> probability that hypothesis A is true given observed homoplasy).
Thank goodness for some additional input!

If hypothesis A is that all these state homology hypotheses are true, then
the expectation is that evolution works in mysterious ways.  (Of course, I
don't mean to imply that Alan thinks that evolution works in any
particular way).  Nevertheless, if we can agree that sometime evolution
destroys information of the phylogenetic history of organisms, and
sometimes it does not, then hypothesis A needn't be posited.  This
requires a shift in the background knowledge of some people; for others,
it's what they've been thinking all along.

> The problem obviously lies in that the hypothesis being tested is in
fact a
> statement about many hypotheses. Many, many hypotheses. The hypothesis "All
> occurences of state b of character B are homologous" is itself a collection
> of (correctably non-independent!) pairwise hypotheses ("b in taxon 1 is
> homologous with b in taxon 2"; "b in taxon 1 is homologous with b in taxon
> 3..."). Much more informative than hypothesis A, it seems, would be to
> evaluate the support the data gives to each of these hypotheses (e.g.,
> "what is the probability that a supported pairwise homology is actually
> false", and "what is the probability that an unsupported pairwise homology
> is actually true"). In this case, also, I don't think circularity is a
> relevant concern, since the hypothesis involves only two data points, and
> the test involves the entire data set.
As I recall Hennig's auxiliary principle, homology is first assumed, and
then overthrown in light of evidence to the contrary.  If the assumed
homology for characters a-g are wrong, but we don't know it, they are
relied upon nevertheless to test homology statements for other characters,
which are, in turn called upon to test homology statements for characters.
If a measure of probability of homology could be made in reference to some
null hypotheses instead of in reference to other characters in the matrix
are doing, then induction would not be required.

> However, my impression is that we're a long way off from being able to
> statistically address these pairwise homologies. The hypotheses that may
> (or may not) be addressable now seem to be of the "bulk" variety, e.g.,
> hypothesis A above, or perhaps an even weaker version: "At least some of
> these state homology hypotheses are true." At best, maybe there is a test
> that really can tell you that a significant proportion of the individual
> homology hypotheses are likely to be correct, but without telling you which
> ones those are. For morphological data sets, at least, this still seems
> like a pointless piece of information,  because we do so much filtering of
> the available data prior to analysis (e.g., I don't try to establish
> homologies for individual setae) that true randomness seems a remote
> possibility.

This manifests itself in a rather explicit assumption for some statistical
analyses of random sampling of taxa and characters.  This assumption is
not made by all statistical analyses, especially if the inferences made
via hypothesis testing refer to a specific matrix of character states, and
the test does not at the same time include an inference of phylogeny made
from that matrix.

> Well, that's my two cents.

> Tangential P.S.: James Lyons-Weiler wrote that "Homoplasy indices show an
> across-taxon, among study, very general trend: when more taxa are added,
> the congruence goes down..." The problem lies in the indices, not the data;
> i.e, it's an artifact (I gave a talk on this at the Toronto Hennig
> meetings).
My point exactly.  The problem is not the data, but the use of a
tree-based measure of information.  The artifact is undesirable.

Cheers, Alan!


More information about the Taxacom mailing list