Richard Jensen rjensen at saintmarys.edu
Tue Aug 23 10:00:27 CDT 2011

```Regardless of the algorithm used, an operation that creates subsets of a
collection of objects may be called cluster analysis.  It may be based
on a matrix of similarities (or distances among the objects) or it may
be based on the basic data matrix itself.  The objects clustered may be
OTUs or the characters comprising the data matrix.

In the old days, parsimony algorithms were recognized as a type of
cluster analysis; however, because they were meant to be interpreted as
approximations to phylogenetic trees, that usage was dropped because it
made it seem as if these were simply another tool for what was called
numerical taxonomy (i.e., phenetics).  This distinction was important
because phenetic analyses were not intended to be interpreted as
phylogenetic trees, although in many instances they produced results
consistent with phylogenetic trees and, in fact, were often used as
first approximations of phylogenetic trees.

There's much more to the story than this, but I see no reason not to
after all, the OTUs in the data set are partitioned into subsets

Cheers,

Dick J

On 8/23/2011 5:21 AM, Sergio Vargas wrote:
> pffff,
>
> I guess this will be another never ending debate;-)
>
>   From the email thread on parsimony, orangs et al. etc. I think we were
> using clustering in the sense of "hierarchical clustering" as if one
> calculates a pair-wise distance matrix and uses this matrix to draw a
> tree using an agglomerative or divisive algorithms. I think Maximum
> parsimony (MP) and Maximum Likelihood (ML) phylogenetic reconstructions
> are not clustering algorithms in this sense.
>
> Using a broader definition from wikipedia: clustering = "the assignment
> of a set of observation into subsets=clusters". I'm still not sure one
> could say MP or ML are actually doing this. One could argue, perhaps,
> you are assigning taxa to clades and MP and ML are some form of
> unsupervised learning... not sure too.
>
> Enlightenment from other taxacomers, pointers to relevant literature,
> etc. much appreciated.
>
> sergio
>
> On 8/23/11 4:24 AM, Herbert Jacobson wrote:
>> I don't think clustering is "...grouping by a data matix." Quite the
>> opposite, it grouping by the "coefficient matrix" which is the result
>> of some sort of data matrix manipulation.
>>
>> Herb
>>
>>> Date: Sat, 20 Aug 2011 12:36:30 -0500
>>> From: Richard.Zander at mobot.org
>>> To: morris.bob at gmail.com; sevragorgia at gmail.com
>>> CC: taxacom at mailman.nhm.ku.edu
>>> Subject: Re: [Taxacom] cladistics (was: clique analysis in textbooks)
>>>
>>> I think taxacomers who lack decisive training in phenetic analysis,
>> which is most of us, figure clustering is grouping by a data matrix
>> that compares one taxon and one variable and then some similarity
>> algorithm. Thus, Sergio is correct that an instant similarity or
>> distance tree is different from a parsimony tree, in terms of what we
>> have been told: i.e. that phenetics and parsimony are different.
>>> On the other hand, I took a tutorial course (3 days) in clustering
>> techniques (didn't learn much, of course) at a meeting of the
>> Classification Socity from the then president of the Society and
>> Pierre Legendre. I asked, ahem, if parsimony was a clustering
>> technique. The two glanced at each other furtively, then opined that
>> indeed parsimony is a clustering technique. Thus, authority says it is.
>>> Yes, parsimony does calculate a bunch of distance trees and selects
>> recursively (I think) the shortest tree because it is NP-complete
>> (NP-hard), i.e., can't complete an exact solution in polynomial time.
>> So...does the fact that we have to do heuristic sampling to get any
>> sort of tree make parsimony not clustering? I think this is what this
>>> Surely the product is a distance tree based on shortest
>> transformation set?
>>>
>>>
>>> * * * * * * * * * * * *
>>> Richard H. Zander
>>> Missouri Botanical Garden, PO Box 299, St. Louis, MO 63166-0299 USA
>>> Web sites: http://www.mobot.org/plantscience/resbot/ and
>>> Modern Evolutionary Systematics Web site:
>> http://www.mobot.org/plantscience/resbot/21EvSy.htm
>>> -----Original Message-----
>>> From: taxacom-bounces at mailman.nhm.ku.edu
>> [mailto:taxacom-bounces at mailman.nhm.ku.edu] On Behalf Of Bob Morris
>>> Sent: Friday, August 19, 2011 10:53 PM
>>> To: Sergio Vargas
>>> Cc: taxacom at mailman.nhm.ku.edu
>>> Subject: Re: [Taxacom] cladistics (was: clique analysis in textbooks)
>>>
>>> On Fri, Aug 19, 2011 at 2:32 PM, Sergio Vargas
>> <sevragorgia at gmail.com>  wrote:
>>> "...because clustering can be done (computationally) efficiently
>>> whereas searching for an optimal tree using phylogenetic methods
>>> cannot."
>>>
>>> It's fair enough that some or even all biologists might have a usage
>>> of "clustering" that meet all of your explanation, and perhaps even
>>> that this should be agreed to by all of the readership of taxacom. I
>>> wouldn't know. But in statistical pattern recognition and datamining,
>>> not everything called clustering can be done computationally
>>> efficiently. Many techniques those disciplines call clustering are
>>> intractable in the sense that they are NP-hard. Informally, this means
>>> that (with presently understood computational complexity theory),
>>> they fundamentally scale at least exponentially with size of the data
>>> and no algorithm can circumvent that, just as for optimal tree
>>> induction problems. So I can only understand your text as meaning
>>> "...because clustering as meant by all practicing phylogeneticists can
>>> be done (computationally) efficiently...", and that is why you are
>>> prepared to subsequently say that the rest of your explanation "[...]
>>> is so basic I cannot believe I am explaining it".
>>>
>>> I do wonder a little whether in fact all practicing phylogeneticist
>>> readers of taxacom understand by "clustering" only tractable
>>> algorithms.
>>>
>>> Bob Morris
>>>
>>> Robert A. Morris
>>> Emeritus Professor  of Computer Science
>>> UMASS-Boston
>>> 100 Morrissey Blvd
>>> Boston, MA 02125-3390
>>> IT Staff
>>> Filtered Push Project
>>> Harvard University Herbaria
>>>
>>>
>>>
>>> email: morris.bob at gmail.com
>>> web: http://efg.cs.umb.edu/
>>> web: http://etaxonomy.org/mw/FilteredPush
>>> http://www.cs.umb.edu/~ram
>>>
>>>
>>>
>>> On Fri, Aug 19, 2011 at 2:32 PM, Sergio Vargas
>> <sevragorgia at gmail.com>  wrote:
>>>> Hi,
>>>>
>>>>> Clustering is clustering is clustering. Group some things
>> together and
>>>> you are clustering - however it is done.
>>>>
>>>> no you are not. Grouping is not clustering, there are many ways to
>> group
>>>> things together not involving clustering. Maximum parsimony, maximum
>>>> likelihood and bayesian analysis are not clustering. It is simply
>>>> incorrect to call to these methods clustering. When you run either of
>>>> the above analyses you are not clustering, despite the result being
>>>> something similar to a cluster. If you could reduce phylogenetic
>>>> inference to clustering everything would be so easy (computationally
>>>> speaking) because clustering can be done (computationally) efficiently
>>>> whereas searching for an optimal tree using phylogenetic methods
>> cannot.
>>>> Taxa are only "clustered" (randomly or sequentially) together to build
>>>> the first tree, afterwards entire topologies are evaluated, taxa
>> are not
>>>> clustered. This is so basic I cannot believe I am explaining it.
>>>>
>>>> sergio
>>>>
>>>> --
>>>> Sergio Vargas R., M.Sc.
>>>> Dept. of Earth&   Environmental Sciences
>>>> Palaeontology&   Geobiology
>>>> Ludwig-Maximilians-Universität München
>>>> Richard-Wagner-Str. 10
>>>> 80333 München
>>>> Germany
>>>> tel. +49 89 2180 17929
>>>> s.vargas at lrz.uni-muenchen.de
>>>> sevra at marinemolecularevolution.org
>>>>
>>>> check my webpage:
>>>> http://www.marinemolecularevolution.org
>>>>
>>>> check my research ID:
>>>> http://www.researcherid.com/rid/A-5678-2011
>>>>
>>>>
>>>> _______________________________________________
>>>>
>>>> Taxacom Mailing List
>>>> Taxacom at mailman.nhm.ku.edu
>>>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>>>>
>>>> The Taxacom archive going back to 1992 may be searched with either
>> of these methods:
>>>> (1) by visiting http://taxacom.markmail.org
>>>>
>>>> (2) a Google search specified as:
>>   site:mailman.nhm.ku.edu/pipermail/taxacom  your search terms here
>>>
>>>
>>> --
>>> Robert A. Morris
>>>
>>> Emeritus Professor  of Computer Science
>>> UMASS-Boston
>>> 100 Morrissey Blvd
>>> Boston, MA 02125-3390
>>> IT Staff
>>> Filtered Push Project
>>> Department of Organismal and Evolutionary Biology
>>> Harvard University
>>>
>>>
>>> email: morris.bob at gmail.com
>>> web: http://efg.cs.umb.edu/
>>> web: http://etaxonomy.org/mw/FilteredPush
>>> http://www.cs.umb.edu/~ram
>>> phone (+1) 857 222 7992 (mobile)
>>>
>>> _______________________________________________
>>>
>>> Taxacom Mailing List
>>> Taxacom at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>>>
>>> The Taxacom archive going back to 1992 may be searched with either
>> of these methods:
>>> (1) by visiting http://taxacom.markmail.org
>>>
>>> (2) a Google search specified as:
>> site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here
>>> _______________________________________________
>>>
>>> Taxacom Mailing List
>>> Taxacom at mailman.nhm.ku.edu
>>> http://mailman.nhm.ku.edu/mailman/listinfo/taxacom
>>>
>>> The Taxacom archive going back to 1992 may be searched with either
>> of these methods:
>>> (1) by visiting http://taxacom.markmail.org
>>>
>>> (2) a Google search specified as:
>> site:mailman.nhm.ku.edu/pipermail/taxacom your search terms here

--
Richard J. Jensen, Professor
Department of Biology
Saint Mary's College
Notre Dame, IN 46556
Tel: 574-284-4674

```