Court Filings

The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data.

The impact of outgroup choice and missing data on major seed plant phylogenetics using genome-wide EST data.
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  The Impact of Outgroup Choice and Missing Data onMajor Seed Plant Phylogenetics Using Genome-Wide ESTData Jose Eduardo de la Torre-Ba´ rcena 1,2 , Sergios-Orestis Kolokotronis 3 , Ernest K. Lee 3 , Dennis Wm.Stevenson 2 , Eric D. Brenner 2 , Manpreet S. Katari 1 , Gloria M. Coruzzi 1 , Rob DeSalle 3 * 1 Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America,  2 Cullman Molecular SystematicsLaboratory and Genomics Laboratory, The New York Botanical Garden, Bronx, New York, United States of America,  3 Sackler Institute for Comparative Genomics, AmericanMuseum of Natural History, New York, New York, United States of America Abstract Background:   Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With theproduction of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoninggenomic data being generated for plant genomes to address one of the more important plant phylogenetic questionsconcerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales,Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or severalgenes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms aremonophyletic and sister groups, they differ on the topological arrangements within each major group. Methodology:   We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenatedorthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employedprograms that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on referenceseed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved andhighly supported phylogenetic hypothesis that was robust to various outgroup combinations. Conclusions:   We evaluated character support and the relative contribution of numerous variables (e.g. gene number,missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics.Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branchsupport, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny whendealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informativecharacters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of addingconflicting characters is minimized. Citation:  de la Torre-Ba´rcena JE, Kolokotronis S-O, Lee EK, Stevenson DW, Brenner ED, et al. (2009) The Impact of Outgroup Choice and Missing Data on MajorSeed Plant Phylogenetics Using Genome-Wide EST Data. PLoS ONE 4(6): e5764. doi:10.1371/journal.pone.0005764 Editor:  William J. Murphy, Texas A&M University, United States of America Received  September 24, 2008;  Accepted  April 16, 2009;  Published  June 2, 2009 Copyright:    2009 de la Torre-Ba´rcena et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, whichpermits unrestricted use, distribution, and reproduction in any medium, provided the srcinal author and source are credited. Funding:  This work was supported by the NSF Plant Genome Program, DBI-0421604, and by the Lewis and Dorothy Cullman Program in Molecular Systematics atthe American Museum of Natural History and the New York Botanical Garden. As well the Sackler Institute for Comparative Genomics at the American Museum of Natural History and the Korein Family Foundation provided support for this work. The funders had no role in study design, data collection and analysis, decision topublish, or preparation of the manuscript. Competing Interests:  The authors have declared that no competing interests exist.* E-mail: Introduction Genome level analyses have enhanced our view of phylogeneticsin many areas of the tree of life. With the production of wholegenome DNA sequences of hundreds of organisms and large-scaleEST databases as well as the incorporation of other genome-enhanced technologies [1–4], a large number of candidate genesfor inclusion into phylogenetic analysis have become available. Inthis work, we exploit the burgeoning EST database and thesteadily growing number of whole plant genomes to address one of the more important phylogenetic questions concerning thehierarchical relationships of the major seed plant lineages(angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales).The elucidation of spermatophyte phylogeny continues to be awork in progress, despite numerous studies using single, few orseveral genes and morphology datasets (morphological: [5–9]; andmolecular: [10–16]) as recently and extensively reviewed [17]. Although most recent studies support the notion that gymno-sperms and angiosperms are monophyletic and sister groups, theydiffer on the topological arrangements within each major group(Figure 1). Many current studies support the placement of Gnetales and conifers as closely-related groups, either as sisterclades (Panel B), or with Gnetales as a nested group within theconifers (Panel D). In both of these hypotheses, cycads are thebasal clade, followed by  Ginkgo . A fourth hypotheses, which firstemerged through the analysis of the plastid genes  rbc  L and  rpoC1 PLoS ONE | 1 June 2009 | Volume 4 | Issue 6 | e5764  [18,19] and multiple plastome genes [20] and again withphytochrome genes [13,21] and some genes involved in develop-ment [16,22,23] has generally remained marginal and controver-sial, places the Gnetales as basal gymnosperms, with conifers and Ginkgo  plus cycads as later-branching sister groups.In a previous publication [11], we incorporated ExpressedSequence Tags (ESTs) together with complete protein sequencesplus a morphology matrix into a phylogenetic analysis of the seedplants. The concatenation and simultaneous analysis of 43 datapartitions yielded a well resolved, single most parsimonious treewith reasonable bootstrap support. In that study we demonstratedthe pertinence of using ESTs as a source of phylogeneticcharacters, provided there is adequate orthology determination.We also stressed the importance of assessing character support inmore robust and consistent ways before declaring a phylogeneticquestion confidently resolved. Given the diverse srcins, roles andevolutionary histories of all genes within a particular genome,issues of character support and conflict are relevant whenconsidering the overall history of a taxonomic group, and itappears sensible to consider as many sources of evidence aspossible (and available). In this context, the question of where tostop adding characters to a phylogenomic analysis [24] remainsopen and a high priority for the careful and efficient planning of sequencing projects across all phyla. Although our earlier approach [11] proved to be very effectivein estimating character support and conflict, as well as supporting the case for the use of ESTs in phylogenetic analysis, it was clearmore character information was needed to provide strongersupport in the resolution of spermatophyte phylogeny. An increasein total characters, but especially an increase in phylogeneticallyinformative characters, would augment both apparent and hiddensupport in all gymnosperm clades, and provide stronger supportfor inferences on the hierarchical relationships among the taxainvolved. The burgeoning EST and sequencing projects being conducted across genomes make such character informationavailable at an accelerated and sustained pace. One of the main Figure 1. Conflicting phylogenetic hypotheses on the evolution of seed plants.  Morphological evidence (synapomophic charactersticsshared between angiosperms and Gnetales) have shaped the anthophyte theory, where these two taxa form sister groups (Panel A). In contrast, mostmolecular studies postulate gymnosperms as a monophyletic group sister to all angiosperms, and place the Gnetales as a sister group to the conifers(Panels B and D). Adding to the controversy, a recent study involving phytochrome genes (Panel C) has placed the Gnetales as basal gymnosperms,with  Ginkgo  and cycads as sister taxa branching after the Coniferales. A: refs. [5,6,66]; B and D: refs. [10,12,20,58,67]; C: ref. [13].doi:10.1371/journal.pone.0005764.g001Seed Plant PhylogenyPLoS ONE | 2 June 2009 | Volume 4 | Issue 6 | e5764  criticisms to phylogenetic projects employing whole- or partial-genome sequences is that with the scarcity of comprehensivegenomic or subgenomic data for a large number of taxa, theanalyses would retrieve phylogenies for very few taxa that, even if well-resolved and strongly-supported, would represent incorrectevolutionary reconstructions (e.g. [25]). Moreover, Gatesy et al.[26] showed that choice of ingroup taxa at the root of the tree and,more importantly, outgroup choice in deep phylogenomic studiesis critical. In the current report, we have expanded taxonomicrepresentation to 17 species, compared to the srcinal six-ingroup,single-outgroup taxa study of de la Torre et al. [11] and expandthe number of gene partitions to 1200. Materials and Methods Orthology prediction In order to generate a comprehensive molecular matrix toaddress the phylogenetic questions of flowering versus non-flowering seed plants, we searched the TIGR Plant Transcript Assemblies database ( for well-sampledrepresentatives of all major seed plant groups. Our databasesearch for available EST/unigenes (from a total 226,210 ESTassemblies and singletons) from well-sampled representativemembers of major seed and seed-free plant groups retrieved atotal of 158,358 genes from complete genomes (   Arabidopsis  , rice,and poplar), and between 16,000 and 22,000 total unigenes(depending on the dataset) from ESTs for all other species includedin various versions of the analysis. In all, the following species weresurveyed:  Arabidopsis thaliana, Oryza sativa   (common rice),  Amborella trichopoda  ,  Vitis vinifera   (common grape vine),  Populus trichocarpa  (California poplar) (angiosperms);  Cycas rumphii   (Malayan fernpalm),  Zamia fischeri  ,  Ginkgo biloba  ,  Gnetum gnemon  (melinjo, bago,peesae),  Welwitschia mirabilis  ,  Cryptomeria japonica   (Japanese cedar), Pinus taeda   (Loblolly pine) (gymnosperms) as ingroup taxa; Selaginella moellendorffii   (Lycopophyte),  Adiantum capillus-veneris   (Fili-calean fern),  Marchantia polymorpha   (liverwort),  Physcomitrella patens  (moss) and  Chlamydomonas reinhardtii   (unicellular green alga) asoutgroups. All available assembled EST databases, independent of their source (tissue, developmental stage, or type of experiment)were surveyed. Using these unigenes, the OrthologID softwarepipeline ([27]; was employedto predict orthologous groups resulting in fully aligned matricescomposed of 926–1,600 gene or ortholog partitions. The variancein the number of orthologs depended on the filtering schemesdiscussed below. These ortholog groups consisted mostly of translated EST sequence data. Ortholog filtering OrthologID identifies all genes that are orthologous amongstthe taxon set under examination [27]. Due to the incompletenature of the EST database, oftentimes the resulting orthologousgroups will include only a few taxa. In addition, the availableorthologs can be distributed in specific and narrowly definedtaxonomic groups. We reasoned that the inclusion of partitionswith three or fewer orthologs will add little to the robustness of thepresent analysis, so we developed a filtering function in ourinformatics analysis pipeline that removed any ortholog sets thathad fewer than four taxa with genes in the ortholog group. Inaddition, we restricted the distribution of this filtering to includeonly those ortholog groups with at least three ingroup taxa(specifically at least two gymnosperms and one angiosperm) andone outgroup taxon per partition. We arrived at a comprehensivedataset formed by 12 ingroup species and 4 outgroup species. Wefound that using all available outgroups resulted in the retrieval of the largest number of   bona fide   orthologous partitions (1,239) withthe filtering scheme specifying the minimal presence of threeingroup taxa (two gymnosperms and one angiosperm) and oneoutgroup per partition. The resulting ortholog groups comprisegenes that are randomly distributed throughout the genome asdemonstrated by mapping the loci on the chromosome map of   Arabidopsis thaliana   (Figure S1). This somewhat balances for thegeneral bias of EST and transcriptome data, which most oftenshow enrichment for genes implicated in metabolism, energy andgeneral housekeeping, and an underrepresentation for functionalcategories such as gene regulation. Still, our dataset comprises anarray of orthologous genes belonging to diverse functionalcategories (Figure S2) including transcriptional regulators andsignaling genes. The fact that statistical tests (z-scores, Sungear[28]; data not shown) show a lack of overrepresentation of thesecategories further suggests that our ortholog sample is morebalanced (i.e. less biased) than any previously reported for similarstudies of EST data. Construction of a comprehensive seed plantphylogenomic matrix Once the ortholog groups were established as detailed above, weused the Perl script ASAP (Automated Simultaneous AnalysisPhylogenies; [29]) to organize and constructa matrix.This programautomatically constructs a matrix with named partitions into genename, GO category, and other informatics categories. Theconcatenated partitioned matrix can be found in Document S1. Phylogenomic analyses The phylogenetic matrix was analyzed using maximum parsi-mony (MP) and maximum likelihood (ML) optimality criteria.Parsimony analysis was performed in PAUP* 4b10 [30] using equalweights. Node support was evaluated using the nonparametricbootstrap and jackknife methods in PAUP. Pairwise phylogeneticcongruence across all partitions was tested using the ILD test(incongruence length difference; [31,32]) in PAUP. While thismeasure has been criticized recently [33–36], we choose to use thistest conservatively in the context of this study. Branch supportmeasures, such as the Bremer index [37], partitioned branchsupport [38], and hidden branch support [39], were calculated in ASAP in conjunction with PAUP. Maximum likelihood inferencewas carried out in RA 6 ML 7.0.4 [40] at the AMNH Computa-tional Sciences facility on an 8-way server with 2.2 GHz AMDOpteron 846 processors and 128 GB RAM using the fine-grainedparallel Pthreads (POSIX Threads Library; [41]) and on theCIPRES cluster ( using the MPI (MessagePassingInterface;[42,43]) implementations. Thesubstitutionmodelbest fitting the data was selected in ProtTest [44] by contrasting each model inference’s log-likelihood score. The JTT model [45] yielded the highest likelihood score and therefore was used in MLinference taking into account empirical amino acid frequenciescalculated directly from the data in hand (Document S2). Among-site rate heterogeneity was accounted for using the CATapproximation model [46] with 25 site rate categories. Nodesupport was quantified with 1625 rapid bootstrap pseudo-replicatesas implemented in the parallel versions of RA 6 ML [47]. In order toexplore outgroup choice on tree topology, we performed a series of searches, with different combinations of ingroup and outgroup taxa.These manipulations are summarized in Figure S3. We alsoexplored the effect of missing taxa on the overall phylogenetichypothesis by measuring the amount of branch support (BS) andpartitioned hidden branch support (PHBS) for trees generated byserial nested additions of ingroup taxa (3–11). This analysis involvedserially adding partitions with up to 3 taxa, then up to 4 taxa, and so Seed Plant PhylogenyPLoS ONE | 3 June 2009 | Volume 4 | Issue 6 | e5764  on, so that the matrix kept expanding as partitions with more taxawere added. Results The impact of outgroup choice on seed plantphylogenetics In order to address the issue of random rooting [26,48] wechose to break up the long root to the seed plants by including additional outgroup taxa (  Physcomitrella, Marchantia, Selaginella  , and  Adiantum  ). Species chosen to implement this approach fulfilled twocriteria: known phylogenetic relevance and good representation inthe database. The results are shown in Figure 2. The relativeplacement of gymnosperm groups changes as outgroup taxa areexcluded or rooting is forced on certain seed plant taxa. If nooutgroups are specified, trees behave differently depending onwhether (and which) seedless taxa are included. When theunicellular green alga  Chlamydomonas   and/or the moss  Physcomitrella  are included, cycads and Ginkgo nest within the conifers, andGnetales appear basal. When only the heterosporous lycophyte, Selaginella   (or any of the seed plants) is used to root the tree,Gnetales and conifers group together, and form a sister group tocycads and Ginkgo. Forcing the latter to be the outgroup does notchange the relative positions of the former. Gap-coding the matrixresults in similar arrangements, except for  Cryptomeria  , which fallsoutside the gymnosperms – probably due to insufficient amountsof informative characters.Figure S2 suggests that the effect of long branch attraction orrandom rooting, can be neutralized by multiple outgroup analysis.In fact, our resulting tree topology remains stable and robustregardless of which outgroup, or outgroup combinations we use(including no outgroup, when rooting with any of the seed plants),suggesting we might have reached a large enough number of informative characters to render a highly robust topology, immuneto outgroup choice. In all subsequent analyses we remove Chlamydomonas   from the analysis due to the fact that it appears tohave extreme random root effects [26,48] and that we havereplaced it with four other more appropriate outgroups. Figure 2. Phylogenetic relationships of seed plants using 1200 genes inferred with parsimony and likelihood methods.  Thetopologies were identical across optimality criteria. The tree shown here was estimated by maximum likelihood using the JTT substitution matrix andempirical amino acid frequencies with the CAT model for among-site rate heterogeneity and final optimization with the GAMMA model. Log-likelihood= 2 3989109.546056 and  a  (alpha)=0.720925. The bar denotes 0.05 substitutions/site. All nodes received the highest level of supportregardless of the optimality criterion. The table inset shows partition support values (PBS and PHBS). The rightmost column in this table shows theproportion of hidden total support. neg, negative hidden support; nd, non-definable.doi:10.1371/journal.pone.0005764.g002Seed Plant PhylogenyPLoS ONE | 4 June 2009 | Volume 4 | Issue 6 | e5764  A robust phylogenomic hypothesis focused on therelationships of major seed plant groups Phylogenetic analysis of the most inclusive matrix we construct-ed (72,900 informative characters from 16 species) resulted in asingle most parsimonious tree with very high measures of branchsupport. Figure 2A shows the MP tree of 12 seed plant ingrouptaxa rooted with all four outgroup taxa (non-seed plants).Bootstrap and jackknife support values are all at or near 100%.Bremer decay values vary, but all are above double-digits. Higher-level inferences of relationships are consistent with most previousmolecular analyses, showing gymnosperms as a monophyleticgroup sister to the angiosperms. As expected, angiosperm speciesconform to the well-accepted view that  Amborella   is basal to allflowering plants, followed by the separation between monocots(  Oryza   ) and the eudicots  Arabidopsis  ,  Vitis  , and  Populus   [25,49]. Notsurprisingly, as two of these species are fully sequenced, allmeasures of support for angiosperm groupings are very high(Bremer indices in the triple-digits).The grouping of gymnosperms in the expanded analysis shownin Figure 2A is different from the one observed in our previousstudy [11], which placed cycads as the earliest diverging branchfollowed by  Ginkgo , and then the Gnetales and conifers as sistertaxa deeper in the gymnosperm clade (i.e., a pectinate gymno-sperm clade). We point out that in the present study, the treegenerated differs from the previous one not only in the overallnumber of taxa, where the ingroup is doubled, and the outgroup isquadrupled, but also in the overall placement of gymnospermtaxa. The MP tree (Figure 2A) shows  Gnetum  and  Welwitschia  (which form a solid monophyletic group) branching early andforming a sister clade to all other gymnosperms.Notably, the topology of the phylogenomic tree shown inFigure 2A does not agree with two prior hypotheses. The firstproposes that all conifers are sister to Gnetales, and the secondproposes that the Gnetales are nested within the conifers inparticular, placed as sister to conifers I (e.g. [10]; see Figure 1Band 1D). In addition, our initial hypothesis [11] that cycads,followed by  Ginkgo , could be the earliest diverging extantgymnosperms is not supported in this larger analysis. Instead,the present analysis seems to provide robust support for thehypothesis that Gnetales are the earliest diverging gymnospermlineage (Figure 1C), previously postulated using phytochromegenes as data sources [13] and in other analyses using thechloroplast gene  rpoC1  [19], using the  AGL  6 [16,22], and using  Floricaula  / LEAFY   [23] even though they are the most recent groupin the seed plant fossil record. Figure 2 shows the maximumlikelihood (ML) tree that agrees entirely with the MP tree topology.This tree has robust (100%) likelihood bootstrap values at all nodeswith the exception of the node supporting the clade (  Selaginella  ,(-  Marchantia  , Physcomitrella   )) at 54%. The final log-likelihood scoreand branch lengths were optimized with the GAMMA model of rate heterogeneity in RA 6 ML and yielded a score of  2 3989109.546056 and an  a  (alpha) shape parameter of the  C (Gamma) distribution of 0.720925. Missing taxa have a significant effect on tree topologyand support – relevance to EST phylogenomics Previous studies using both simulated (e.g. [50]) and real (using ESTs: e.g. [51,52]) datasets have tested whether large amounts of missing taxa have a significant effect on the topology and supportof a phylogenetic analysis. This type of analysis is particularlyrelevant to EST studies as the probability of obtaining a fullcomplement of taxa for a particular ortholog is reduced as thenumber of taxa in the analysis increases (see [53] for an example inanimals). This approach is generally accomplished by comparing support metrics and topology changes on datasets with andwithout given combinations of missing taxa. All existing results(with little change in these factors for compared datasets) havehitherto suggested that large numbers of missing taxa  per se   do notalter either the signal or support values. However, when ‘‘missing taxa’’ also means too few available characters for a correct callregarding taxon placement, the negative effect is indeed dramatic.Our analysis on the 43-partition matrix [11] revealed thatsubtracting partitions with high taxon representation did collapsemany branches or significantly lower overall support, although theexclusion of these taxon-dense partitions also meant the removal of crucially informative character information. We explored theeffect these missing taxa had on the overall phylogenetichypothesis by comparing the amount of branch support andhidden branch support for each node using partitions whereinformation was available for 7, 6, 5, and 4 taxa. As shown in Figure 3, tree support values increase dramaticallyas more partitions with fuller taxon complements are added. Thisresult could argue for the exclusion of partitions with low numberof taxa. When analyzing individual partitions, it is clear that treesfrom those with lower number of taxa have fewer informativecharacters, number of resolved clades and, ultimately, lowersupport value across the board. However, we also suggest keeping those partitions with even minimal character information, as thesepartitions may often prove valuable in the resolution of a singleclade or clades within the tree.We also explored the effect of missing taxa on the overallphylogenetic hypothesis. Figure 4 depicts how the data in ourstudy relates to the compromise of increasing number of characterand taxa. Given our choice of taxa, and the current sequenceavailability for each species (indicated on the X axis of Figure 4), apeak of informative characters and related bootstrap values (a‘‘phylogenetic sweet spot’’ of sorts) is reached between 5 and 6taxa. That is genes that are found in five or six of the taxa in thisstudy when combined have more parsimony informative charac-ters and higher overall bootstrap values This result is attained as aresult of there being fewer and fewer genes with fuller taxonomicrepresentation in the EST database.This result does not necessarily mean the incorporation of additional taxa is of no value. Potentially important characterinformation is still obtained when adding more taxa. While thisillustrates the effect of missing taxa for genes in the EST database,an analysis will benefit from the compounded informationobtained from including all partitions containing 3 to 9 ingrouptaxa, below and above which phylogenetic information will benull. In theory, the upper limit will shift to the right as moregenomes are sequenced, until reaching an absolute limit, given byevolutionary – not technological – constraints, i.e. a real lack of overlap for several genes among species. As seen before, evenwhen adding incomplete partitions (i.e. with varying amounts of taxon representation within the partition), support increasesradically as more parsimony-informative sequence data are added.This result indeed argues for the inclusion of all informationavailable, as long as a minimum of 3 ingroup and one outgroupspecies is maintained in each partition. Analysis of individual partitions  As shown previously for seed plants [11] and yeast species [24],analysis of trees generated with individual data partitions, revealslarge disagreement with the simultaneous analysis tree hypothesis.Yet, as shown in earlier studies (e.g. [11,54,55]), most, if not all, of such apparent incongruence is statistically significant using theILD test. We employed this test in order to explore the interaction Seed Plant PhylogenyPLoS ONE | 5 June 2009 | Volume 4 | Issue 6 | e5764
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!