Main
The sharp divide in terms of cellular complexity between eukaryotes and prokaryotes has been deemed “the greatest single evolutionary discontinuity to be found in the present-day world” 10 , and the possibility of extensive gene transfer through prokaryotic (endo)symbiosis has long been considered a stepping stone in this process 11 . The current consensus on eukaryogenesis revolves around scenarios that always involve an endosymbiotic relationship with extensive gene transfer between an alphaproteobacterial endosymbiont and a host with an Asgard archaeal ancestry 2 , 3 . Phylogenomics has shed light on the phylogenetic placement and potential traits of these two established partners 12 , 13 , 14 , 15 , 16 . However, some proposed eukaryogenesis models involve at least one other partner 5 or even serial interactions with various non-alphaproteobacterial symbionts acting as gene donors 4 , 6 . The large dominance of bacterial over archaeal contributions in reconstructed last eukaryotic common ancestor (LECA) proteomes 4 , 17 and the observation that only a small fraction of bacterial-derived proteins can be confidently traced back to Alphaproteobacteria may indicate further bacterial contributions 4 , 18 . However, alternative explanations of such observations include the difficulty of reconstructing ancient phylogenetic relationships and the presence of horizontal gene transfer (HGT) carried by the Asgard archaeal or alphaproteobacterial partners.
Here we addressed the question of whether protein families within the LECA could be traced back to ancestors other than Alphaproteobacteria or Asgard archaea with a level of support similar to those traced to these broadly accepted partners. To alleviate potential artefacts arising from phylogenetic reconstruction, including the effects of unsampled lineages, contamination and recent HGT, we used state-of-the-art methodologies and compiled curated datasets including representative proteomes with the highest possible qualities, from which we purged low-quality sequences, recent paralogues and sparsely distributed proteins. Our results identify at least two major signals of bacterial ancestry different from Alphaproteobacteria and a consistent set of gene acquisitions inferred to be mediated by Nucleocytoviricota viruses.
Reconstruction of the proteome
To leverage the recent explosion of genome data across the tree of life, particularly among eukaryotes, we reinferred the LECA proteome using an automated approach similar to those used in previous studies 4 , 19 , 20 , 21 . To minimize known methodological issues in homology and phylogenetic inference, we subsampled existing data to obtain a balanced representation across the eukaryotic tree of life (eTOL), while ensuring a tractable size and the highest possible quality ( Methods , Supplementary Methods and Supplementary Tables 1 and 2 ). We also curated the selected proteomes to remove low-quality and low-complexity proteins. Given our focus on deep evolutionary nodes, we kept a single representative of clusters of recent eukaryotic in-paralogues. We replicated this procedure to generate three alternative 100-proteome datasets (eTOLDBA, eTOLDBB and eTOLDBC) that overlapped with respect to about 46% of their proteins (Supplementary Fig. 1 ) for assessment of the data-dependency of our results. We clustered proteins in these datasets into orthologous groups (OGs) and defined putative descendants of the LECA (LECA-OGs) as OGs that contained at least five different species, at least three of nine eukaryotic supergroups and the two main eukaryotic stems after removal of potential contaminants ( Methods and Supplementary Figs. 1 and 2 ). LECA-OGs were highly consistent across datasets (more than 96%). To further refine these families, we used protein alignment profile similarity searches against a broad database (broadDB) comprising order-level prokaryotic pangenomes reconstructed from more than 65,000 genomes available at GTDB 22 and sequence representatives of more than 1.3 million clusters of viral sequences 23 ( Methods ). This approach ensured maximal coverage of extant diversity while minimizing the impact of database biases and recent HGT. We next reconstructed maximum likelihood phylogenies from the LECA-OG expanded with the closest broadDB hits ( Methods ). This was used for assessment of the monophyly of the eukaryotic proteins, which were split into different, monophyletic LECA-OGs (mLECA-OGs) if necessary. We repeated the same procedure with this new set of mLECA-OGs by building new alignment profiles and repeating the broadDB search and phylogenetic reconstructions ( Methods ). This resulted in a final set of mLECA-OGs (with 79% consistency across datasets) and their phylogenies in the context of their closest non-eukaryotic homologues. Analysis of the earliest splits in the mLECA-OG phylogenies indicated that only 3% of the OGs could possibly result from HGT between eukaryotic supergroups ( Methods ).…
Read the full article at Nature News →