Noticias tecnicas,cientificas,relacionadas con la sociedad de la información y la ciencia
traductor
miércoles, 5 de junio de 2024
Desvelado el mayor catálogo de nuevas moléculas antibióticas: casi un millón de compuestos desconocidos
Desvelado el mayor catálogo de nuevas moléculas antibióticas: casi un millón de compuestos desconocidos
El
español César de la Fuente y el portugués Luis Pedro Coelho revelan con
computación aplicada a la biología un potencial arsenal contra la
resistencia de los microorganismos a los fármacos existentes
Cuando el francés Ernest Duchesne halló la penicilina en
1897 y la redescubrió Alexander Fleming en 1928, la salud de la
humanidad dio un paso de gigante. Por primera vez, las posibilidades de
morir por una infección descendían drásticamente. Sin embargo, el uso y
abuso de antibióticos en los últimos 100 años ha enseñado a los
patógenos microbianos a desarrollar defensas frente a la mejor arma
farmacológica. Cada año, según The Lancet,
casi cinco millones de personas mueren por microorganismos resistentes a
los antibióticos actuales y es imprescindible encontrar nuevas
moléculas efectivas. En esta lucha ineludible, los laboratorios del español César de la Fuente en la Universidad de Pensilvania y del portugués Luis Pedro Coelho en la Universidad de Tecnología de Queensland han descubierto, según publican en Cell,
la mayor cantera del mundo (863.498 péptidos) de antimicrobianos a
partir de los cuales se pueden desarrollar nuevos tratamientos.
Los investigadores han recurrido a la inteligencia artificial y al aprendizaje mecánico (machine learning)
para rebuscar en cualquier parte —en el cuerpo humano (saliva o piel),
animales (intestinos de los cerdos o corales), plantas, tierra, agua o seres extintos—
una combinación de aminoácidos que tengan potencial antibiótico. Es lo
que se conoce como materia oscura microbiana, microorganismos que han
dejado material genético en cualquier medio, pero que aún no se han
cultivado en laboratorio.
Del
casi millón de moléculas halladas, nueve de cada 10 son inéditas y han
tenido que ser bautizadas, como la lachnospirina y enterococcina, las
más efectivas. “Nunca se habían descrito”, resalta De la Fuente. De esa
ingente cantidad han conseguido probar un centenar a nivel preclínico
(placas de Petri y ratones) en 11 cepas bacterianas causantes de
enfermedades, incluidas cepas resistentes a los antibióticos de E. coli y Staphylococcus aureus.
“Nuestra evaluación inicial reveló que 63 de estos candidatos
erradicaron por completo el crecimiento de, al menos, uno de los
patógenos probados y, a menudo, de múltiples cepas. En algunos casos,
estas moléculas fueron efectivas contra las bacterias en dosis muy
bajas”, explica el investigador coruñés, recientemente premiado en su
tierra.
En un modelo preclínico probado en ratones
infectados, el tratamiento con los nuevos péptidos produjo resultados
similares a los efectos de la polimixina B, un antibiótico usado como
control y disponible comercialmente que se utiliza para tratar la
meningitis, la neumonía, la sepsis y las infecciones del tracto
urinario.
Que ambos investigadores sean biotecnólogos ha
permitido reducir a meses procesos que tardaban hasta una década. De
esta forma, sus equipos analizaron bases de datos de 87.920 genomas de
microbios y 63.410 metagenomas (mezclas de estos). Buscaban
combinaciones de aminoácidos desconocidas para los patógenos que han
desarrollado resistencia a los antibióticos actuales y responsables de
lo que la Organización Mundial de la Salud considera una de las 10
principales amenazas de la humanidad.
El equipo ha publicado todos los hallazgos, agrupados bajo el nombre AMPSphere
(esfera de péptidos antimicrobianos), en una plataforma de código
abierto para permitir la investigación a partir de sus hallazgos a
cualquier entidad interesada en desarrollar nuevos antibióticos. La idea
es superar la tendencia de la industria farmacéutica a centrarse más en
tratamientos de enfermedades crónicas, de uso prolongado y más
rentables.
Llevo toda mi carrera
dedicada a los antibióticos, porque es una de las áreas que tiene menos
inversión y que mata a más gente en el mundo. Simplemente, mi sueño es
intentar ayudar a la humanidad, salvar vidas. Y para mí es lo más
importante, más que ganar dinero
César de la Fuente, Universidad de Pensilvania
“Llevo
toda mi carrera dedicada a los antibióticos, porque es una de las áreas
que tiene menos inversión y que mata a más gente en el mundo.
Simplemente, mi sueño es intentar ayudar a la humanidad, salvar vidas. Y
para mí es lo más importante, más que ganar dinero”, afirma De la
Fuente, quien promueve la creación de una empresa surgida de su
laboratorio en la Universidad de Pensilvania para acelerar los
desarrollos de nuevos antibióticos.
“Hay una necesidad
urgente de nuevos métodos para el descubrimiento de antibióticos. Usar
la inteligencia artificial para comprender y aprovechar el poder del
microbioma mundial nos lleva a investigaciones innovadoras que mejoran
la salud pública”, añade Coelho, cuya colaboración ha sido, en opinión
de De la Fuente, extraordinaria.
“Estamos
orgullosos de esta investigación porque creemos que es el proyecto de
descubrimiento de antibióticos más grande que se ha escrito en cuanto a
la cantidad de información biológica que hemos explorado y la de
moléculas que hemos encontrado nuevas. Es una representación muy
completa de toda la increíble diversidad microbiana que existe”, resalta
el investigador gallego.
De la Fuente detalla cómo el
hallazgo procede de una novedosa forma de aproximarse al problema global
y urgente de la resistencia a los antibióticos: “Yo pienso en la
biología como una fuente de información en forma de ADN, de nucleótidos,
de proteínas o de aminoácidos. Con los ordenadores podemos entrar como
con una lupa y explorar toda esa diversidad oculta al ojo humano y
codificada de forma tan compleja e ingente”.
Con modelos
mucho más humildes, otros investigadores trabajan en la misma dirección y
bajo la misma premisa: la lucha contra una amenaza global. Una
investigación en The Microbe,
ha analizado las comunidades bacterianas y de arqueas (organismos
procariotas que tienen apariencia de bacterias) en los baños romanos de
la ciudad británica de Bath. “Esta es una investigación muy emocionante.
La resistencia a los antimicrobianos es reconocida como una de las
amenazas más importantes para la salud mundial y la búsqueda de nuevos
productos naturales antimicrobianos se está acelerando. Nuestro estudio
ha desvelado, por primera vez, que algunos de los microorganismos
presentes en las Termas Romanas son una fuente potencial de nuevos
descubrimientos antimicrobianos. Las termas romanas han sido
consideradas medicinales durante mucho tiempo y ahora, gracias a los
avances de la ciencia moderna, descubrimos que los romanos y otros
tenían razón”, comenta Lee Hutt, autor principal del trabajo e
investigador de la Universidad de Plymouth.
Otra línea de investigación se dirige no solo a descubrir nuevos antibióticos, sino que estos no impliquen efectos indeseados.El
tratamiento con la conocida amoxicilina y clindamicina provoca cambios
en la estructura general de las poblaciones bacterianas en el intestino,
disminuyendo la abundancia de varios grupos microbianos beneficiosos,
según detalla un equipo de investigadores de la Universidad de Illinois
Urbana-Champaign en Nature.
Los investigadores han probado en ratones un nuevo antibiótico. “La
lolamicina”, como se denomina el nuevo compuesto, “no causa ningún
cambio drástico en la composición taxonómica en el transcurso del
tratamiento de tres días o la recuperación de los siguientes 28 días”,
sostienen los investigadores.
Machine learning predicts nearly 1 million new antibiotics in the global microbiome
•
Out of 100 tested peptides, 79 were active in vitro; 63 of these targeted pathogens
•
Some peptides may originate from longer sequences through genomic fragmentation
•
The AMPSphere is an open-access resource to accelerate antibiotic discovery
Summary
Novel
antibiotics are urgently needed to combat the antibiotic-resistance
crisis. We present a machine-learning-based approach to predict
antimicrobial peptides (AMPs) within the global microbiome and leverage a
vast dataset of 63,410 metagenomes and 87,920 prokaryotic genomes from
environmental and host-associated habitats to create the AMPSphere, a
comprehensive catalog comprising 863,498 non-redundant peptides, few of
which match existing databases. AMPSphere provides insights into the
evolutionary origins of peptides, including by duplication or gene
truncation of longer sequences, and we observed that AMP production
varies by habitat. To validate our predictions, we synthesized and
tested 100 AMPs against clinically relevant drug-resistant pathogens and
human gut commensals both in vitro and in vivo. A
total of 79 peptides were active, with 63 targeting pathogens. These
active AMPs exhibited antibacterial activity by disrupting bacterial
membranes. In conclusion, our approach identified nearly one million
prokaryotic AMP sequences, an open-access resource for antibiotic
discovery.
Therefore, there is an urgent need for novel methods for antibiotic
discovery. Computational approaches have recently been developed to
accelerate our ability to identify novel antibiotics, including
antimicrobial peptides (AMPs).Recently, proteome mining approaches have even been developed to
identify antimicrobial agents in extinct organisms in an attempt to
further expand our repertoire of known antimicrobials.
Bacteria
live in an intricate balance of antagonism and mutualism in natural
habitats. AMPs play an important role in modulating such microbial
interactions and can displace competitor strains, facilitating
cooperation.
(e.g., pexiganan, LL-37, and PAC-113). Although most AMPs display
broad-spectrum activity, some are only active against closely related
members of the same species or genus.
Such AMPs are more targeted agents than conventional broad-spectrum antibiotics.Furthermore, contrary to conventional antibiotics, the evolution of
resistance to many AMPs occurs at low rates and is not related to
cross-resistance to other classes of widely used antibiotics.
The
application of metagenomic analyses to the study of AMPs has been
limited due to technical constraints, primarily stemming from the
challenge of distinguishing genuine protein-coding sequences from false
positives.
Therefore, the significance of small open reading frames (smORFs) has been historically overlooked in (meta)genomic analyses.In recent years, significant progress has been made in metagenomic analyses of human-associated smORFs.These advancements have incorporated machine learning (ML) techniques
to identify smORFs encoding proteins belonging to specific functional
categories.Notably, a recent study used predicted smORFs to uncover approximately
2,000 AMPs from metagenomic samples of human gut microbiomes.
Nevertheless, it is important to note that the human gut represents
only a fraction of the overall microbial diversity, suggesting that
there remains an immense potential for the discovery of AMPs from
prokaryotes in the diverse range of habitats across the globe.
In
this study, we employed ML to predict and catalog AMPs from the global
microbiome as currently represented in public databases. By
computationally exploring 63,410 publicly available metagenomes and
87,920 high-quality microbial genomes,
we uncovered a vast array of AMP diversity. This resulted in the
creation of the AMPSphere, a collection of 863,498 non-redundant peptide
sequences, encompassing candidate AMPs (c_AMPs) derived from
(meta)genomic data. Remarkably, the majority of these c_AMP sequences
had not been previously described. Our analysis revealed that these
c_AMPs were specific to particular habitats and were predominantly not
core genes in the pangenome.
Moreover, we synthesized 100 c_AMPs from AMPSphere and found that 79 were active, with 63 exhibiting antimicrobial activity in vitro against clinically significant ESKAPEE pathogens, which are recognized as public health concerns.
and demonstrated their ability to target bacterial membranes and their
propensity to adopt α-helical and β-structures. Notably, the leading
candidates displayed promising anti-infective activity in a preclinical
animal model. Together, our work demonstrates the ability of ML
approaches to identify functional AMPs from the global microbiome.
Results
AMPSphere comprises almost 1 million c_AMPs from several habitats
AMPSphere incorporates c_AMPs predicted with ML using Macrel,
a pipeline that uses random forests to predict AMPs from large peptide
datasets with an emphasis on precision over recall. It was applied to
63,410 globally distributed publicly available metagenomes (Figure 1A; Table S1) and 87,920 high-quality bacterial and archaeal genomes.
except when they had a significant match (defined as amino acid
identity ≥75% and E-value ≤10⁻⁵) to a sequence in the AMP-dedicated
database Data Repository of Antimicrobial Peptides (DRAMP) version 3.0.
This resulted in 5,518,294 genes, 0.1% of the total predicted smORFs,
coding for 863,498 non-redundant c_AMPs (on average 37 ± 8 residues
long; Figures 1A and S1). Similar to validated sequences with antimicrobial activity,
c_AMPs from AMPSphere present a positive charge (4.7 ± 2.6), high
isoelectric point (10.9 ± 1.2), amphiphilicity (hydrophobic moment,
0.6 ± 0.1), and a potential to bind to membranes or other proteins
(Boman index, 1.14 ± 1.1). As expected, in general, the distribution of
physicochemical properties of peptides from AMPSphere, DRAMP
are more similar to each other than to the negative training set
(assumed to not be AMPs). Nonetheless, c_AMPs from AMPSphere are on
average longer (37 ± 8 residues) than those in DRAMP
version 3.0 (28 ± 22 residues), and we observed differences in the
distribution of other features (e.g., charge, aliphaticity,
amphipathicity, and isoelectric point; Figure S1).
We
subsequently estimated the quality of the smORF predictions and
detected 20% (172,840) of the c_AMP sequences in independent publicly
available metaproteomes or metatranscriptomes (Figures 2 and S2A; see STAR Methods section “Quality control of c_AMPs”) belonging to several habitats included in the AMPSphere, such as the human gut, plants, and others (Table S6). We then subjected all c_AMPs to a bundle of in silico quality tests (see STAR Methods section “Quality control of c_AMPs”).
A subset of c_AMPs (9.2% or 80,213 c_AMPs) passed all of them, and this
subset is hereafter designated as high-quality. Testing with other AMP
prediction systems (AMPScanner v2,),
we observed that 98.4% (849,703 peptides) of AMPSphere c_AMPs were also
predicted as AMPs by at least one other AMP prediction system.
Approximately 15% (132,440 out of 863,498 peptides) of AMPSphere c_AMPs
were co-predicted by all methods used.
Only
0.7% of the identified c_AMPs (6,339 peptides) are homologous
(operationally defined as amino acid identity ≥75% and E-value ≤10⁻⁵) to
experimentally validated AMP sequences in DRAMP version 3.0.
suggesting that c_AMPs represent a region of peptide sequence space
that is not present in these other databases. In total, we could find
only 73,774 (8.5%) c_AMPs with homologs in any of the databases we
considered. High-quality c_AMPs were detected in public databases at a
higher frequency than general c_AMPs (2.5-fold, pHypergom. = 4.2 × 10−250; Figure 1B),
with 23,012 out of the 80,213 high-quality c_AMPs having a match in
another database. However, it is notable that 76.4% (4,843 peptides out
of 6,339) of those c_AMPs that have a homolog in DRAMP
version 3.0 (and, therefore, are highly likely to be functional) are
not high-quality c_AMPs. Thus, while our quality tests do enrich for
validated sequences, a failure to pass the tests is not a sufficient
reason to conclude that the sequence is not active.
To put c_AMPs in an evolutionary context, we hierarchically clustered peptides using a reduced amino acid alphabet of 8 letters.
The three sequence clustering levels adopted identity cutoffs of 100%, 85%, and 75% (Figure S3).
At the 75% identity level, we obtained 521,760 protein clusters, of
which 405,547 were singletons, corresponding to 47% of all c_AMPs from
AMPSphere. A total of 78,481 (19.3%) of these singletons were detected
in metatranscriptomes or metaproteomes from various sources, indicating
that they were not artifacts. The large number of singletons suggests
that most c_AMPs originated from processes other than diversification
within families, which is the opposite of the hypothesized origin of
full-length proteins, in which singleton families are rare.
Among them, we considered 6,499 as high-quality families because they
contained evidence of translation or transcription or because ≥75% of
their sequences pass all in silico quality tests, regardless of whether experimental evidence is available (see STAR Methods section “AMP families”). These high-quality families span 15.4% of the AMPSphere (133,309 peptides).
All the c_AMPs predicted here can be accessed at https://ampsphere.big-data-biology.org/.
Users can retrieve the peptide sequences, ORFs, and predicted
biochemical properties of each c_AMP (e.g., molecular weight,
isoelectric point, and net charge at pH 7.0). We also provide the
distribution across geographical regions, habitats, and microbial
species for each c_AMP.
c_AMPs are rare and habitat-specific
The
AMPSphere spans 72 different habitats, which were classified into eight
high-level habitat groups, e.g., soil/plant (36.6% of c_AMPs in
AMPSphere), aquatic (24.8%), and human gut (13%; Figure 1A; Table S2). Most of the habitats, except for the human gut, appear to be far from saturated in terms of discovered c_AMPs (Figure 1C).
In fact, most AMPs are rare (median number of detections is 99, or
0.17% of the dataset; when restricted to high-quality c_AMPs, the median
number of detections is 81, or 0.14% of the dataset), with 83.97% being
observed in <1% of samples (Figure S2).
Only 10.8% (93,280) of c_AMPs were detected in more than one high-level
habitat group (henceforth termed “multi-habitat c_AMPs”); this fraction
is 7.25-fold smaller than would be expected by a random assignment of
habitats to samples (pPermutation < 10−300; see STAR Methods section “Multi-habitat and rare c_AMPs”).
Even within high-level habitat groups, c_AMPs overlap between habitats
much less frequently than expected by chance (2.4–192-fold less, pPermutation < 5.4 × 10−50; see STAR Methods section “Testing c_AMPs overlap across habitats”; Figure 1D).
Mutations in larger genes generate c_AMPs as independent genomic entities
Many AMPs are generated post-translationally by the fragmentation of larger proteins.
For example, EPs are computationally detected fragments from protein
sequences within the human proteome and other proteomes that have been
shown to be highly active.
EPs present diverse secondary structures and act on the membrane of
bacterial cells similarly to known natural AMPs but have different
physicochemical features compared to known AMPs.
AMPSphere only considered peptides encoded by dedicated genes.
Nonetheless, we hypothesized that some of these have originated from
larger proteins by fragmentation at the genomic level. To explore this,
we aligned the AMPSphere c_AMPs to the full-length proteins in GMGCv1
and observed that about 7% (61,020) of them are homologous to a canonical-length protein (Figure 1B),
with 27% of these hits sharing the start codon with the longer protein.
This suggests early termination of full-length proteins as one
mechanism for generating novel c_AMPs (Figures 3A and 3B ).
To investigate the function of the full-length proteins homologous to AMPs, we mapped the matching proteins from GMGCv1
We identified 3,792 (out of 43,789) OGs significantly enriched (pHypergeom. < 0.05,
after multiple hypothesis corrections with the Holm-Sidak method) among
the hits from AMPSphere. Although OGs of unknown function comprise
53.8% of all identified OGs, when considered individually, these OGs are
on average smaller than OGs in other categories. Thus, despite each OG
having a relatively small number of c_AMP hits, when compared to the
background distribution of the OGs in GMGCv1,
OGs of unknown function were the most enriched among the c_AMP hits, with an average enrichment of 10,857-fold (pMann ≤ 3.9 × 10−4; Figure 3C; Table S3).
c_AMP genes may arise after gene duplication events
We
next raised the question of whether c_AMPs would be predominantly
present in specific genomic contexts. To investigate the functions of
the neighboring genes of the c_AMPs, we mapped them against 169,484
genomes included in a recent study.
A total of 38.9% (21,465 out of 55,191) of c_AMPs with more than two
homologs in different genomes in the database showed phylogenetically
conserved genomic context with genes of known function (see STAR Methods
section “Genomic context conservation analysis”).
This holds true for curated versions of the catalog: 35.32% of
high-quality c_AMPs and 32.06% of high-quality c_AMPs with experimental
evidence show conserved genomic neighbors. These conservation values are
similar to that of 3,899,674 gene families with more than two homologs
calculated de novo on the gene catalog (34.4%), indicating that the genomic location of c_AMPs is not random.
Despite
being involved in similar processes, c_AMPs were generally depleted
from conserved genomic contexts involving known systems of antibiotic
synthesis and resistance, even when compared to small protein families (Figure 4).
Instead, we found that c_AMPs are encoded in conserved genomic contexts
with ribosomal genes (23.6%) at a higher frequency than other gene
families (4.75%; Figure 4A; Table S4).
Most of the c_AMPs (2,201 out of 2,642) in a conserved context with ribosomal subunits are homologous to ribosomal proteins (Figure 4D), congruent with the observation that in some species, ribosomal proteins have antimicrobial properties.
Seventy-seven c_AMPs homologous to ribosomal proteins were also
homologous to a ribosomal gene in their immediate vicinity (up to 1 gene
up/downstream). This phenomenon is not exclusive to ribosomal proteins:
1,951 c_AMPs can be annotated to the same KEGG Orthologous Group (KO)
as some of their immediate neighbors and may have originated from gene
duplication events. This shared annotation was interpreted in this
context as evidence for a common evolutionary origin and not as a
functional prediction for the c_AMPs. These duplications may have arisen
by recombination of flanking homologous sequences, which can happen
during cell division.Interestingly, 1,635 (83.8%) of these c_AMPs are located upstream of
the neighbor with the same KO annotation. Different permeases and
transposases are the most common KOs assigned to c_AMPs and their
neighbors (400 and 125 c_AMPs, respectively; see Table S5).
Most c_AMPs are members of the accessory pangenome
We observed that only a small portion (5.9%, pPermutation = 4.8 × 10−3, NSpecies = 416) of c_AMP families present in ProGenomes2 are contained in ≥95% of genomes from the same species (Figure 5), here referred to as “core.” This is consistent with previous work, in which AMP production was observed to be strain-specific. In contrast, a high proportion (circa 68.8%) of full-length protein families are core in ProGenomes2
species. There is a 1.9-fold greater chance (pFisher = 2.2 × 10−92)
that a pair of genomes from the same species share at least one c_AMP
when they belong to the same strain (99.5% ≤ ANI <99.99%).
One example of this strain-specific behavior is AMP10.018_194, the only c_AMP found in Mycoplasma pneumoniae genomes. M. pneumoniae strains are traditionally classified into two groups based on their P1 adhesin gene.
Of the 76 M. pneumoniae
genomes present in our study, 29 were classified as type-1, 29 were
classified as type-2, and the remaining 18 were undetermined in this
classification system
(see STAR Methods section “Determination of accessory AMPs”).
Twenty-six of the 29 type-2 genomes contain AMP10.018_194, as did 2
undetermined type genomes, but none of the type-1 genomes contain this
AMP.
More transmissible species have lower c_AMP density
We investigated the taxonomic composition of AMPSphere by annotating contigs with the Genome Taxonomy Database (GTDB) taxonomy
(see STAR Methods section “c_AMP density in microbial species”),
which resulted in 570,187 c_AMPs being annotated to a genus or species.
The genera contributing the most c_AMPs to AMPSphere were Prevotella (18,593 c_AMPs), Bradyrhizobium (11,846 c_AMPs), Pelagibacter (6,675 c_AMPs), Faecalibacterium (5,917 c_AMPs), and CAG-110 (5,254 c_AMPs; see Figure 5).
This distribution reflects the fact that these genera are among those
that contribute the most assembled sequences in our dataset (all
occupying percentiles above 99.75% among the assembled genera).
Therefore, we calculated the c_AMP density (⍴AMP
)
by determining the number of c_AMP genes per megabase pairs of
assembled sequence. To avoid bias due to the unequal sampling of
habitats, we included all the sequences predicted by Macrel
in each sample, including singleton sequences that were subsequently removed and are not part of AMPSphere.
To
further explore the importance of AMP production in ecological
processes, we investigated the role of AMPs in the mother-to-child
transmissibility of bacterial species in a recently published paper
for each bacterial species to the published measures of microbial
transmission. Human gut bacteria showed increased transmissibility at
lower AMP densities (RSpearman = −0.42, pHolm-Sidak = 3.4 × 10−2, NSpecies =
43). Similarly, in human oral microbiome bacterial species,
transmissibility from mother to offspring is consistently inversely
correlated with their ρAMP for the first year (RSpearman = −0.55, pHolm-Sidak = 1.4 × 10−3, NSpecies = 41). This suggests that human gut bacteria and oral microbiome bacterial species show increased transmissibility at lower ρAMP. Moreover, it highlights the potential influence of ρAMP
on the transmissibility of gut and oral microbiota, suggesting a link
between AMPs and the transmission success rates of microbial species.
Physicochemical features and secondary structure of AMPs
To
investigate the properties and structure of the synthesized peptides,
we first compared their amino acid composition to AMPs from available
databases of experimentally verified sequences (DRAMP
Notably, AMPSphere sequences displayed a slightly higher abundance of
aliphatic amino acid residues, specifically alanine and valine. However,
these AMPSphere sequences consistently differed (Figure 6A) from EPs.
The resemblances in amino acid composition between the identified
c_AMPs and known AMPs suggested similar physicochemical characteristics
and secondary structures, both of which are recognized for their
influence on antimicrobial activity.
The c_AMPs exhibited comparable hydrophobicity, net charge, and amphiphilicity to AMPs sourced from databases (Figure S1). Furthermore, they displayed a slight propensity for disordered conformations (Figure 6B) and had a lower net positive charge compared to other EPs (Figure 6A).
To
evaluate the structural and antimicrobial properties of c_AMPs from
AMPSphere, we first filtered the AMPSphere for peptides that were
predicted as suitable for in vitro assays due to their
solubility in aqueous solution and ease of chemical synthesis. We chose a
set of high-quality AMPs with 50 peptide sequences based on their
prevalence and taxonomic diversity (see STAR Methods section “Peptide selection for synthesis and testing”).
Additionally, to provide an unbiased evaluation of the peptides we
report here, we first excluded any peptides with a homolog in one of the
published databases and then randomly selected 50 additional peptides
from the AMPSphere, including 25 peptides with AMP probabilities of at
least 0.6 (as reported by Macrel
) and 25 peptides with lower probabilities (0.5–0.6).
Subsequently, we conducted experimental assessments of the secondary structure of the active c_AMPs using circular dichroism (Figures 6B and S4).
Similar to AMPs documented in databases, peptides derived from
AMPSphere exhibited different propensities for adopting α-helical
structures; also, some of them were unstructured or adopted
β-antiparallel conformations in all media analyzed. Notably, they also
displayed an unusually high content of β-antiparallel structures in both
water and methanol/water mixtures (Figure 6B)
despite their amino acid composition similarities to AMPs and EPs. We
attribute these findings to the slightly elevated occurrence of alanine
and valine residues, which are known to favor β-like structures with a
preference for β-antiparallel conformation.
Validation of c_AMPs as potent antimicrobials through in vitro assays
Next, we tested the 100 synthesized peptides against 11 clinically relevant pathogenic strains encompassing Acinetobacter baumannii, Escherichia coli (including one colistin-resistant strain), Klebsiella pneumoniae, Pseudomonas aeruginosa, Staphylococcus aureus (including one methicillin-resistant strain), vancomycin-resistant Enterococcus faecalis, and vancomycin-resistant Enterococcus faecium.
Our initial screening revealed that 63 AMPs (out of 100 synthesized)
completely eradicated the growth of at least one of the pathogens tested
(Figure 6C). Remarkably, in some cases, the AMPs were active at concentrations as low as 1 μmol L−1,
close to the peptide antibiotic polymyxin B and the antibiotic
levofloxacin that were used as positive controls in all experiments (Figure S4A). The Gram-negative bacteria A. baumannii and E. coli, as well as the Gram-positive vancomycin-resistant strains E. faecalis and E. faecium,
displayed higher susceptibility to the AMPs, with 39, 24, 21, and 26
peptide hits, respectively. However, none of the tested AMPs affected
methicillin-resistant S. aureus (MRSA) (Figure 6C).
We also synthesized and tested the scrambled versions of five of the
most active peptides from the high-quality group for antimicrobial
activity (i.e., actinomycin-1, enterococcin-1, lachnospirin-1,
proteobacticin-1, and synechocucin-1). All scrambled versions were
inactive except for lachnospirin-1_scrambled, which presented modest
activity against A. baumannii at 32 μmol L−1 (16 times higher concentration compared to its parent peptide lachnospirin-1; Figure S5A).
These results underscore the importance of the specific sequence of
these peptides to exert their antimicrobial activity. To further explore
the influence of sequence on structure, we assessed the secondary
structure tendency of the scrambled peptides using circular dichroism.
We noticed a decrease in helical fraction for sequences with higher
helical content (enterococcin-1, lachnospirin-1, and synechocucin-1),
while the predominately random coiled sequences actinomycin-1 and
proteobactin-1, as well as their scrambled counterparts, showed similar
secondary structural sequences in all media analyzed (Figures S5B–S5E).
These results suggest a lack of correlation between secondary structure
and antimicrobial activity of the AMPs derived from AMPSphere.
The growth of human gut commensals is impaired by c_AMPs
We screened the AMPs against eight of the most relevant members of the human gut microbiota associated with human health.
our study found that 58 of the synthesized AMPs (58%) demonstrated
inhibitory effects on at least one commensal strain at low
concentrations (8–16 μmol L−1). Although this concentration range was higher than that observed for the most active peptides against pathogens (1–4 μmol L−1), it still falls within the highly active range of AMPs based on previous studies
(Figure 6C). Interestingly, all the analyzed gut microbiome strains were susceptible to at least four c_AMPs, with strains of A. muciniphila, B. uniformis, P. vulgatus, C. aerofaciens, C. scindens, and P. distasonis
exhibiting the highest susceptibility. In total, 79 AMPs (out of 100
synthesized peptides) demonstrated antimicrobial activity against
pathogens and/or commensals. We also screened scrambled sequences of
five of the highly active peptides from the high-quality group against
gut commensals. Similarly to the results obtained against pathogenic
strains (Figure S5), only lachnospirin-1_scrambled was modestly active against C. scindens at 64 μmol L−1 (Figure S5A).
Permeabilization and depolarization of the bacterial membrane by c_AMPs from AMPSphere
To
gain insights into the mechanism of action responsible for the
antimicrobial activity observed in the peptides derived from AMPSphere (Figure 6C),
we conducted experiments to assess their ability to permeabilize and
depolarize the outer and cytoplasmic membranes of bacteria at their
minimum inhibitory concentrations (MICs). Specifically, we investigated
the effects of all 39 peptides that showed activity against A. baumannii (Figures 6D and 6E) and 6 peptides with antimicrobial activity on P. aeruginosa (Figures S6A
and S6B). For comparison and as a control, we used polymyxin B, a
peptide antibiotic known for its membrane permeabilization and
depolarization properties.
To
investigate the potential permeabilization of the outer membranes of
Gram-negative bacteria by the selected AMPs, we conducted
1-(N-phenylamino)naphthalene (NPN) uptake assays. NPN is a lipophilic
fluorophore that exhibits increased fluorescence in the presence of
lipids found within bacterial outer membranes. The uptake of NPN
indicates membrane permeabilization and damage. Among the 39 peptides
evaluated for activity against A. baumannii, 10 peptides caused
significant permeabilization of the outer membrane, resulting in
fluorescence levels at least 50% higher than that of polymyxin B (Figure 6D) after 45 min of exposure. In the case of P. aeruginosa cells, four out of the six tested peptides showed higher permeabilization than polymyxin B (Figure S6A).
To
evaluate the potential membrane depolarization effect of the selected
AMPs from AMPSphere, we utilized the fluorescent dye
3,3′-dipropylthiadicarbocyanine iodide (DiSC3-[5]). Among the peptides tested against A. baumannii,
bogicin-1 (AMP10.364_543), ampspherin-2 (AMP10.615_023), and
marinobacticin-1 (AMP10.321_460) exhibited greater cytoplasmic membrane
depolarization than polymyxin B, and among the ones tested against P. aeruginosa, all peptides tested exhibited greater cytoplasmic membrane depolarization than polymyxin B (Figure 6B).
Interestingly, all the tested AMPSphere peptides displayed a
characteristic crescent-shaped depolarization pattern compared to
polymyxin B, with lower levels of depolarization observed during the
first 20 min of exposure followed by an increase in depolarization over
time (Figures 6E and S6B).
Taken together, these results indicate that the kinetics of cytoplasmic
membrane depolarization are slower compared to the kinetics of outer
membrane permeabilization, which occurs rapidly upon interaction with
the bacterial cells.
Our findings
indicate that the tested AMPs from AMPSphere primarily exert their
effects by permeabilizing the outer membrane rather than depolarizing
the cytoplasmic membrane, revealing a similar mechanism of action to
that observed for classical AMPs and EPs from the human proteome.
AMPs exhibit anti-infective efficacy in a mouse model
Next, we tested the anti-infective efficacy of AMPSphere-derived peptides in a skin abscess murine infection model (Figure 7A). Mice were subjected to infection with A. baumannii,
a dangerous Gram-negative pathogen known for causing severe infections
in various body sites including the bloodstream, lungs, urinary tract,
and wounds.
Ten lead AMPs from different sources displayed potent in vitro activity against A. baumannii: synechocucin-1 (AMP10.000_211, 8 μmol L−1) from Synechococcus sp. (coral-associated, marine microbiome); proteobacticin-1 (AMP10.048_551, 16 μmol L−1) from Pseudomonadota (plant and soil microbiome); actynomycin-1 (AMP10.199_072, 64 μmol L−1) from Actinomyces (human mouth and saliva microbiome); lachnospirin-1 (AMP10.015_742, 2 μmol L−1) from Lachnospira sp. (human gut microbiome); enterococcin-1 (AMP10.051_911, 1 μmol L−1) from Enterococcus faecalis (human gut microbiome); alphaprotecin-1 (AMP10.316_798, 1 μmol L−1) from Alphaproteobacteria (aquatic microbiome); oscillospirin (AMP10.771_988, 8 μmol L−1) from Oscillospiraceae (pig gut microbiome); ampspherin-4 (AMP10.466_287, 8 μmol L−1) from an unknown source; methylocellin-1 (AMP10.446_571, 2 μmol L−1) from Methylocella sp. (soil microbiome); and reyranin-1 (AMP10.337_875, 16 μmol L−1) from Reyranella (plant and soil microbiome). The skin abscess infection was established with a bacterial load of 20 μL of A. baumannii cells at 106 colony-forming units (CFUs) mL−1 onto the wounded area of the dorsal epidermis (Figure 7A). A single dose of each peptide at their respective MIC value obtained in vitro (Figures 6C and S4A)
was administered to the infected area. Two days post-infection,
synechocucin-1, actynomycin-1, and oscillosporin-1 presented
bacteriostatic activity, inhibiting the proliferation of A. baumannii
cells, whereas lachnospirin-1, enterococcin-1, ampspherin-4, and
reyranin-1 presented bactericidal activity close to that of the
antibiotic polymyxin B (at 5 μmol L−1), reducing the CFU counts by 3–4 orders of magnitude (Figure 7B).
Four days post-infection, synechocucin-1, lachnospirin-1,
enterococcin-1, and ampspherin-4 presented a bacteriostatic effect close
to that of the antibiotic polymyxin B, reducing the CFU counts by 2–3
orders of magnitude compared to the untreated control (Figure S6C).
These results highlight the anti-infective potential of the tested
peptides from AMPSphere as they were administered at a single time
immediately after the establishment of the abscess. Mouse weight was
monitored as a proxy for toxicity, and no significant changes were
observed (Figures 7C and S6D), suggesting that the peptides tested were not toxic.
Discussion
Here,
we used ML to identify nearly a million candidate AMPs in the global
microbiome. Building on previous studies that focused specifically on
the human gut microbiome,
we cataloged AMPs from the global microbiome across 63,410 publicly
available metagenomes as well as 87,920 high-quality microbial genomes
from the ProGenomes2 database.
This led to the creation of AMPSphere (https://ampsphere.big-data-biology.org/),
an open-access and publicly available resource encompassing 863,498
non-redundant peptides and 6,499 high-quality AMP families from 72
different habitats, including marine and soil environments and the human
gut. Most of the c_AMPs (91.5%) were previously unknown and lacked
detectable homologs in other databases, and about one in five had
evidence of translation and/or transcription, as they could be detected
in independent publicly available sets of metatranscriptomes or
metaproteomes.
We designed a set of
tests to capture higher-quality predictions, but many peptides failed
these tests despite evidence that they were active, including our own in vitro
data and the existence of validated homologs in external databases.
Low-prevalence peptides will be less likely to pass the tests (RNAcode
requires multiple variants), which is independent of their activity and influenced by sampling biases.
Focusing on candidate AMPs that are directly encoded in the genome enabled in vitro and in vivo
testing using chemical synthesis without post-translational
modifications, but there are other processes that generate active
peptides, such as encrypted peptides (EPs),
which we used as a comparison point. Notably, the amino acid
composition and physicochemical characteristics of the validated AMPs
from AMPSphere differed from those of recently identified in EPs.
Two evolutionary mechanisms by which AMPs may be generated were
explored. First, mutations in genes encoding longer proteins could
generate gene fragments via truncation. Among the enriched ortholog
groups of proteins from GMGCv1
for small proteins from the human gut microbiome. The second mechanism
is that a small protein gene could undergo a duplication followed by
mutation, which we observed in the case of ribosomal proteins. Ribosomal
proteins can harbor antimicrobial activity,
Nonetheless,
the majority of identified AMPs did not have a detectable homolog in
other databases. The lack of observed homology may be due to limitations
in our ability to robustly detect these homology relationships in small
sequences, but there is also the possibility that small proteins, such
as AMPs, may be more likely to be generated de novo compared to longer proteins and may have repeatedly evolved in various taxa.
This may also be an explanation for the large fraction of c_AMPs in the AMPSphere that do not cluster with any other sequences.
We
observed that c_AMPs from AMPSphere were habitat-specific and mostly
accessory members of microbial pangenomes. Furthermore, four out of the
five genera with the most c_AMPs present in AMPSphere share a
host-associated lifestyle, and three of these (Prevotella, Faecalibacterium, and CAG-110) are common in animal hosts
who recently analyzed a large collection of human-associated
metagenomes, provide a species-specific index of transmissibility for
the several transmission scenarios they study (e.g., mother to infant).
Hypothesizing that AMP production may be related to transmission, we
correlated the species-specific ⍴AMP
calculated in AMPSphere with transmission scores. In both the human gut and oral microbiomes, species with higher ⍴AMP
are less transmissible, possibly because AMPs confer protection against
strain replacement. Taken together, these results validate the
applicability of AMPSphere in the study of microbial ecology, as they
suggest a role for AMPs in determining the transmissibility and
colonization ability of microbes, which warrants further investigation
and validation in future work.
Finally, we experimentally validated predictions made by our ML model
and found that 79 (out of 100) synthesized AMPs displayed antimicrobial
activity against either pathogens or commensals. Nonetheless, notably,
four peptides (cagicin-1, cagicin-4, and enterococcin-1 against A. baumannii and cagicin-1 and lachnospirin-1 against vancomycin-resistant E. faecium) presented MIC values as low as 1 μmol L−1, comparable to the MICs of some of the most potent peptides previously described in the literature.
We
show that the tested AMPs from AMPSphere tended to target clinically
relevant Gram-negative pathogens and showed activity against
vancomycin-resistant E. faecium. Although conventional AMPs do not target bacteria from the human gut microbiome,
tested AMPs from AMPSphere showed efficacy against commensal bacteria,
suggesting potential ecological implications of peptides as protective
agents for their producing organisms and their ability to reconfigure
microbiome communities.
When assessing their activity in vivo,
three peptides exhibited anti-infective efficacy in a murine infection
model, with lachnospirin-1 and enterococcin-1 being the most potent,
resulting in a reduction of bacterial load by up to three orders of
magnitude. The active peptides included those derived from both
human-associated and environmental microbiota, validating our approach
of investigating the global microbiome. Overall, our findings unveil a
wide array of AMP sequences without matches in other databases,
highlighting the potential of machine learning in the discovery of
much-needed antimicrobials.
Limitations of the study
We
focused on a particular category of AMPs, namely peptides encoded by
their own genes and composed of up to 100 amino acids, which does not
cover all active peptides. We explored the global microbiome as
represented in public databases, and certain habitats and areas of the
globe have been significantly more explored than others. This uneven
coverage also impacts our quality estimates, as they depend on data
availability. We will, however, continue to update the resource as newer
genomes and metagenomes are made available. We report results based on
finding homologs to our peptides, but matching small sequences to large
databases has a higher rate of errors (particularly missed matches) than
is the case for longer sequences. Our results on the transmissibility
of microbial strains and AMP density were intended to demonstrate the
value of AMPSphere as a resource, but a full validation of this link
will be the focus of future work. Finally, we tested peptides in vitro and in vivo
against a panel of bacteria. Given that we observed species- and even
strain-specific responses, it is possible that peptides for which we did
not observe any activity would have been active against strains not
tested here.
Further
information and requests for resources and reagents should be directed
to and will be fulfilled by the lead contact Luis Pedro Coelho (luispedro@big-data-biology.org).
Materials availability
This study did not generate new unique reagents.
Data and code availability
•
Metagenomes
and Genomes data are publicly available at the European Nucleotide
Archives (ENA) as of the date of publication. Their accession numbers
are listed in Table S1. AMPSphere is available as a public online resource (https://ampsphere.big-data-biology.org/), and its files have been deposited in Zenodo and are publicly available as of the date of publication. DOIs are listed in the key resources table.
•
All original code has been deposited at Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table.
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Experimental model and study participant details
Bacterial strains and growth conditions
The pathogenic strains Acinetobacter baumannii ATCC 19606, Escherichia coli ATCC 11775, Escherichia coli AIC221 [Escherichia coli MG1655 phnE_2FRT (control strain for AIC 222)], Escherichia coli AIC222 [Escherichia coli MG1655 pmrA53 phnE_2FRT (polymyxin-resistant; colistin-resistant strain)], Klebsiella pneumoniae ATCC 13883, Pseudomonas aeruginosa PAO1, Pseudomonas aeruginosa PA14, Staphylococcus aureus ATCC 12600, Staphylococcus aureus ATCC BAA-1556 (methicillin-resistant strain), Enterococcus faecalis ATCC 700802 (vancomycin-resistant strain), and Enterococcus faecium
ATCC 700221 (vancomycin-resistant strain) were grown and plated on
Luria-Bertani (LB) agar plates and incubated overnight at 37°C from
frozen stocks. After incubation, one isolated colony was transferred to
6 mL of medium (LB), and cultures were incubated overnight (16 h) at
37°C. The following day, inocula were prepared by diluting the overnight
cultures 1:100 in 6 mL of the respective media and incubating them at
37°C until bacteria reached logarithmic phase (OD600 = 0.3–0.5).
The gut commensal strains Akkermansia muciniphila ATCC BAA-635, Bacteroides fragilis ATCC 25285, Bacteroides thetaiotaomicron ATCC 29148, Bacteroides uniformis ATCC 8492, Bacteroides vulgatus ATCC 8482 (Phocaeicola vulgatus), Collinsella aerofaciens ATCC 25986, Clostridium scindens ATCC 35704, and Parabacteroides distasonis ATCC 8503 were grown in brain heart infusion (BHI) agar plates enriched with 0.1% (v/v) vitamin K3 (1 mg mL−1), 1% (v/v) hemin (1 mg mL−1, diluted with 10 mL of 1 N sodium hydroxide), and 10% (v/v) L-cysteine (0.05 mg mL−1),
from frozen stocks and incubated overnight at 37°C. Resazurin was used
as an oxygen indicator. After the incubation period, a single isolated
colony was transferred to 3 mL of BHI broth and incubated overnight at
37°C. The next day, inocula were prepared by diluting the bacterial
overnight cultures 1:100 in 3 mL of BHI broth and incubated at 37°C
until cells reached the logarithmic phase (OD600 = 0.3–0.5).
Skin abscess infection mouse model
To assess the anti-infective efficacy of the peptides against A. baumannii ATCC 19606 in a skin abscess infection mouse model, the bacteria were cultured in tryptic soy broth (TSB) medium until an OD600 of 0.5 was reached. Next, the cells were washed twice with sterile PBS (pH 7.4) and suspended to a final concentration of 5·106 colony-forming units (CFU) per mL−1.
Six-week-old female CD-1 mice, after being anesthetized with
isoflurane, were subjected to a superficial linear skin abrasion on
their backs in an area that they could not touch with their mouth or
limbs. An aliquot of 20 μL containing the bacterial load was then
administered over the abraded area. A single dose of the peptides
diluted in water at their MIC value was administered to the infected
area 2 h after the infection. The animals were euthanized two- and
four-days post-infection, and the infected area was extracted and
homogenized for 20 min using a bead beater (25 Hz) and 10-fold serially
diluted for CFU quantification on MacConkey agar plates for easy
differentiation of A. baumannii colonies. The experimental groups consisted of 3 mice CD-1 per group (n =
3), all female, and each mouse was infected with an inoculum from a
different colony to ensure variability. The animals were single caged to
avoid cross-contamination. All the mice were used three days after
arrival from the commercial provider. The skin abscess infection mouse
model was approved by the University Laboratory Animal Resources (ULAR)
from the University of Pennsylvania (Protocol 806763).
Method details
Selection of microbial (meta)genomes
Selection of metagenomes and genomes to compose the AMPSphere was similar to that adopted by Coelho et al.
Public metagenomes available on 1 January 2020 produced with Illumina
instruments (except for MiSeq, to ensure the consistency and reliability
of the meta-analysis findings), with at least 2 million reads and, on
average, 75 bp long, were downloaded from the European Nucleotide
Archive (ENA). These samples met two criteria: (1) they were tagged with
taxonomy ID 408169 (for metagenome) or were a descendant of it in the
taxonomic tree; and/or (2) they came from experiments with the library
source listed as “METAGENOMIC”. Samples were grouped by project and all
projects with at least 20 samples were included for analysis.
Additionally, metagenomes deposited by the Integrated Microbial Genomes
System (IMG) missing from ENA were also included. Metadata was manually
curated from each sample’s describing literature and Biosamples
database.
For habitat classification groups were created based on the similarity
of habitat conditions, such as air, anthropogenic, aquatic,
host-associated, ph:alkaline, sediment, terrestrial, and others. The
sample origins and information related to host species were obtained
using the NCBI taxonomic identification number. High-quality microbial
genomes were selected from ProGenomes2 database.
trimming positions with quality lower than 25 and discarding reads
shorter than 60 bp post-trimming. Metagenomes obtained from a
host-associated microbiome passed through a filtering of reads mapping
to the host genome when available. Reads totaling more than 14.7
trillion base pairs of sequenced DNA were assembled with MEGAHIT 1.2.9
to predict smORFs (33–303 bp) from contigs. The 4,599,187,424 redundant
smORFs, most of which (99.25%) originated in metagenomes, were then
de-duplicated to optimize the computational resource usage, yielding
2,724,621,233 non-redundant smORFs. Macrel
was run on the de-duplicated smORFs to predict c_AMPs. Singleton
sequences (those appearing in a single sample or genome) were
eliminated, except when they had a significant match (amino acid
identity ≥75% and E-value ≤10−5) to a sequence from the Data Repository of Antimicrobial Peptides (DRAMP)
In total, AMPSphere encompassed 863,498 non-redundant predicted c_AMPs
encoded by 5,518,294 redundant genes. AMP densities were estimated as
the number of AMPs per assembled base pairs in a sample or a species.
had the taxonomy of the original genome assigned to them, whereas AMP
genes from metagenomes were assigned the taxonomy predicted for the
contig where they were found. Insights about potential structural
conformations were obtained using the function
secondary_structure_fraction from the ProtParam module implemented in
the SeqUtils in Biopython.
This function calculates the fraction of amino acids tend to assume
conformations of helix [VIYFWL], turn [NPGS], and sheet [EMAL].
Clustering of AMP families
Clustering
peptides by sequence identity is only possible at high identities as
short low-/medium-identity matches are possible by chance. Therefore,
aiming to recover matches where basic features are preserved even if
individual amino acids are not identical,
- [LVIMC], [AG], [ST], [FYW], [EDNQ], [KR], [P], [H]. c_AMPs were
hierarchically clustered after alphabet reduction using three sequential
identity cutoffs (100%, 85%, and 75%) with CD-Hit.
Representative sequences of peptide clusters were selected according to
their length (taking the longest) with ties being broken by their
alphabetical order.
To validate this
clustering procedure, we used a sample of 3,000 sequences randomly
sampled from AMPSphere, excluding cluster representatives. These
sequences were aligned against the representative sequence of their
cluster using the Smith-Waterman algorithm
with the BLOSUM 62 cost matrix, and gap open and extension penalties of
−10 and −0.5, respectively. The alignment score was then converted to
an E-value according to the model by Karlin and Altschul,
Alignments were considered significant if their E-value was less than 10−5.
We found that more than 95.3% of alignments produced in the first two
levels (100% and ≥85% of identity) were significant, along with 77.1% of
those from the third level (≥75% of identity) – see Figure S3.
Quality control of c_AMPs
The c_AMPs in AMPSphere were submitted to another six AMP prediction systems (AMPScanner v2,
The
genes of c_AMPs were subjected to five different quality tests to
reduce the likelihood that the observed peptides were artifacts or
fragments of larger proteins. Initially, the peptides were searched
against AntiFam v.7.0
which was designed to identify commonly recurring spuriously predicted
ORFs, with the option “--cut_ga”. Fewer than 0.05% of c_AMPs had any
significant hits.
For each smORF, we
searched for an in-frame stop codon upstream of its start codon. When
no stop codon is found, we cannot rule out the possibility that the
smORF is part of a larger gene which we cannot observe due to fragmented
assembly. Most (68.4%) of the c_AMPs are encoded by at least one gene
that is not terminally placed. However, the fact that a c_AMP is
terminal does not imply that the given c_AMP is an artifact since the
AMP genes are short enough to be recovered even in short contigs. For
example, 72.9% (4,622/6,339) of homologs to DRAMP
program predicts protein-coding regions based on evolutionary
signatures typical for protein genes. This analysis depends on a set of
homologous and non-identical genes. Therefore, AMP clusters containing
at least three gene variants were aligned. Given that an extensive
portion of the AMPSphere candidates (53%; 459,910 out of 863,498) is not
part of such a cluster, they could not be tested. Of the tested c_AMPs,
53% (215,421 out of 403,588) were considered genes with evolutionary
traits of protein-coding sequences.
We
then checked for evidence of transcription and/or translation using 221
publicly available metatranscriptomes, comprising human gut (142), peat
(48), plant (13), and symbionts (17); and 109 publicly available
metaproteomes from PRIDE
we selected genes with at least one read mapped across a minimum of two
samples to increase our confidence. This approach is similar to that
adopted when predicting AMPs.
k-mers of all AMPSphere peptides (with length equal to at least half
the length of the sequence) were compared to peptide sequences in
metaproteomics data. A perfect match between a k-mer and a metaproteomic
peptide was considered additional evidence that this c_AMP is likely to
be translated, as described by Ma et al.
Briefly, the number of c_AMP peptides mapped against the set of
metaproteomic samples was counted, and those c_AMP peptides with at
least one match covering more than 50% of the peptide were marked as
detected. c_AMPs with experimental evidence in metatranscriptomes and/or
metaproteomes accounted for circa 20% of the AMPSphere.
The
mapping of c_AMPs was performed without considering genomic context,
which may have led to an overestimation of candidates being identified
as potentially transcribed. For example, if they are homologous to
longer proteins the presence of the longer gene may lead to a false
positive detection of the shorter c_AMP. We investigated this using
Fisher’s Exact Test to compare the percentage of AMP homologs to the
GMGCv1
database with experimental evidence of translation (3.4% - 2,073 out of 61,020 peptides, Odds Ratio = 4.3, PFisher’s exact < 10−300) and/or transcription (22.8% - 13,901 out of 61,020 peptides, Odds Ratio = 1.2, PFisher’s exact = 6.7 · 10−108).
The results suggest that our approach tends to slightly overestimate
the potential transcription and translation of candidates with
canonical-length homologs.
Given
that only a small number of transcriptomic or proteomics dataset were
available and the afore-mentioned limitations in interpreting the
mappings, we considered AMPs passing all quality-control tests to be
high-quality, regardless of evidence of translation or transcription. We
further separated those with experimental evidence of
translation/transcription (17,115 c_AMPs, circa 2% of AMPSphere) and those without it (63,098 c_AMPs, circa
7%). For c_AMP families, we considered high-quality those where ≥75% of
its c_AMPs pass all quality control tests or those with at least one
c_AMP possessing experimental evidence of translation/transcription.
Sample-based c_AMPs accumulation curves
To
determine the saturation of c_AMP discovery, for each habitat or group
of habitats, we computed sample-based accumulation curves by randomly
sampling metagenomes in steps of 10 metagenomes. This procedure was
repeated 32 times, and the average was taken.
Multi-habitat and rare c_AMPs
We
first counted c_AMPs present in ≥2 habitats (“multi-habitat AMPs”). To
then test the significance of this value, we opted for a similar
approach to that described in Coelho et al.
:
habitat labels for each sample were shuffled 100 times and the number
of resulting multi-habitat c_AMPs was counted. Shuffling labels resulted
in 676,489.7 ± 4,281.8 multi-habitat c_AMPs by chance for high-level
habitat groups, and in 685,477.17 ± 4,369.6 multi-habitat c_AMPs by
chance when looking at the habitats individually inside the high-level
groups. The Shapiro-Wilks test was used to check that the resulting data
distribution is normal (p = 0.49, for specific habitats; p =
0.1 for high-level habitats). In the original (non-shuffled data),
high-level habitat groups presented 93,280 multi-habitat c_AMPs (136.21
standard deviations below shuffled value), while specific habitats
presented 173,955 multi-habitat c_AMPs (117.1 standard deviations below
shuffled value).
To determine the rarity of c_AMPs, we adapted the protocol previously established by Coelho et al.
We considered only uniquely mapped reads. From the mapping, we computed
the c_AMPs detected per sample and the number of detections per c_AMP,
considering “rare” c_AMPs as those detected less than the average of the
entire AMPSphere (682 detections or 1% of all samples as previously
described for species
).
This approach was adopted to overcome the high computational costs of a
competitive mapping procedure. We expect that our approach
overestimates how prevalent c_AMPs are, and because of that, it is a
robust way to estimate the rarity of c_AMPs.
As
the high-quality designation requires at least 3 gene variants for the
RNAcode test to be performed, the rarest genes will not be high-quality.
However, for robustness, we quantified this effect by computing the
mean and median number of detections in only the high-quality c_AMPs and
only non-terminal c_AMPs (a test which does not require a minimum
number of genes). The mean number of detections is 682 for the full
collection, 789 for high-quality c_AMPs, and 679 for non-terminal ones.
Testing c_AMPs overlap across habitats
Like
was done when testing the significance of the number of multi-habitat
c_AMPs observed, the number of overlapping c_AMPs was computed for each
pair of habitats. We shuffled the sample labels 1,000 times, counting
the number of randomly overlapping c_AMPs for each pair of habitats.
Then, we estimated the probability of observing the overlap by
Chebyshev’s inequality, which does not rely on any assumption regarding
the distribution of the data as we observed, using the Shapiro-Wilk’s
test, that the shuffled counts do not follow a normal distribution.
Chebyshev’s inequality is p≤1Z2
, where Z stands for the Z score computed from the average and standard deviations estimated by the shuffling procedure. The p-values were adjusted using Holm-Sidak implemented in multipletests from the statsmodels package,
, where ncAMPs
is the number of c_AMP redundant genes and L is the assembled base
pairs. We assume, as an approximation, that in a large segment
assembled, the start positions of AMP genes are independent and
uniformly random. Then, we calculated the standard sample proportion
error with the formula: STDerr=ρ∗(1−ρ)L−−−−−−√. The standard sample proportion error was used to calculate the margin of error at a 95% confidence interval (Z=1.96,α=0.05
).
To
gain insights about the contributions of different phyla, species, and
genera to the AMPSphere, we calculated the c_AMP density for these
taxonomy levels using the c_AMPs included within AMPSphere, summing all
assembled base pairs for contigs assigned to each taxonomy level in the
samples used in AMPSphere. The ρAMP
of genera, phyla and species within a margin of error superior to 10%
of the calculated value were eliminated along with outliers according to
Tukey’s fences (k=1.5). We estimated species’ presence and abundance in each sample using mOTUs2.
to demonstrate the effect of AMPs on the transmission of bacterial
species from mother to children. Only those species overlapping
AMPSphere and the datasets from Valles-Colomer et al.
were used for this analysis, and their AMP densities were calculated as
described in the previous section (c_AMP density in microbial species),
using all the predicted c_AMPs from metagenomes and genomes we
obtained, also including those not in AMPSphere, to avoid sampling bias.
The AMP density and the coefficient of transmissibility were correlated
using Spearman’s method implemented in the scipy package
: following children’s microbiome after 1, 3, and up to 18 years, as well as, cohabitation and intra-datasets. The p-values of correlations were corrected using Holm-Sidak implemented in the multipletests function from the statsmodels package.
To
uncover the prevalence of c_AMPs through the microbial pangenomes,
core, shell, and accessory c_AMP clusters were determined using the
subset of c_AMPs obtained from ProGenomes2
).
To increase confidence in our measures, only species containing at
least 10 genomes were used in this analysis. c_AMPs and AMP families
present in fewer than 50% of the genomes from a microbial species were
classified as accessory. c_AMPs and families present in 50%–95% of the
genomes in the cluster were classified as shell,
To
determine the propensity of AMPs being shared between genomes belonging
to the same strain, we first defined strains within species. For this,
we used FastANI v.1.33
Genome groups with ANI ≥99.99% were considered clonal complexes and
only a single representative of each clonal complex was kept for further
analyses. Species that had fewer than 10 genomes after this step were
not considered further in this analysis. Next, we inferred strains
(99.5% ≤ ANI <99.99%) as in Rodriguez et al.
We then counted the pairs of genomes from the same species sharing
AMPs, stratified by whether the pair originates from the same strain or
not, and tested the results with Fisher’s Exact Test implemented in the
scipy package.
To
determine the proportions of accessory, shell and core full-length
proteins in the microbial pangenomes, we also extracted the predicted
full-length proteins from the ENA database for each genome and
hierarchically clustered them after alphabet reduction in a similar
fashion to that described in the topic “AMP families”. Full-length
protein clusters with ≥8 sequences for each species were kept. The
prevalence of full-length protein families within a species was computed
as above and the number of core families was compared to the number of
c_AMP core families using the probability, calculated as number of
species with proportion of core full-length protein families less or
equal to that observed for c_AMPs divided by the total of assessed
species.
To determine the genotype of Mycoplasma pneumoniae genomes in ProGenomes2,
with the restricted time-reversible substitution model and a
bootstrapping procedure with 1,000 pseudo-replicates to determine node
support. The tree was used to segregate and classify genomes taking the
strain type of reference genomes from Diaz et al.
To
detect homologs to previously published proteins, we aligned AMPSphere
candidates against several databases: (i) the small protein sets in
SmProt 2,
version 3.0. To strictly avoid any artifacts of assembly for the
analysis, only c_AMPs which passed the terminal placement test (i.e.,
for which there was strong evidence that the ORF is indeed complete)
were searched against the GMGCv1.
for each ortholog group was computed along with the counts for ortholog
groups in the top hits to AMPSphere. The enrichment was given as the
proportion of hits present in a given ortholog group divided by the
proportion of that ortholog group among the redundant sequences in
GMGCv1,
and results were considered significant if p < 0.05 after correction with the Holm-Sidak method implemented in multipletests from the statsmodels package.
hits associated with them, using a minimum of 10, 20, or even 100
proteins, the results were kept similar to those obtained with all data
showing that the extension of the ortholog groups in AMPSphere did not
affect the enrichment analysis.
To
check for genomic entities generated after gene truncation, we screened
for c_AMP homologs using the default settings for Blastn
keeping only significant hits with a maximum E-value of 10−5. As a case study, we selected the AMP10.271_016, predicted to be produced by Prevotella jejuni,
which shares the start codon with the gene coding for a
NAD(P)-dependent dehydrogenase (WP_089365220.1). To verify the gene
disposition and putative mutations leading to the AMP creation, we used
Biopython
to codon-align the fragments from metagenomic contigs assembled from
samples SAMN09837386, SAMN09837387, and SAMN09837388, and genomic
fragments of different strains of Prevotella jejuni CD3:33 (CP023864.1:504836–504949), F0106 (CP072366.1:781389–781502), F0697 (CP072364.1:1466323–1466436), and from Prevotella melaninogenica
strains FDAARGOS_760 (CP054010.1:157726–157839), FDAARGOS_306
(CP022041.2:943522–943635), FDAARGOS_1566 (CP085943.1:1102942–1103055),
and ATCC 25845 (CP002123.1:409656–409769) and compared the segments
coding for the AMP and the original full-length protein.
Genomic context conservation analysis
To
gain insights into the gene synteny involving AMP genes, we mapped the
863,498 AMP sequences against a collection of 169,632 reference genomes,
metagenome-assembled genomes (MAGs) and single amplified genomes (SAGs)
curated elsewhere
Hits with identity >50% (amino acid) and query and target coverage
>90% were considered significant. The target coverage threshold
avoids hits to larger homologs whose function may be unrelated. This
yielded 107,308 AMPs with homologs in at least one genome. We built gene
families from the hits of each AMP detected in the prokaryotic genomes
and calculated a conservation score based on the functional annotation
of the neighboring genes in a window of three genes up and downstream.
The vertical conservation score at each position within the window of
each c_AMP was calculated as the number of genes with a given functional
annotation (ortholog group, Kyoto Encyclopedia of Genes and Genomes
(KEGG) pathway, KEGG orthology, KEGG module,
).
divided by the number of genes in the family. AMPs with more than two
hits and a vertical conservation score >0.9 with any functional term
were considered to have conserved genomic contexts. Figure 4 shows genomic context conservation of different KEGG pathways.
For
testing whether the fraction of AMPs with conserved genomic neighbors
is similar to that of other gene families within the 169,632 genomes
curated by del Río et al.,
(using a minimal amino acid identity of 30%, coverage of the shorter sequence of at least 50%, and maximum E-value of 10−3). The c_AMPs were also annotated using EggNOG-mapper v2.
Their KO annotations were compared to that of the immediate neighbors
(+/− 1 positions) to identify neighborhoods with the same function. It
was possible to annotate 56.1% (60,173 out of 107,308) of c_AMPs with
hits to the genomes tested using the EggNOG5 database.
Of these, 18.1% were assigned to translation-related functions (class
J), 14.4% belong to proteins of unknown function (S), 9% were assigned
to replication, recombination, and repair (L).
and Vue Javascript. The database was built with sqlite, and SQLalchemy
was used to map the database to Python objects. Internal and external
APIs were built using FastAPI and Gunicorn to serve them. On the front
end, Vue 3 was used as the backbone and Quasar built the layout. Plotly
was used to generate interactive visualization plots, and Axios to
render content seamlessly. LogoJS (https://logojs.wenglab.org/app/) was used to generate sequence logos for AMP families; while the helical wheel app (https://github.com/clemlab/helicalwheel) was used to generate AMP helical wheels.
Peptide selection for synthesis and testing
We
selected two groups of peptides: (i) 50 peptides that were selected as
being particularly likely to be active and that were otherwise
interesting (as described below), (ii) 50 peptides selected randomly
after applying technical exclusions.
For the first group, only high-quality (see the topic “quality control of c_AMPs”) c_AMPs were considered for synthesis. They were further filtered according to six criteria for solubility
version 3.0 had a slightly lower rate, 44.3% passed half the tests. We
then assessed the peptides regarding their ease of synthesis, however,
only 21.2% from AMPSphere passed at least 2 out of the 3 criteria
established for chemical synthesis.
A
peptide approved for at least six of the above-mentioned criteria was
then filtered by predicting AMP activity with six methods in addition to
Macrel
Peptides predicted to be AMPs by all methods were filtered by length,
discarding sequences longer than 40 amino acid residues, for which
conventional solid-phase peptide synthesis using Fmoc strategy has lower
yields and many recoupling reactions.
Only one peptide was kept from each family or cluster, namely the one
with the highest number of observed smORFs. After this process, we
obtained 364 candidate AMPs, belonging to 166 families and 198 clusters
with <8 c_AMPs. Of these, 30 candidates were homologous to sequences
from the databases used in annotation (e.g., SmProt 2
).
To compose the list of 50 high-likelihood candidates: (i) we selected
34 of the most prevalent peptides; (ii) we randomly selected 14 c_AMPs
(30% of our set) with homologs to the GMGCv1
We also included scrambled sequences made using five of the most active
peptide sequences to verify the potency of randomly generated
sequences.
To build the group of
randomly selected peptides, we first selected c_AMPs that are not
homologous to any other databases tested and that passed the
abovementioned synthesis criteria (total of 768,061 out of 863,498
peptides). We further divided this group into subgroups: (i) those with
Macrel-assigned probability >0.6 (271,555 c_AMPs) and (ii) those in
the range 0.5–0.6 (496,506 c_AMPs; note that all c_AMPs in AMPSphere
have a Macrel-assigned probability ≥0.5). We randomly sampled 25
peptides from each group.
MIC values were considered as the concentration of the peptides that
killed 100% of cells after 24 h of incubation at 37°C. First, peptides
diluted in water were added to untreated flat-bottom polystyrene
microtiter 96-well plates in 2-fold dilutions ranging from 64 to 1 μmol L−1, and then peptides were exposed to an inoculum of 2·106
cells in LB or BHI broth, for pathogens and gut commensals,
respectively. After the incubation time, the absorbance of each well
representing each of the conditions was analyzed using a
spectrophotometer at 600 nm. The assays were conducted in three
biological replicates to ensure statistical reliability.
Circular dichroism assays
Circular
dichroism experiments were conducted using a J1500 circular dichroism
spectropolarimeter (Jasco) at the Biological Chemistry Resource Center
(BCRC) of the University of Pennsylvania. The experiments were carried
out at a temperature of 25°C. Circular dichroism spectra were obtained
by averaging three accumulations using a quartz cuvette with an optical
path length of 1.0 mm. The spectra were recorded in the wavelength range
from 260 to 190 nm at a scanning rate of 50 nm min−1 with a bandwidth of 0.5 nm. The peptides were tested at a concentration of 50 μmol L−1.
Measurements were performed in water, a mixture of water and
trifluoroethanol (TFE) in a ratio of 3:2, and a mixture of water and
methanol in a ratio of 1:1. Baseline measurements were recorded prior to
each measurement. To minimize background effects, a Fourier transform
filter was applied. The helical fraction values were calculated using
the single spectra analysis tool available on the BeStSel server.
Membrane
permeability was analyzed using the 1-(N-phenylamino)naphthalene (NPN)
uptake assay. NPN demonstrates weak fluorescence in an extracellular
environment but displays strong fluorescence when in contact with lipids
from the bacterial outer membrane. Thus, NPN will show increased
fluorescence when the integrity of the outer membrane is compromised. A. baumannii ATCC 19606 and P. aeruginosa PA01 were cultured until cell numbers reached an OD600 of 0.4, followed by centrifugation (10,000 rpm at 4°C for 3 min), washing, and resuspension in buffer (5 mmol L−1 HEPES, 5 mmol L−1 glucose, pH 7.4). Subsequently, 4 μL of NPN solution (working concentration of 0.5 mmol L−1) was added to 100 μL of bacterial solution in a white flat bottom 96-well plate. The fluorescence was monitored at λex = 350 nm and λem =
420 nm. The peptide solutions in water (100 μL solution at their MIC
values) were introduced into each well, and fluorescence was monitored
as a function of time until no further increase in fluorescence was
observed (30 min). The relative fluorescence was calculated using a
non-linear fit. The positive control (antibiotic polymyxin B) was used
as baseline. The following equation was applied to reflect % of
difference between the baseline (polymyxin B) and the sample:
The
ability of the peptides to depolarize the cytoplasmic membrane was
assessed by measuring the fluorescence of the membrane
potential-sensitive dye 3,3′-dipropylthiadicarbocyanine iodide [DiSC3-(5)].
This potentiometric fluorophore fluoresces upon release from the
interior of the cytoplasmic membrane in response to an imbalance of its
transmembrane potential. A. baumannii ATCC 19606 and P. aeruginosa PA01 cells were grown with agitation at 37°C until they reached mid-log phase (OD600 = 0.5). The cells were then centrifuged and washed twice with washing buffer (20 mmol L−1 glucose, 5 mmol L−1 HEPES, pH 7.2) and re-suspended to an OD600 of 0.05 in 20 mmol L−1 glucose, 5 mmol L−1 HEPES, 0.1 mol L−1
KCl, pH 7.2. An aliquot of 100 μL of bacterial cells was added to a
black flat bottom 96-well plate and incubated with 20 nmol L−1 of DiSC3-(5)
for 15 min until the fluorescence stabilized, indicating the
incorporation of the dye into the cytoplasmic membrane. The membrane
depolarization was monitored by observing the change in the fluorescence
emission intensity of the dye (λex = 622 nm, λem =
670 nm), after the addition of the peptides (100 μL solution at their
MIC values). The relative fluorescence was calculated using a non-linear
fit. The positive control (antibiotic polymyxin B) was used as
baseline. We estimated the % of difference between the baseline
(polymyxin B) and the sample using the same mathematical approach as in
the “Outer membrane permeabilization assays”.
Quantification and statistical analysis
Graphs
for the experimental results were created and statistical tests
conducted in GraphPad Prism v.9.5.1 (GraphPad Software, San Diego,
California USA).
Additional resources
AMPSphere is freely available for download in Zenodo
We
thank Marija Dmitrijeva (University of Zurich) for her helpful comments
on a previous version of the manuscript. We thank Kaylyn Tousignant
(Queensland University of Technology) for her help editing the
manuscript. We thank Georgina H. Joyce (Queensland University of
Technology) for her help designing the graphical abstract. We thank
members of the Coelho group and the de la Fuente Lab for insightful
discussions. C.F.-N. holds a Presidential Professorship at the
University of Pennsylvania and acknowledges funding from the
Procter & Gamble Company, United Therapeutics, a BBRF Young
Investigator Grant, the Nemirovsky Prize, the Penn Health-Tech
Accelerator Award, Defense Threat Reduction Agency grants HDTRA11810041
and HDTRA1-23-1-0001, and the Dean’s Innovation Fund from the Perelman
School of Medicine at the University of Pennsylvania. We thank Dr. Mark
Goulian for kindly donating the strains Escherichia coli AIC221 (Escherichia coli MG1655 phnE_2:FRT [control strain for AIC 222]) and Escherichia coli AIC222 (Escherichia coli
MG1655 pmrA53 phnE_2:FRT [polymyxin-resistant]). This work was partly
funded by the EMBL and the following grants: National Natural Science
Foundation of China grants T2225015 and 61932008 (L.P.C. and X.-M.Z.);
Shanghai Science and Technology Commission Program grant 23JS1410100
(L.P.C. and X.-M.Z.); National Key R&D Program of China grants
2023YFF1204800 and 2020YFA0712403 (L.P.C. and X.-M.Z.); Shanghai
Municipal Science and Technology Major Project grant 2018SHZDZX01
(L.P.C. and X.-M.Z.); Lingang Laboratory and National Key Laboratory of
Human Factors Engineering Joint Grant LG-TKN-202203-01 (X.-M.Z.); The
Science and Technology Commission of Shanghai Municipality grant
22JC1410900 (L.P.C.); Australian Research Council grant FT230100724
(L.P.C.); the Langer Prize from the AIChE Foundation (C.F.-N.); National
Institutes of Health grant R35GM138201 (C.F.-N.); Defense Threat
Reduction Agency grant HDTRA1-21-1-0014 (C.F.-N.); PID2021-127210NB-I00,
MCIN/AEI/10.13039/501100011033/FEDER, UE (J.H.-C.); 'la Caixa'
Foundation ID 100010434, fellowship code LCF/BQ/DI18/11660009
(A.R.d.R.); and the European Union’s Horizon 2020 research and
innovation program under the Marie Skłodowska-Curie grant agreement
713673 (A.R.d.R.).
Author contributions
Conceptualization,
C.D.S.-J., L.P.C., M.D.T.T., and C.F.-N.; Data curation, C.D.S.-J.,
Y.D., T.S.B.S., M.K., A.F., L.P.C., M.D.T.T., and C.F.-N.; Formal
analysis, C.D.S.-J., L.P.C., and M.D.T.T.; Funding acquisition, L.P.C.,
X.-M.Z., and C.F.-N.; Investigation, C.D.S.-J., L.P.C., M.D.T.T., and
C.F.-N.; Methodology, C.D.S.-J., Y.D., J.H.-C., A.R.d.R., L.P.C.,
M.D.T.T., and C.F.-N.; Project administration, L.P.C., M.K., X.-M.Z.,
P.B., and C.F.-N.; Resources, L.P.C., X.-M.Z., and C.F.-N.; Supervision,
L.P.C. and C.F.-N.; Visualization, C.D.S.-J., J.H.-C., J.S., A.V.,
A.H., C.Z., L.P.C., and M.D.T.T.; Writing – original draft, C.D.S.-J.,
M.D.T.T., C.F.-N., and L.P.C.; Writing – review & editing,
C.D.S.-J., Y.D., J.H.-C., A.R.d.R., T.S.B.S., A.F., P.B., X.-M.Z.,
L.P.C., M.D.T.T., and C.F.-N.
Declaration of interests
C.F.-N.
provides consulting services to Invaio Sciences and is a member of the
Scientific Advisory Boards of Nowture S.L. and Phare Bio. The de la
Fuente Lab has received research funding or in-kind donations from
United Therapeutics, Strata Manufacturing PJSC, and Procter &
Gamble, none of which were used in support of this work. An invention
disclosure associated with this work has been submitted.
Table S1. Metadata and description of (meta)genomes used in AMPSphere, related to Figure 1
The
sample is identified by its access code in the European Nucleotide
Archive (ENA), and the habitat shows the type of habitat this sample was
retrieved from. Other data about the sequencing, such as the number of
raw inserts and the number of assembled base pairs (bp), are also
available along with the information on N50. The number of predicted
complete large ORFs (>100 amino acids) and smORFs (10–100 amino
acids) is shown (ORFs+smORFs) along with the number of smORFs alone and
the predicted non-redundant c_AMPs.
Table S2. c_AMP distribution in the habitat groups, related to Figure 1
The
habitats grouped under each class are shown along with the number of
genes encoding the non-redundant c_AMPs, the number of c_AMP clusters in
total, and the number of clusters containing ≥8 c_AMPs (c_AMP
families).
was compared using the number of c_AMPs affiliating to homologs of a
given OG and the total number of OGs found in the homologs of c_AMPs
(156,711) in comparison to the GMGCv1.
As a background measure, we used the counts of a given OG in the
redundant set of genes belonging to GMGCv1 and the total number of OGs
found in the redundant GMGCv1 catalog (9,180,087,363). Enrichment in the
c_AMPs set was given as the fold-change calculated for each given OG in
relation to that expected in the GMGCv1. p values were adjusted using Holm-Sidak, and only significant hits (p < 0.05) were shown.
Table S4. c_AMP genome context in comparison to families with proteins of different sizes, related to Figure 4
Proportion
of protein families and AMPs with two or more members showing conserved
genomic context involving different KEGG pathways, shown with their
accession code and description. This table provides a comparison across
the protein families of all sizes, small proteins (<50 amino acids),
the set of AMPs passing in all quality tests, the set of AMPs passing
all quality tests except for the experimental evidence, and all AMPs.
and those harboring KOs were included in this analysis. For each AMP
showing the same KO annotation as their neighboring genes, we provide
the KO annotation, the relative position of the homolog neighbor, and a
description of the function of the KO.
Table S6.
Metatranscriptomes and metaproteomes used in the verification for
experimental signals of transcription and/or translation of c_AMP genes
from AMPSphere, related to STAR Methods section “Quality control of
c_AMPs”
Metatranscriptomes
from EMBL-ENA were used for transcription of c_AMPs. Datasets from the
Proteomics Identification Database (PRIDE)
Identification
of natural antimicrobial peptides from bacteria through metagenomic and
metatranscriptomic analysis of high-throughput transcriptome data of
Taiwanese oolong teas.
No hay comentarios:
Publicar un comentario