El virus ancestral de la COVID-19
apareció a más de 2.000 km de Wuhan solo unos años antes de la pandemia:
nuevo estudio apunta al comercio de animales salvajes como vía más probable de transmisión
The recency and geographical origins of the bat viruses ancestral to SARS-CoV and SARS-CoV-2
Un nuevo estudio revela que el virus causante de la COVID-19
(SARS-CoV-2) siguió un patrón casi idéntico al del SARS, viajando miles
de kilómetros a través del comercio de animales salvajes.
La historia del coronavirus que desencadenó la pandemia global podría haber comenzado muy lejos de Wuhan,
en las cuevas de murciélagos de China occidental o del norte de Laos.
Pero lo realmente impactante no es el lugar de origen, sino la ruta
inesperadamente rápida y distante que siguió el virus antes de llegar a
los primeros contagios humanos. Un nuevo estudio
publicado en la revista Cell —y
resumido en un comunicado oficial de la Universidad de California en
San Diego— apunta a una vía de transmisión que reaviva el foco sobre un
viejo sospechoso:
el comercio de fauna salvaje.
Este trabajo, realizado por un consorcio internacional de
investigadores, no solo reescribe parte de la historia reciente de la
pandemia, sino que también refuerza la teoría de un origen natural del SARS-CoV-2, muy similar al del brote de SARS en 2002.
Y lo hace desmontando, con evidencia evolutiva y geográfica, uno de los
argumentos más utilizados por los defensores de la teoría de la
filtración de laboratorio: la gran distancia entre Wuhan y los hábitats
naturales de los murciélagos portadores del virus.
Los científicos analizaron 250
genomas de coronavirus estrechamente relacionados con SARS-CoV-1 y
SARS-CoV-2. Estos virus pertenecen a una subfamilia llamada sarbecovirus, que se encuentra principalmente en murciélagos del género Rhinolophus
(murciélagos de herradura). A lo largo de milenios, estos virus han
circulado silenciosamente entre murciélagos en Asia, mutando y
recombinando fragmentos de su material genético.
Aquí es donde entra en juego un elemento crucial: la recombinación viral.
En lugar de heredar una única línea evolutiva, los sarbecovirus pueden
intercambiar fragmentos genéticos entre sí cuando dos virus diferentes
infectan al mismo murciélago. Esta capacidad convierte su genoma en una
especie de mosaico difícil de rastrear. Por eso, los investigadores
decidieron centrarse exclusivamente en las regiones del genoma que no
han sufrido recombinación para construir árboles genealógicos más
fiables.
Los resultados mostraron que los ancestros más cercanos al SARS-CoV-2 circularon en murciélagos del norte de Laos o de la provincia china de Yunnan entre cinco y siete años antes del inicio de la pandemia en Wuhan,
situada a unos 2.700 kilómetros. Un patrón similar se observó con el
SARS-CoV-1, cuyo ancestro más próximo apareció en el suroeste de China
poco antes de que surgiera el brote de 2002 en Cantón, también a más de
mil kilómetros de distancia. De hecho, el análisis reveló que los
ancestros de sarbecovirus más recientes tanto del SARS-CoV-1 como del
SARS-CoV-2 abandonaron sus puntos de origen menos de 10 años antes de
que se informara por primera vez que infectaban a humanos, a más de mil kilómetros de distancia.
La conclusión es sorprendente: la dispersión natural de los murciélagos no puede explicar estos desplazamientos tan rápidos y extensos.
Según el modelo desarrollado en el estudio, los sarbecovirus se mueven
aproximadamente al mismo ritmo que sus huéspedes: unos pocos kilómetros
por noche. Por tanto, el virus debió viajar por otra vía.
La ruta más probable: animales salvajes transportados por humanos
El estudio, liderado por
instituciones como la Universidad de California en San Diego y el
Instituto Rega de Bélgica, sugiere que el salto geográfico del virus se
produjo a través del comercio de animales salvajes. Esta hipótesis ya se
había barajado para el SARS-CoV-1, cuando se encontraron coronavirus en
civetas y perros mapache vendidos en mercados del sur de China. Ahora,
el nuevo análisis genético y geográfico refuerza la idea de que el comercio de fauna silvestre funcionó como puente entre murciélagos y humanos también en el caso de la COVID-19.
En concreto, los animales
infectados habrían sido trasladados, posiblemente vivos, desde las zonas
rurales del sur de Asia hasta mercados urbanos, como el de Huanan en
Wuhan. Allí, en condiciones de hacinamiento y contacto estrecho entre
especies y personas, el virus encontró el escenario perfecto para dar el
salto definitivo.
Este patrón —virus que
circulan en murciélagos durante años, salto a animales intermedios,
transporte humano a zonas densamente pobladas y finalmente transmisión a
humanos— es casi un calco entre el SARS original y el SARS-CoV-2. Y
eso, para los autores del estudio, es una coincidencia demasiado precisa como para ignorarla.
¿Y qué pasa con la teoría del laboratorio?
Desde los inicios de la
pandemia, la distancia geográfica entre Wuhan y las zonas con
murciélagos portadores del virus ha sido uno de los caballos de batalla
de quienes defienden un posible origen artificial del virus. Sin
embargo, el nuevo estudio desmonta este argumento mostrando que esa
distancia también existía en el caso del SARS de 2002. Es más, el patrón de propagación y evolución viral es extraordinariamente parecido en ambos casos.
Eso no significa, como
advierten algunos expertos, que se pueda descartar completamente la
hipótesis del escape de laboratorio. La falta de un huésped intermedio
identificado para el SARS-CoV-2 deja todavía una brecha importante en la
narrativa. Sin embargo, los hallazgos genéticos, geográficos y
temporales presentados ahora inclinan la balanza hacia un origen natural
facilitado por la actividad humana, especialmente el comercio de animales salvajes.
Hay que recordar que un estudio genético publicado el año pasado
señalaba al mercado de Huanan, en Wuhan, como el posible epicentro del
brote de la COVID-19, respaldando la hipótesis de un salto del virus
desde animales a humanos. Pero no es el único.
Varios estudios publicados
entre 2021 y 2023 han reforzado de forma consistente la hipótesis de un
origen zoonótico del SARS-CoV-2. Un estudio de 2021
examinó de forma crítica las pruebas disponibles y concluyó que la
transmisión del virus desde animales a humanos era la explicación más
plausible. En 2022, una investigación independiente
localizó el inicio del brote en el mercado mayorista de mariscos de
Huanan, en Wuhan, descartando que el Instituto de Virología de la misma
ciudad fuera el foco inicial.
Ese mismo año, otro análisis
identificó al menos dos eventos de transmisión zoonótica independientes
ocurridos en el mercado, con apenas una semana de diferencia, lo que
sugiere una propagación desde múltiples animales infectados. En 2023, una nueva revisión
cuestionó la falta de pruebas que respalden la teoría de una fuga de
laboratorio y denunció la desinformación que se ha difundido en torno a
la llamada “ganancia de función”, una técnica utilizada en virología que
ha sido malinterpretada en el debate público.
Ese mismo año,
un análisis crítico
concluyó que no existen indicios de que el SARS-CoV-2 haya sido creado o
manipulado artificialmente, y recordó que este virus es el noveno
coronavirus conocido que ha pasado de animales a humanos, un patrón ya
observado con otros patógenos.
Un estudio independiente de 2023
llegó a las mismas conclusiones, apuntando nuevamente hacia un origen
natural facilitado por el contacto estrecho entre humanos y fauna
silvestre. Y, por último, el pasado año Shi Zhengli, la viróloga del
Instituto de Virología de Wuhan frecuentemente mencionada en relación
con la hipótesis del escape de laboratorio,
publicó
las secuencias de coronavirus almacenadas en su laboratorio, dejando
claro que ninguna de ellas guarda una relación cercana con el
SARS-CoV-2.
Lecciones para futuras pandemias
Más allá del debate sobre los orígenes del SARS-CoV-2, el estudio ofrece una lección inquietante y urgente: las pandemias zoonóticas seguirán apareciendo si no se regula de forma estricta el comercio y transporte de animales salvajes.
Las condiciones que hicieron posible la COVID-19 —contacto cercano
entre especies, mercados abarrotados, transporte masivo de fauna desde
zonas rurales a núcleos urbanos— siguen existiendo en muchas partes del
mundo.
La vigilancia activa de los
virus en murciélagos, así como en animales vendidos en mercados, se
presenta como una herramienta fundamental para prevenir nuevas
catástrofes sanitarias. La historia se ha repetido una vez, y los
investigadores temen que vuelva a hacerlo si no se toman medidas
contundentes.
Este estudio, publicado el 7 de mayo de 2025 en Cell, no cierra todas las preguntas, pero ofrece
la evidencia más sólida hasta la fecha de que el comercio de vida
silvestre fue el motor silencioso detrás de la llegada del SARS-CoV-2 a
los humanos. En palabras de los investigadores, comprender la
historia evolutiva de estos virus no es solo una cuestión académica: es
una necesidad urgente para anticipar y contener la próxima pandemia.
https://www.cell.com/cell/fulltext/S0092-8674(25)00353-8
The recency and geographical origins of the bat viruses ancestral to SARS-CoV and SARS-CoV-2
Jonathan E. Pekar1,2,3,4,21,23 jpekar@ed.ac.uk ∙ Spyros Lytras5,6,21 spyros@g.ecc.u-tokyo.ac.jp ∙ Mahan Ghafari7 ∙ … ∙ Michael Worobey20,22 worobey@arizona.edu ∙ Joel O. Wertheim4,22 jwertheim@health.ucsd.edu ∙ Philippe Lemey
Fragments of the human SARS-CoVs share very recent common ancestors with bat viruses
•
SARS-CoV-1-like and SARS-CoV-2-like viruses have circulated in Asia for millennia
•
Recent ancestors of human SARS-CoVs likely circulated in Western China and Northern Laos
•
These ancestors traveled unexpectedly fast to reach sites of human emergence
Summary
The
emergence of SARS-CoV in 2002 and SARS-CoV-2 in 2019 led to increased
sampling of sarbecoviruses circulating in horseshoe bats. Employing
phylogenetic inference while accounting for recombination of bat
sarbecoviruses, we find that the closest-inferred bat virus ancestors of
SARS-CoV and SARS-CoV-2 existed less than a decade prior to their
emergence in humans. Phylogeographic analyses show bat sarbecoviruses
traveled at rates approximating their horseshoe bat hosts and circulated
in Asia for millennia. We find that the direct ancestors of SARS-CoV
and SARS-CoV-2 are unlikely to have reached their respective sites of
emergence via dispersal in the bat reservoir alone, supporting
interactions with intermediate hosts through wildlife trade playing a
role in zoonotic spillover. These results can guide future sampling
efforts and demonstrate that viral genomic regions extremely closely
related to SARS-CoV and SARS-CoV-2 were circulating in horseshoe bats,
confirming their importance as the reservoir species for SARS viruses.
Horseshoe bats (
Rhinolophus spp.) are the main hosts of the
Sarbecovirus subgenus
1 (genus
Betacoronavirus, family
Coronaviridae)
and, likely via virus transmission through transient intermediate
species, were the source of SARS-CoV (referred to hereafter as
SARS-CoV-1, now extinct) and SARS-CoV-2.
2,3,4
Since the first emergence of SARS-CoV-1 in Guangzhou, Guangdong
Province, in 2002 and SARS-CoV-2 (jointly referred to as the SARS-CoVs)
in Wuhan, Hubei Province, in 2019, the sampling and analyses of
sarbecoviruses in horseshoe bats has contributed to our understanding of
sarbecovirus diversity and geographical spread. Although sarbecoviruses
exhibit substantial geographic structuring linked to the ranges of
their host horseshoe bat species,
2,3,4,5,6 where and when their ancestors circulated, both in recent and ancient history,
7 is poorly understood.
Genome-wide
sequence identity is typically used to compare bat sarbecoviruses with
the SARS-CoVs, but because coronaviruses frequently recombine,
5,6 whole-genome identity of these bat viruses and the SARS-CoVs does not adequately reflect their complex evolutionary histories (
Figure S1A).
Previous attempts to date the most recent common ancestor (MRCA) of
SARS-CoV-2 and a closely related bat virus genome using long genomic
fragments resulted in estimates of 30–40 years before 2019.
4,6,8
However, when analyzing all non-recombinant regions (NRRs), we will
inevitably find some NRRs with older and some with younger MRCAs than
either this estimate from a long fragment or an estimate based on the
entire genome, which would effectively be a weighted average of NRR time
of the MRCAs (tMRCAs). To find a bat virus genome that has not
undergone recombination relative to either of the SARS-CoVs, it would
likely have been necessary to identify the sarbecovirus involved in each
cross-species transmission event leading to the SARS-CoVs (which we
refer to as the “direct ancestor”;
Figure S1B)
or a very closely related virus from a bat in the same ecological
community infected around the time of the transmission event. What we
have, instead, is a relatively small sample of bat sarbecovirus genome
sequences, gathered across different times and places. Hence, it is
necessary to identify all the NRRs of sarbecoviruses—genomic regions
within which there is little to no detectable signal of recombination
among the analyzed genomes and representing, to the extent possible, a
single evolutionary history. These genomic regions provide the clearest
insights into the evolutionary history of SARS-CoV-1 and SARS-CoV-2 and
the origins of their respective outbreaks.
Here,
we separately analyze the recombination patterns and respective
evolutionary histories of SARS-CoV-1 and its closely related viruses
(referred to as SARS-CoV-1-like viruses) and SARS-CoV-2 and its closely
related viruses (referred to as SARS-CoV-2-like viruses). We estimate
the separate sets of NRRs for the SARS-CoV-1-like and SARS-CoV-2-like
viruses. For each NRR, we utilize a molecular clock rate specific for
that portion of the genome and determine the MRCA of the SARS-CoV and
the most closely related bat virus, or set of viruses, among currently
sampled sarbecoviruses for the NRR. We refer to this MRCA as the
“closest-inferred ancestor” for that NRR because it is the parent node
of the SARS-CoV in the phylogeny (
Figure S1B)
and is limited by the current sample set of bat sarbecoviruses.
Additionally, the dating of deeper nodes in the phylogeny is complicated
by substitution saturation: the repeated occurrence of nucleotide
substitutions at the same site resulting in biased estimation of older
divergence in viral phylogenies.
9,10
We therefore account for substitution saturation in our NRR-specific
inferences when examining the deeper evolutionary history of
sarbecoviruses.
We show that the
available sequence data provide evidence of the closest-inferred bat
virus ancestors of both SARS-CoV-1 and SARS-CoV-2 circulating in bats in
the years immediately preceding the emergence of each SARS-CoV. By
performing phylogeographic analyses of these sarbecoviruses, we
demonstrate that the sarbecoviruses typically spread at a rate that
approximates their bat hosts. Based on this rate and current sampling,
we find that both SARS-CoV-1 and SARS-CoV-2 likely emerged far from
where their direct bat virus ancestors likely circulated and were
therefore unlikely to reach their provinces of emergence via bat host
movement alone.
Results
Inference of NRRs
The
evolutionary history of sarbecoviruses can only be accurately inferred
when focusing on NRRs rather than entire genomes or specific genes
because these viruses, as well as their ancestors, have undergone
extensive recombination.
4,6,8
To identify these recombination patterns, we aligned the 250
full-genome sarbecovirus sequences that were available at the time of
this analysis, including SARS-CoV-1 and SARS-CoV-2. We used the genetic
algorithm for recombination detection (GARD)
11 to independently identify recombination breakpoints on genomes closely related to SARS-CoV-1 and SARS-CoV-2 (
Figure S2).
We then performed separate analyses to reduce the potentially
overlapping recombination signal between the sets and increase our power
for confidently detecting breakpoint positions unique to each set. The
two sequence sets were separated based on previously published
recombination-aware phylogenetic analyses of most genomes used here
4,12,13 (see
STAR Methods).
The two recombination breakpoint analyses identified 31 putative NRRs
for the SARS-CoV-1-like viruses, with a median length of 870 nucleotides
(nt) and a range of 309–3,414 nt, and 44 putative NRRs for the
SARS-CoV-2-like viruses, with a median length of 595 nt and a range of
168–2,706 nt (
Figure 1A).
To examine whether the inferred positions of recombination breakpoints
of the SARS-CoV-1-like and SARS-CoV-2-like NRR analyses were closer to
each other than expected, we simulated randomly distributed
recombination breakpoint sets along the alignment for the
SARS-CoV-1-like and SARS-CoV-2-like viruses (30 and 43 breakpoints,
respectively) and compared the distribution of distances between the two
breakpoint sets. The empirical breakpoint sets had a mean genomic
distance of 203 nt (90% confidence interval [CI]: 138–299 nt) separating
the pairs of closest SARS-CoV-1-like and SARS-CoV-2-like virus
breakpoints, whereas the simulated distances had a mean of 396 nt (90%
CI: 259–606 nt;
Figure S3).
Additionally, there were seven breakpoint pairs between the two virus
sets that were in the smallest 10% of the observed distribution of
distances between breakpoints (10–51 nt apart;
Figure 1A),
indicating recombination hotspots shared between the two sarbecovirus
groups (e.g., near nsp3 and between the Spike N-terminal domain [NTD]
and receptor-binding domain [RBD]).
Recent bat virus ancestors of human SARS-CoVs
Focusing
on each NRR identified in the recombination analysis, we examined when
the closest-inferred bat viruses ancestral to SARS-CoV-1 and SARS-CoV-2
circulated. These ancestors are represented by the MRCA of each SARS-CoV
and the most closely related bat viruses, which is akin to the parent
node of each SARS-CoV in its respective phylogeny (
Figure S1B).
To examine whether the inferred
positions of recombination breakpoints of the SARS-CoV-1-like and
SARS-CoV-2-like NRR analyses were closer to each other than expected, we
simulated randomly distributed recombination breakpoint sets along the
alignment for the SARS-CoV-1-like and SARS-CoV-2-like viruses (30 and 43
breakpoints, respectively) and compared the distribution of distances
between the two breakpoint sets. The empirical breakpoint sets had a
mean genomic distance of 203 nt (90% confidence interval [CI]: 138–299
nt) separating the pairs of closest SARS-CoV-1-like and SARS-CoV-2-like
virus breakpoints, whereas the simulated distances had a mean of 396 nt
(90% CI: 259–606 nt;
Figure S3).
Additionally, there were seven breakpoint pairs between the two virus
sets that were in the smallest 10% of the observed distribution of
distances between breakpoints (10–51 nt apart;
Figure 1A),
indicating recombination hotspots shared between the two sarbecovirus
groups (e.g., near nsp3 and between the Spike N-terminal domain [NTD]
and receptor-binding domain [RBD]).
Recent bat virus ancestors of human SARS-CoVs
Focusing
on each NRR identified in the recombination analysis, we examined when
the closest-inferred bat viruses ancestral to SARS-CoV-1 and SARS-CoV-2
circulated. These ancestors are represented by the MRCA of each SARS-CoV
and the most closely related bat viruses, which is akin to the parent
node of each SARS-CoV in its respective phylogeny (
Figure S1B).
There
is insufficient temporal signal when calibrating a molecular clock
using tip dating with sarbecoviruses sampled from bats and pangolins (
Manis javanica),
6
likely as a consequence of limited sampling across space and time.
Therefore, we used SARS-CoV-1 genomes to identify a suitable rate prior
for the molecular clock in our timescaled phylogenetic analyses (
Figure S2; see
STAR Methods),
and these inferred substitution rates served as NRR-specific rate
priors for subsequent Bayesian phylogenetic inference of a SARS-CoV-1
genome from 2002 and 139 SARS-CoV-1-like viruses (
Figure S4A;
Table S1). The median time of the closest-inferred bat virus ancestor of SARS-CoV-1 within each NRR varied by several decades (
Figure 1B),
and the median of these medians is 1996. However, the NRR with the most
recent closest-inferred ancestor of SARS-CoV-1 had an inferred date of
2001 (95% highest posterior density [HPD]: 1998–2002; NRR 14 [Orf1b];
1,065 nt), only 1 year before the emergence of SARS-CoV-1 in 2002. The
viruses most closely related to SARS-CoV-1 within this NRR were YNLF_31C
(GenBank: KP886808.1) and YNLF_34C (GenBank: KP886809.1), sampled in
Yunnan, China, in 2013. Our findings were consistent, albeit slightly
earlier, when we included additional human and masked palm civet (
Paguma larvata) SARS-CoV-1 genomes from both the 2002–2003 and 2003–2004 outbreaks (
Figure S4C;
Table S1).
We similarly used the evolutionary rate of SARS-CoV-2 across the NRRs of the SARS-CoV-2-like viruses (
Figure S4B;
Table S1),
inferred from SARS-CoV-2 genomes sampled up to late 2020, as priors for
the phylogenetic analysis of the NRRs of the SARS-CoV-2-like viruses,
comprising SARS-CoV-2 and 37 SARS-CoV-2-like genomes. The time of the
closest-inferred bat virus ancestor of SARS-CoV-2 ranged from several
years to several decades preceding late 2019—when SARS-CoV-2 was
introduced into humans—across the 44 NRRs (
Figure 1C).
Although the median time of the closest-inferred ancestor of SARS-CoV-2
was in 2000, the two NRRs with the most recent time of the
closest-inferred ancestor both had an inferred date of 2014 (95% HPDs:
2002–2019 and 2007–2018; NRR 37 [Orf3a and E] and 43 [N]; 347 and 331
nt, respectively), only 5 years prior to its introduction into humans.
14
The genome most closely related to SARS-CoV-2 within NRR 37 is RacCS203
(GenBank: MW251308.1), sampled in Thailand in 2020, whereas the genome
most closely related to SARS-CoV-2 within NRR 43 is RpYN06 (GenBank:
MZ081381.1), sampled in Yunnan in 2021. We note that although these two
NRRs are each less than 350 nt, NRR 42 is 456 nt and has a similar time
of the closest-inferred ancestor of 2013 (95% HPD: 1999–2019). Within
NRR 42, the genomes most closely related to SARS-CoV-2 are BANAL-20-52
(GenBank: MZ937000.1) and BANAL-20-236 (GenBank: MZ937003.1), both
sampled in Laos in 2020. When using a faster rate prior based on
SARS-CoV-2 genomes sampled in early 2020 (
Figure S4B;
Table S1), we obtain slightly more recent estimates of the closest-inferred ancestor (
Figure S4D;
Table S1).
Our inferred ages for the closest-inferred ancestors of each of the
SARS-CoVs across the NRRs indicate that several of the published
sarbecovirus genomes include non-recombinant fragments that are
descendant from viruses that circulated likely only several years before
the emergence of SARS-CoV-1 and SARS-CoV-2.
The
times of closest-inferred ancestor of SARS-CoV-1 across the NRRs are
nearer to the date of human emergence than those for SARS-CoV-2 (
Figures 1B
and 1C): 21 NRRs of the SARS-CoV-1-like viruses have a median time of
the closest-inferred ancestor of SARS-CoV-1 within 10 years of the
emergence of SARS-CoV-1, whereas only 7 NRRs of the SARS-CoV-2-like
viruses have a closest-inferred ancestor of SARS-CoV-2 within 10 years
of the emergence of SARS-CoV-2. This pattern could be explained in part
by the sampling of a substantially larger number of SARS-CoV-1- than
SARS-CoV-2-like genomes in bats. However, using a faster rate prior
based on SARS-CoV-2 genomes sampled in early 2020 results in 17 NRRs
having a closest-inferred ancestor of SARS-CoV-2 within 10 years of its
emergence (
Figure S4D). Both SARS-CoV-1-like and SARS-CoV-2-like viruses consistently have older closest-inferred ancestors across the Spike gene (
Figures 1B
and 1C). The oldest time of the closest-inferred bat virus ancestor
within a single NRR of SARS-CoV-1 and SARS-CoV-2 was, respectively, 1963
(95% HPD: 1934–1985; NRR 19 [Orf1b]) and 1944 (95% HPD: 1798–2009; NRR
28 [Spike]). Notably, only 1 NRR among the SARS-CoV-1-like viruses (NRR
19) and 6 NRRs among the SARS-CoV-2-like viruses (NRRs 10, 27, 28, 31,
32, and 38) have a median time of the closest-inferred ancestor more
than 50 years earlier than the most recently sampled SARS-CoV-1-like or
SARS-CoV-2-like virus, respectively.
Effect of increased sampling on the recombinant common ancestors
To
test the influence of sampling, we quantified the extent to which the
publication of more viruses closely related to SARS-CoV-1 and SARS-CoV-2
has contributed toward our understanding of the ancestor of each of
these human viruses. The respective recombinant common ancestor
(“recCA”) of SARS-CoV-1 and SARS-CoV-2 can be understood as the
aggregate of the closest-inferred ancestors of each of the SARS-CoVs
across each of its NRRs,
14
with each closest-inferred ancestor existing at different points in
time and with potentially different viruses most closely related to the
SARS-CoV. The recCA therefore accounts for the closest relative(s)
across all non-recombinant segments. We inferred the genome of the recCA
of each SARS-CoV over time, progressively adding published genomes to
the reconstruction based on their publication date. Then, we calculated
the similarity of each recCA to its respective SARS-CoV as a function of
the publication date to examine how additional closely related genomes
can affect ancestral reconstruction and reflect our increasing
understanding of sarbecoviruses over time (
Figure 2A).
We
observe that, as more genomes were published, the SARS-CoV-1 recCA and
SARS-CoV-2 recCA became more similar to SARS-CoV-1 and SARS-CoV-2,
respectively (
Figures 2A
and 2B). The full bat virus genomes most similar to SARS-CoV-1 and
SARS-CoV-2 are, respectively, YN2020E (96.2% genetic identity to
SARS-CoV-1; GenBank: OK017855.1) and BANAL-20-52 (96.8% genetic identity
to SARS-CoV-2), both published in 2021. By the end of 2022, the recCA
of SARS-CoV-2 shares 98.8% genetic identity with SARS-CoV-2 and the
recCA of SARS-CoV-1 shares 98.7% genetic identity with SARS-CoV-1, both
expectedly exceeding the genetic identity of the bat virus genomes most
genetically similar to the human SARS-CoVs. Notably, the recCA
repeatedly increases in similarity to its respective SARS-CoV even when
newly published genomes share less genetic identity with the recCA than
the most similar genome at the time (
Figures 2A
and 2B). This trend can be observed in detail at the level of
individual NRRs: sequentially published genomes increase the genetic
identity of some of the NRRs of the recCA to the matching regions of its
SARS-CoV (
Figures 2C
and 2D). The shared genetic identity of the recCA to its respective
SARS-CoV is broadly consistent when incorporating additional SARS-CoV-1
genomes with the SARS-CoV-1-like viruses (
Figures S5A
and S5C) and when using a faster rate prior in the Bayesian
time-calibrated phylogenetic inference of the SARS-CoV-2-like viruses (
Figures S5B and S5D).
The
patterns observed here are the result of newly added genomes containing
fragments that descend from a more closely related ancestor despite not
being the most genetically similar overall, and further highlight the
importance of inferring NRRs in evolutionary analyses. Additionally, the
similar genetic identity of the recCAs to their respective SARS-CoVs,
both genome-wide and often at the level of individual NRRs, suggest that
the greater number of sampled SARS-CoV-1-like viruses than
SARS-CoV-2-like viruses is not solely responsible for the times of
closest-inferred ancestor of SARS-CoV-1 being consistently nearer to the
date of human emergence than those for SARS-CoV-2.
Phylogeography of bat sarbecoviruses
Sarbecovirus NRR phylogenies have an appreciable degree of geographic structuring linked to their hosts’ ranges.
6
To understand the spatiotemporal spread of the SARS-CoV-1-like and
SARS-CoV-2-like viruses in bats, we simultaneously fitted a spatially
explicit phylogeographic model to each NRR. As the sampling locations of
SARS-CoV-1, SARS-CoV-2, and the pangolin sarbecoviruses likely do not
represent where their direct bat virus ancestors circulated,
15,16,17,18 we excluded their locations from the phylogeographic analyses to avoid the impact of dispersal of non-bat hosts (
Figure S2).
Due
to substitution saturation, the deeper evolutionary history of rapidly
evolving viruses, including sarbecoviruses, are underestimated using
standard molecular clock approaches.
7,10,19,20,21
To account for the effects of substitution saturation on the
SARS-CoV-1-like and SARS-CoV-2-like phylogenies when trying to
understand the long-term dispersal patterns of sarbecoviruses, we
applied the prisoner of war (PoW) model
21—a
molecular clock model that estimates divergence times while accounting
for time-dependent evolutionary rates—to the posterior phylogenies with
phylogeographic estimates scaled in units of genetic distance (
Figure S2; see
STAR Methods).
We find that substitution saturation primarily affected the inferred
time of divergence events more than 50 years before the most recently
sampled virus in our dataset, with younger internal nodes having
relatively unchanged ages and indicating that our inferences of the time
of the ancestors of human SARS-CoVs and their closest bat
sarbecoviruses are unbiased (
Figures 3A and 3B). The effects of substitution saturation can, however, be particularly more pronounced for older divergence estimates (
Figures 3A
and 3B). When examining the entire phylogenies, the tMRCAs were often
hundreds of years old when using a standard clock model, whereas the
PoW-transformed phylogenies had tMRCAs that were typically thousands to
tens of thousands of years older (
Figures 3C and 3D).
To
investigate the dispersal history of the two sets of viruses, we mapped
the PoW-transformed phylogeny of each NRR across Asia (
Figures 4G and 4H;
Data S2).
The phylogeographic reconstructions of both sets of viruses suggest a
few lineages having dispersed over long distances, but the number of
these dispersals is likely an underestimate due to undersampling of
sarbecoviruses from various regions in Asia. The inferred dispersal
history of viral lineages looks consistent among NRRs (
Figures 4G and 4H;
Data S2):
we observe a similar spatial distribution of the sets of viruses and
their connections on the different NRR maps. Moreover, the virus
lineages are spatially structured, with phylogenetically similar viruses
being generally sampled in geographically similar regions with almost
no back and forth travel of viral lineages across relatively large
regions of the study area, which we confirmed by computing the
isolation-by-distance (IBD) signal
22 for each NRR (
Table S2).
Our analyses indeed reveal a moderate to strong IBD signal when
computing the correlation between the patristic distance—the sum of the
branch lengths connecting two nodes on a tree—and great-circle
geographic distance computed for each pair of bat virus samples: a
Spearman correlation coefficient of 0.45 for SARS-CoV-1-like viruses
(95% HPD: 0.36–0.60) and of 0.67 for SARS-CoV-2-like viruses (95% HPD:
0.29–0.84) (see
Table S2 for a comparison with other viruses).
Location of the closest-inferred ancestors of the human SARS-CoVs
Determining
where the recent ancestors of SARS-CoV-1 and SARS-CoV-2 circulated
could inform future sampling efforts and elucidate how these virus
lineages eventually traveled to their provinces of emergence. As the PoW
model enforces contemporaneously sampled tips, it can produce
unreliable results for the most recent inferred ages. However, the PoW
transformation shows that the timing of divergence events in the last
∼50 years inferred from standard molecular clock models are unbiased (
Figures 3A
and 3B). Therefore, we performed a separate tip-dated phylogeographic
analysis to infer where the closest-inferred ancestors of each SARS-CoV
circulated in the past 50 years (
Figure S2).
We find that the closest-inferred ancestors of SARS-CoV-1 likely circulated in western China (Yunnan, Sichuan, or Guizhou) (
Figure 4I).
The closest-inferred ancestors of SARS-CoV-2 likely circulated in
Yunnan, China, or Northern Laos, overlapping with contiguous karst and
cave landscapes extending through these regions
23,24 (
Figure 4J).
Our results indicate that both SARS-CoV-1 and SARS-CoV-2 emerged in
humans over a thousand kilometers from where their closest-inferred bat
virus ancestors likely circulated, with SARS-CoV-1 emerging in Guangzhou
and SARS-CoV-2 in Wuhan. The inferred geographic location of the
closest-inferred ancestors of each SARS-CoV overlapped with regions with
moderate horseshoe bat species richness (
Figure 4B;
Data S2), including in regions known to harbor
R. affinis,
R. pusillus, and
R. ferrumequinum (
Data S2), and the closest-inferred ancestor of SARS-CoV-2 additionally overlapped with known locations of
R. malayanus circulation (
Data S2). Although sarbecoviruses have been found in bats from Guangdong and Hubei,
13,25,26
the viruses most genetically similar to SARS-CoV-1 and SARS-CoV-2, both
at the genome level and among individual NRRs, have been predominantly
sampled in and near Southwest China.
27,28
Therefore, our findings are not solely a result of localized sampling
efforts but reflect a natural pattern in the geographic distribution of
bat sarbecoviruses, though more sampling in areas proximal to the Laos
and Vietnam borders may identify additional CoVs descendant from
relatively recent ancestors circulating in bats.
Sarbecoviruses spread at rates that approximate bat host dispersal
To
investigate how fast the SARS-CoV-1-like and SARS-CoV-2-like viruses
were recently spreading across Asia, we inferred the weighted diffusion
coefficient,
29 a measure of the diffusivity of viral transmission, for the 50 years before the most recently sampled genome in each dataset (
Figure S2). We found that the SARS-CoV-1-like viruses spread with a weighted diffusion coefficient of 1,666 km
2/year (95% HPD: 811–3,399;
Figure 4E), and the SARS-CoV-2-like viruses circulated with a weighted diffusion coefficient of 740 km
2/year (95% HPD: 116–2,587;
Figure 4F).
The SARS-CoV-1-like viruses were likely inferred to spread, on average,
slightly more quickly than the SARS-CoV-2-like viruses because of the
recent dispersal events to South Korea. These results were similar to
the diffusion coefficients calculated for the 75 years preceding the
most recently sampled genomes (
Table S2). Horseshoe bats, the primary hosts of sarbecoviruses, tend to forage within around 2–3 km of their roost
30,31,32,33 and have a reported diffusion coefficient of approximately 2,000 km
2/year (Henley et al.
34).
Therefore, both the SARS-CoV-1-like and SARS-CoV-2-like viruses (in the
context of host infections) traveled at a roughly similar rate as their
hosts, suggesting that these viruses are able to efficiently explore
new niches, likely due to the density of horseshoe bat roosts within the
karst landscapes of Southern China.
Rapid movement of lineages leading to human SARS-CoVs
To
test whether the bat virus ancestors of each human SARS-CoV could have
spread to where the SARS-CoVs emerged in humans via dispersal of
SARS-related coronaviruses across bat populations, we determined the
minimum distance from the location of the closest-inferred bat virus
ancestor to the province of emergence (
Figures 4I and 4J), and, given the time between emergence and this ancestor (
Figures 1B
and 1C), we calculated the dispersal velocity this distance implies. We
compared the implied dispersal velocity for the tip branch leading to
SARS-CoV-2 to the dispersal velocity of bat virus tip branches to obtain
a posterior rank distribution (
Figure S6). The tip branches leading to SARS-CoV-1 and SARS-CoV-2 had the highest mean posterior dispersal velocity rank (
Figure 5A),
but there were other tip branches leading to SARS-CoV-1-like viruses
with consistently high posterior ranks as well. Strikingly, the
posterior rank distributions for the tip branches of SARS-CoV-1 and
SARS-CoV-2 exhibit the most extreme skewness away from a rank of 0.5 and
toward 1 (
Figure 5B; see
STAR Methods), with their posterior rank distributions rarely including values below 0.5 (
Figure S6).
This pattern further indicates the rapid nature of the dispersal
velocities associated with the tip branches leading to SARS-CoV-1 and
SARS-CoV-2.
We
next explored the contributions of individual NRRs to the relatively
fast dispersal velocities (and, therefore, high posterior ranks)
required for the closest-inferred ancestors of SARS-CoV-1 and SARS-CoV-2
to reach their respective provinces of emergence. The tip branches
leading to SARS-CoV-1 and SARS-CoV-2 have the greatest mean posterior
rank in, respectively, 12 and 28 of the NRRs, indicating that it is not
only NRRs with recent closest-inferred ancestors of the SARS-CoVs (
Figure 1)
that result in relatively high dispersal velocities for the tip
branches leading to the SARS-CoVs. Considering the large distance
separating the closest-inferred bat virus ancestors from where
SARS-CoV-1 and SARS-CoV-2 each emerged (
Figures 4I and 4J), and the high dispersal velocities necessary to traverse this distance (
Figure 5),
it is unlikely that the lineages descending from the closest-inferred
bat virus ancestors of SARS-CoV-1 and SARS-CoV-2 reached Guangdong and
Hubei, respectively, solely via dispersal through their bat reservoirs.
Phylogeography robust to hypothetical unsampled diversity
We
evaluated whether our findings could be affected by the hypothetical
discovery of viruses closely related to the human SARS-CoVs nearer to
their provinces of emergence than has been observed in the empirical
data. We conducted tip-dated phylogeographic analyses that incorporated
hypothetical unsampled viruses, which represent unsequenced and
unobserved viral diversity. These unsampled taxa were positioned between
the provinces of emergence of SARS-CoV-1 and SARS-CoV-2 and the
locations where the taxa most genetically similar to them were sampled,
while also enforcing monophyly between the unsampled taxa and the
respective SARS-CoV (
Figure S7; see
STAR Methods), thereby ensuring that these viruses were more phylogenetically similar to SARS-CoV-1 and SARS-CoV-2 across the entire genome.
The
inclusion of these unsampled taxa resulted in a shift in the inferred
location of the closest-inferred ancestor toward the province of
emergence of the respective SARS-CoV, although, notably, the 50% and 75%
HPDs were consistent (
Figure S8).
There were slight reductions in the dispersal velocities necessary for
the closest-inferred ancestors of SARS-CoV-1 and SARS-CoV-2 to reach
their provinces of emergence (
Figure S9),
primarily due to the shift in the inferred location of the
closest-inferred ancestor. However, the tip branches leading to
SARS-CoV-1 and SARS-CoV-2 still had the highest mean posterior dispersal
velocity rank on average and across many NRRs, as well as the most
positive skewness (
Figure 6).
Other measures to quantify differences in dispersal velocity also
generally support the relatively high ranks of the tip branches leading
to SARS-CoV-1 and SARS-CoV-2 but are more sensitive to potentially
unsampled diversity, in particular for SARS-CoV-1 (
Figure S10).
Discussion
By
reconstructing the evolutionary history of sarbecoviruses while
accounting for recombination, we show that the ancestors of SARS-CoV-1
and SARS-CoV-2 likely circulated in horseshoe bat populations hundreds
to thousands of kilometers away from the sites of the emergence of these
viruses in humans and as recently as one to six years prior to this
emergence. Our findings indicate that there would not have been
sufficient time for the direct bat virus ancestor to reach the locations
of emergence of the human SARS-CoVs via normal dispersal through bat
populations alone.
Sarbecoviruses
undergo extensive recombination and tend to share breakpoints at
specific genome regions—so-called recombination hotspots.
4,35
Comparisons of non-recombinant subgenomic segments when less data was
available previously suggested there may have been decades of separation
between the closest-inferred bat virus ancestor and each of the
SARS-CoVs.
6,36
However, dating the closest-inferred ancestor across non-recombinant
subgenomic regions spanning the entire genome after calibrating a
molecular clock on a per region level reveals much more recent ancestry
between the SARS-CoVs and their bat-circulating relatives. The
non-recombinant fragments, therefore, must be comprehensively analyzed
to properly understand the emergence of the human SARS-CoVs.
5,6
We additionally show here that adding new genomes to the
recombination-aware ancestral reconstruction brought the sequence
identity of the recombinant common ancestor of each SARS-CoV to its
respective human virus to greater than 98% genetic identity.
The
phylogeographic approach implemented here provides important insights
into the sarbecoviruses’ extensive historical movement across China and
Southeast Asia. Our analysis confirms that both the SARS-CoV-1-like and
SARS-CoV-2-like viruses have each been circulating in these regions for
at least tens, if not hundreds, of thousands of years. Conversely, the
bat species endemic to Japan known to harbor sarbecoviruses,
R. cornutus, and its phylogenetically closest mainland species infected by SARS-CoV-2-like viruses,
R. pusillus, are estimated to have shared a common ancestor 13.3 million years ago.
37
This much more ancient separation between the host species indicates
that the viruses did not strictly co-diverge with these bat species. We
find that, in more recent history, the viral lineages diffuse at a
similar rate to horseshoe bats. These bats typically have small home
ranges (2–3 km
2 movement per night on average), are rarely migratory,
32,38
and have a limited dispersal capacity with scant evidence of longer
migrations, with such evidence limited to species in temperate rather
than tropical regions.
39,40
After accounting for recombination, the closest-inferred bat virus
ancestors of SARS-CoV-1 and SARS-CoV-2 are inferred to have circulated
in regions spanning from Northern Laos to Western China. Furthermore,
our examination of the effects of hypothetical unsampled taxa on the
phylogeographic reconstructions suggest that our inference of the
location of the closest-inferred bat virus ancestors and their relative
dispersal velocities are robust rather than a byproduct of subsampling.
SARS-CoV-1 is thought to have been moved from bats in Yunnan province to Guangdong province via intermediate mammalian hosts.
15 The movement of palm civets (
Paguma larvata) and raccoon dogs (
Nyctereutes procyonoides) as part of the wild and farmed animal trade has been implicated in the emergence of SARS-CoV-1,
41
suggesting that human-mediated transport of these infected animals can
allow for viral lineages to move faster than their bat hosts. Given (1)
the presumed emergence of SARS-CoV-1 through the animal trade, (2) the
recency and location of the closest-inferred ancestor relative to
SARS-CoV-2, (3) the consistently relatively high dispersal velocity
associated with SARS-CoV-2, and (4) the clear evidence that the
epicenter of the SARS-CoV-2 pandemic was at one of only four markets in
Wuhan that sold live wildlife from plausible intermediate host mammal
species,
42,43,44
either the closest-inferred ancestor or the direct ancestor of
SARS-CoV-2 likely moved from an area in or around Yunnan province, to
Hubei province, via the wild and farmed animal trade.
Our
phylogeographic model assumes that bat and human sarbecoviruses evolve
at similar rates. If bat viruses have slower rates of evolution than
those inferred here, the closest-inferred bat virus ancestors likely
circulated earlier. However, if this were true, there would be a
concomitant decrease in their inferred rates of travel and, therefore,
still not enough time for them to reach the provinces of emergence of
their respective SARS-CoVs.
The
roughly similar rates of diffusion among viral lineages and horseshoe
bats suggest these viruses are able to rapidly and efficiently move
among distinct horseshoe bat populations. If virus circulation were
restricted to the location of each bat population, with minimal
transmission of strains to other bats, we would expect a considerably
lower diffusion coefficient for the viruses, but there is substantial
overlap between the virus lineage and bat diffusion coefficients.
Therefore, the sarbecoviruses are likely being transmitted among dense
bat populations of various horseshoe bat species. This finding is
further supported by the high levels of recombination in bat
sarbecoviruses, requiring frequent co-infection and co-circulation of
distinct virus strains in a given bat population. Viral movement is
likely to take place through horseshoe bats regularly moving among
roosts, especially as roost switching tends to happen during breeding
season, or if bats travel or migrate to hibernacula.
33,45 Furthermore, in hibernacula, horseshoe bats will roost in very dense aggregations with different species,
46
providing the potential for spillover between bat species while bodily
functions are downregulated and body temperatures can reach very low
levels.
46,47
Heterothermy during hibernation and variable temperatures among species
may also impact vulnerability to pathogens, as metabolic processes are
downregulated during this time.
48,49Despite
limited evidence for long-distance travel of horseshoe bats, there are
several relatively long-distance viral lineage dispersals between nodes
in the phylogenies. These geographically distant nodes therefore likely
represent a lack of sarbecovirus sampling from areas between the distant
locations rather than long-range bat movements, though it could also
indicate dispersal from more migratory conspecific bat species, such as
Chaerephon, which can share roosts, or
Hipposideros,
which can disperse longer distances and are phylogenetically close to
Rhinolophidae. For example, internal branches of the SARS-CoV-2-like
tree represent movement from the east of China (Zhejiang) to the west
(Yunnan), to Thailand and Cambodia (
Figure 4H).
There are likely many more, still-undiscovered SARS-CoV-2-like viruses
that would bridge the geographic gap between these regions (e.g., in
northern Vietnam, Guizhou, and Guangxi) as the diversity of bats remains
high across these regions, such that increased sampling in
underexplored and undersampled regions would be helpful (
Figure 4B).
Although horseshoe bat ranges extend across most of China, species richness decreases at increasing latitudes,
27,50
and virus sampling in bats has largely been limited to higher
biodiversity areas, including Yunnan province in China and Laos.
Additionally, other large areas with high diversity, such as northern
Vietnam, remain unsampled and host a similar community of bat species.
The SARS-CoV-1-like bat viruses are still better represented (
n = 134) and therefore have more phylogeographic resolution than the SARS-CoV-2-like bat viruses (
n
= 31). Although there are viruses closely related to the SARS-CoVs
across most of the NRRs, the viruses are relatively more divergent
across the NRRs that overlap with the Spike gene (
Figure 1).
This relatively greater divergence suggests that Spike genes that can
bind non-bat entry receptors circulate in lower frequencies in wild
horseshoe bat populations and, therefore, are less likely to be sampled.
Since
the emergence of SARS-CoV-1 and SARS-CoV-2 in humans, the descendants
of the bat viruses that gave rise to them have experienced years of
evolution and almost certainly recombined with yet-unsampled lineages of
the sarbecovirus tree. Hence, we should not expect to find the direct
ancestor of either SARS-CoV-1 or SARS-CoV-2 circulating in wild bats in
future sampling. Although the search for and analyses of sarbecoviruses
related to SARS-CoV-1 and SARS-CoV-2 have often focused on partial
genomes, typically the RdRp protein,
13,51,52,53,54,55
future sequencing efforts should routinely aim for whole-genome
sequences to detect and characterize all genomic fragments that descend
from closely related ancestors. The non-recombinant segments of these
mosaic genomes are pieces of a complex puzzle that are crucial for
understanding the past and future emergence of SARS-CoVs into the human
population.
Limitations of the study
Because
there is insufficient temporal signal when calibrating a molecular
clock using tip dating with sarbecoviruses sampled from bats and
pangolins, we needed to calibrate the clock using SARS-CoV-1 and
SARS-CoV-2 genomes from viruses sampled from humans and civets. The NRRs
reported here are based on distinct inferred evolutionary histories
across the sarbecovirus genomes, but, due to the need for a detectable
signal, they likely represent a lower estimate of how much recombination
has likely occurred among these viruses. Phylogeographic
reconstructions across deep evolutionary histories can only capture
general patterns and necessarily make abstraction of potentially complex
population and habitat dynamics. Due the absence of bat viruses sampled
closer to where SARS-CoV-1 and SARS-CoV-2 emerged, the inference of the
location of the closest-inferred bat virus ancestors could be biased
toward where published, closely related viruses were sampled. Similarly,
a lack of sampling and publication of sarbecoviruses in nearby and
potentially difficult-to-access regions limits our understanding of the
broader ecology of these viruses. Even if sampling efforts would target
more regions, host shedding of virus will vary over time because it is
likely to increase during periods of increased metabolic stress, such as
pregnancy or lactation. Therefore, detectability will vary across the
year and there may be temporal sampling biases. However, the lack of
sampled bat sarbecoviruses very closely related to SARS-CoV-1 and
SARS-CoV-2 in and around their provinces of emergence, despite increased
sampling of bat viruses there,
13,25,26
suggests that our findings are not solely the result of ascertainment
bias. Further sampling of closely related sarbecoviruses, methodological
improvements in disentangling complex recombination patterns, and
improved molecular clock calibration can allow for the inference of more
detailed spatiotemporal trends and recombination patterns, as well as
more complete evolutionary histories of each NRR.
Resource availability
Lead contact
Further
information and requests for resources and additional data should be
directed to and will be fulfilled by the lead contact, Jonathan E. Pekar
(
jpekar@ed.ac.uk).
Materials availability
This
study did not generate new unique reagents or specific biological
material. The materials used and generated in this study are available
from the
lead contact.
As a computational project, the input, results, output, and code of
this study are publicly available and described below in the “
data and code availability” section.
Data and code availability
Acknowledgments
S.D.
acknowledges support from the Fonds National de la Recherche
Scientifique (F.R.S.-FNRS, Belgium; grant no. F.4515.22) and from the
Research Foundation - Flanders (
Fonds voor Wetenschappelijk Onderzoek - Vlaanderen,
FWO, Belgium; grant no. G098321N). S.D. and P.L. acknowledge support
from the European Union Horizon 2020 project MOOD (grant agreement no.
874850). P.L., A.F.M., A.R., and M.A.S. acknowledge support from the
European Union’s Horizon 2020 research and innovation programme (grant
agreement no. 725422-ReservoirDOCS), from the Wellcome Trust through
project 206298/Z/17/Z, and from the National Institutes of Health grants
R01 AI153044, R01 AI162611, and U19 AI135995. P.L. also acknowledges
support from the Research Foundation - Flanders (
Fonds voor Wetenschappelijk Onderzoek – Vlaanderen,
G0D5117N and G051322N). T.I.V. acknowledges support from the Branco
Weiss Fellowship. D.L.R. and J.H. acknowledge funding from the UK
Medical Research Council (MRC, MC_UU_12014/12, MC_UU_00034/5, and
MR/V01157X/1). S.L. was funded by an MRC studentship. M.G. is supported
by the Biotechnology and Biological Science Research Council (BBSRC),
grant no. BB/M011224/1. A.K. acknowledges funding from European Research
Council grant 101001623-PALVIREVOL. J.E.P. acknowledges support from
NIH (T15LM011271) and the UC San Diego Merkin Fellowship. This work was
funded in part with federal funds from the National Institute of Allergy
and Infectious Diseases, National Institutes of Health (NIH),
Department of Health and Human Services (contract no. 75N93021C00015 to
M.W.). J.O.W. acknowledges support from NIH (R01AI135992). We gratefully
acknowledge the authors from the originating laboratories and the
submitting laboratories, who generated and shared through GISAID, the
viral genomic sequences and metadata on which this research is based (
Data S1).
We thank Kristian Andersen, Alexander Crits-Christoph, Fabian
Leendertz, Niema Moshiri, and Richard Neher for fruitful conversations
and insight. We gratefully acknowledge support from Advanced Micro
Devices, Inc., with the donation of parallel computing resources used
for this research.
Author contributions
J.E.P.,
S.L., D.L.R., M.W., J.O.W., and P.L. conceived and planned the
research. J.E.P., S.L., S.D., M.G., A.F.M., E.P., and P.L. analyzed and
processed the data. Y.W., X.J., A.F.M., T.I.V., M.A.S., J.H., S.D.,
J.O.W., and P.L. advised on statistical methodology. J.E.P., S.L., S.D.,
M.G., and P.L. designed and implemented genomic data processing
pipelines. A.C.H. modeled bat distributions. J.E.P., S.L., S.D., and
P.L. wrote the first draft of the manuscript. All authors contributed to
editing and interpreting the results. D.L.R., S.D., M.W., J.O.W., and
P.L. jointly supervised the work.
Declaration of interests
S.L.
has received consulting fees from EcoHealth Alliance. M.A.S. receives
contracts and grants from the US Food and Drug Administration, the US
Department of Veterans Affairs, and Janssen Research and Development
unrelated to this research. J.O.W. has received funding from the CDC
(ongoing) through contracts or agreements to his institution unrelated
to this research. J.E.P., M.A.S., A.R., M.W., and J.O.W. have received
consulting fees and/or provided compensated expert testimony on
SARS-CoV-2 and the COVID-19 pandemic.
STAR★Methods
Experimental model and study participant details
The sources of bioinformatic data analyzed in this study are available in the
key resources table.
Method details
Dataset compilation
For the animal sarbecovirus dataset we compiled a set of 250 whole-genome sequences (>28,500 nt) of the subgenus
Sarbecovirus
sampled primarily in horseshoe bat hosts, including reference genomes
of the human viruses SARS-CoV-1 and SARS-CoV-2 and six
pangolin-infecting viruses (
Data S1).
This set contains all publicly available whole-genome sarbecoviruses
available on 4th of October 2023 retrieved from the NCBI Genbank, GISAID
and National Genomics Data Center of China databases. A single genome
was used for SARS-CoV-1 (HSZ-Cc; GenBank: AY394995.1), SARS-CoV-2
(Wuhan-Hu-1; GenBank: MN908947.3) and the Guangdong pangolin
sarbecovirus isolate (MP789; GenBank: MT121216.1) which has multiple
published genome accessions associated with the same sample.
16,18,85,86
Collection date, sampling location and host metadata associated with
the sequences were retrieved from the corresponding sequence databases
and manually checked with the published papers describing the sequences.
For sequences where exact city/cave sampling location information was
provided a single longitude and latitude data point corresponding to
that location was recorded, while provinces were recorded as sampling
locations otherwise. The metadata for these 250 genomes are available in
Data S1.
For the SARS-CoV-2 datasets, we used 863 genomes from the early pandemic
44 and 1,000 genomes from late 2020.
87
For the SARS-CoV-1 dataset, we downloaded the 98 publicly available
whole genomes of SARS-CoV-1 from humans and civets from GenBank, with
dates and locations matched by metadata or literature review:
https://github.com/andersen-lab/SARS-CoV-1_Evolution/blob/main/metadata_extended.csv. A list of all accessions used in this study are available in
Data S1. All GISAID accessions are also available through GISAID when using the identifier
EPI_SET_241121yn.
Recombination analysis
The 250 whole-genome sequences were aligned using MAFFT (v7.453, localpair option)
75 and manually curated using BioEdit.
76
To explore the recombination patterns of SARS-CoV-1-like and
SARS-CoV-2-like sarbecoviruses separately we selected a number of
representatives from each set of genomes on which to perform
recombination analysis using the Genetic Algorithm for Recombination
Detection (GARD).
11
We analysed the two sarbecovirus clades separately to reduce
overlapping recombination signal between the sets that may skew the
position of identified breakpoints unique to each clade. Even though
inter-clade recombinants exist in the dataset, the GARD algorithm does
not require potential recombination parents to be in the dataset, but
assesses recombination simply by comparing tree topology incongruencies.
Thus, signals of SARS-CoV-1-like/SARS-CoV-2-like inter-clade
recombinants can be detected in genomes of each set without the parental
lineages of recombinant genomes having to be present in each dataset.
The genome selection was made by first performing a preliminary
phylogenetic analysis of the relatedness between these genomes using the
recombination breakpoints presented in Lytras et al.
4
Based on the preliminary analysis, clusters of genomes grouping with
SARS-CoV-1 and SARS-CoV-2 for at least one of their genomic segments and
showing the same overall recombination pattern were manually identified
and one representative genome was selected from each cluster for the
corresponding downstream GARD analyses. For the SARS-CoV-2-like set, 24
genomes including the SARS-CoV-2 reference Wuhan-Hu-1 (GenBank:
MN908947.3) were used in the GARD analysis. For the SARS-CoV-1-like set,
38 genomes including the SARS-CoV-1 HSZ-Cc genome (GenBank: AY394995.1)
were used in the analysis (
Data S1).
The representative sequences for each set were retrieved from the full
alignment. The ends of genomes are usually more difficult to sequence,
leading to shorter or extended 3’ and 5’ ends in the sequences. Due to
the incomplete coverage of these short genomic regions they are unlikely
to contain any informative sites for the recombination analysis and for
this reason were trimmed from each resulting subset alignment. GARD was
performed for each subset alignment using the HyPhy software suite
v.2.5.29
77
under a general time reversible (GTR) substitution model and a 3-class
general discrete distribution (GDD) to account for site-to-site rate
variation. The SARS-CoV-2-like GARD analysis finished with a total of 92
consecutive breakpoint models, while the SARS-CoV-1-like analysis
finished with 56. Based on these models, we accepted positions as
confident recombination breakpoints if they were present in at least 1/2
of the consecutive models for the SARS-CoV-2-like set and in at least
1/3 of the models for the SARS-CoV-1-like set (proportional to the total
number of models). Breakpoints closer than 100 alignment positions to
one another were merged. The resulting breakpoint positions were mapped
back to the full whole-genome alignment which was subsequently split
into the corresponding non-recombinant regions (NRR) for each set of
breakpoints. Since some NRRs have relatively short lengths, incorrect
alignment of sites may have a larger effect on the phylogenetic
inference. To avoid alignment errors from biasing the phylogenetic
inferences, all NRR alignments were processed using HMMCleaner,
78 which removes sequence segments that fit poorly in the rest of the alignment, before performing any downstream analyses.
To
separate all genomes into SARS-CoV-1-like and SARS-CoV-2-like viruses
we first inferred maximum likelihood phylogenies for every NRR alignment
using IQ-TREE (v.1.4.4)
79
under a GTR+I+F+G4 substitution model, assessing node support by
performing 10,000 replicates of the ultrafast bootstrap approximation
for each inference.
88
All trees were rooted by the outgroup clade of sarbecovirus genomes
sampled in Europe and Africa (GenBank: NC_014470.1, MZ190137.1,
KY352407.1, MT726043.1, MT726044.1, MT726045.1, MZ190138.1, MW719567.1)
and nodes with ultrafast bootstrap support below 70 were collapsed using
the ‘ete3’ python package.
80
Based on the resulting topologies, the viruses clustering
monophyletically with SARS-CoV-2 but not with SARS-CoV-1 for at least
one of the SARS-CoV-2-like GARD NRR trees were placed in the
SARS-CoV-2-like viruses and vice versa for the SARS-CoV-1-like virus
set. This led to two non-independent sets of 38 SARS-CoV-2-like and 140
SARS-CoV-1-like genomes (each including SARS-CoV-2 and SARS-CoV-1
respectively) used in the downstream dating and phylogeographic
analyses; there were 19 genomes shared between these two datasets. The
set to which each genome is assigned is listed in
Data S1.
Comparison of breakpoint positions
Based
on the GARD recombination analysis, 30 and 43 breakpoint positions were
detected for the two virus subgroups. Only one SARS-CoV-1 genome was
used in both GARD sets (
Data S1),
so overall breakpoint inferences are expected to represent independent
recombination patterns of known SARS-CoV-1-like and SARS-CoV-2-like
viruses respectively. The lengths of the resulting NRRs on the alignment
with all 250 sarbecovirus genomes had a median of 870 nt and ranged
from 309 to 3,414 nt for SARS-CoV-1-like viruses, and a median length of
595 nt and ranged from 168 to 2,706 nt for SARS-CoV-2-like viruses. It
should be noted that the last NRR is the longest due to extended ends at
the 3' end of the genomes.
We
calculated the distances between the closest position pairs between the
two breakpoint sets, resulting in 30 pair distances (each
SARS-CoV-1-like position matched with its closest SARS-CoV-2-like
position). To compare whether the pairs were more or less closely
located than one would expect by chance, we performed 100 simulations of
position sets randomly sampled across the same alignment length without
replacement (one set with 30 positions and one with 43 for each
simulation to match the empirical data on the number of breakpoints). We
then quantified the distances between closest pairs in the simulated
sets as done with the observed breakpoints. Both empirical and simulated
distances formed distributions with log-normal shapes, hence 90%
confidence intervals and means were estimated using Cox’s method for
inferring log-normal confidence intervals.
89
Similarly, the significantly closest observed breakpoint pairs were
defined as those with distances below the one-tailed lower 90%
confidence interval of the observed log-normal distance distribution.
Molecular clock calibration
There
is insufficient data to inform a molecular clock when using tip dating
with sarbecoviruses sampled from bats and pangolins.
6
To identify a suitable rate prior for each NRR identified from the
recombination analysis, we conducted a phylogenetic analysis of
SARS-CoVs sampled from humans.
Molecular
clock calibration across the 44 inferred NRRs of the SARS-CoV-2-like
viruses was conducted using a Bayesian approach with BEAST v1.10.5
81 and using the BEAGLE library v3 to increase computational performance.
82
We inferred the substitution rate of SARS-CoV-2 across each of these
NRRs by using a subset of SARS-CoV-2 genomes (n=1,000) from our
previously published Bayesian phylogenetic analysis of 3,959 genomes
sampled up to late 2020 in Europe
87 (
Data S1),
a non-parametric skygrid prior with 19 grid points and a cutoff of 1
year, and a Hasegawa-Kishino-Yano (HKY85) substitution model
90
with four gamma categories. The topologies and substitution models were
shared across all the NRRs, but assuming an independent and estimable
strict clock rate under an uninformative continuous-time Markov chain
(CTMC) rate reference prior
91
for each NRR. We ran one Markov chain Monte Carlo (MCMC) chain of 500
million generations, subsampling every 50,000 iterations to a continuous
parameter log file. The first 10% of the chains were discarded as
burn-in. Convergence and mixing was assessed in Tracer v1.7.1
83
and all relevant effective sample size (ESS) values for clock rates
were >100. We also perform an additional clock calibration without
any NRR partitioning to infer a single clock rate prior for the
phylogeographic analysis (see below). The rest of the parameterizations
are identical to the partitioned clock calibration inference.
To
calibrate the molecular clock for the SARS-CoV-1-like viruses analysis,
we first performed Bayesian phylogenetic inference on 98 human and
civet SARS-CoV-1 genomes (
Data S1),
using an HKY85 substitution model with a gamma site heterogeneity model
and 4 gamma categories (G4), a strict molecular clock, and an
exponential growth coalescent prior. We ran a single chain of 100
million generations, subsampling every 10 thousand iterations to a
continuous parameter log and tree files. We next inferred the
substitution rate of SARS-CoV-1 across the 31 inferred NRRs of the
SARS-CoV-1-like viruses by using an empirical tree distribution
(n=1,000) from these results, sharing the topologies and substitution
models across all the NRRs, with independent and estimable clock rates
under CTMC rate priors. We ran one chain of one million generations,
subsampling every one hundred iterations to a continuous parameter log
file. For each step of the inference, the first 15% of the chains were
discarded as burn-in. Convergence and mixing was assessed in Tracer and
all relevant effective sample size (ESS) values were >200.
Bayesian divergence time estimation
We
used BEAST to perform phylogenetic inference and ancestral state
reconstruction across the 44 NRRs of the SARS-CoV-2-like viruses. For
the primary analysis (
Figure S2),
we used all of the 44 SARS-CoV-2-like genomes (including SARS-CoV-2),
an uncorrelated relaxed clock using calibrated region-specific
normally-distributed substitution rate priors for each NRR based on
SARS-CoV-2 sampled from humans, a GTR+G4 substitution model, a
non-parametric skygrid prior
92
with 49 grid points and a cutoff of 1,000 (which translates to 1022
CE), and Hamiltonian Monte Carlo gradient-based sampling for node ages
and branch rates.
93,94
Specifically, we set the mean and standard deviation of the rate prior
for each NRR to equal the posterior mean and standard deviation
estimated during molecular clock calibration with SARS-CoV-2 genomes. We
tracked the stem age of SARS-CoV-2 (
i.e., where SARS-CoV-2
connects to the rest of the phylogeny, also referred to as the time of
the closest-inferred ancestor, with the parent node of SARS-CoV-2
referred to as the closest-inferred ancestor). For each NRR, we first
ran one chain of 20 million generations, subsampling every two thousand
iterations to continuous parameter log and tree files. The first 10% of
the chain was discarded as burn-in for each analysis. Convergence and
mixing was assessed in Tracer. For analyses of NRRs where any relevant
ESS values were <200, we ran a second chain with the same
specifications as the first, discarded the first 5–25% of the chain as
burn-in, and combined the two chains. All relevant ESS values for the
one- and two-chain analyses were >200.
We
performed the same analysis for the SARS-CoV-1-like viruses for its 31
NRRs, using all of the 140 SARS-CoV-1-like genomes (including
SARS-CoV-1). The BEAST parameterization was identical to the analysis
for the SARS-CoV-2-like viruses, except we (1) tracked the stem age of
SARS-CoV-1 (i.e., the time of the closest-inferred ancestor of
SARS-CoV-1), (2) used the relevant calibrated molecular clock priors for
the 31 NRRs of the SARS-CoV-1-like viruses, and (3) ran one chain of
100 million generations, sub-sampling every 10 thousand iterations to
continuous parameter log and tree files.
Bayesian divergence time estimation sensitivity analyses
We
performed a series of sensitivity analyses for the Bayesian divergence
time estimation: (i) using SARS-CoV-2 genomes from early 2020, rather
than late 2020, when calibrating the molecular clock prior for the
subsequent analyses of SARS-CoV-2-like viruses, and (ii) including two
additional human SARS-CoV-1 genomes and two civet SARS-CoV-1 genomes in
the SARS-CoV-1-like viruses analyses, representing both the 2002–2003
and 2003–2004 SARS-CoV-1 outbreaks. For (i), we performed molecular
clock calibration using the same approach as with the SARS-CoV-2 genomes
sampled from late 2020, except here using 863 genomes from the early
pandemic (
Data S1),
sharing the topologies and substitution models across all the NRRs, but
assuming an independent and estimable strict clock rate under an
uninformative CTMC rate prior
91
for each NRR. We ran one MCMC chain of one million generations,
subsampling every one thousand iterations to a continuous parameter log
file. The first 10–15% of the chains were discarded as burn-in.
Convergence and mixing was assessed in Tracer and all relevant effective
sample size (ESS) values were >200. There was no temporal overlap
between the datasets of SARS-CoV-2 genomes from early 2020 and late
2020; the latest genome in early 2020 dataset was sampled on 14 February
2020 and the earliest genome in the late 2020 dataset was sampled on 22
February 2020.
For (ii), SZ3
(GenBank: AY304486.1) was selected as representative of the earliest
sampled detection in the intermediate animal reservoir (market-based
civets). ZS-B (GenBank: AY394996.1) was selected to represent the early
stages of the human epidemic from late January 2003 onwards. As both of
these are from the 2002–2003 SARS-CoV-1 outbreak, we also included the
civet virus HC/SZ/266/03 (GenBank: AY545916.1) and the human virus
GZ0401 (GenBank: AY568539.1) from the 2003–2004 SARS-CoV-1 outbreak.
Convergence
and mixing were assessed in Tracer. For analyses of NRRs where any
relevant ESS values were <200, we ran up to two more chains with the
same specifications as the initial chain, discarded the first 10% of the
chain as burn-in, and combined all chains. All relevant ESS values for
the combined chains were >200.
Constructing the recombinant common ancestor
For
each NRR of the SARS-CoV-2-like viruses, we reconstructed the most
likely nucleotide sequence belonging to the parent node of SARS-CoV-2
based on the tree distribution from the primary Bayesian divergence time
estimation (
Figure S2). The most likely nucleotide sequences for the 44 NRRs are then concatenated to form the recombinant common ancestor (recCA)
14 of SARS-CoV-2.
To
analyze how the recCA changed as more genomes were released, we
constructed the recCA of SARS-CoV-2 as the genomes of closely related
sarbecoviruses were chronologically published. We performed ancestral
state reconstruction using the posterior trees with all 38 taxa. We next
identified the inferred sequence of the most recent common ancestor of
SARS-CoV-2 and its most closely related taxa based on the published
genomes as new genomes were sequenced. We then examined the genetic
similarity of the recCA, as well as the most closely related published
genome, to SARS-CoV-2 as a function of the publication date. The same
analysis was performed for SARS-CoV-1 and the SARS-CoV-1-like viruses
and each phylogenetic sensitivity analysis (see above).
Bayesian phylogeographic inference
We
inferred the evolutionary and geographic history of the 44 NRRs of the
SARS-CoV-2-like viruses with BEAST. In order to optimally inform a
flexible phylogeographic reconstruction approach, we jointly fit a
bivariate continuous diffusion model using a shared estimable precision
matrix across all independent random NRRs phylogenies, rescaling this
matrix across branches in each tree to model a relaxed random walk
(RRW).
95
All branch-specific RRW rate scalars are drawn from the same log-normal
distribution with a mean of 1 and an estimable standard deviation. In
the continuous phylogeographic diffusion model, we exclude the sampling
locations for the human and pangolin viruses and integrate over the
uncertainty for locations that lack precision. As a tree prior, we
specify a joint non-parametric skygrid prior with 49 grid points and a
cutoff of 1,000 (which translates to 1022 CE) or a cutoff of 0.5
substitutions per site for trees scaled in units of genetic distance
(see below). We further specify a joint HKY85 nucleotide substitution
model across NRRs. These relatively simple sequence evolutionary
parameterizations were chosen to meet the requirements for the Prisoner
of War (PoW) model
21 to rescale the evolutionary histories (see below). We perform two versions of this analysis (
Figure S2):
(i) using a hierarchical strict molecular clock model that allows for
the clock rate of each NRR to vary (but must average to the overall
rate), with the single inferred clock rate prior from above for the
overall clock rate, in order to estimate the trees in units of time and
(ii) using independent strict molecular clock models with clock rates
fixed to 1 and contemporaneous tips in order to estimate trees in
substitution units for PoW transformation. For (i), we also included
gamma-distributed among-site rate heterogeneity with a hierarchical
prior over the NRR-specific shape parameters. We ran two chains of 500
million generations, subsampling every 10 thousand iterations to
continuous parameter log and tree files; the first 10% of the chain was
discarded as burn-in. Convergence and mixing was assessed in Tracer, and
all relevant ESS values were >200.
We
performed the same analysis for the SARS-CoV-1-like virus set for its
31 NRRs. The BEAST parameterization was identical to the analysis for
the SARS-CoV-2-like viruses, but here we use the rate priors with
standard deviations divided by 5 to assist in convergence. We ran
multiple chains for >200 million generations, subsampling every 500
thousand iterations to continuous parameter log and tree files, and
discarded the necessary burn-in prior to combining the independent
samples. Convergence and mixing was assessed in Tracer, and all relevant
ESS values were >100.
To
account for the time-dependent decline in the evolutionary rate of
sarbecoviruses, we applied the PoW model, a mechanistic evolutionary
method for estimating virus substitution rates over long evolutionary
timescales. For this, we used the post-burn-in substitution trees
inferred from the previous step along with the posterior rate
distributions of the SARS-CoV-1-like viruses and the SARS-CoV-2-like
viruses to rescale substitution trees into time trees using the PoW
transformation (see ref Ghafari et al.
21 for more details on the method).
We used TreeAnnotator v1.10
81
to obtain and annotate a maximum clade credibility (MCC) tree for each
continuous phylogeographic reconstruction. Finally, we used the R
package ‘seraphim’
84
to extract the spatiotemporal information embedded within 100 trees
sampled from each post-burn-in posterior distribution and to visualize
the different phylogeographic reconstructions. For each NRR, we
estimated the weighted diffusion coefficients
29 with the ‘spreadStatistics’ function of the R package ‘seraphim’
84.
Bayesian phylogeographic inference sensitivity analysis
To
examine the consequences of potentially biased sampling on the
phylogeographic inference, we included in the Bayesian inference two
hypothetical unsampled taxa that were associated with a specific
location and a sampling date of 2020 but no observed sequence data. For
the analysis of the SARS-CoV-2-like viruses, one taxon was given a
sampling location associated with a limestone karst in Guizhou expected
to have high bat diversity, and the other taxon was given a sampling
location associated with caves in Hunan also expected to have high bat
diversity. Both of these locations are between Hubei, where SARS-CoV-2
emerged, and Yunnan, China and Northern Laos, where the bat viruses most
closely related to SARS-CoV-2 were sampled. For the analysis of the
SARS-CoV-1-like viruses, both taxa were assigned sampling locations in
Guangxi, between Guangdong, where SARS-CoV-1 emerged, and Yunnan, where
the bat viruses most closely related to SARS-CoV-1 were sampled. The
unsampled taxa were constrained to branch off from the branch connecting
the human SARS-CoV and closest-inferred ancestor.
Ranking the dispersal velocity of the lineage descending from the closest-inferred ancestor
Using
100 trees sampled from each post-burn-in posterior tree distribution
from each NRR from the tip-dated phylogeography analyses, we calculated
the dispersal velocity for each tip branch, excluding the branches
leading to the human or pangolin viruses. We next computed the minimum
distance between the closest-inferred ancestor and Guangdong province or
Hubei province for SARS-CoV-1 or SARS-CoV-2, respectively. The
necessary dispersal velocity for the closest-inferred ancestor to reach
either Guangdong or Hubei province was calculated using this minimum
distance and the branch leading to, respectively, SARS-CoV-1 or
SARS-CoV-2 (
i.e., the time between closest-inferred ancestor and
the SARS-CoV). Tip branches associated with taxa that have parent nodes
dated to more than 50 years before the most recently sampled virus were
excluded from downstream analyses. If the tip branch was excluded from
at least 50% of the samples for an NRR, the tip branch was removed from
the NRR entirely. If the tip branch associated with a human SARS-CoV was
excluded from at least 50% of the samples for an NRR, the NRR was
removed entirely from downstream analyses. The 50-year threshold was
chosen based on the observed substitution saturation inflection point (
Figure 3A), thereby also focusing primarily on the recent movement of sarbecoviruses.
We
then computed the rank of the dispersal velocity for each remaining tip
branch in each tree sample. These posterior dispersal velocity ranks
were pooled together for each tip across all samples for an overall
posterior dispersal velocity rank distribution, or across all samples
for each NRR for posterior dispersal velocity rank distributions per
NRR.
To calculate how much the
posterior dispersal velocity rank distribution deviates from an expected
average rank, we use an adjusted formula for skewness with a given
“mean” of 0.5:
Modeling of bat distributions
Our bat distribution database comprised of the data used in recently published studies,
27 recent Global Diversity Information Facility (GBIF) data,
56,57 and the Darkcide database.
58
The Darkcide database included recent data collected by bioacoustic
surveys in Chiang Mai and Kanchanaburi in Thailand by A.C. Hughes using
an echometer Pro, and multiple published datasets.
27,56,57,59,60,61,62,63,64,65,96,97
The data was integrated as we attempted to keep the database complete,
even if not all data is used within our particular study. All
Rhinolophid bat species, as well as
Chaerephon plicatus and
Aselliscus stoliczkanus
were extracted from the dataset, and then the dataset was clipped to
Southeast Asia (India to the West, Mongolia to the North, Japan to the
East and Malaysia to the South) to buffer the region of interest with
distribution data. Once all duplicate records had been removed, the
clipped data for Rhinolophids included 10,254 distribution points for
all species across the region, in addition to 235 records for
A. stoliczkanus and 108 for
C. plicatus.
For environmental data, we based variable selection on those used in Zhou et al.
27 and known from our previous work to be important to bats in the region.
66,67 All analysis was conducted in ArcMap 10.8 (ESRI) at a resolution of 0.008
o, which is approximately 1km
2, and all layers were clipped and subsampled to the same resolution using a mask. Climate variables were downloaded from Chelsa
68 (available at
https://chelsa-climate.org/downloads/)
based on a selection from our previous analyses (the bioclimatic
variables: Annual Mean Temperature, Mean Diurnal temperature Range,
Isothermality, Temperature Seasonality, Maximum and Minimum Temperature,
Annual Precipitation, Precipitation seasonality, Maximum and Minimum
precipitation).
66,67 From Envirem
69 (
https://envirem.github.io/)
we used the Thornthwaite Aridity index, climate moisture index,
continentality, embergers pluviothermic quotient, growing degree days
above 0, and both annual mean potential evapotranspiration and
seasonality of potential evapotranspiration. In addition we used actual
evapotranspiration,
70 a vegetation canopy height layer,
71 forest canopy height,
72 tree density,
73 distance to bedrock
74 (the bdticm and bdricm aggregated grids available at
https://github.com/ISRICWorldSoil/SoilGrids250m/), and bulk carbon stock.
74
These were all trimmed to the same region that bat data was clipped to
and then species distributions were modelled using Maxent
98 using the GUI following the same approach as Hughes et al.
66
For each species, three replicates were run and the average was used.
These probabilistic distribution maps were then reclassified using the
10th percentile cumulative logistic threshold to produce a binary
presence-absence map with ArcMap 10.8, and all models had an AUC
exceeding 0.85. The models were then clipped to the study region, and
any Rhinolophids with no modelled suitable habitat in the clipped study
area were removed from analysis. As the projections show suitable
habitat but do not account for species biogeography we then
cross-checked the 40 remaining Rhinolophid maps against the IUCN
database (
https://www.iucnredlist.org/),
and species not listed were checked against relevant publications on
the species. Based on the IUCN mapped ranges species endemic to Japan,
or species not found in Japan or at similar latitudes (areas of the
South Korean Peninsula are climatically suitable for a number of
species, but lack continuous habitat to reach these regions if largely
distributed in Southeast Asia) were then clipped to the
biogeographically suitable range. The binary maps were then stacked
using the Mosaic to New Raster function in ArcMap 10.8 to find the
maximum number of species that may be co-distributed.
https://www.cell.com/cell/fulltext/S0092-8674(25)00353-8