traductor

sábado, 10 de mayo de 2025

COVID-19:El comercio de animales salvajes como vía más probable de transmisión

 

 

El virus ancestral de la COVID-19 apareció a más de 2.000 km de Wuhan solo unos años antes de la pandemia: nuevo estudio apunta al comercio de animales salvajes como vía más probable de transmisión 

The recency and geographical origins of the bat viruses ancestral to SARS-CoV and SARS-CoV-2

Un nuevo estudio revela que el virus causante de la COVID-19 (SARS-CoV-2) siguió un patrón casi idéntico al del SARS, viajando miles de kilómetros a través del comercio de animales salvajes. 
 
 
La historia del coronavirus que desencadenó la pandemia global podría haber comenzado muy lejos de Wuhan, en las cuevas de murciélagos de China occidental o del norte de Laos. Pero lo realmente impactante no es el lugar de origen, sino la ruta inesperadamente rápida y distante que siguió el virus antes de llegar a los primeros contagios humanos. Un nuevo estudio publicado en la revista Cell —y resumido en un comunicado oficial de la Universidad de California en San Diego— apunta a una vía de transmisión que reaviva el foco sobre un viejo sospechoso: el comercio de fauna salvaje.
 
Este trabajo, realizado por un consorcio internacional de investigadores, no solo reescribe parte de la historia reciente de la pandemia, sino que también refuerza la teoría de un origen natural del SARS-CoV-2, muy similar al del brote de SARS en 2002. Y lo hace desmontando, con evidencia evolutiva y geográfica, uno de los argumentos más utilizados por los defensores de la teoría de la filtración de laboratorio: la gran distancia entre Wuhan y los hábitats naturales de los murciélagos portadores del virus. 

Los científicos analizaron 250 genomas de coronavirus estrechamente relacionados con SARS-CoV-1 y SARS-CoV-2. Estos virus pertenecen a una subfamilia llamada sarbecovirus, que se encuentra principalmente en murciélagos del género Rhinolophus (murciélagos de herradura). A lo largo de milenios, estos virus han circulado silenciosamente entre murciélagos en Asia, mutando y recombinando fragmentos de su material genético.

Aquí es donde entra en juego un elemento crucial: la recombinación viral. En lugar de heredar una única línea evolutiva, los sarbecovirus pueden intercambiar fragmentos genéticos entre sí cuando dos virus diferentes infectan al mismo murciélago. Esta capacidad convierte su genoma en una especie de mosaico difícil de rastrear. Por eso, los investigadores decidieron centrarse exclusivamente en las regiones del genoma que no han sufrido recombinación para construir árboles genealógicos más fiables.

Los resultados mostraron que los ancestros más cercanos al SARS-CoV-2 circularon en murciélagos del norte de Laos o de la provincia china de Yunnan entre cinco y siete años antes del inicio de la pandemia en Wuhan, situada a unos 2.700 kilómetros. Un patrón similar se observó con el SARS-CoV-1, cuyo ancestro más próximo apareció en el suroeste de China poco antes de que surgiera el brote de 2002 en Cantón, también a más de mil kilómetros de distancia. De hecho, el análisis reveló que los ancestros de sarbecovirus más recientes tanto del SARS-CoV-1 como del SARS-CoV-2 abandonaron sus puntos de origen menos de 10 años antes de que se informara por primera vez que infectaban a humanos, a más de mil kilómetros de distancia.

La conclusión es sorprendente: la dispersión natural de los murciélagos no puede explicar estos desplazamientos tan rápidos y extensos. Según el modelo desarrollado en el estudio, los sarbecovirus se mueven aproximadamente al mismo ritmo que sus huéspedes: unos pocos kilómetros por noche. Por tanto, el virus debió viajar por otra vía.


La ruta más probable: animales salvajes transportados por humanos

El estudio, liderado por instituciones como la Universidad de California en San Diego y el Instituto Rega de Bélgica, sugiere que el salto geográfico del virus se produjo a través del comercio de animales salvajes. Esta hipótesis ya se había barajado para el SARS-CoV-1, cuando se encontraron coronavirus en civetas y perros mapache vendidos en mercados del sur de China. Ahora, el nuevo análisis genético y geográfico refuerza la idea de que el comercio de fauna silvestre funcionó como puente entre murciélagos y humanos también en el caso de la COVID-19.

En concreto, los animales infectados habrían sido trasladados, posiblemente vivos, desde las zonas rurales del sur de Asia hasta mercados urbanos, como el de Huanan en Wuhan. Allí, en condiciones de hacinamiento y contacto estrecho entre especies y personas, el virus encontró el escenario perfecto para dar el salto definitivo.

Este patrón —virus que circulan en murciélagos durante años, salto a animales intermedios, transporte humano a zonas densamente pobladas y finalmente transmisión a humanos— es casi un calco entre el SARS original y el SARS-CoV-2. Y eso, para los autores del estudio, es una coincidencia demasiado precisa como para ignorarla.

¿Y qué pasa con la teoría del laboratorio?

Desde los inicios de la pandemia, la distancia geográfica entre Wuhan y las zonas con murciélagos portadores del virus ha sido uno de los caballos de batalla de quienes defienden un posible origen artificial del virus. Sin embargo, el nuevo estudio desmonta este argumento mostrando que esa distancia también existía en el caso del SARS de 2002. Es más, el patrón de propagación y evolución viral es extraordinariamente parecido en ambos casos.

Eso no significa, como advierten algunos expertos, que se pueda descartar completamente la hipótesis del escape de laboratorio. La falta de un huésped intermedio identificado para el SARS-CoV-2 deja todavía una brecha importante en la narrativa. Sin embargo, los hallazgos genéticos, geográficos y temporales presentados ahora inclinan la balanza hacia un origen natural facilitado por la actividad humana, especialmente el comercio de animales salvajes.

Hay que recordar que un estudio genético publicado el año pasado señalaba al mercado de Huanan, en Wuhan, como el posible epicentro del brote de la COVID-19, respaldando la hipótesis de un salto del virus desde animales a humanos. Pero no es el único.

Varios estudios publicados entre 2021 y 2023 han reforzado de forma consistente la hipótesis de un origen zoonótico del SARS-CoV-2. Un estudio de 2021 examinó de forma crítica las pruebas disponibles y concluyó que la transmisión del virus desde animales a humanos era la explicación más plausible. En 2022, una investigación independiente localizó el inicio del brote en el mercado mayorista de mariscos de Huanan, en Wuhan, descartando que el Instituto de Virología de la misma ciudad fuera el foco inicial.

Ese mismo año, otro análisis identificó al menos dos eventos de transmisión zoonótica independientes ocurridos en el mercado, con apenas una semana de diferencia, lo que sugiere una propagación desde múltiples animales infectados. En 2023, una nueva revisión cuestionó la falta de pruebas que respalden la teoría de una fuga de laboratorio y denunció la desinformación que se ha difundido en torno a la llamada “ganancia de función”, una técnica utilizada en virología que ha sido malinterpretada en el debate público.

Ese mismo año, un análisis crítico concluyó que no existen indicios de que el SARS-CoV-2 haya sido creado o manipulado artificialmente, y recordó que este virus es el noveno coronavirus conocido que ha pasado de animales a humanos, un patrón ya observado con otros patógenos. Un estudio independiente de 2023 llegó a las mismas conclusiones, apuntando nuevamente hacia un origen natural facilitado por el contacto estrecho entre humanos y fauna silvestre. Y, por último, el pasado año Shi Zhengli, la viróloga del Instituto de Virología de Wuhan frecuentemente mencionada en relación con la hipótesis del escape de laboratorio, publicó las secuencias de coronavirus almacenadas en su laboratorio, dejando claro que ninguna  de ellas guarda una relación cercana con el SARS-CoV-2.

Lecciones para futuras pandemias

Más allá del debate sobre los orígenes del SARS-CoV-2, el estudio ofrece una lección inquietante y urgente: las pandemias zoonóticas seguirán apareciendo si no se regula de forma estricta el comercio y transporte de animales salvajes. Las condiciones que hicieron posible la COVID-19 —contacto cercano entre especies, mercados abarrotados, transporte masivo de fauna desde zonas rurales a núcleos urbanos— siguen existiendo en muchas partes del mundo.

La vigilancia activa de los virus en murciélagos, así como en animales vendidos en mercados, se presenta como una herramienta fundamental para prevenir nuevas catástrofes sanitarias. La historia se ha repetido una vez, y los investigadores temen que vuelva a hacerlo si no se toman medidas contundentes.

Este estudio, publicado el 7 de mayo de 2025 en Cell, no cierra todas las preguntas, pero ofrece la evidencia más sólida hasta la fecha de que el comercio de vida silvestre fue el motor silencioso detrás de la llegada del SARS-CoV-2 a los humanos. En palabras de los investigadores, comprender la historia evolutiva de estos virus no es solo una cuestión académica: es una necesidad urgente para anticipar y contener la próxima pandemia.

https://www.cell.com/cell/fulltext/S0092-8674(25)00353-8 

 

The recency and geographical origins of the bat viruses ancestral to SARS-CoV and SARS-CoV-2

Jonathan E. Pekar1,2,3,4,21,23 jpekar@ed.ac.ukSpyros Lytras5,6,21 spyros@g.ecc.u-tokyo.ac.jpMahan Ghafari7 ∙ … Michael Worobey20,22 worobey@arizona.eduJoel O. Wertheim4,22 jwertheim@health.ucsd.eduPhilippe Lemey

 

Fragments of the human SARS-CoVs share very recent common ancestors with bat viruses
SARS-CoV-1-like and SARS-CoV-2-like viruses have circulated in Asia for millennia
Recent ancestors of human SARS-CoVs likely circulated in Western China and Northern Laos
These ancestors traveled unexpectedly fast to reach sites of human emergence

Summary

The emergence of SARS-CoV in 2002 and SARS-CoV-2 in 2019 led to increased sampling of sarbecoviruses circulating in horseshoe bats. Employing phylogenetic inference while accounting for recombination of bat sarbecoviruses, we find that the closest-inferred bat virus ancestors of SARS-CoV and SARS-CoV-2 existed less than a decade prior to their emergence in humans. Phylogeographic analyses show bat sarbecoviruses traveled at rates approximating their horseshoe bat hosts and circulated in Asia for millennia. We find that the direct ancestors of SARS-CoV and SARS-CoV-2 are unlikely to have reached their respective sites of emergence via dispersal in the bat reservoir alone, supporting interactions with intermediate hosts through wildlife trade playing a role in zoonotic spillover. These results can guide future sampling efforts and demonstrate that viral genomic regions extremely closely related to SARS-CoV and SARS-CoV-2 were circulating in horseshoe bats, confirming their importance as the reservoir species for SARS viruses.
Horseshoe bats (Rhinolophus spp.) are the main hosts of the Sarbecovirus subgenus1 (genus Betacoronavirus, family Coronaviridae) and, likely via virus transmission through transient intermediate species, were the source of SARS-CoV (referred to hereafter as SARS-CoV-1, now extinct) and SARS-CoV-2.2,3,4 Since the first emergence of SARS-CoV-1 in Guangzhou, Guangdong Province, in 2002 and SARS-CoV-2 (jointly referred to as the SARS-CoVs) in Wuhan, Hubei Province, in 2019, the sampling and analyses of sarbecoviruses in horseshoe bats has contributed to our understanding of sarbecovirus diversity and geographical spread. Although sarbecoviruses exhibit substantial geographic structuring linked to the ranges of their host horseshoe bat species,2,3,4,5,6 where and when their ancestors circulated, both in recent and ancient history,7 is poorly understood.
Genome-wide sequence identity is typically used to compare bat sarbecoviruses with the SARS-CoVs, but because coronaviruses frequently recombine,5,6 whole-genome identity of these bat viruses and the SARS-CoVs does not adequately reflect their complex evolutionary histories (Figure S1A). Previous attempts to date the most recent common ancestor (MRCA) of SARS-CoV-2 and a closely related bat virus genome using long genomic fragments resulted in estimates of 30–40 years before 2019.4,6,8 However, when analyzing all non-recombinant regions (NRRs), we will inevitably find some NRRs with older and some with younger MRCAs than either this estimate from a long fragment or an estimate based on the entire genome, which would effectively be a weighted average of NRR time of the MRCAs (tMRCAs). To find a bat virus genome that has not undergone recombination relative to either of the SARS-CoVs, it would likely have been necessary to identify the sarbecovirus involved in each cross-species transmission event leading to the SARS-CoVs (which we refer to as the “direct ancestor”; Figure S1B) or a very closely related virus from a bat in the same ecological community infected around the time of the transmission event. What we have, instead, is a relatively small sample of bat sarbecovirus genome sequences, gathered across different times and places. Hence, it is necessary to identify all the NRRs of sarbecoviruses—genomic regions within which there is little to no detectable signal of recombination among the analyzed genomes and representing, to the extent possible, a single evolutionary history. These genomic regions provide the clearest insights into the evolutionary history of SARS-CoV-1 and SARS-CoV-2 and the origins of their respective outbreaks.
Here, we separately analyze the recombination patterns and respective evolutionary histories of SARS-CoV-1 and its closely related viruses (referred to as SARS-CoV-1-like viruses) and SARS-CoV-2 and its closely related viruses (referred to as SARS-CoV-2-like viruses). We estimate the separate sets of NRRs for the SARS-CoV-1-like and SARS-CoV-2-like viruses. For each NRR, we utilize a molecular clock rate specific for that portion of the genome and determine the MRCA of the SARS-CoV and the most closely related bat virus, or set of viruses, among currently sampled sarbecoviruses for the NRR. We refer to this MRCA as the “closest-inferred ancestor” for that NRR because it is the parent node of the SARS-CoV in the phylogeny (Figure S1B) and is limited by the current sample set of bat sarbecoviruses. Additionally, the dating of deeper nodes in the phylogeny is complicated by substitution saturation: the repeated occurrence of nucleotide substitutions at the same site resulting in biased estimation of older divergence in viral phylogenies.9,10 We therefore account for substitution saturation in our NRR-specific inferences when examining the deeper evolutionary history of sarbecoviruses.
We show that the available sequence data provide evidence of the closest-inferred bat virus ancestors of both SARS-CoV-1 and SARS-CoV-2 circulating in bats in the years immediately preceding the emergence of each SARS-CoV. By performing phylogeographic analyses of these sarbecoviruses, we demonstrate that the sarbecoviruses typically spread at a rate that approximates their bat hosts. Based on this rate and current sampling, we find that both SARS-CoV-1 and SARS-CoV-2 likely emerged far from where their direct bat virus ancestors likely circulated and were therefore unlikely to reach their provinces of emergence via bat host movement alone.

Results

Inference of NRRs

The evolutionary history of sarbecoviruses can only be accurately inferred when focusing on NRRs rather than entire genomes or specific genes because these viruses, as well as their ancestors, have undergone extensive recombination.4,6,8 To identify these recombination patterns, we aligned the 250 full-genome sarbecovirus sequences that were available at the time of this analysis, including SARS-CoV-1 and SARS-CoV-2. We used the genetic algorithm for recombination detection (GARD)11 to independently identify recombination breakpoints on genomes closely related to SARS-CoV-1 and SARS-CoV-2 (Figure S2). We then performed separate analyses to reduce the potentially overlapping recombination signal between the sets and increase our power for confidently detecting breakpoint positions unique to each set. The two sequence sets were separated based on previously published recombination-aware phylogenetic analyses of most genomes used here4,12,13 (see STAR Methods). The two recombination breakpoint analyses identified 31 putative NRRs for the SARS-CoV-1-like viruses, with a median length of 870 nucleotides (nt) and a range of 309–3,414 nt, and 44 putative NRRs for the SARS-CoV-2-like viruses, with a median length of 595 nt and a range of 168–2,706 nt (Figure 1A).
 
To examine whether the inferred positions of recombination breakpoints of the SARS-CoV-1-like and SARS-CoV-2-like NRR analyses were closer to each other than expected, we simulated randomly distributed recombination breakpoint sets along the alignment for the SARS-CoV-1-like and SARS-CoV-2-like viruses (30 and 43 breakpoints, respectively) and compared the distribution of distances between the two breakpoint sets. The empirical breakpoint sets had a mean genomic distance of 203 nt (90% confidence interval [CI]: 138–299 nt) separating the pairs of closest SARS-CoV-1-like and SARS-CoV-2-like virus breakpoints, whereas the simulated distances had a mean of 396 nt (90% CI: 259–606 nt; Figure S3). Additionally, there were seven breakpoint pairs between the two virus sets that were in the smallest 10% of the observed distribution of distances between breakpoints (10–51 nt apart; Figure 1A), indicating recombination hotspots shared between the two sarbecovirus groups (e.g., near nsp3 and between the Spike N-terminal domain [NTD] and receptor-binding domain [RBD]).

Recent bat virus ancestors of human SARS-CoVs

Focusing on each NRR identified in the recombination analysis, we examined when the closest-inferred bat viruses ancestral to SARS-CoV-1 and SARS-CoV-2 circulated. These ancestors are represented by the MRCA of each SARS-CoV and the most closely related bat viruses, which is akin to the parent node of each SARS-CoV in its respective phylogeny (Figure S1B).
To examine whether the inferred positions of recombination breakpoints of the SARS-CoV-1-like and SARS-CoV-2-like NRR analyses were closer to each other than expected, we simulated randomly distributed recombination breakpoint sets along the alignment for the SARS-CoV-1-like and SARS-CoV-2-like viruses (30 and 43 breakpoints, respectively) and compared the distribution of distances between the two breakpoint sets. The empirical breakpoint sets had a mean genomic distance of 203 nt (90% confidence interval [CI]: 138–299 nt) separating the pairs of closest SARS-CoV-1-like and SARS-CoV-2-like virus breakpoints, whereas the simulated distances had a mean of 396 nt (90% CI: 259–606 nt; Figure S3). Additionally, there were seven breakpoint pairs between the two virus sets that were in the smallest 10% of the observed distribution of distances between breakpoints (10–51 nt apart; Figure 1A), indicating recombination hotspots shared between the two sarbecovirus groups (e.g., near nsp3 and between the Spike N-terminal domain [NTD] and receptor-binding domain [RBD]).
Figure S3 Breakpoint pair distance distributions of the SARS-CoV-1-like and SARS-CoV-2-like viruses, related to Figure 1

Recent bat virus ancestors of human SARS-CoVs

Focusing on each NRR identified in the recombination analysis, we examined when the closest-inferred bat viruses ancestral to SARS-CoV-1 and SARS-CoV-2 circulated. These ancestors are represented by the MRCA of each SARS-CoV and the most closely related bat viruses, which is akin to the parent node of each SARS-CoV in its respective phylogeny (Figure S1B).
There is insufficient temporal signal when calibrating a molecular clock using tip dating with sarbecoviruses sampled from bats and pangolins (Manis javanica),6 likely as a consequence of limited sampling across space and time. Therefore, we used SARS-CoV-1 genomes to identify a suitable rate prior for the molecular clock in our timescaled phylogenetic analyses (Figure S2; see STAR Methods), and these inferred substitution rates served as NRR-specific rate priors for subsequent Bayesian phylogenetic inference of a SARS-CoV-1 genome from 2002 and 139 SARS-CoV-1-like viruses (Figure S4A; Table S1). The median time of the closest-inferred bat virus ancestor of SARS-CoV-1 within each NRR varied by several decades (Figure 1B), and the median of these medians is 1996. However, the NRR with the most recent closest-inferred ancestor of SARS-CoV-1 had an inferred date of 2001 (95% highest posterior density [HPD]: 1998–2002; NRR 14 [Orf1b]; 1,065 nt), only 1 year before the emergence of SARS-CoV-1 in 2002. The viruses most closely related to SARS-CoV-1 within this NRR were YNLF_31C (GenBank: KP886808.1) and YNLF_34C (GenBank: KP886809.1), sampled in Yunnan, China, in 2013. Our findings were consistent, albeit slightly earlier, when we included additional human and masked palm civet (Paguma larvata) SARS-CoV-1 genomes from both the 2002–2003 and 2003–2004 outbreaks (Figure S4C; Table S1).
We similarly used the evolutionary rate of SARS-CoV-2 across the NRRs of the SARS-CoV-2-like viruses (Figure S4B; Table S1), inferred from SARS-CoV-2 genomes sampled up to late 2020, as priors for the phylogenetic analysis of the NRRs of the SARS-CoV-2-like viruses, comprising SARS-CoV-2 and 37 SARS-CoV-2-like genomes. The time of the closest-inferred bat virus ancestor of SARS-CoV-2 ranged from several years to several decades preceding late 2019—when SARS-CoV-2 was introduced into humans—across the 44 NRRs (Figure 1C). Although the median time of the closest-inferred ancestor of SARS-CoV-2 was in 2000, the two NRRs with the most recent time of the closest-inferred ancestor both had an inferred date of 2014 (95% HPDs: 2002–2019 and 2007–2018; NRR 37 [Orf3a and E] and 43 [N]; 347 and 331 nt, respectively), only 5 years prior to its introduction into humans.14 The genome most closely related to SARS-CoV-2 within NRR 37 is RacCS203 (GenBank: MW251308.1), sampled in Thailand in 2020, whereas the genome most closely related to SARS-CoV-2 within NRR 43 is RpYN06 (GenBank: MZ081381.1), sampled in Yunnan in 2021. We note that although these two NRRs are each less than 350 nt, NRR 42 is 456 nt and has a similar time of the closest-inferred ancestor of 2013 (95% HPD: 1999–2019). Within NRR 42, the genomes most closely related to SARS-CoV-2 are BANAL-20-52 (GenBank: MZ937000.1) and BANAL-20-236 (GenBank: MZ937003.1), both sampled in Laos in 2020. When using a faster rate prior based on SARS-CoV-2 genomes sampled in early 2020 (Figure S4B; Table S1), we obtain slightly more recent estimates of the closest-inferred ancestor (Figure S4D; Table S1). Our inferred ages for the closest-inferred ancestors of each of the SARS-CoVs across the NRRs indicate that several of the published sarbecovirus genomes include non-recombinant fragments that are descendant from viruses that circulated likely only several years before the emergence of SARS-CoV-1 and SARS-CoV-2.
The times of closest-inferred ancestor of SARS-CoV-1 across the NRRs are nearer to the date of human emergence than those for SARS-CoV-2 (Figures 1B and 1C): 21 NRRs of the SARS-CoV-1-like viruses have a median time of the closest-inferred ancestor of SARS-CoV-1 within 10 years of the emergence of SARS-CoV-1, whereas only 7 NRRs of the SARS-CoV-2-like viruses have a closest-inferred ancestor of SARS-CoV-2 within 10 years of the emergence of SARS-CoV-2. This pattern could be explained in part by the sampling of a substantially larger number of SARS-CoV-1- than SARS-CoV-2-like genomes in bats. However, using a faster rate prior based on SARS-CoV-2 genomes sampled in early 2020 results in 17 NRRs having a closest-inferred ancestor of SARS-CoV-2 within 10 years of its emergence (Figure S4D). Both SARS-CoV-1-like and SARS-CoV-2-like viruses consistently have older closest-inferred ancestors across the Spike gene (Figures 1B and 1C). The oldest time of the closest-inferred bat virus ancestor within a single NRR of SARS-CoV-1 and SARS-CoV-2 was, respectively, 1963 (95% HPD: 1934–1985; NRR 19 [Orf1b]) and 1944 (95% HPD: 1798–2009; NRR 28 [Spike]). Notably, only 1 NRR among the SARS-CoV-1-like viruses (NRR 19) and 6 NRRs among the SARS-CoV-2-like viruses (NRRs 10, 27, 28, 31, 32, and 38) have a median time of the closest-inferred ancestor more than 50 years earlier than the most recently sampled SARS-CoV-1-like or SARS-CoV-2-like virus, respectively.
Figure S4 Substitution rates for the SARS-CoV-1-like and SARS-CoV-2-like viruses and time of the closest-inferred ancestor using additional genomes or different molecular clock priors, related to Figure 1

Effect of increased sampling on the recombinant common ancestors

To test the influence of sampling, we quantified the extent to which the publication of more viruses closely related to SARS-CoV-1 and SARS-CoV-2 has contributed toward our understanding of the ancestor of each of these human viruses. The respective recombinant common ancestor (“recCA”) of SARS-CoV-1 and SARS-CoV-2 can be understood as the aggregate of the closest-inferred ancestors of each of the SARS-CoVs across each of its NRRs,14 with each closest-inferred ancestor existing at different points in time and with potentially different viruses most closely related to the SARS-CoV. The recCA therefore accounts for the closest relative(s) across all non-recombinant segments. We inferred the genome of the recCA of each SARS-CoV over time, progressively adding published genomes to the reconstruction based on their publication date. Then, we calculated the similarity of each recCA to its respective SARS-CoV as a function of the publication date to examine how additional closely related genomes can affect ancestral reconstruction and reflect our increasing understanding of sarbecoviruses over time (Figure 2A).
Figure 2 The recCA (recombinant common ancestor) over time
We observe that, as more genomes were published, the SARS-CoV-1 recCA and SARS-CoV-2 recCA became more similar to SARS-CoV-1 and SARS-CoV-2, respectively (Figures 2A and 2B). The full bat virus genomes most similar to SARS-CoV-1 and SARS-CoV-2 are, respectively, YN2020E (96.2% genetic identity to SARS-CoV-1; GenBank: OK017855.1) and BANAL-20-52 (96.8% genetic identity to SARS-CoV-2), both published in 2021. By the end of 2022, the recCA of SARS-CoV-2 shares 98.8% genetic identity with SARS-CoV-2 and the recCA of SARS-CoV-1 shares 98.7% genetic identity with SARS-CoV-1, both expectedly exceeding the genetic identity of the bat virus genomes most genetically similar to the human SARS-CoVs. Notably, the recCA repeatedly increases in similarity to its respective SARS-CoV even when newly published genomes share less genetic identity with the recCA than the most similar genome at the time (Figures 2A and 2B). This trend can be observed in detail at the level of individual NRRs: sequentially published genomes increase the genetic identity of some of the NRRs of the recCA to the matching regions of its SARS-CoV (Figures 2C and 2D). The shared genetic identity of the recCA to its respective SARS-CoV is broadly consistent when incorporating additional SARS-CoV-1 genomes with the SARS-CoV-1-like viruses (Figures S5A and S5C) and when using a faster rate prior in the Bayesian time-calibrated phylogenetic inference of the SARS-CoV-2-like viruses (Figures S5B and S5D).
Figure S5 The recCA (recombinant common ancestor) of SARS-CoV-1 over time when including additional SARS-CoV-1 genomes, and SARS-CoV-2 over time when using rate priors based on SARS-CoV-2 genomes sampled from early 2020, related to Figure 2
The patterns observed here are the result of newly added genomes containing fragments that descend from a more closely related ancestor despite not being the most genetically similar overall, and further highlight the importance of inferring NRRs in evolutionary analyses. Additionally, the similar genetic identity of the recCAs to their respective SARS-CoVs, both genome-wide and often at the level of individual NRRs, suggest that the greater number of sampled SARS-CoV-1-like viruses than SARS-CoV-2-like viruses is not solely responsible for the times of closest-inferred ancestor of SARS-CoV-1 being consistently nearer to the date of human emergence than those for SARS-CoV-2.

Phylogeography of bat sarbecoviruses

Sarbecovirus NRR phylogenies have an appreciable degree of geographic structuring linked to their hosts’ ranges.6 To understand the spatiotemporal spread of the SARS-CoV-1-like and SARS-CoV-2-like viruses in bats, we simultaneously fitted a spatially explicit phylogeographic model to each NRR. As the sampling locations of SARS-CoV-1, SARS-CoV-2, and the pangolin sarbecoviruses likely do not represent where their direct bat virus ancestors circulated,15,16,17,18 we excluded their locations from the phylogeographic analyses to avoid the impact of dispersal of non-bat hosts (Figure S2).
Due to substitution saturation, the deeper evolutionary history of rapidly evolving viruses, including sarbecoviruses, are underestimated using standard molecular clock approaches.7,10,19,20,21 To account for the effects of substitution saturation on the SARS-CoV-1-like and SARS-CoV-2-like phylogenies when trying to understand the long-term dispersal patterns of sarbecoviruses, we applied the prisoner of war (PoW) model21—a molecular clock model that estimates divergence times while accounting for time-dependent evolutionary rates—to the posterior phylogenies with phylogeographic estimates scaled in units of genetic distance (Figure S2; see STAR Methods). We find that substitution saturation primarily affected the inferred time of divergence events more than 50 years before the most recently sampled virus in our dataset, with younger internal nodes having relatively unchanged ages and indicating that our inferences of the time of the ancestors of human SARS-CoVs and their closest bat sarbecoviruses are unbiased (Figures 3A and 3B). The effects of substitution saturation can, however, be particularly more pronounced for older divergence estimates (Figures 3A and 3B). When examining the entire phylogenies, the tMRCAs were often hundreds of years old when using a standard clock model, whereas the PoW-transformed phylogenies had tMRCAs that were typically thousands to tens of thousands of years older (Figures 3C and 3D).
Figure 3 Time to the MRCA after PoW transformation
To investigate the dispersal history of the two sets of viruses, we mapped the PoW-transformed phylogeny of each NRR across Asia (Figures 4G and 4H; Data S2). The phylogeographic reconstructions of both sets of viruses suggest a few lineages having dispersed over long distances, but the number of these dispersals is likely an underestimate due to undersampling of sarbecoviruses from various regions in Asia. The inferred dispersal history of viral lineages looks consistent among NRRs (Figures 4G and 4H; Data S2): we observe a similar spatial distribution of the sets of viruses and their connections on the different NRR maps. Moreover, the virus lineages are spatially structured, with phylogenetically similar viruses being generally sampled in geographically similar regions with almost no back and forth travel of viral lineages across relatively large regions of the study area, which we confirmed by computing the isolation-by-distance (IBD) signal22 for each NRR (Table S2). Our analyses indeed reveal a moderate to strong IBD signal when computing the correlation between the patristic distance—the sum of the branch lengths connecting two nodes on a tree—and great-circle geographic distance computed for each pair of bat virus samples: a Spearman correlation coefficient of 0.45 for SARS-CoV-1-like viruses (95% HPD: 0.36–0.60) and of 0.67 for SARS-CoV-2-like viruses (95% HPD: 0.29–0.84) (see Table S2 for a comparison with other viruses).
Figure 4 Estimated evolutionary history of SARS-CoV-1-like (left) and SARS-CoV-2-like (right) viruses

Location of the closest-inferred ancestors of the human SARS-CoVs

Determining where the recent ancestors of SARS-CoV-1 and SARS-CoV-2 circulated could inform future sampling efforts and elucidate how these virus lineages eventually traveled to their provinces of emergence. As the PoW model enforces contemporaneously sampled tips, it can produce unreliable results for the most recent inferred ages. However, the PoW transformation shows that the timing of divergence events in the last ∼50 years inferred from standard molecular clock models are unbiased (Figures 3A and 3B). Therefore, we performed a separate tip-dated phylogeographic analysis to infer where the closest-inferred ancestors of each SARS-CoV circulated in the past 50 years (Figure S2).
We find that the closest-inferred ancestors of SARS-CoV-1 likely circulated in western China (Yunnan, Sichuan, or Guizhou) (Figure 4I). The closest-inferred ancestors of SARS-CoV-2 likely circulated in Yunnan, China, or Northern Laos, overlapping with contiguous karst and cave landscapes extending through these regions23,24 (Figure 4J). Our results indicate that both SARS-CoV-1 and SARS-CoV-2 emerged in humans over a thousand kilometers from where their closest-inferred bat virus ancestors likely circulated, with SARS-CoV-1 emerging in Guangzhou and SARS-CoV-2 in Wuhan. The inferred geographic location of the closest-inferred ancestors of each SARS-CoV overlapped with regions with moderate horseshoe bat species richness (Figure 4B; Data S2), including in regions known to harbor R. affinis, R. pusillus, and R. ferrumequinum (Data S2), and the closest-inferred ancestor of SARS-CoV-2 additionally overlapped with known locations of R. malayanus circulation (Data S2). Although sarbecoviruses have been found in bats from Guangdong and Hubei,13,25,26 the viruses most genetically similar to SARS-CoV-1 and SARS-CoV-2, both at the genome level and among individual NRRs, have been predominantly sampled in and near Southwest China.27,28 Therefore, our findings are not solely a result of localized sampling efforts but reflect a natural pattern in the geographic distribution of bat sarbecoviruses, though more sampling in areas proximal to the Laos and Vietnam borders may identify additional CoVs descendant from relatively recent ancestors circulating in bats.

Sarbecoviruses spread at rates that approximate bat host dispersal

To investigate how fast the SARS-CoV-1-like and SARS-CoV-2-like viruses were recently spreading across Asia, we inferred the weighted diffusion coefficient,29 a measure of the diffusivity of viral transmission, for the 50 years before the most recently sampled genome in each dataset (Figure S2). We found that the SARS-CoV-1-like viruses spread with a weighted diffusion coefficient of 1,666 km2/year (95% HPD: 811–3,399; Figure 4E), and the SARS-CoV-2-like viruses circulated with a weighted diffusion coefficient of 740 km2/year (95% HPD: 116–2,587; Figure 4F). The SARS-CoV-1-like viruses were likely inferred to spread, on average, slightly more quickly than the SARS-CoV-2-like viruses because of the recent dispersal events to South Korea. These results were similar to the diffusion coefficients calculated for the 75 years preceding the most recently sampled genomes (Table S2). Horseshoe bats, the primary hosts of sarbecoviruses, tend to forage within around 2–3 km of their roost30,31,32,33 and have a reported diffusion coefficient of approximately 2,000 km2/year (Henley et al.34). Therefore, both the SARS-CoV-1-like and SARS-CoV-2-like viruses (in the context of host infections) traveled at a roughly similar rate as their hosts, suggesting that these viruses are able to efficiently explore new niches, likely due to the density of horseshoe bat roosts within the karst landscapes of Southern China.

Rapid movement of lineages leading to human SARS-CoVs

To test whether the bat virus ancestors of each human SARS-CoV could have spread to where the SARS-CoVs emerged in humans via dispersal of SARS-related coronaviruses across bat populations, we determined the minimum distance from the location of the closest-inferred bat virus ancestor to the province of emergence (Figures 4I and 4J), and, given the time between emergence and this ancestor (Figures 1B and 1C), we calculated the dispersal velocity this distance implies. We compared the implied dispersal velocity for the tip branch leading to SARS-CoV-2 to the dispersal velocity of bat virus tip branches to obtain a posterior rank distribution (Figure S6). The tip branches leading to SARS-CoV-1 and SARS-CoV-2 had the highest mean posterior dispersal velocity rank (Figure 5A), but there were other tip branches leading to SARS-CoV-1-like viruses with consistently high posterior ranks as well. Strikingly, the posterior rank distributions for the tip branches of SARS-CoV-1 and SARS-CoV-2 exhibit the most extreme skewness away from a rank of 0.5 and toward 1 (Figure 5B; see STAR Methods), with their posterior rank distributions rarely including values below 0.5 (Figure S6). This pattern further indicates the rapid nature of the dispersal velocities associated with the tip branches leading to SARS-CoV-1 and SARS-CoV-2.
Figure 5 Dispersal velocity rank and skewness of SARS-CoV-1-like and SARS-CoV-2-like viruses
Figure S6 Posterior rank distributions of the dispersal velocity of tip branches from the SARS-CoV-1-like and SARS-CoV-2-like virus tip-dated phylogeographies, related to Figure 5
We next explored the contributions of individual NRRs to the relatively fast dispersal velocities (and, therefore, high posterior ranks) required for the closest-inferred ancestors of SARS-CoV-1 and SARS-CoV-2 to reach their respective provinces of emergence. The tip branches leading to SARS-CoV-1 and SARS-CoV-2 have the greatest mean posterior rank in, respectively, 12 and 28 of the NRRs, indicating that it is not only NRRs with recent closest-inferred ancestors of the SARS-CoVs (Figure 1) that result in relatively high dispersal velocities for the tip branches leading to the SARS-CoVs. Considering the large distance separating the closest-inferred bat virus ancestors from where SARS-CoV-1 and SARS-CoV-2 each emerged (Figures 4I and 4J), and the high dispersal velocities necessary to traverse this distance (Figure 5), it is unlikely that the lineages descending from the closest-inferred bat virus ancestors of SARS-CoV-1 and SARS-CoV-2 reached Guangdong and Hubei, respectively, solely via dispersal through their bat reservoirs.

Phylogeography robust to hypothetical unsampled diversity

We evaluated whether our findings could be affected by the hypothetical discovery of viruses closely related to the human SARS-CoVs nearer to their provinces of emergence than has been observed in the empirical data. We conducted tip-dated phylogeographic analyses that incorporated hypothetical unsampled viruses, which represent unsequenced and unobserved viral diversity. These unsampled taxa were positioned between the provinces of emergence of SARS-CoV-1 and SARS-CoV-2 and the locations where the taxa most genetically similar to them were sampled, while also enforcing monophyly between the unsampled taxa and the respective SARS-CoV (Figure S7; see STAR Methods), thereby ensuring that these viruses were more phylogenetically similar to SARS-CoV-1 and SARS-CoV-2 across the entire genome.
Figure S7 Phylogenies including hypothetical unsampled taxa, related to Figure 6
The inclusion of these unsampled taxa resulted in a shift in the inferred location of the closest-inferred ancestor toward the province of emergence of the respective SARS-CoV, although, notably, the 50% and 75% HPDs were consistent (Figure S8). There were slight reductions in the dispersal velocities necessary for the closest-inferred ancestors of SARS-CoV-1 and SARS-CoV-2 to reach their provinces of emergence (Figure S9), primarily due to the shift in the inferred location of the closest-inferred ancestor. However, the tip branches leading to SARS-CoV-1 and SARS-CoV-2 still had the highest mean posterior dispersal velocity rank on average and across many NRRs, as well as the most positive skewness (Figure 6). Other measures to quantify differences in dispersal velocity also generally support the relatively high ranks of the tip branches leading to SARS-CoV-1 and SARS-CoV-2 but are more sensitive to potentially unsampled diversity, in particular for SARS-CoV-1 (Figure S10).
Figure 6 Dispersal velocity rank and skewness of SARS-CoV-1-like and SARS-CoV-2-like viruses when including hypothetical unsampled taxa
Figure S8 Inferred location of the closest-inferred ancestor comparison when including hypothetical unsampled taxa, related to Figures 4, 5, and 6
Figure S9 Comparison of dispersal velocity when including hypothetical unsampled taxa, related to Figures 4, 5, and 6
Figure S10 Additional dispersal velocity rank statistics of SARS-CoV-1-like and SARS-CoV-2-like viruses, related to Figures 5 and 6

Discussion

By reconstructing the evolutionary history of sarbecoviruses while accounting for recombination, we show that the ancestors of SARS-CoV-1 and SARS-CoV-2 likely circulated in horseshoe bat populations hundreds to thousands of kilometers away from the sites of the emergence of these viruses in humans and as recently as one to six years prior to this emergence. Our findings indicate that there would not have been sufficient time for the direct bat virus ancestor to reach the locations of emergence of the human SARS-CoVs via normal dispersal through bat populations alone.
Sarbecoviruses undergo extensive recombination and tend to share breakpoints at specific genome regions—so-called recombination hotspots.4,35 Comparisons of non-recombinant subgenomic segments when less data was available previously suggested there may have been decades of separation between the closest-inferred bat virus ancestor and each of the SARS-CoVs.6,36 However, dating the closest-inferred ancestor across non-recombinant subgenomic regions spanning the entire genome after calibrating a molecular clock on a per region level reveals much more recent ancestry between the SARS-CoVs and their bat-circulating relatives. The non-recombinant fragments, therefore, must be comprehensively analyzed to properly understand the emergence of the human SARS-CoVs.5,6 We additionally show here that adding new genomes to the recombination-aware ancestral reconstruction brought the sequence identity of the recombinant common ancestor of each SARS-CoV to its respective human virus to greater than 98% genetic identity.
The phylogeographic approach implemented here provides important insights into the sarbecoviruses’ extensive historical movement across China and Southeast Asia. Our analysis confirms that both the SARS-CoV-1-like and SARS-CoV-2-like viruses have each been circulating in these regions for at least tens, if not hundreds, of thousands of years. Conversely, the bat species endemic to Japan known to harbor sarbecoviruses, R. cornutus, and its phylogenetically closest mainland species infected by SARS-CoV-2-like viruses, R. pusillus, are estimated to have shared a common ancestor 13.3 million years ago.37 This much more ancient separation between the host species indicates that the viruses did not strictly co-diverge with these bat species. We find that, in more recent history, the viral lineages diffuse at a similar rate to horseshoe bats. These bats typically have small home ranges (2–3 km2 movement per night on average), are rarely migratory,32,38 and have a limited dispersal capacity with scant evidence of longer migrations, with such evidence limited to species in temperate rather than tropical regions.39,40 After accounting for recombination, the closest-inferred bat virus ancestors of SARS-CoV-1 and SARS-CoV-2 are inferred to have circulated in regions spanning from Northern Laos to Western China. Furthermore, our examination of the effects of hypothetical unsampled taxa on the phylogeographic reconstructions suggest that our inference of the location of the closest-inferred bat virus ancestors and their relative dispersal velocities are robust rather than a byproduct of subsampling.
SARS-CoV-1 is thought to have been moved from bats in Yunnan province to Guangdong province via intermediate mammalian hosts.15 The movement of palm civets (Paguma larvata) and raccoon dogs (Nyctereutes procyonoides) as part of the wild and farmed animal trade has been implicated in the emergence of SARS-CoV-1,41 suggesting that human-mediated transport of these infected animals can allow for viral lineages to move faster than their bat hosts. Given (1) the presumed emergence of SARS-CoV-1 through the animal trade, (2) the recency and location of the closest-inferred ancestor relative to SARS-CoV-2, (3) the consistently relatively high dispersal velocity associated with SARS-CoV-2, and (4) the clear evidence that the epicenter of the SARS-CoV-2 pandemic was at one of only four markets in Wuhan that sold live wildlife from plausible intermediate host mammal species,42,43,44 either the closest-inferred ancestor or the direct ancestor of SARS-CoV-2 likely moved from an area in or around Yunnan province, to Hubei province, via the wild and farmed animal trade.
Our phylogeographic model assumes that bat and human sarbecoviruses evolve at similar rates. If bat viruses have slower rates of evolution than those inferred here, the closest-inferred bat virus ancestors likely circulated earlier. However, if this were true, there would be a concomitant decrease in their inferred rates of travel and, therefore, still not enough time for them to reach the provinces of emergence of their respective SARS-CoVs.
The roughly similar rates of diffusion among viral lineages and horseshoe bats suggest these viruses are able to rapidly and efficiently move among distinct horseshoe bat populations. If virus circulation were restricted to the location of each bat population, with minimal transmission of strains to other bats, we would expect a considerably lower diffusion coefficient for the viruses, but there is substantial overlap between the virus lineage and bat diffusion coefficients. Therefore, the sarbecoviruses are likely being transmitted among dense bat populations of various horseshoe bat species. This finding is further supported by the high levels of recombination in bat sarbecoviruses, requiring frequent co-infection and co-circulation of distinct virus strains in a given bat population. Viral movement is likely to take place through horseshoe bats regularly moving among roosts, especially as roost switching tends to happen during breeding season, or if bats travel or migrate to hibernacula.33,45 Furthermore, in hibernacula, horseshoe bats will roost in very dense aggregations with different species,46 providing the potential for spillover between bat species while bodily functions are downregulated and body temperatures can reach very low levels.46,47 Heterothermy during hibernation and variable temperatures among species may also impact vulnerability to pathogens, as metabolic processes are downregulated during this time.48,49
Despite limited evidence for long-distance travel of horseshoe bats, there are several relatively long-distance viral lineage dispersals between nodes in the phylogenies. These geographically distant nodes therefore likely represent a lack of sarbecovirus sampling from areas between the distant locations rather than long-range bat movements, though it could also indicate dispersal from more migratory conspecific bat species, such as Chaerephon, which can share roosts, or Hipposideros, which can disperse longer distances and are phylogenetically close to Rhinolophidae. For example, internal branches of the SARS-CoV-2-like tree represent movement from the east of China (Zhejiang) to the west (Yunnan), to Thailand and Cambodia (Figure 4H). There are likely many more, still-undiscovered SARS-CoV-2-like viruses that would bridge the geographic gap between these regions (e.g., in northern Vietnam, Guizhou, and Guangxi) as the diversity of bats remains high across these regions, such that increased sampling in underexplored and undersampled regions would be helpful (Figure 4B).
Although horseshoe bat ranges extend across most of China, species richness decreases at increasing latitudes,27,50 and virus sampling in bats has largely been limited to higher biodiversity areas, including Yunnan province in China and Laos. Additionally, other large areas with high diversity, such as northern Vietnam, remain unsampled and host a similar community of bat species. The SARS-CoV-1-like bat viruses are still better represented (n = 134) and therefore have more phylogeographic resolution than the SARS-CoV-2-like bat viruses (n = 31). Although there are viruses closely related to the SARS-CoVs across most of the NRRs, the viruses are relatively more divergent across the NRRs that overlap with the Spike gene (Figure 1). This relatively greater divergence suggests that Spike genes that can bind non-bat entry receptors circulate in lower frequencies in wild horseshoe bat populations and, therefore, are less likely to be sampled.
Since the emergence of SARS-CoV-1 and SARS-CoV-2 in humans, the descendants of the bat viruses that gave rise to them have experienced years of evolution and almost certainly recombined with yet-unsampled lineages of the sarbecovirus tree. Hence, we should not expect to find the direct ancestor of either SARS-CoV-1 or SARS-CoV-2 circulating in wild bats in future sampling. Although the search for and analyses of sarbecoviruses related to SARS-CoV-1 and SARS-CoV-2 have often focused on partial genomes, typically the RdRp protein,13,51,52,53,54,55 future sequencing efforts should routinely aim for whole-genome sequences to detect and characterize all genomic fragments that descend from closely related ancestors. The non-recombinant segments of these mosaic genomes are pieces of a complex puzzle that are crucial for understanding the past and future emergence of SARS-CoVs into the human population.

Limitations of the study

Because there is insufficient temporal signal when calibrating a molecular clock using tip dating with sarbecoviruses sampled from bats and pangolins, we needed to calibrate the clock using SARS-CoV-1 and SARS-CoV-2 genomes from viruses sampled from humans and civets. The NRRs reported here are based on distinct inferred evolutionary histories across the sarbecovirus genomes, but, due to the need for a detectable signal, they likely represent a lower estimate of how much recombination has likely occurred among these viruses. Phylogeographic reconstructions across deep evolutionary histories can only capture general patterns and necessarily make abstraction of potentially complex population and habitat dynamics. Due the absence of bat viruses sampled closer to where SARS-CoV-1 and SARS-CoV-2 emerged, the inference of the location of the closest-inferred bat virus ancestors could be biased toward where published, closely related viruses were sampled. Similarly, a lack of sampling and publication of sarbecoviruses in nearby and potentially difficult-to-access regions limits our understanding of the broader ecology of these viruses. Even if sampling efforts would target more regions, host shedding of virus will vary over time because it is likely to increase during periods of increased metabolic stress, such as pregnancy or lactation. Therefore, detectability will vary across the year and there may be temporal sampling biases. However, the lack of sampled bat sarbecoviruses very closely related to SARS-CoV-1 and SARS-CoV-2 in and around their provinces of emergence, despite increased sampling of bat viruses there,13,25,26 suggests that our findings are not solely the result of ascertainment bias. Further sampling of closely related sarbecoviruses, methodological improvements in disentangling complex recombination patterns, and improved molecular clock calibration can allow for the inference of more detailed spatiotemporal trends and recombination patterns, as well as more complete evolutionary histories of each NRR.

Resource availability

Lead contact

Further information and requests for resources and additional data should be directed to and will be fulfilled by the lead contact, Jonathan E. Pekar (jpekar@ed.ac.uk).

Materials availability

This study did not generate new unique reagents or specific biological material. The materials used and generated in this study are available from the lead contact. As a computational project, the input, results, output, and code of this study are publicly available and described below in the “data and code availability” section.

Data and code availability

All genome accessions are available in Data S1. The XML files, code, and Newick trees are available at https://github.com/phylogeography/sarbecovirus_phylogeography.

Acknowledgments

S.D. acknowledges support from the Fonds National de la Recherche Scientifique (F.R.S.-FNRS, Belgium; grant no. F.4515.22) and from the Research Foundation - Flanders (Fonds voor Wetenschappelijk Onderzoek - Vlaanderen, FWO, Belgium; grant no. G098321N). S.D. and P.L. acknowledge support from the European Union Horizon 2020 project MOOD (grant agreement no. 874850). P.L., A.F.M., A.R., and M.A.S. acknowledge support from the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 725422-ReservoirDOCS), from the Wellcome Trust through project 206298/Z/17/Z, and from the National Institutes of Health grants R01 AI153044, R01 AI162611, and U19 AI135995. P.L. also acknowledges support from the Research Foundation - Flanders (Fonds voor Wetenschappelijk Onderzoek – Vlaanderen, G0D5117N and G051322N). T.I.V. acknowledges support from the Branco Weiss Fellowship. D.L.R. and J.H. acknowledge funding from the UK Medical Research Council (MRC, MC_UU_12014/12, MC_UU_00034/5, and MR/V01157X/1). S.L. was funded by an MRC studentship. M.G. is supported by the Biotechnology and Biological Science Research Council (BBSRC), grant no. BB/M011224/1. A.K. acknowledges funding from European Research Council grant 101001623-PALVIREVOL. J.E.P. acknowledges support from NIH (T15LM011271) and the UC San Diego Merkin Fellowship. This work was funded in part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health (NIH), Department of Health and Human Services (contract no. 75N93021C00015 to M.W.). J.O.W. acknowledges support from NIH (R01AI135992). We gratefully acknowledge the authors from the originating laboratories and the submitting laboratories, who generated and shared through GISAID, the viral genomic sequences and metadata on which this research is based (Data S1). We thank Kristian Andersen, Alexander Crits-Christoph, Fabian Leendertz, Niema Moshiri, and Richard Neher for fruitful conversations and insight. We gratefully acknowledge support from Advanced Micro Devices, Inc., with the donation of parallel computing resources used for this research.

Author contributions

J.E.P., S.L., D.L.R., M.W., J.O.W., and P.L. conceived and planned the research. J.E.P., S.L., S.D., M.G., A.F.M., E.P., and P.L. analyzed and processed the data. Y.W., X.J., A.F.M., T.I.V., M.A.S., J.H., S.D., J.O.W., and P.L. advised on statistical methodology. J.E.P., S.L., S.D., M.G., and P.L. designed and implemented genomic data processing pipelines. A.C.H. modeled bat distributions. J.E.P., S.L., S.D., and P.L. wrote the first draft of the manuscript. All authors contributed to editing and interpreting the results. D.L.R., S.D., M.W., J.O.W., and P.L. jointly supervised the work.

Declaration of interests

S.L. has received consulting fees from EcoHealth Alliance. M.A.S. receives contracts and grants from the US Food and Drug Administration, the US Department of Veterans Affairs, and Janssen Research and Development unrelated to this research. J.O.W. has received funding from the CDC (ongoing) through contracts or agreements to his institution unrelated to this research. J.E.P., M.A.S., A.R., M.W., and J.O.W. have received consulting fees and/or provided compensated expert testimony on SARS-CoV-2 and the COVID-19 pandemic.

STAR★Methods

Key resources table

REAGENT or RESOURCESOURCEIDENTIFIER
Deposited data
Sarbecovirus genomesNCBI-GenBank, GISAID, National Genomics Data Center of ChinaListed in Data S1
SARS-CoV-1 genomesNCBI-GenBankListed in Data S2
SARS-CoV-2 genomesNCBI-GenBank, GISAID, Crits-Christoph et al.44Listed in Data S2
Bat distributionsZhou et al.27https://www.sciencedirect.com/science/article/pii/S0092867421007091#app2
Bat distributionsGBIF.org (06 May 2023) GBIF Occurrence Download56https://doi.org/10.15468/dl.xc2jvq
Bat distributionsGBIF.org (08 May 2022) GBIF Occurrence Download57https://doi.org/10.15468/dl.82sgat
Bat distributionsTanalgo et al.58https://figshare.com/articles/dataset/Metadata_for_DarkCideS_1_0_a_global_database_for_bats_in_karsts_and_caves/16413405
Bat distributionsBecker et al.59https://datadryad.org/stash/dataset/doi:10.5061/dryad.kkwh70s18
Bat distributionsDi Gregorio et al.60https://www.tandfonline.com/doi/full/10.1080/24750263.2021.1918275#d1e1343
Bat distributionsJargalsaikhan et al.61https://www.sciencedirect.com/science/article/pii/S2287884X2200053X#appsec1
Bat distributionsKruskop et al.62https://bioone.org/journals/acta-chiropterologica/volume-22/issue-2/15081109ACC2020.22.2.002/Genetic-Diversity-of-Mongolian-Long-Eared-Bats-Plecotus-Vespertilionidae-Chiroptera/10.3161/15081109ACC2020.22.2.002.short?tab=ArticleLink
Bat distributionsKuo et al.63https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.12838
Bat distributionsMokrani et al.64https://museucienciesjournals.cat/en/amz/issue/16-2018-amz/rapid-assessment-of-cave-dwelling-bat-diversity-in-the-chebket-es-sellaoua-mountains-eastern-algeria?lang=en
Bat distributionsRaman et al.65https://figshare.com/s/b75dca7cdea7edae7bfd
Bat distributionsHughes et al.66https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-2486.2012.02641.x
Bat distributionsHughes67https://www.sciencedirect.com/science/article/abs/pii/S0006320717303932
Bat distributionsIUCN Red list of Specieshttps://www.iucnredlist.org/
Annual mean temperature, mean diurnal temperature Range, isothermality, temperature seasonality, maximum and minimum temperature, annual precipitation, precipitation seasonality, maximum and minimum precipitationBrun et al.68https://doi.org/10.16904/envidat.332
Thornthwaite aridity index, climate moisture index, continentality, embergers pluviothermic quotient, growing degree days above 0, and both annual mean potential evapotranspiration and seasonality of potential evapotranspirationTitle et al.69https://envirem.github.io/
Actual evapotranspirationTrabucco and Zomer70https://doi.org/10.6084/m9.figshare.7707605.v3
Vegetation canopy height layerLang et al.71https://langnico.github.io/globalcanopyheight/
Forest canopy heightPotapov et al.72https://glad.umd.edu/dataset/gedi/
Tree densityCrowther et al.73http://elischolar.library.yale.edu/yale_fes_data/1/
Distance to bedrock, bulk carbon stockHengl et al.74https://github.com/ISRICWorldSoil/SoilGrids250m/
Software and algorithms
MAFFT v7.453Katoh and Standley75https://mafft.cbrc.jp/alignment/software/
BioEditHall76https://thalljiscience.github.io/
HyPhy v.2.5.29Kosakovsky Pond et al.77https://github.com/veg/hyphy
HMMCleanerDi Franco et al.78https://metacpan.org/dist/Bio-MUST-Apps-HmmCleaner
IQ-TREE v1.4.4Nguyen et al.79http://www.iqtree.org/
ETE 3Huerta-Cepas et al.80https://etetoolkit.org/
BEAST v1.10.5Suchard et al.81https://beast.community/
BEAGLE 3Ayres et al.82https://github.com/beagle-dev/beagle-lib
Tracer v1.7.1Rambaut et al.83https://github.com/beast-dev/tracer
Prisoner of War (PoW) modelGhafari et al.21https://github.com/mg878/PoW_model
TreeAnnotator v1.10Suchard et al.81https://beast.community/
SeraphimDellicour et al.84https://github.com/sdellicour/seraphim
ArcMap 10.8ESRIhttps://www.esri.com/en-us/arcgis/products/arcgis-desktop/resources
MaxentHengl et al.74https://biodiversityinformatics.amnh.org/open_source/maxent/
Custom scripts for this studyhttps://github.com/phylogeography/sarbecovirus_phylogeographyhttps://github.com/phylogeography/sarbecovirus_phylogeography

Experimental model and study participant details

The sources of bioinformatic data analyzed in this study are available in the key resources table.

Method details

Dataset compilation

For the animal sarbecovirus dataset we compiled a set of 250 whole-genome sequences (>28,500 nt) of the subgenus Sarbecovirus sampled primarily in horseshoe bat hosts, including reference genomes of the human viruses SARS-CoV-1 and SARS-CoV-2 and six pangolin-infecting viruses (Data S1). This set contains all publicly available whole-genome sarbecoviruses available on 4th of October 2023 retrieved from the NCBI Genbank, GISAID and National Genomics Data Center of China databases. A single genome was used for SARS-CoV-1 (HSZ-Cc; GenBank: AY394995.1), SARS-CoV-2 (Wuhan-Hu-1; GenBank: MN908947.3) and the Guangdong pangolin sarbecovirus isolate (MP789; GenBank: MT121216.1) which has multiple published genome accessions associated with the same sample.16,18,85,86 Collection date, sampling location and host metadata associated with the sequences were retrieved from the corresponding sequence databases and manually checked with the published papers describing the sequences. For sequences where exact city/cave sampling location information was provided a single longitude and latitude data point corresponding to that location was recorded, while provinces were recorded as sampling locations otherwise. The metadata for these 250 genomes are available in Data S1.
For the SARS-CoV-2 datasets, we used 863 genomes from the early pandemic44 and 1,000 genomes from late 2020.87 For the SARS-CoV-1 dataset, we downloaded the 98 publicly available whole genomes of SARS-CoV-1 from humans and civets from GenBank, with dates and locations matched by metadata or literature review: https://github.com/andersen-lab/SARS-CoV-1_Evolution/blob/main/metadata_extended.csv. A list of all accessions used in this study are available in Data S1. All GISAID accessions are also available through GISAID when using the identifier EPI_SET_241121yn.

Recombination analysis

The 250 whole-genome sequences were aligned using MAFFT (v7.453, localpair option)75 and manually curated using BioEdit.76 To explore the recombination patterns of SARS-CoV-1-like and SARS-CoV-2-like sarbecoviruses separately we selected a number of representatives from each set of genomes on which to perform recombination analysis using the Genetic Algorithm for Recombination Detection (GARD).11 We analysed the two sarbecovirus clades separately to reduce overlapping recombination signal between the sets that may skew the position of identified breakpoints unique to each clade. Even though inter-clade recombinants exist in the dataset, the GARD algorithm does not require potential recombination parents to be in the dataset, but assesses recombination simply by comparing tree topology incongruencies. Thus, signals of SARS-CoV-1-like/SARS-CoV-2-like inter-clade recombinants can be detected in genomes of each set without the parental lineages of recombinant genomes having to be present in each dataset. The genome selection was made by first performing a preliminary phylogenetic analysis of the relatedness between these genomes using the recombination breakpoints presented in Lytras et al.4 Based on the preliminary analysis, clusters of genomes grouping with SARS-CoV-1 and SARS-CoV-2 for at least one of their genomic segments and showing the same overall recombination pattern were manually identified and one representative genome was selected from each cluster for the corresponding downstream GARD analyses. For the SARS-CoV-2-like set, 24 genomes including the SARS-CoV-2 reference Wuhan-Hu-1 (GenBank: MN908947.3) were used in the GARD analysis. For the SARS-CoV-1-like set, 38 genomes including the SARS-CoV-1 HSZ-Cc genome (GenBank: AY394995.1) were used in the analysis (Data S1). The representative sequences for each set were retrieved from the full alignment. The ends of genomes are usually more difficult to sequence, leading to shorter or extended 3’ and 5’ ends in the sequences. Due to the incomplete coverage of these short genomic regions they are unlikely to contain any informative sites for the recombination analysis and for this reason were trimmed from each resulting subset alignment. GARD was performed for each subset alignment using the HyPhy software suite v.2.5.2977 under a general time reversible (GTR) substitution model and a 3-class general discrete distribution (GDD) to account for site-to-site rate variation. The SARS-CoV-2-like GARD analysis finished with a total of 92 consecutive breakpoint models, while the SARS-CoV-1-like analysis finished with 56. Based on these models, we accepted positions as confident recombination breakpoints if they were present in at least 1/2 of the consecutive models for the SARS-CoV-2-like set and in at least 1/3 of the models for the SARS-CoV-1-like set (proportional to the total number of models). Breakpoints closer than 100 alignment positions to one another were merged. The resulting breakpoint positions were mapped back to the full whole-genome alignment which was subsequently split into the corresponding non-recombinant regions (NRR) for each set of breakpoints. Since some NRRs have relatively short lengths, incorrect alignment of sites may have a larger effect on the phylogenetic inference. To avoid alignment errors from biasing the phylogenetic inferences, all NRR alignments were processed using HMMCleaner,78 which removes sequence segments that fit poorly in the rest of the alignment, before performing any downstream analyses.
To separate all genomes into SARS-CoV-1-like and SARS-CoV-2-like viruses we first inferred maximum likelihood phylogenies for every NRR alignment using IQ-TREE (v.1.4.4)79 under a GTR+I+F+G4 substitution model, assessing node support by performing 10,000 replicates of the ultrafast bootstrap approximation for each inference.88 All trees were rooted by the outgroup clade of sarbecovirus genomes sampled in Europe and Africa (GenBank: NC_014470.1, MZ190137.1, KY352407.1, MT726043.1, MT726044.1, MT726045.1, MZ190138.1, MW719567.1) and nodes with ultrafast bootstrap support below 70 were collapsed using the ‘ete3’ python package.80 Based on the resulting topologies, the viruses clustering monophyletically with SARS-CoV-2 but not with SARS-CoV-1 for at least one of the SARS-CoV-2-like GARD NRR trees were placed in the SARS-CoV-2-like viruses and vice versa for the SARS-CoV-1-like virus set. This led to two non-independent sets of 38 SARS-CoV-2-like and 140 SARS-CoV-1-like genomes (each including SARS-CoV-2 and SARS-CoV-1 respectively) used in the downstream dating and phylogeographic analyses; there were 19 genomes shared between these two datasets. The set to which each genome is assigned is listed in Data S1.

Comparison of breakpoint positions

Based on the GARD recombination analysis, 30 and 43 breakpoint positions were detected for the two virus subgroups. Only one SARS-CoV-1 genome was used in both GARD sets (Data S1), so overall breakpoint inferences are expected to represent independent recombination patterns of known SARS-CoV-1-like and SARS-CoV-2-like viruses respectively. The lengths of the resulting NRRs on the alignment with all 250 sarbecovirus genomes had a median of 870 nt and ranged from 309 to 3,414 nt for SARS-CoV-1-like viruses, and a median length of 595 nt and ranged from 168 to 2,706 nt for SARS-CoV-2-like viruses. It should be noted that the last NRR is the longest due to extended ends at the 3' end of the genomes.
We calculated the distances between the closest position pairs between the two breakpoint sets, resulting in 30 pair distances (each SARS-CoV-1-like position matched with its closest SARS-CoV-2-like position). To compare whether the pairs were more or less closely located than one would expect by chance, we performed 100 simulations of position sets randomly sampled across the same alignment length without replacement (one set with 30 positions and one with 43 for each simulation to match the empirical data on the number of breakpoints). We then quantified the distances between closest pairs in the simulated sets as done with the observed breakpoints. Both empirical and simulated distances formed distributions with log-normal shapes, hence 90% confidence intervals and means were estimated using Cox’s method for inferring log-normal confidence intervals.89 Similarly, the significantly closest observed breakpoint pairs were defined as those with distances below the one-tailed lower 90% confidence interval of the observed log-normal distance distribution.

Molecular clock calibration

There is insufficient data to inform a molecular clock when using tip dating with sarbecoviruses sampled from bats and pangolins.6 To identify a suitable rate prior for each NRR identified from the recombination analysis, we conducted a phylogenetic analysis of SARS-CoVs sampled from humans.
Molecular clock calibration across the 44 inferred NRRs of the SARS-CoV-2-like viruses was conducted using a Bayesian approach with BEAST v1.10.581 and using the BEAGLE library v3 to increase computational performance.82 We inferred the substitution rate of SARS-CoV-2 across each of these NRRs by using a subset of SARS-CoV-2 genomes (n=1,000) from our previously published Bayesian phylogenetic analysis of 3,959 genomes sampled up to late 2020 in Europe87 (Data S1), a non-parametric skygrid prior with 19 grid points and a cutoff of 1 year, and a Hasegawa-Kishino-Yano (HKY85) substitution model90 with four gamma categories. The topologies and substitution models were shared across all the NRRs, but assuming an independent and estimable strict clock rate under an uninformative continuous-time Markov chain (CTMC) rate reference prior91 for each NRR. We ran one Markov chain Monte Carlo (MCMC) chain of 500 million generations, subsampling every 50,000 iterations to a continuous parameter log file. The first 10% of the chains were discarded as burn-in. Convergence and mixing was assessed in Tracer v1.7.183 and all relevant effective sample size (ESS) values for clock rates were >100. We also perform an additional clock calibration without any NRR partitioning to infer a single clock rate prior for the phylogeographic analysis (see below). The rest of the parameterizations are identical to the partitioned clock calibration inference.
To calibrate the molecular clock for the SARS-CoV-1-like viruses analysis, we first performed Bayesian phylogenetic inference on 98 human and civet SARS-CoV-1 genomes (Data S1), using an HKY85 substitution model with a gamma site heterogeneity model and 4 gamma categories (G4), a strict molecular clock, and an exponential growth coalescent prior. We ran a single chain of 100 million generations, subsampling every 10 thousand iterations to a continuous parameter log and tree files. We next inferred the substitution rate of SARS-CoV-1 across the 31 inferred NRRs of the SARS-CoV-1-like viruses by using an empirical tree distribution (n=1,000) from these results, sharing the topologies and substitution models across all the NRRs, with independent and estimable clock rates under CTMC rate priors. We ran one chain of one million generations, subsampling every one hundred iterations to a continuous parameter log file. For each step of the inference, the first 15% of the chains were discarded as burn-in. Convergence and mixing was assessed in Tracer and all relevant effective sample size (ESS) values were >200.

Bayesian divergence time estimation

We used BEAST to perform phylogenetic inference and ancestral state reconstruction across the 44 NRRs of the SARS-CoV-2-like viruses. For the primary analysis (Figure S2), we used all of the 44 SARS-CoV-2-like genomes (including SARS-CoV-2), an uncorrelated relaxed clock using calibrated region-specific normally-distributed substitution rate priors for each NRR based on SARS-CoV-2 sampled from humans, a GTR+G4 substitution model, a non-parametric skygrid prior92 with 49 grid points and a cutoff of 1,000 (which translates to 1022 CE), and Hamiltonian Monte Carlo gradient-based sampling for node ages and branch rates.93,94 Specifically, we set the mean and standard deviation of the rate prior for each NRR to equal the posterior mean and standard deviation estimated during molecular clock calibration with SARS-CoV-2 genomes. We tracked the stem age of SARS-CoV-2 (i.e., where SARS-CoV-2 connects to the rest of the phylogeny, also referred to as the time of the closest-inferred ancestor, with the parent node of SARS-CoV-2 referred to as the closest-inferred ancestor). For each NRR, we first ran one chain of 20 million generations, subsampling every two thousand iterations to continuous parameter log and tree files. The first 10% of the chain was discarded as burn-in for each analysis. Convergence and mixing was assessed in Tracer. For analyses of NRRs where any relevant ESS values were <200, we ran a second chain with the same specifications as the first, discarded the first 5–25% of the chain as burn-in, and combined the two chains. All relevant ESS values for the one- and two-chain analyses were >200.
We performed the same analysis for the SARS-CoV-1-like viruses for its 31 NRRs, using all of the 140 SARS-CoV-1-like genomes (including SARS-CoV-1). The BEAST parameterization was identical to the analysis for the SARS-CoV-2-like viruses, except we (1) tracked the stem age of SARS-CoV-1 (i.e., the time of the closest-inferred ancestor of SARS-CoV-1), (2) used the relevant calibrated molecular clock priors for the 31 NRRs of the SARS-CoV-1-like viruses, and (3) ran one chain of 100 million generations, sub-sampling every 10 thousand iterations to continuous parameter log and tree files.

Bayesian divergence time estimation sensitivity analyses

We performed a series of sensitivity analyses for the Bayesian divergence time estimation: (i) using SARS-CoV-2 genomes from early 2020, rather than late 2020, when calibrating the molecular clock prior for the subsequent analyses of SARS-CoV-2-like viruses, and (ii) including two additional human SARS-CoV-1 genomes and two civet SARS-CoV-1 genomes in the SARS-CoV-1-like viruses analyses, representing both the 2002–2003 and 2003–2004 SARS-CoV-1 outbreaks. For (i), we performed molecular clock calibration using the same approach as with the SARS-CoV-2 genomes sampled from late 2020, except here using 863 genomes from the early pandemic (Data S1), sharing the topologies and substitution models across all the NRRs, but assuming an independent and estimable strict clock rate under an uninformative CTMC rate prior91 for each NRR. We ran one MCMC chain of one million generations, subsampling every one thousand iterations to a continuous parameter log file. The first 10–15% of the chains were discarded as burn-in. Convergence and mixing was assessed in Tracer and all relevant effective sample size (ESS) values were >200. There was no temporal overlap between the datasets of SARS-CoV-2 genomes from early 2020 and late 2020; the latest genome in early 2020 dataset was sampled on 14 February 2020 and the earliest genome in the late 2020 dataset was sampled on 22 February 2020.
For (ii), SZ3 (GenBank: AY304486.1) was selected as representative of the earliest sampled detection in the intermediate animal reservoir (market-based civets). ZS-B (GenBank: AY394996.1) was selected to represent the early stages of the human epidemic from late January 2003 onwards. As both of these are from the 2002–2003 SARS-CoV-1 outbreak, we also included the civet virus HC/SZ/266/03 (GenBank: AY545916.1) and the human virus GZ0401 (GenBank: AY568539.1) from the 2003–2004 SARS-CoV-1 outbreak.
Convergence and mixing were assessed in Tracer. For analyses of NRRs where any relevant ESS values were <200, we ran up to two more chains with the same specifications as the initial chain, discarded the first 10% of the chain as burn-in, and combined all chains. All relevant ESS values for the combined chains were >200.

Constructing the recombinant common ancestor

For each NRR of the SARS-CoV-2-like viruses, we reconstructed the most likely nucleotide sequence belonging to the parent node of SARS-CoV-2 based on the tree distribution from the primary Bayesian divergence time estimation (Figure S2). The most likely nucleotide sequences for the 44 NRRs are then concatenated to form the recombinant common ancestor (recCA)14 of SARS-CoV-2.
To analyze how the recCA changed as more genomes were released, we constructed the recCA of SARS-CoV-2 as the genomes of closely related sarbecoviruses were chronologically published. We performed ancestral state reconstruction using the posterior trees with all 38 taxa. We next identified the inferred sequence of the most recent common ancestor of SARS-CoV-2 and its most closely related taxa based on the published genomes as new genomes were sequenced. We then examined the genetic similarity of the recCA, as well as the most closely related published genome, to SARS-CoV-2 as a function of the publication date. The same analysis was performed for SARS-CoV-1 and the SARS-CoV-1-like viruses and each phylogenetic sensitivity analysis (see above).

Bayesian phylogeographic inference

We inferred the evolutionary and geographic history of the 44 NRRs of the SARS-CoV-2-like viruses with BEAST. In order to optimally inform a flexible phylogeographic reconstruction approach, we jointly fit a bivariate continuous diffusion model using a shared estimable precision matrix across all independent random NRRs phylogenies, rescaling this matrix across branches in each tree to model a relaxed random walk (RRW).95 All branch-specific RRW rate scalars are drawn from the same log-normal distribution with a mean of 1 and an estimable standard deviation. In the continuous phylogeographic diffusion model, we exclude the sampling locations for the human and pangolin viruses and integrate over the uncertainty for locations that lack precision. As a tree prior, we specify a joint non-parametric skygrid prior with 49 grid points and a cutoff of 1,000 (which translates to 1022 CE) or a cutoff of 0.5 substitutions per site for trees scaled in units of genetic distance (see below). We further specify a joint HKY85 nucleotide substitution model across NRRs. These relatively simple sequence evolutionary parameterizations were chosen to meet the requirements for the Prisoner of War (PoW) model21 to rescale the evolutionary histories (see below). We perform two versions of this analysis (Figure S2): (i) using a hierarchical strict molecular clock model that allows for the clock rate of each NRR to vary (but must average to the overall rate), with the single inferred clock rate prior from above for the overall clock rate, in order to estimate the trees in units of time and (ii) using independent strict molecular clock models with clock rates fixed to 1 and contemporaneous tips in order to estimate trees in substitution units for PoW transformation. For (i), we also included gamma-distributed among-site rate heterogeneity with a hierarchical prior over the NRR-specific shape parameters. We ran two chains of 500 million generations, subsampling every 10 thousand iterations to continuous parameter log and tree files; the first 10% of the chain was discarded as burn-in. Convergence and mixing was assessed in Tracer, and all relevant ESS values were >200.
We performed the same analysis for the SARS-CoV-1-like virus set for its 31 NRRs. The BEAST parameterization was identical to the analysis for the SARS-CoV-2-like viruses, but here we use the rate priors with standard deviations divided by 5 to assist in convergence. We ran multiple chains for >200 million generations, subsampling every 500 thousand iterations to continuous parameter log and tree files, and discarded the necessary burn-in prior to combining the independent samples. Convergence and mixing was assessed in Tracer, and all relevant ESS values were >100.
To account for the time-dependent decline in the evolutionary rate of sarbecoviruses, we applied the PoW model, a mechanistic evolutionary method for estimating virus substitution rates over long evolutionary timescales. For this, we used the post-burn-in substitution trees inferred from the previous step along with the posterior rate distributions of the SARS-CoV-1-like viruses and the SARS-CoV-2-like viruses to rescale substitution trees into time trees using the PoW transformation (see ref Ghafari et al.21 for more details on the method).
We used TreeAnnotator v1.1081 to obtain and annotate a maximum clade credibility (MCC) tree for each continuous phylogeographic reconstruction. Finally, we used the R package ‘seraphim’84 to extract the spatiotemporal information embedded within 100 trees sampled from each post-burn-in posterior distribution and to visualize the different phylogeographic reconstructions. For each NRR, we estimated the weighted diffusion coefficients29 with the ‘spreadStatistics’ function of the R package ‘seraphim’84.

Bayesian phylogeographic inference sensitivity analysis

To examine the consequences of potentially biased sampling on the phylogeographic inference, we included in the Bayesian inference two hypothetical unsampled taxa that were associated with a specific location and a sampling date of 2020 but no observed sequence data. For the analysis of the SARS-CoV-2-like viruses, one taxon was given a sampling location associated with a limestone karst in Guizhou expected to have high bat diversity, and the other taxon was given a sampling location associated with caves in Hunan also expected to have high bat diversity. Both of these locations are between Hubei, where SARS-CoV-2 emerged, and Yunnan, China and Northern Laos, where the bat viruses most closely related to SARS-CoV-2 were sampled. For the analysis of the SARS-CoV-1-like viruses, both taxa were assigned sampling locations in Guangxi, between Guangdong, where SARS-CoV-1 emerged, and Yunnan, where the bat viruses most closely related to SARS-CoV-1 were sampled. The unsampled taxa were constrained to branch off from the branch connecting the human SARS-CoV and closest-inferred ancestor.

Ranking the dispersal velocity of the lineage descending from the closest-inferred ancestor

Using 100 trees sampled from each post-burn-in posterior tree distribution from each NRR from the tip-dated phylogeography analyses, we calculated the dispersal velocity for each tip branch, excluding the branches leading to the human or pangolin viruses. We next computed the minimum distance between the closest-inferred ancestor and Guangdong province or Hubei province for SARS-CoV-1 or SARS-CoV-2, respectively. The necessary dispersal velocity for the closest-inferred ancestor to reach either Guangdong or Hubei province was calculated using this minimum distance and the branch leading to, respectively, SARS-CoV-1 or SARS-CoV-2 (i.e., the time between closest-inferred ancestor and the SARS-CoV). Tip branches associated with taxa that have parent nodes dated to more than 50 years before the most recently sampled virus were excluded from downstream analyses. If the tip branch was excluded from at least 50% of the samples for an NRR, the tip branch was removed from the NRR entirely. If the tip branch associated with a human SARS-CoV was excluded from at least 50% of the samples for an NRR, the NRR was removed entirely from downstream analyses. The 50-year threshold was chosen based on the observed substitution saturation inflection point (Figure 3A), thereby also focusing primarily on the recent movement of sarbecoviruses.
We then computed the rank of the dispersal velocity for each remaining tip branch in each tree sample. These posterior dispersal velocity ranks were pooled together for each tip across all samples for an overall posterior dispersal velocity rank distribution, or across all samples for each NRR for posterior dispersal velocity rank distributions per NRR.
To calculate how much the posterior dispersal velocity rank distribution deviates from an expected average rank, we use an adjusted formula for skewness with a given “mean” of 0.5:
Skew=n(n1)(n2)i=1n(Xi0.5s)3

Modeling of bat distributions

Our bat distribution database comprised of the data used in recently published studies,27 recent Global Diversity Information Facility (GBIF) data,56,57 and the Darkcide database.58 The Darkcide database included recent data collected by bioacoustic surveys in Chiang Mai and Kanchanaburi in Thailand by A.C. Hughes using an echometer Pro, and multiple published datasets.27,56,57,59,60,61,62,63,64,65,96,97 The data was integrated as we attempted to keep the database complete, even if not all data is used within our particular study. All Rhinolophid bat species, as well as Chaerephon plicatus and Aselliscus stoliczkanus were extracted from the dataset, and then the dataset was clipped to Southeast Asia (India to the West, Mongolia to the North, Japan to the East and Malaysia to the South) to buffer the region of interest with distribution data. Once all duplicate records had been removed, the clipped data for Rhinolophids included 10,254 distribution points for all species across the region, in addition to 235 records for A. stoliczkanus and 108 for C. plicatus.
For environmental data, we based variable selection on those used in Zhou et al.27 and known from our previous work to be important to bats in the region.66,67 All analysis was conducted in ArcMap 10.8 (ESRI) at a resolution of 0.008o, which is approximately 1km2, and all layers were clipped and subsampled to the same resolution using a mask. Climate variables were downloaded from Chelsa68 (available at https://chelsa-climate.org/downloads/) based on a selection from our previous analyses (the bioclimatic variables: Annual Mean Temperature, Mean Diurnal temperature Range, Isothermality, Temperature Seasonality, Maximum and Minimum Temperature, Annual Precipitation, Precipitation seasonality, Maximum and Minimum precipitation).66,67 From Envirem69 (https://envirem.github.io/) we used the Thornthwaite Aridity index, climate moisture index, continentality, embergers pluviothermic quotient, growing degree days above 0, and both annual mean potential evapotranspiration and seasonality of potential evapotranspiration. In addition we used actual evapotranspiration,70 a vegetation canopy height layer,71 forest canopy height,72 tree density,73 distance to bedrock74 (the bdticm and bdricm aggregated grids available at https://github.com/ISRICWorldSoil/SoilGrids250m/), and bulk carbon stock.74 These were all trimmed to the same region that bat data was clipped to and then species distributions were modelled using Maxent98 using the GUI following the same approach as Hughes et al.66 For each species, three replicates were run and the average was used. These probabilistic distribution maps were then reclassified using the 10th percentile cumulative logistic threshold to produce a binary presence-absence map with ArcMap 10.8, and all models had an AUC exceeding 0.85. The models were then clipped to the study region, and any Rhinolophids with no modelled suitable habitat in the clipped study area were removed from analysis. As the projections show suitable habitat but do not account for species biogeography we then cross-checked the 40 remaining Rhinolophid maps against the IUCN database (https://www.iucnredlist.org/), and species not listed were checked against relevant publications on the species. Based on the IUCN mapped ranges species endemic to Japan, or species not found in Japan or at similar latitudes (areas of the South Korean Peninsula are climatically suitable for a number of species, but lack continuous habitat to reach these regions if largely distributed in Southeast Asia) were then clipped to the biogeographically suitable range. The binary maps were then stacked using the Mosaic to New Raster function in ArcMap 10.8 to find the maximum number of species that may be co-distributed.
 
 
 

https://www.cell.com/cell/fulltext/S0092-8674(25)00353-8