As the world debates the origin of SARS-COV-2, most assume the SARS outbreak of 2003 was a natural event. But revisiting the evidence I found parallels, direct linkages and many unresolved questions.
Apr 3, 2023·edited Apr 4, 2023Liked by Dog's Breakfast
The following code downloads FASTA files for nucleotide and amino acid sequences of SARS-like viruses, it aligns the spike protein sequences, and it sorts the sequence by their number of mismatches to Tor2 in the region which features the DATSTGNYNYKYRYLR sequence in Tor2:
tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf(i==m?"%s\n":"%-"(b[i]+n)"s",a[h][i])}}' "${1+FS=$1}" "n=${2-1}";} # `tab \\t` is like `column -ts$'\t'` but it doesn't get thrown off by empty fields
Eight bat SARS viruses featured the sequence DATSTGNHNYKYRYLRH which has only one mismatch: BtRs-BetaCoV/YN2018B, Rs9401, Rs7327, YN2016C, YN2016D, YN2016E, YN2016A, YN2016B. They all have between 1254 and 1283 nucleotide changes from Tor2. WIV1 has about a hundred fewer nucleotide changes from Tor2 (1150) but it has two mismatches (DATQTGNYNYKYRSLRH). The only genome with three mismatches is "Rhinolophus affinis coronavirus isolate LYRa11" (DATSSGNFNYKYRSLRH), where the number of mismatches is pretty low considering that the whole genome has 2672 nucleotide changes from Tor2. The LYRa11 sequence was published in 2014 as part of a paper titled "Identification of Diverse Alphacoronaviruses and Genomic Characterization of a Novel Severe Acute Respiratory Syndrome-Like Coronavirus from Bats in China".
The Y?Y?Y pattern of three Y residues interspaced by single other residues is also featured in Wuhan-Hu-1: DSKVGGNYNYLYRLFRK. The region is identical in BANAL-52, BANAL-236, and BANAL-103. But in RaTG13 the first four residues DAKE instead of DSKV. And ZC45 has deletions in the middle of the sequence: "DV---GN--YFYRSHRS".
"YN2018B, Rs9401, Rs7327, YN2016C, YN2016D, YN2016E, YN2016A, YN2016B" yes I have been looking at these sequences and they also show have very strange "recombination", or engineered
features when viewed using SNAP.
LyRa11 I discuss a little above (there is also LyRa3 posted for some reason only as a protein sequence). These sequences were published by AMMS' Colonel Changchun Tu, a few months after WIV provided Rs3367. It seems a more realistic - I would say - fake. Doesn't jump out so obviously on a SNAP diagram.
Apr 4, 2023·edited Apr 4, 2023Liked by Dog's Breakfast
In a paper from 2020 titled "A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein", they described the RmYN02 sequence which was similar to BANAL-116 and BANAL-247 which were published only much later in 2022: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7211627/. The whole genome sequence of RmYN02 is not available from GenBank but only GISAID and NMDC. But the raw reads are available from the SRA, so I posted instructions here on how you use MEGAHIT to assemble the raw reads yourself: https://usmortality.substack.com/p/sars-cov-2-genome-assembly-part-2/comment/14148825. The spike protein of RmYN02 is at GenBank, even though for some reason almost the entire S1 domain is missing: https://www.ncbi.nlm.nih.gov/nuccore/MW201982.1.
In the paper about RmYN02, they tried to demonstrate that the "PRRA" insert had a natural origin because RmYN02 contained a "PAA" sequence around the furin cleavage site which at that point had not been described in other SARS-like viruses, even though it was later featured in BANAL-116 and BANAL-247.
In an alignment of the spike proteins, the region around the furin cleavage site is:
- "SYQTQTNSPRRARSVA" in Wuhan-Hu-1
- "SY----NSPAA-R-VG" in RmYN02, BANAL-126, and BANAL-247
- "SYQTQTNS----RSVA" in RaTG13, BANAL-52, BANAL-103, and BANAL-236
- "SYHTASIL----RSTS" in ZC45
- "SYTHASIL----RSTG" in ZXC21
But if the "PRRA" sequence evolved from "PAA" like the authors of the RmYN02 paper suggested, then it's weird that the "PAA" sequence is not featured in BANAL-52 which is much closer to SARS 2 than BANAL-116 and BANAL-247 are.
While I will cover in more detail in a future substack you can see visually some of the weirdness in BANAL sequences in SNAP diagrams at this twitter thread. More recombination? I'll also look at RmYn02.
RmYN02 Is indeed suspect as some authorsmay be closely related to AMMS. Authors of the paper (and importantly those who did the sequencing) Juan Li Tao Hu Hong Zhou had recently worked with AMMS authors of the paper "Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins". AMMS Major-General Wuchun Cao supervises graduate students at Shandong University. Alice C. Hughes appears to have had no role in sequencing or assembly, only collecting initial sample which she is unable to vouch for.
I basically just entered the accession number of Wuhan-Hu-1's spike protein to protein BLAST (QHD43416): https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins. I entered "SARS-CoV-2 (taxid:2697049)" in the organism field and clicked the "exclude" checkbox next to it. Then I clicked "Algorithm parameters" and I set "Max target sequences" to 500. Then I clicked I clicked "BLAST" and I selected "FASTA (complete sequence)" from the "Download" menu. Then I removed the last entries from the file which weren't SARS-like viruses, removed most sequences marked as synthetic constructs, and so on.
I found that there's a new sequence called BtSY2 which was added to GenBank in January 2023: https://www.ncbi.nlm.nih.gov/nuccore/OP963576.1. It doesn't have a full genome at GenBank but only the full CDS. In my alignment of the spike proteins, the number of letter changes from Wuhan-Hu-1 was 20 in BANAL-52, 33 in RaTG13, and 35 in BtSY2, but after that there was a huge gap until Pangolin coronavirus GX_P2V which had 98 letter changes.
If you look at the region of the spike protein 100 bases before PRRA and 100 bases after PRRA, it has only one amino acid change from Wuhan-Hu-1 in RaTG13 and BtSY2 and two changes in BANAL-52. But most current strains of Omicron have 4 amino acid changes in the same region even though their whole genome has an order of magnitude less nucleotide changes from Wuhan-Hu-1.
You can download a FASTA file for a global subsample of about 3000 SARS 2 sequences from NextStrain: https://docs.nextstrain.org/projects/ncov/en/latest/reference/remote_inputs.html. I used Nextclade CLI to generate the protein sequences for all sequences: https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html. I checked the same region of 100 residues before and 100 residues after the PRRA insert to see how many amino acid changes the region has from Wuhan-Hu-1, but when I ignored positions with an X letter, 1917 out of 2929 sequences had 4 changes, 365 had 1 change, 60 had 2 changes, 38 had zero changes, 25 had 5 changes, and so on (but many of the sequences were old samples from 2020 or 2021):
[code omitted]
The code below shows the region around the "Y?Y?Y" pattern in BtSY2 and other sequences. It's "LDSKVGGNYNYLYRLFRKS" in Wuhan-Hu-1, BtSY2, BANAL-52, and a bunch of supposed Pangolin viruses, but RaTG13 actually has five amino acid changes in the same short region: "IDAKEGGNFNYLYRLFRKA". I guess it's recombination again...
[code omitted]
(Substack didn't allow me to post this comment when I included shell code, so I posted a version with the code included here: https://output.jsbin.com/qulozit.)
That's the reason the BANAL sequences had to be "discovered". RaTG13 was very similar to SARS-CoV-2, but not similar enough in the RBD. It was "designed" in a hurry, it has quite a few flaws, many people have pointed to. BtSY2 almost certainly another fake. Will they ever stop spamming Genbank with new sequences?
Thoroughly enjoyable read. It's refreshing to find so much detail presented in such a clear and accessible form.
Once you see these similarities laid out, it's striking how such a comparative analysis hasn't been attempted before now. I hope this stimulates lots of discussion and even further analysis.
Two of the characters you mention have not gained as much attention as the likes of Fauci, Daszak, Shi or even Baric - I'm referring to Lin-Fa Wang and Garry Crameri - their role in the SARS-COV-2 drama is worthy of scrutiny imho.
Congrats on your first post, welcome to substack and I hope we'll have a chance to read more from you.
Thanks. I would like to know more specifics about Linfa Wang and Gary Crameri's involvement in both outbreaks, it's hard to appreciate their role just from the author list on a paper. Unfortunately, CSIRO keep refusing my FOI requests.
Mostly lack of space and the limitations of the free software I was using. Those particular events from 2008-13 relate to the genesis of Rs3367 and RsShC014. But I'll try to make an expanded version, because will be interesting to see all events/discoveries/publications in context.
It probably would be a good idea to include more early publications on the engineering of coronaviruses. But it doesn't tell you who engineered it, if a method is published anyone may have used it. And Baric wasn't the only one with an interest in synthetic coronaviruses. If there's something that specifically points to Baric's involvement, please let me know and I'll be happy to include.
If one is to patent a method that means it's novel enough to patent. There was not enough time between when Baric patented the SARS creation method and the emergence of SARS for anyone else to have possibly done it. Baric began working on SARS reconstruction after it was identified and it took him a year to publish those results and no other team was able to beat him to it.
Therefore it would take anyone else at least one year to copy his methods, but it only appeared six months after he had patented the method. There is only one conclusion and this time it's deductive, Ralph Baric made SARS.
My read of that patent is that it's a bit speculative. He doesn't seem to have used those techniques in the SARS cloning and doesn't cite it. It also seems quite generic. He does cite this paper (submitted 31 Jan, 2002):
Thanks for sharing those two papers, they make the case stronger and flesh out what he was capable of doing with coronaviruses.
It's simply that no one was even close to the skills Baric was developing at that time. With SARS it's a bit of a litmus test for honesty and integrity, if one is not willing to admit Baric made SARS then there's no point in discussing the origins of any synthetic viruses.
Thanks Adrian. It all came about because of a SNAP diagram. I was looking for a pair of natural sequences to illustrate a "normal" chart and serendipitously picked WIV1 and WIV16. Gazing at that unnatural looking image, a rabbit hole took shape...
Your rabbit hole has been most productive by any standards.
I have been helping a Ph.D. student in Prague with an analysis of a potyvirus, and got sidetracked into the lineage from which it came. I am now straightening that out, and it will provide some lovely comparisons for SNAPs - I'll let you know how I go. Needless to say a large proportion of the sequences come from China, and I'm starting to feel a Dog's Breakfast coming on!!!
Adrian
PS I presume you know all the historical stuff about the establishment of ANAHL - I helped Bede Morris keep FMDV out of it, at least initially - it was quite a campaign.
Bede Morris ran a successful campaign against ANAHL's insistence, before it was opened, that it needed FMDV to demonstrate the symptoms of infection to vets. Bede's campaign was aimed at the NFF, on the lines that symptoms of stressed animals in ANAHL would probably be atypical, that if FMDV was not in ANAHL then it couldn't get out, and that ELISA had just been invented, so transporting samples to ANAHL for testing was dangerous, better to take the test to the field, and that the way viruses got out of high security labs was more often on people, rather than faulty equipment - though Pirbright reversed equation, but later.
Sure enough a couple of years after opening ANAHL a pathogenic strain of Newcastle Disease Virus almost certainly got out, but didn't fortunately didn't spread. The story, which I've never seen in print, but may be in Trove, was that a technician was harvesting NDV infected allantoic fluid from eggs, and the vacuum collection bottle imploded, and drenched her. NDV grows I believe in the human conjunctiva, so the lady would almost certainly have been infected. However it transpired that no plans had been made to handle such a incident, and to quarantine such a person in the building, so she was cautioned to steer clear of chickens, and anyway it was a Friday. Nonetheless next day she went to a party at her uncle's - he was, of course, a chicken farmer!
A few years later a new Director of AAHL ( they had by then realised that its first acronym was unsatisfactory) was appointed. He was another one of the Pirbright diaspora who believed that all animal virus research should be based on FMDV. So he immediately started a campaign to import FMDV as essential. He appeared on ABC with his message, and loudly proclaimed that no virus could possibly, ever, get out of such a cleverly built facility. I bailed him up at a meeting, and by letter, and told him that before making such comments he should better inform himself of the history of ANAHL. The NFF also blew their top, and he shut up for a few years, but I expect they now have every bloody exotic virus known to humanity.
Amazing story. I didn't realize Newcastle Disease infected humans (interestingly sounds similar to Adenovirus 37).
"I expect they now have every bloody exotic virus known to humanity"
Yes I expect so, and no longer just animal viruses, hence the more recent name change to ACDP. Is collecting exotic viruses a matter of national prestige? I assumed in Australia we aren't doing much by way of manipulating them (particularly gain of function). I might be wrong?
I may have already told you that I complained to Peter Doherty and Paul Young about their Institutes boastfully telling the world that they had devised a simpler, faster, easier method for cloning big RNA genomes, and published a paper to let the world know, so I conclude that they are well into it. Particularly worrying was the argument I had with Paul Young (hope i've got the name correct) about the term 'Gain of Function' - "gain, not loss" he said "don't you understand", "No" I replied, "as soon as you genetically manipulate a virus you have no idea what you've done, whatever you may claim, because you cannot predict all the phenotypes of any genotype change" It would be much better for the work to be called "GM virology", cos all would know what that means.
Apart from WIV1, the "SNV" trimer is also included 5 times in these sequences which all have about 100 amino acid changes from Tor2: BtRs-BetaCoV/YN2018B, Rs3367, Rs7327, Rs9401, YN2016A, YN2016B, YN2016C, YN2016D, YN2016E.
The Rs and YN sequences are both supposed to come from Chinese horseshoe bats (Rhinolophus sinicus). The Rs sequences were published in a paper from 2016 titled "Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus", where the last author was Zhengli Shi and the other authors included Peter Daszak. The YN (Yunnan) sequences were published in a paper from 2021 titled "A comprehensive survey of bat sarbecoviruses across China for the origin tracing of SARS-CoV and SARS-CoV-2".
You wrote: "Before and since the discovery of BM48-31 several other SARS related viruses have been discovered in Europe, by authentically independent teams, but none have claimed to find similar RBM features." But the SNV trimer is also included three times in some of the European samples from Russia (Khosta-1 and Khosta-2) and the UK (RfGB01, RfGB02, RhGB01, RhGB02, RhGB03, RhGB04, RhGB05, RhGB06, RhGB07, RhGB08):
I didn't find any other trimers which occured 5 times in the spike protein of SARS-like viruses, apart from "NFN" and "FNF" in Sarbecovirus sp. HN2021D, and "ITP" in 8 sequences like Rs7896 (which all end with a four-digit number that starts with 78 or 79):
All of these come from one of two sources (which effectively may be just one source). Those in the form RsXXXX are from WIV (an entity of CAS) and EcoHealth, those in the form YNXXXXX are from Institute of Pathogen Biology (another entity of CAS) and EcoHealth. So while it sounds like a lot, none of it is independent of Chinese government control. The presence of EcoHealth remains a mystery.
Khosta-1 and RhGB01 each have an SNV that's unique further downstream, could be coincidence. More interesting I think that Khosta-1 has the same SNV as BM48-31 in the RBM. The presence of Peng Zhou and Danielle Andersen editing the paper isn't helpful. Neither is the fact sequences weren't submitted until 28-May-2021. Raises more questions than it answers.
Two of the interspaced Y residues in the Y?Y?Y pattern of SARS 1 and SARS 2 are also included in many HKU5 and MERS samples and some hedgehog coronaviruses:
SAGEIVQFNYKQDFSNPTCRVLATVPQNLTTI---TKPSNYAYLTECYKTS Pipistrellus bat coronavirus HKU5 isolate 19S|KC522093.1|AGP04932.1
The YNYK motif is repeated twice in SARS 1 with 24 residues in between, but the first YNYK is also included in the Tylonycteris bat coronaviruses above.
I found the sequences by searching BLAST for the 1000 closest matches to Wuhan-Hu-1's spike protein so that I excluded SARS 2 from the search results.
Yes so that first occurrence is conserved in many CoVs, but not in the RBM. Idk if its important. I'll be taking a look at MERS/HKU4/5/Hedehog CoVs etc in future. Handy to know.
My coding skills are very rusty (particularly this shell stuff, an occasional bit of Python is about all I do these days). But I have some ideas -will stop by.
A very interesting study. Have you thought of making comparisons at some other 'informational level'? So instead of amino acid to amino acid comparisons, use some of those grouping metrics used in the early Expasy. I suppose you are actually doing this when you compare structures. Then there is of course, AlphaFold - https://alphafold.ebi.ac.uk/. All strength to your elbow!!
I've been using Alphafold, a fantastic tool for visualizing these bat viruses, for most of which no structure has been determined. To date it hasn't been useful for protein-protein interactions, such as receptor binding, but Alphafold2 is a step closer apparently:
Improved prediction of protein-protein interactions using AlphaFold2
Your basic aim, I assume, is to recognise the fake sequences from the real ones. The gold standard for that is to have another lab repeat the sequencing, or to use the sequences coming from reliable labs, and assume real sequences will have similar structure, so you need a metric of how close an unreliable sequence is to a reliable one. Apologies,I'm just mulling.
The following code downloads FASTA files for nucleotide and amino acid sequences of SARS-like viruses, it aligns the spike protein sequences, and it sorts the sequence by their number of mismatches to Tor2 in the region which features the DATSTGNYNYKYRYLR sequence in Tor2:
brew install mafft seqkit brewsci/bio/snp-dists xmlstarlet
curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1j-YFiMYG4DkVKSget2fYW-gaJDy6NCkW' # 335 aligned sequences of SARS-like viruses from GenBank
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id='$(seqkit seq -ni sarslike.fa|paste -sd, -)>sarslike.aa
seqkit grep -nrp spike\|surface sarslike.aa|mafft ->spike.aln
snp-dists sarslike.fa>sarslike.dist
xml fo -D sarslike.xml|xml sel -t -m //GBSeq -v GBSeq_accession-version -o $'\t' -v GBSeq_definition -o $'\t' -v GBSeq_create-date -o $'\t' -v './/GBQualifier[GBQualifier_name="collection_date"]/GBQualifier_value' -o $'\t' -v '(.//GBAuthor)[1]' -o ... -v '(.//GBAuthor)[last()]' -o $'\t' -v '(.//GBReference_title[text()!="Direct Submission"])[last()]' -o $'\n'>sarslike.tsv
tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf(i==m?"%s\n":"%-"(b[i]+n)"s",a[h][i])}}' "${1+FS=$1}" "n=${2-1}";} # `tab \\t` is like `column -ts$'\t'` but it doesn't get thrown off by empty fields
x=NC_004718.3;seqkit subseq -r490:506 spike.aln|seqkit fx2tab|sed $'s/_prot_[^\t]*//;s/lcl|//'|gawk '{l=length($2);for(i=1;i<=l;i++)a[$1][i]=substr($2,i,1);b[$1]=$2}END{for(i in a){d=0;for(j=1;j<=l;j++)if(a[targ][j]!=a[i][j])d++;print i"\t"b[i]"\t"d}}' targ=$x|awk 'NR==FNR{a[$1]=$2;next}{print$3,$2,a[$1],$1}' {,O}FS=\\t <(seqkit seq -n sarslike.fa|sed $'s/ /\t/;s/, complete genome//') -|sort -n|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print a[$4]"\t"$0}' <(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i==x)break;next}{print$1 FS$i}' x=$x sarslike.dist) -|sort -n|awk 'NR==FNR{a[$1]=$3 FS$4 FS$5;next}{print$0"\t"a[$NF]}' {,O}FS=\\t sarslike.tsv -|tab \\t
I posted the output of the shell commands here: https://pastebin.com/raw/GDm9PNqD.
Eight bat SARS viruses featured the sequence DATSTGNHNYKYRYLRH which has only one mismatch: BtRs-BetaCoV/YN2018B, Rs9401, Rs7327, YN2016C, YN2016D, YN2016E, YN2016A, YN2016B. They all have between 1254 and 1283 nucleotide changes from Tor2. WIV1 has about a hundred fewer nucleotide changes from Tor2 (1150) but it has two mismatches (DATQTGNYNYKYRSLRH). The only genome with three mismatches is "Rhinolophus affinis coronavirus isolate LYRa11" (DATSSGNFNYKYRSLRH), where the number of mismatches is pretty low considering that the whole genome has 2672 nucleotide changes from Tor2. The LYRa11 sequence was published in 2014 as part of a paper titled "Identification of Diverse Alphacoronaviruses and Genomic Characterization of a Novel Severe Acute Respiratory Syndrome-Like Coronavirus from Bats in China".
The Y?Y?Y pattern of three Y residues interspaced by single other residues is also featured in Wuhan-Hu-1: DSKVGGNYNYLYRLFRK. The region is identical in BANAL-52, BANAL-236, and BANAL-103. But in RaTG13 the first four residues DAKE instead of DSKV. And ZC45 has deletions in the middle of the sequence: "DV---GN--YFYRSHRS".
Thank you for the analysis.
"YN2018B, Rs9401, Rs7327, YN2016C, YN2016D, YN2016E, YN2016A, YN2016B" yes I have been looking at these sequences and they also show have very strange "recombination", or engineered
features when viewed using SNAP.
LyRa11 I discuss a little above (there is also LyRa3 posted for some reason only as a protein sequence). These sequences were published by AMMS' Colonel Changchun Tu, a few months after WIV provided Rs3367. It seems a more realistic - I would say - fake. Doesn't jump out so obviously on a SNAP diagram.
More about the BANAL sequences in next installment...
In a paper from 2020 titled "A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein", they described the RmYN02 sequence which was similar to BANAL-116 and BANAL-247 which were published only much later in 2022: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7211627/. The whole genome sequence of RmYN02 is not available from GenBank but only GISAID and NMDC. But the raw reads are available from the SRA, so I posted instructions here on how you use MEGAHIT to assemble the raw reads yourself: https://usmortality.substack.com/p/sars-cov-2-genome-assembly-part-2/comment/14148825. The spike protein of RmYN02 is at GenBank, even though for some reason almost the entire S1 domain is missing: https://www.ncbi.nlm.nih.gov/nuccore/MW201982.1.
In the paper about RmYN02, they tried to demonstrate that the "PRRA" insert had a natural origin because RmYN02 contained a "PAA" sequence around the furin cleavage site which at that point had not been described in other SARS-like viruses, even though it was later featured in BANAL-116 and BANAL-247.
In an alignment of the spike proteins, the region around the furin cleavage site is:
- "SYQTQTNSPRRARSVA" in Wuhan-Hu-1
- "SY----NSPAA-R-VG" in RmYN02, BANAL-126, and BANAL-247
- "SYQTQTNS----RSVA" in RaTG13, BANAL-52, BANAL-103, and BANAL-236
- "SYHTASIL----RSTS" in ZC45
- "SYTHASIL----RSTG" in ZXC21
But if the "PRRA" sequence evolved from "PAA" like the authors of the RmYN02 paper suggested, then it's weird that the "PAA" sequence is not featured in BANAL-52 which is much closer to SARS 2 than BANAL-116 and BANAL-247 are.
BTW I now also figured out how you can colorize the amino acids in a shell: https://pastebin.com/raw/X9VCj3YZ.
While I will cover in more detail in a future substack you can see visually some of the weirdness in BANAL sequences in SNAP diagrams at this twitter thread. More recombination? I'll also look at RmYn02.
https://twitter.com/breakfast_dogs/status/1603482329850191880?s=20
RmYN02 Is indeed suspect as some authorsmay be closely related to AMMS. Authors of the paper (and importantly those who did the sequencing) Juan Li Tao Hu Hong Zhou had recently worked with AMMS authors of the paper "Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins". AMMS Major-General Wuchun Cao supervises graduate students at Shandong University. Alice C. Hughes appears to have had no role in sequencing or assembly, only collecting initial sample which she is unable to vouch for.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7211627/
I now compiled a FASTA file of the spike protein sequences of SARS-like viruses from GenBank: https://drive.google.com/uc?export=download&id=1r9TzeL6jaQsV6JChQL8r9-WG9-3Y4Wgw. TSV metadata: https://drive.google.com/uc?export=download&id=1QVurMpmQfbZa2KEe57YSrWHjfvnouwiM. My file includes sequences like RmYN02 and LYRa3 that are missing the whole genome sequence at GenBank.
I basically just entered the accession number of Wuhan-Hu-1's spike protein to protein BLAST (QHD43416): https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins. I entered "SARS-CoV-2 (taxid:2697049)" in the organism field and clicked the "exclude" checkbox next to it. Then I clicked "Algorithm parameters" and I set "Max target sequences" to 500. Then I clicked I clicked "BLAST" and I selected "FASTA (complete sequence)" from the "Download" menu. Then I removed the last entries from the file which weren't SARS-like viruses, removed most sequences marked as synthetic constructs, and so on.
I found that there's a new sequence called BtSY2 which was added to GenBank in January 2023: https://www.ncbi.nlm.nih.gov/nuccore/OP963576.1. It doesn't have a full genome at GenBank but only the full CDS. In my alignment of the spike proteins, the number of letter changes from Wuhan-Hu-1 was 20 in BANAL-52, 33 in RaTG13, and 35 in BtSY2, but after that there was a huge gap until Pangolin coronavirus GX_P2V which had 98 letter changes.
If you look at the region of the spike protein 100 bases before PRRA and 100 bases after PRRA, it has only one amino acid change from Wuhan-Hu-1 in RaTG13 and BtSY2 and two changes in BANAL-52. But most current strains of Omicron have 4 amino acid changes in the same region even though their whole genome has an order of magnitude less nucleotide changes from Wuhan-Hu-1.
You can download a FASTA file for a global subsample of about 3000 SARS 2 sequences from NextStrain: https://docs.nextstrain.org/projects/ncov/en/latest/reference/remote_inputs.html. I used Nextclade CLI to generate the protein sequences for all sequences: https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html. I checked the same region of 100 residues before and 100 residues after the PRRA insert to see how many amino acid changes the region has from Wuhan-Hu-1, but when I ignored positions with an X letter, 1917 out of 2929 sequences had 4 changes, 365 had 1 change, 60 had 2 changes, 38 had zero changes, 25 had 5 changes, and so on (but many of the sequences were old samples from 2020 or 2021):
[code omitted]
The code below shows the region around the "Y?Y?Y" pattern in BtSY2 and other sequences. It's "LDSKVGGNYNYLYRLFRKS" in Wuhan-Hu-1, BtSY2, BANAL-52, and a bunch of supposed Pangolin viruses, but RaTG13 actually has five amino acid changes in the same short region: "IDAKEGGNFNYLYRLFRKA". I guess it's recombination again...
[code omitted]
(Substack didn't allow me to post this comment when I included shell code, so I posted a version with the code included here: https://output.jsbin.com/qulozit.)
That's the reason the BANAL sequences had to be "discovered". RaTG13 was very similar to SARS-CoV-2, but not similar enough in the RBD. It was "designed" in a hurry, it has quite a few flaws, many people have pointed to. BtSY2 almost certainly another fake. Will they ever stop spamming Genbank with new sequences?
https://www.biorxiv.org/content/10.1101/2022.11.23.517609v1.full.pdf
Your help sifting through this is appreciated, thanks.
Thoroughly enjoyable read. It's refreshing to find so much detail presented in such a clear and accessible form.
Once you see these similarities laid out, it's striking how such a comparative analysis hasn't been attempted before now. I hope this stimulates lots of discussion and even further analysis.
Two of the characters you mention have not gained as much attention as the likes of Fauci, Daszak, Shi or even Baric - I'm referring to Lin-Fa Wang and Garry Crameri - their role in the SARS-COV-2 drama is worthy of scrutiny imho.
Congrats on your first post, welcome to substack and I hope we'll have a chance to read more from you.
Thanks. I would like to know more specifics about Linfa Wang and Gary Crameri's involvement in both outbreaks, it's hard to appreciate their role just from the author list on a paper. Unfortunately, CSIRO keep refusing my FOI requests.
I'm not giving up
Why does the SARS origins timeline begin in 2008. Did the Chinese invent time travel?
Mostly lack of space and the limitations of the free software I was using. Those particular events from 2008-13 relate to the genesis of Rs3367 and RsShC014. But I'll try to make an expanded version, because will be interesting to see all events/discoveries/publications in context.
Why are you ignoring the fact that Baric published how to make SARS in 2002 and assuming that China released a bioweapon on itself.
Isn't that specious and motivated reasoning., one million lines of text to ignore what's obvious?
It probably would be a good idea to include more early publications on the engineering of coronaviruses. But it doesn't tell you who engineered it, if a method is published anyone may have used it. And Baric wasn't the only one with an interest in synthetic coronaviruses. If there's something that specifically points to Baric's involvement, please let me know and I'll be happy to include.
If one is to patent a method that means it's novel enough to patent. There was not enough time between when Baric patented the SARS creation method and the emergence of SARS for anyone else to have possibly done it. Baric began working on SARS reconstruction after it was identified and it took him a year to publish those results and no other team was able to beat him to it.
Therefore it would take anyone else at least one year to copy his methods, but it only appeared six months after he had patented the method. There is only one conclusion and this time it's deductive, Ralph Baric made SARS.
https://patentimages.storage.googleapis.com/b2/32/2f/aa83b26a524941/WO2002086068A2.pdf
https://www.pnas.org/doi/10.1073/pnas.1735582100
My read of that patent is that it's a bit speculative. He doesn't seem to have used those techniques in the SARS cloning and doesn't cite it. It also seems quite generic. He does cite this paper (submitted 31 Jan, 2002):
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC136593/
I'll have to read it carefully to see if it might be any more relevant.
See also:
https://journals.asm.org/doi/10.1128/JVI.74.22.10600-10611.2000
Submitted 18 May 2000, published 15 November 2000. The 2002 paper may be the first reference to the "No See'm" technique.
Thanks for sharing those two papers, they make the case stronger and flesh out what he was capable of doing with coronaviruses.
It's simply that no one was even close to the skills Baric was developing at that time. With SARS it's a bit of a litmus test for honesty and integrity, if one is not willing to admit Baric made SARS then there's no point in discussing the origins of any synthetic viruses.
A truely amazing story. I love those SNAP diagrams - a elegantly simple, yet so instructive. Keep up the good work. Adrian
Thanks Adrian. It all came about because of a SNAP diagram. I was looking for a pair of natural sequences to illustrate a "normal" chart and serendipitously picked WIV1 and WIV16. Gazing at that unnatural looking image, a rabbit hole took shape...
Dear DB,
Your rabbit hole has been most productive by any standards.
I have been helping a Ph.D. student in Prague with an analysis of a potyvirus, and got sidetracked into the lineage from which it came. I am now straightening that out, and it will provide some lovely comparisons for SNAPs - I'll let you know how I go. Needless to say a large proportion of the sequences come from China, and I'm starting to feel a Dog's Breakfast coming on!!!
Adrian
PS I presume you know all the historical stuff about the establishment of ANAHL - I helped Bede Morris keep FMDV out of it, at least initially - it was quite a campaign.
"I presume you know all the historical stuff about the establishment of ANAHL"
No I don't. Can you point me to a good source, or share something yourself?
Bede Morris ran a successful campaign against ANAHL's insistence, before it was opened, that it needed FMDV to demonstrate the symptoms of infection to vets. Bede's campaign was aimed at the NFF, on the lines that symptoms of stressed animals in ANAHL would probably be atypical, that if FMDV was not in ANAHL then it couldn't get out, and that ELISA had just been invented, so transporting samples to ANAHL for testing was dangerous, better to take the test to the field, and that the way viruses got out of high security labs was more often on people, rather than faulty equipment - though Pirbright reversed equation, but later.
Sure enough a couple of years after opening ANAHL a pathogenic strain of Newcastle Disease Virus almost certainly got out, but didn't fortunately didn't spread. The story, which I've never seen in print, but may be in Trove, was that a technician was harvesting NDV infected allantoic fluid from eggs, and the vacuum collection bottle imploded, and drenched her. NDV grows I believe in the human conjunctiva, so the lady would almost certainly have been infected. However it transpired that no plans had been made to handle such a incident, and to quarantine such a person in the building, so she was cautioned to steer clear of chickens, and anyway it was a Friday. Nonetheless next day she went to a party at her uncle's - he was, of course, a chicken farmer!
A few years later a new Director of AAHL ( they had by then realised that its first acronym was unsatisfactory) was appointed. He was another one of the Pirbright diaspora who believed that all animal virus research should be based on FMDV. So he immediately started a campaign to import FMDV as essential. He appeared on ABC with his message, and loudly proclaimed that no virus could possibly, ever, get out of such a cleverly built facility. I bailed him up at a meeting, and by letter, and told him that before making such comments he should better inform himself of the history of ANAHL. The NFF also blew their top, and he shut up for a few years, but I expect they now have every bloody exotic virus known to humanity.
Good luck
Adrian
Amazing story. I didn't realize Newcastle Disease infected humans (interestingly sounds similar to Adenovirus 37).
"I expect they now have every bloody exotic virus known to humanity"
Yes I expect so, and no longer just animal viruses, hence the more recent name change to ACDP. Is collecting exotic viruses a matter of national prestige? I assumed in Australia we aren't doing much by way of manipulating them (particularly gain of function). I might be wrong?
I may have already told you that I complained to Peter Doherty and Paul Young about their Institutes boastfully telling the world that they had devised a simpler, faster, easier method for cloning big RNA genomes, and published a paper to let the world know, so I conclude that they are well into it. Particularly worrying was the argument I had with Paul Young (hope i've got the name correct) about the term 'Gain of Function' - "gain, not loss" he said "don't you understand", "No" I replied, "as soon as you genetically manipulate a virus you have no idea what you've done, whatever you may claim, because you cannot predict all the phenotypes of any genotype change" It would be much better for the work to be called "GM virology", cos all would know what that means.
I'll send you the NDV paper by email.
Adrian
Apart from WIV1, the "SNV" trimer is also included 5 times in these sequences which all have about 100 amino acid changes from Tor2: BtRs-BetaCoV/YN2018B, Rs3367, Rs7327, Rs9401, YN2016A, YN2016B, YN2016C, YN2016D, YN2016E.
The Rs and YN sequences are both supposed to come from Chinese horseshoe bats (Rhinolophus sinicus). The Rs sequences were published in a paper from 2016 titled "Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus", where the last author was Zhengli Shi and the other authors included Peter Daszak. The YN (Yunnan) sequences were published in a paper from 2021 titled "A comprehensive survey of bat sarbecoviruses across China for the origin tracing of SARS-CoV and SARS-CoV-2".
You wrote: "Before and since the discovery of BM48-31 several other SARS related viruses have been discovered in Europe, by authentically independent teams, but none have claimed to find similar RBM features." But the SNV trimer is also included three times in some of the European samples from Russia (Khosta-1 and Khosta-2) and the UK (RfGB01, RfGB02, RhGB01, RhGB02, RhGB03, RhGB04, RhGB05, RhGB06, RhGB07, RhGB08):
curl -Lso spikes.fa 'https://drive.google.com/uc?export=download&id=1r9TzeL6jaQsV6JChQL8r9-WG9-3Y4Wgw'
dif2x(){ awk 'NR==1{split($0,a,"");l=length;next}{split($0,b,"");n=0;for(i=1;i<=l;i++)if(a[i]!="X"&&b[i]!="X"&&a[i]!=b[i])n++;print n}' <(seqkit grep -p "$2" "$1"|seqkit seq -s;seqkit seq -s "$1");}
seqkit seq -g spikes.fa|seqkit locate -Pp SNV|sed 1d|cut -f1|sort|uniq -c|sort|awk 'NR==FNR{a[$2]=$0;next}{$2=a[$2]}1' <(dif2x spikes.fa AAP41037.1|paste - <(seqkit seq -n spikes.fa)) -
I didn't find any other trimers which occured 5 times in the spike protein of SARS-like viruses, apart from "NFN" and "FNF" in Sarbecovirus sp. HN2021D, and "ITP" in 8 sequences like Rs7896 (which all end with a four-digit number that starts with 78 or 79):
seqkit seq -g spikes.fa|seqkit fx2tab|cut -d' ' -f2-|awk -F\\t '{gsub("X","",$2);l=length($2);for(i=1;i<=l-k+1;i++)print substr($2,i,k)"\t"$1}' k=3|awk '$1!~/X/'|LC_ALL=C sort|uniq -c|LC_ALL=C sort -r|head -n100
"YN2018B, Rs3367, Rs7327, Rs9401, YN2016A, YN2016B, YN2016C, YN2016D, YN2016E"
All of these come from one of two sources (which effectively may be just one source). Those in the form RsXXXX are from WIV (an entity of CAS) and EcoHealth, those in the form YNXXXXX are from Institute of Pathogen Biology (another entity of CAS) and EcoHealth. So while it sounds like a lot, none of it is independent of Chinese government control. The presence of EcoHealth remains a mystery.
https://www.frontiersin.org/articles/10.3389/fmicb.2019.01900/full
"I didn't find any other trimers which occured 5 times in the spike protein of SARS-like viruses"
Thanks for this info. How about with 4 occurrences? I'll have a closer look at the UK and Russian CoVs.
I wish that Russian CoV paper hadn't been edited by Peng Zhou (of WIV) and Danielle Andersen (Linfa Wang associate). There's no escaping them.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8779456/
You're not allowed to say Danielle Andersen without adding her official title "the last Western scientist to work at the Wuhan Institute of Virology."
Obligatory! I will add it when I do an update.
Khosta-1 and RhGB01 each have an SNV that's unique further downstream, could be coincidence. More interesting I think that Khosta-1 has the same SNV as BM48-31 in the RBM. The presence of Peng Zhou and Danielle Andersen editing the paper isn't helpful. Neither is the fact sequences weren't submitted until 28-May-2021. Raises more questions than it answers.
Two of the interspaced Y residues in the Y?Y?Y pattern of SARS 1 and SARS 2 are also included in many HKU5 and MERS samples and some hedgehog coronaviruses:
QTGVIADYNYKLPDDFMGC-VLAWNTRNI----DATSTGNYNYKYR----- SARS coronavirus Tor2|NC_004718.3|YP_009825051.1
QTGKIADYNYKLPDDFTGC-VIAWNSNNL----DSKVGGNYNYLYR----- Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1|NC_045512.2|YP_009724390.1
SAEAISMFNYNQDYSNPTCRIHATVTANVSSVMNFTADNNYAYISRCQGTD Betacoronavirus Erinaceus isolate ErinaceusCoV/Italy/50265-11/2019|MW246799.1|QRN68066.1
SAGPISQFNYKQSFSNPTCLILATVPHNLTTI---TKPLKYSYINKCSRLL Middle East respiratory syndrome coronavirus isolate KFMC-8|KT121579.1|AKN24812.1
SAGDIPMYNYKQSFANPTCRVLATVPSNL-TL---VKPAAYGYIQKCSRLS Tylonycteris robustula coronavirus isolate 162275|ON745165.1|UUT43901.1
SAGNIPLYNYKQAFANPTCRVMASVPPNV-TI---TKPEAYGYISKCSRLT Tylonycteris bat coronavirus HKU4 isolate GZ1863|MW218390.1|QPX50171.1
SADRIVRFNYNQDYSNPSCRIHSKVNSSI-GI---SYAGAYSYITNCNYGA Coronavirus Neoromicia/PML-PHE1/RSA/2011|KC869678.4|AGY29650.2
SAGEIVQFNYKQDFSNPTCRVLATVPQNLTTI---TKPSNYAYLTECYKTS Pipistrellus bat coronavirus HKU5 isolate 19S|KC522093.1|AGP04932.1
The YNYK motif is repeated twice in SARS 1 with 24 residues in between, but the first YNYK is also included in the Tylonycteris bat coronaviruses above.
I found the sequences by searching BLAST for the 1000 closest matches to Wuhan-Hu-1's spike protein so that I excluded SARS 2 from the search results.
Yes so that first occurrence is conserved in many CoVs, but not in the RBM. Idk if its important. I'll be taking a look at MERS/HKU4/5/Hedehog CoVs etc in future. Handy to know.
My coding skills are very rusty (particularly this shell stuff, an occasional bit of Python is about all I do these days). But I have some ideas -will stop by.
A very interesting study. Have you thought of making comparisons at some other 'informational level'? So instead of amino acid to amino acid comparisons, use some of those grouping metrics used in the early Expasy. I suppose you are actually doing this when you compare structures. Then there is of course, AlphaFold - https://alphafold.ebi.ac.uk/. All strength to your elbow!!
I've been using Alphafold, a fantastic tool for visualizing these bat viruses, for most of which no structure has been determined. To date it hasn't been useful for protein-protein interactions, such as receptor binding, but Alphafold2 is a step closer apparently:
Improved prediction of protein-protein interactions using AlphaFold2
https://www.nature.com/articles/s41467-022-28865-w
Of course that might also be of great assistance to wannabe bioterrorists and rogue states.
Your basic aim, I assume, is to recognise the fake sequences from the real ones. The gold standard for that is to have another lab repeat the sequencing, or to use the sequences coming from reliable labs, and assume real sequences will have similar structure, so you need a metric of how close an unreliable sequence is to a reliable one. Apologies,I'm just mulling.