40 Comments
Apr 3, 2023·edited Apr 4, 2023Liked by Dog's Breakfast

The following code downloads FASTA files for nucleotide and amino acid sequences of SARS-like viruses, it aligns the spike protein sequences, and it sorts the sequence by their number of mismatches to Tor2 in the region which features the DATSTGNYNYKYRYLR sequence in Tor2:

brew install mafft seqkit brewsci/bio/snp-dists xmlstarlet

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1j-YFiMYG4DkVKSget2fYW-gaJDy6NCkW' # 335 aligned sequences of SARS-like viruses from GenBank

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id='$(seqkit seq -ni sarslike.fa|paste -sd, -)>sarslike.aa

seqkit grep -nrp spike\|surface sarslike.aa|mafft ->spike.aln

snp-dists sarslike.fa>sarslike.dist

xml fo -D sarslike.xml|xml sel -t -m //GBSeq -v GBSeq_accession-version -o $'\t' -v GBSeq_definition -o $'\t' -v GBSeq_create-date -o $'\t' -v './/GBQualifier[GBQualifier_name="collection_date"]/GBQualifier_value' -o $'\t' -v '(.//GBAuthor)[1]' -o ... -v '(.//GBAuthor)[last()]' -o $'\t' -v '(.//GBReference_title[text()!="Direct Submission"])[last()]' -o $'\n'>sarslike.tsv

tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf(i==m?"%s\n":"%-"(b[i]+n)"s",a[h][i])}}' "${1+FS=$1}" "n=${2-1}";} # `tab \\t` is like `column -ts$'\t'` but it doesn't get thrown off by empty fields

x=NC_004718.3;seqkit subseq -r490:506 spike.aln|seqkit fx2tab|sed $'s/_prot_[^\t]*//;s/lcl|//'|gawk '{l=length($2);for(i=1;i<=l;i++)a[$1][i]=substr($2,i,1);b[$1]=$2}END{for(i in a){d=0;for(j=1;j<=l;j++)if(a[targ][j]!=a[i][j])d++;print i"\t"b[i]"\t"d}}' targ=$x|awk 'NR==FNR{a[$1]=$2;next}{print$3,$2,a[$1],$1}' {,O}FS=\\t <(seqkit seq -n sarslike.fa|sed $'s/ /\t/;s/, complete genome//') -|sort -n|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print a[$4]"\t"$0}' <(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i==x)break;next}{print$1 FS$i}' x=$x sarslike.dist) -|sort -n|awk 'NR==FNR{a[$1]=$3 FS$4 FS$5;next}{print$0"\t"a[$NF]}' {,O}FS=\\t sarslike.tsv -|tab \\t

I posted the output of the shell commands here: https://pastebin.com/raw/GDm9PNqD.

Eight bat SARS viruses featured the sequence DATSTGNHNYKYRYLRH which has only one mismatch: BtRs-BetaCoV/YN2018B, Rs9401, Rs7327, YN2016C, YN2016D, YN2016E, YN2016A, YN2016B. They all have between 1254 and 1283 nucleotide changes from Tor2. WIV1 has about a hundred fewer nucleotide changes from Tor2 (1150) but it has two mismatches (DATQTGNYNYKYRSLRH). The only genome with three mismatches is "Rhinolophus affinis coronavirus isolate LYRa11" (DATSSGNFNYKYRSLRH), where the number of mismatches is pretty low considering that the whole genome has 2672 nucleotide changes from Tor2. The LYRa11 sequence was published in 2014 as part of a paper titled "Identification of Diverse Alphacoronaviruses and Genomic Characterization of a Novel Severe Acute Respiratory Syndrome-Like Coronavirus from Bats in China".

The Y?Y?Y pattern of three Y residues interspaced by single other residues is also featured in Wuhan-Hu-1: DSKVGGNYNYLYRLFRK. The region is identical in BANAL-52, BANAL-236, and BANAL-103. But in RaTG13 the first four residues DAKE instead of DSKV. And ZC45 has deletions in the middle of the sequence: "DV---GN--YFYRSHRS".

Expand full comment
Apr 3, 2023Liked by Dog's Breakfast

Thoroughly enjoyable read. It's refreshing to find so much detail presented in such a clear and accessible form.

Once you see these similarities laid out, it's striking how such a comparative analysis hasn't been attempted before now. I hope this stimulates lots of discussion and even further analysis.

Two of the characters you mention have not gained as much attention as the likes of Fauci, Daszak, Shi or even Baric - I'm referring to Lin-Fa Wang and Garry Crameri - their role in the SARS-COV-2 drama is worthy of scrutiny imho.

Congrats on your first post, welcome to substack and I hope we'll have a chance to read more from you.

Expand full comment
Apr 5, 2023Liked by Dog's Breakfast

Why does the SARS origins timeline begin in 2008. Did the Chinese invent time travel?

Expand full comment
Apr 5, 2023Liked by Dog's Breakfast

A truely amazing story. I love those SNAP diagrams - a elegantly simple, yet so instructive. Keep up the good work. Adrian

Expand full comment
Apr 5, 2023Liked by Dog's Breakfast

Dear DB,

Your rabbit hole has been most productive by any standards.

I have been helping a Ph.D. student in Prague with an analysis of a potyvirus, and got sidetracked into the lineage from which it came. I am now straightening that out, and it will provide some lovely comparisons for SNAPs - I'll let you know how I go. Needless to say a large proportion of the sequences come from China, and I'm starting to feel a Dog's Breakfast coming on!!!

Adrian

PS I presume you know all the historical stuff about the establishment of ANAHL - I helped Bede Morris keep FMDV out of it, at least initially - it was quite a campaign.

Expand full comment
Apr 5, 2023Liked by Dog's Breakfast

Apart from WIV1, the "SNV" trimer is also included 5 times in these sequences which all have about 100 amino acid changes from Tor2: BtRs-BetaCoV/YN2018B, Rs3367, Rs7327, Rs9401, YN2016A, YN2016B, YN2016C, YN2016D, YN2016E.

The Rs and YN sequences are both supposed to come from Chinese horseshoe bats (Rhinolophus sinicus). The Rs sequences were published in a paper from 2016 titled "Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus", where the last author was Zhengli Shi and the other authors included Peter Daszak. The YN (Yunnan) sequences were published in a paper from 2021 titled "A comprehensive survey of bat sarbecoviruses across China for the origin tracing of SARS-CoV and SARS-CoV-2".

You wrote: "Before and since the discovery of BM48-31 several other SARS related viruses have been discovered in Europe, by authentically independent teams, but none have claimed to find similar RBM features." But the SNV trimer is also included three times in some of the European samples from Russia (Khosta-1 and Khosta-2) and the UK (RfGB01, RfGB02, RhGB01, RhGB02, RhGB03, RhGB04, RhGB05, RhGB06, RhGB07, RhGB08):

curl -Lso spikes.fa 'https://drive.google.com/uc?export=download&id=1r9TzeL6jaQsV6JChQL8r9-WG9-3Y4Wgw'

dif2x(){ awk 'NR==1{split($0,a,"");l=length;next}{split($0,b,"");n=0;for(i=1;i<=l;i++)if(a[i]!="X"&&b[i]!="X"&&a[i]!=b[i])n++;print n}' <(seqkit grep -p "$2" "$1"|seqkit seq -s;seqkit seq -s "$1");}

seqkit seq -g spikes.fa|seqkit locate -Pp SNV|sed 1d|cut -f1|sort|uniq -c|sort|awk 'NR==FNR{a[$2]=$0;next}{$2=a[$2]}1' <(dif2x spikes.fa AAP41037.1|paste - <(seqkit seq -n spikes.fa)) -

I didn't find any other trimers which occured 5 times in the spike protein of SARS-like viruses, apart from "NFN" and "FNF" in Sarbecovirus sp. HN2021D, and "ITP" in 8 sequences like Rs7896 (which all end with a four-digit number that starts with 78 or 79):

seqkit seq -g spikes.fa|seqkit fx2tab|cut -d' ' -f2-|awk -F\\t '{gsub("X","",$2);l=length($2);for(i=1;i<=l-k+1;i++)print substr($2,i,k)"\t"$1}' k=3|awk '$1!~/X/'|LC_ALL=C sort|uniq -c|LC_ALL=C sort -r|head -n100

Expand full comment
Apr 5, 2023Liked by Dog's Breakfast

Two of the interspaced Y residues in the Y?Y?Y pattern of SARS 1 and SARS 2 are also included in many HKU5 and MERS samples and some hedgehog coronaviruses:

QTGVIADYNYKLPDDFMGC-VLAWNTRNI----DATSTGNYNYKYR----- SARS coronavirus Tor2|NC_004718.3|YP_009825051.1

QTGKIADYNYKLPDDFTGC-VIAWNSNNL----DSKVGGNYNYLYR----- Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1|NC_045512.2|YP_009724390.1

SAEAISMFNYNQDYSNPTCRIHATVTANVSSVMNFTADNNYAYISRCQGTD Betacoronavirus Erinaceus isolate ErinaceusCoV/Italy/50265-11/2019|MW246799.1|QRN68066.1

SAGPISQFNYKQSFSNPTCLILATVPHNLTTI---TKPLKYSYINKCSRLL Middle East respiratory syndrome coronavirus isolate KFMC-8|KT121579.1|AKN24812.1

SAGDIPMYNYKQSFANPTCRVLATVPSNL-TL---VKPAAYGYIQKCSRLS Tylonycteris robustula coronavirus isolate 162275|ON745165.1|UUT43901.1

SAGNIPLYNYKQAFANPTCRVMASVPPNV-TI---TKPEAYGYISKCSRLT Tylonycteris bat coronavirus HKU4 isolate GZ1863|MW218390.1|QPX50171.1

SADRIVRFNYNQDYSNPSCRIHSKVNSSI-GI---SYAGAYSYITNCNYGA Coronavirus Neoromicia/PML-PHE1/RSA/2011|KC869678.4|AGY29650.2

SAGEIVQFNYKQDFSNPTCRVLATVPQNLTTI---TKPSNYAYLTECYKTS Pipistrellus bat coronavirus HKU5 isolate 19S|KC522093.1|AGP04932.1

The YNYK motif is repeated twice in SARS 1 with 24 residues in between, but the first YNYK is also included in the Tylonycteris bat coronaviruses above.

I found the sequences by searching BLAST for the 1000 closest matches to Wuhan-Hu-1's spike protein so that I excluded SARS 2 from the search results.

Expand full comment
May 6, 2023Liked by Dog's Breakfast

A very interesting study. Have you thought of making comparisons at some other 'informational level'? So instead of amino acid to amino acid comparisons, use some of those grouping metrics used in the early Expasy. I suppose you are actually doing this when you compare structures. Then there is of course, AlphaFold - https://alphafold.ebi.ac.uk/. All strength to your elbow!!

Expand full comment
May 6, 2023Liked by Dog's Breakfast

Your basic aim, I assume, is to recognise the fake sequences from the real ones. The gold standard for that is to have another lab repeat the sequencing, or to use the sequences coming from reliable labs, and assume real sequences will have similar structure, so you need a metric of how close an unreliable sequence is to a reliable one. Apologies,I'm just mulling.

Expand full comment