SARS-1: Evidence of an Artificial Origin

Apr 3, 2023

As the world debates the origin of SARS-COV-2, most assume the SARS outbreak of 2003 was a natural event. But revisiting the evidence I found parallels, direct linkages and many unresolved questions.

Read →

40 Comments

henjin

Apr 3, 2023Edited

The following code downloads FASTA files for nucleotide and amino acid sequences of SARS-like viruses, it aligns the spike protein sequences, and it sorts the sequence by their number of mismatches to Tor2 in the region which features the DATSTGNYNYKYRYLR sequence in Tor2:

brew install mafft seqkit brewsci/bio/snp-dists xmlstarlet

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1j-YFiMYG4DkVKSget2fYW-gaJDy6NCkW' # 335 aligned sequences of SARS-like viruses from GenBank

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id='$(seqkit seq -ni sarslike.fa|paste -sd, -)>sarslike.aa

seqkit grep -nrp spike\|surface sarslike.aa|mafft ->spike.aln

snp-dists sarslike.fa>sarslike.dist

xml fo -D sarslike.xml|xml sel -t -m //GBSeq -v GBSeq_accession-version -o $'\t' -v GBSeq_definition -o $'\t' -v GBSeq_create-date -o $'\t' -v './/GBQualifier[GBQualifier_name="collection_date"]/GBQualifier_value' -o $'\t' -v '(.//GBAuthor)[1]' -o ... -v '(.//GBAuthor)[last()]' -o $'\t' -v '(.//GBReference_title[text()!="Direct Submission"])[last()]' -o $'\n'>sarslike.tsv

tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf(i==m?"%s\n":"%-"(b[i]+n)"s",a[h][i])}}' "${1+FS=$1}" "n=${2-1}";} # `tab \\t` is like `column -ts$'\t'` but it doesn't get thrown off by empty fields

x=NC_004718.3;seqkit subseq -r490:506 spike.aln|seqkit fx2tab|sed $'s/_prot_[^\t]*//;s/lcl|//'|gawk '{l=length($2);for(i=1;i<=l;i++)a[$1][i]=substr($2,i,1);b[$1]=$2}END{for(i in a){d=0;for(j=1;j<=l;j++)if(a[targ][j]!=a[i][j])d++;print i"\t"b[i]"\t"d}}' targ=$x|awk 'NR==FNR{a[$1]=$2;next}{print$3,$2,a[$1],$1}' {,O}FS=\\t <(seqkit seq -n sarslike.fa|sed $'s/ /\t/;s/, complete genome//') -|sort -n|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print a[$4]"\t"$0}' <(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i==x)break;next}{print$1 FS$i}' x=$x sarslike.dist) -|sort -n|awk 'NR==FNR{a[$1]=$3 FS$4 FS$5;next}{print$0"\t"a[$NF]}' {,O}FS=\\t sarslike.tsv -|tab \\t

I posted the output of the shell commands here: https://pastebin.com/raw/GDm9PNqD.

Eight bat SARS viruses featured the sequence DATSTGNHNYKYRYLRH which has only one mismatch: BtRs-BetaCoV/YN2018B, Rs9401, Rs7327, YN2016C, YN2016D, YN2016E, YN2016A, YN2016B. They all have between 1254 and 1283 nucleotide changes from Tor2. WIV1 has about a hundred fewer nucleotide changes from Tor2 (1150) but it has two mismatches (DATQTGNYNYKYRSLRH). The only genome with three mismatches is "Rhinolophus affinis coronavirus isolate LYRa11" (DATSSGNFNYKYRSLRH), where the number of mismatches is pretty low considering that the whole genome has 2672 nucleotide changes from Tor2. The LYRa11 sequence was published in 2014 as part of a paper titled "Identification of Diverse Alphacoronaviruses and Genomic Characterization of a Novel Severe Acute Respiratory Syndrome-Like Coronavirus from Bats in China".

The Y?Y?Y pattern of three Y residues interspaced by single other residues is also featured in Wuhan-Hu-1: DSKVGGNYNYLYRLFRK. The region is identical in BANAL-52, BANAL-236, and BANAL-103. But in RaTG13 the first four residues DAKE instead of DSKV. And ZC45 has deletions in the middle of the sequence: "DV---GN--YFYRSHRS".

Expand full comment

Dogstack

SARS-1: Evidence of an Artificial Origin