Data sources and preprocessing

GenSpectrum uses mostly open data from the International Nucleotide Sequence Database Collaboration (INSDC) which consists of GenBank, ENA and DDBJ.

  • For influenza, RSV and Mpox, we download the data directly from GenBank using the NCBI Datasets CLI.
  • For SARS-CoV-2, we download the data from Nextstrain who gets the data from GenBank and processes them.
  • For the West Nile virus, we use data from Pathoplexus, which also includes data from INSDC. (For CoV-Spectrum, we have an instance which uses data from GISAID.)

We use Nextclade as the main tool for preprocessing the sequences. The following sequences are used as a reference for alignment: