Data sources and preprocessing

GenSpectrum uses mostly open data from the International Nucleotide Sequence Database Collaboration (INSDC) which consists of GenBank, ENA and DDBJ.

  • For influenza and RSV, we download the data directly from GenBank using the NCBI Datasets CLI. You can explore the data in our Loculus instance.
  • For SARS-CoV-2, we download the data from Nextstrain who gets the data from GenBank and processes them. (For CoV-Spectrum, we have an instance which uses data from GISAID.)
  • For the West Nile virus and Mpox, we use data from Pathoplexus, which also includes data from INSDC.

We use Nextclade as the main tool for preprocessing the sequences. The following sequences are used as a reference for alignment: