GenSpectrum uses mostly open data from the International Nucleotide Sequence Database Collaboration (INSDC) which consists of GenBank, ENA and DDBJ.
- For influenza, RSV and Mpox, we download the data directly from GenBank using the NCBI Datasets CLI.
- For SARS-CoV-2, we download the data from Nextstrain who gets the data from GenBank and processes them.
- For the West Nile virus, we use data from Pathoplexus, which also includes data from INSDC. (For CoV-Spectrum, we have an instance which uses data from GISAID.)
We use Nextclade as the main tool for preprocessing the sequences. The following sequences are used as a reference for alignment:
- SARS-CoV-2: NC_045512
- Mpox: NC_063383.1
- RSV-A: EPI_ISL_412866 (available on GISAID; an almost identical sequences is LR699737)
- RSV-B: OP975389.1
- West Nile virus: NC_009942.1
- Influenza A/H5N1: A/goose/Guangdong/1/1996