Data sources and preprocessing

For influenza, we download the data directly from GenBank using the NCBI Datasets CLI. You can explore the data in our Loculus instance .
For SARS-CoV-2, we download the data from Nextstrain who gets the data from GenBank and processes them. (For CoV-Spectrum , we have an instance which uses data from GISAID .)
For the West Nile virus, Mpox, Ebola and RSV, we use data from Pathoplexus , which also includes data from INSDC.

We use Nextclade as the main tool for preprocessing the sequences. The following sequences are used as a reference for alignment:

SARS-CoV-2: NC_045512
Mpox: NC_063383.1
RSV-A: PP109421.1
RSV-B: OP975389.1
West Nile virus: NC_009942.1
Ebola Sudan: NC_006432.1
Ebola Zaire: NC_002549.1
Influenza A/H5N1: A/goose/Guangdong/1/1996
Influenza A/H1N1pdm: We use the reference assembly GCF_001343785.1 except for HA (uses CY121680.1 ) and NA (uses MW626056.1).
Influenza A/H3N2: We use the reference assembly GCF_000865085.1 except for HA (uses CY163680.1 ) and NA (uses CY114383.1).
Influenza B/Victoria: We use the same references as the official nextclade dataset. These are:
- seg1: CY115157.1
- seg2: CY115158.1
- seg3: CY115156.1
- seg4: KX058884.1
- seg5: CY115154.1
- seg6: CY073894.1
- seg7: CY115152.1
- seg8: CY115155.1
Dengue 1: NC_001477.1
Dengue 2: NC_001474.2
Dengue 3: NC_001475.2
Dengue 4: NC_002640.1