We use 8 characters sample IDs which encode various pieces of information about each sample. For example '22FAlama'.
This can be interpreted as follows:
22FAlama: | Year |
22FAlama: | 2 character code representing the school: Forfar Academy |
22FAlama: | 4 character code indicating the variety: Lady Margaret |
BA | Banchory Academy |
BC | Beaulieu Convent School |
TA | St Thomas of Aquin's High School |
FA | Forfar Academy |
QA | Queen Anne High School |
SP | St Peter the Apostle High School |
SM | St Modan's High School |
cher | Cheerio |
coul | Coulmouny |
cove | Coverack beauty |
empr | Empress |
fofi | Forest Fire |
furn | Furness |
lama | Lady Margaret Boscawen |
lofy | Loch Fyne |
luci | Lucifer |
mihu | Minnie Hume |
morv | Morven |
orna | Ornatus |
poet | Poetarum |
prin | Princeps |
alba | Albatross |
The data available from this site have been generated as part of a Royal Society (London) project (The Scottish Daffodil Project) to introduce 16-18 year old school pupils across Scotland to genome sequencing, molecular evolution and ecology.
The data are all freely available for teaching and research but are preliminary prior to formal publication by the Scottish Daffodil Project. The Scottish Daffodil Project is preparing publications that include: (a) the principals and teaching value of practical genome sequencing and genome analysis in schools and (b) the scientific interpretation of the genome data generated in the Project.
The expectation is that the Scottish Daffodil Project will submit articles within 18 months of the first release of raw data. If you wish to use the data in your own publications during this time, then please contact the Scottish Daffodil Project to ensure you have the best quality data to work from. Also, the following acknowledgment should be included: "These data were produced by the Scottish Daffodil Project in collaboration." We request that you notify the Scottish Daffodil Project upon publication so that this information can be included in the final annotation of the data and reporting on the Scottish Daffodil Project.
While still in waiting period status, the assembly and raw sequence reads should not be redistributed or repackaged without permission from Scottish Daffodil Project." Once moved to unreserved status, the data are freely available for any subsequent use.
This project is a collaboration between Jon Hale (Head of Biology at Beaulieu Convent School, Jersey), the University of Dundee School of Life Sciences and the James Hutton Institute.
9 schools are working in conjunction with STEM partners from the University of Dundee and the James Hutton Institute to sample various daffodil varieties and carry out DNA sequencing of the chloroplast genome using Oxford Nanopore MinIon sequencing. The schools are then carrying out analysis of the resulting sequence to answer specific research questions.
The resulting sequencing data from all the schools are being collated on this site, allowing with additional sequences obtained by Beaulieu Convent School, Jersey. This enables the phylogeny of the various varietites to be inferred.
This project has been funded by The Royal Society, with additional funding from the Friends of Dundee University Botanic Garden to enable the University of Dundee Data Analysis Group to provide this centralised resource with consistent analysis processes.
We would be happy for anyone to reuse this data in their own projects, but please see our Data Usage Policy.
The sequences have all been processed using an automated workflow consisting of a number of stages allowing us to get from the raw data produced by the MinIon sequencer to assembled, annotated sequences
This is the process of converting the signal from the MinIon sequencer into the DNA bases this represents. The instrument produces thousands of reads of varying length, up to 40-50kb in length. Basecalling was carried out using the Oxford Nanopore Guppy software (version 6.1.2) using the appropriate high accuracy model for the flowcells and sequencing kits used (dna_r9.4.1_450bps_hac).
Every base of sequence generated is assigned a quality score, which is a measure of how likely the base is to have been called correctly. These range from 0-40, using a logarithmic scale, where a score of 10 represents a 1 in 10 chance that the base call is incorrect, a score of 20 relating to a 1 in 100 chance, and 30 being 1 in 1000. Sequences are filtered using NanoFilt to remove sequences shorter than 300 bp and also those with a minimum average quality score below 10.
A large proportion of these samples does not consist of daffodil chloroplastic DNA, so to make the assembly process easier, contaminants are removed from the sequence data. Kraken 2.1.2 is also used to identify what the origin of the seqeunces in each sample. The sequence reads are mapped to a known daffodil chloroplast sequence using winnowmap 2.03, and those which do not have similarity with this reference sequence are discarded.
The individual sequence reads represent a part of the chloroplast genome, so a process called assembly is used to essentially find where all the reads overlap with each other to allow the sequence of the entire chloroplast genome to be determined. This produces sequences referred to as 'contigs', which contain contiguous stretches of sequence. In the ideal world we would have a single contig representing the entire genome, but in reality we typically see more than one. The assembly was carried out using Flye 2.9. There is typically far more data present in these samples than required for assembly, so the longest reads are used, with enough reads to cover the genome 50 times. Once assembled, the remaining reads are then mapped against the assembly to try to identify and remove errors in a process called polishing. 3 iterations of polishing were carried out in this case.
To determine the correct orientation and order of the contigs, they are aligned against an existing daffodil chloroplast genome, which also allows missing regions of the assembly to be identified. This process produces scaffolds, which are combine the contig sequences in the correct order and orientation, and fills any gaps in the sequence with 'N' bases, which represents an unknown base. Scaffolding was carried out using RagTag 2.1.0, with the MH706763.1 database sequence as a reference.
Since the chloroplast genome is a circular molecule, the linear representation produced by assembly may start at any point in the genome. To try to resolve this the sequence is aligned against the known chloroplast genome and the location of the start position identified. The sequence is cut at this point and the removed part added back to the end of the sequence
The position of the genes on the chloroplast genome was determined by an approach finding similiarity to those present in the MH706763 sequence. An initial annotation is carried out using the Rapid Annotation Transfer Tool, then once all sequences are available a more comprehensive annotation is carried out using GeSeq
The individual assembled chloroplast genomes are combined together and a multiple sequence alignment carried out using the Mafft aligner, with it's l-INS-i iterative pairwise local alignment method. Some assemblies have produced only partial assemblies, which if included in the phylogenetic analysis would greatly reduce the available usable sequence. Alignments are therefore carried out using all the sequences, and a separate alignment based on just chloroplast sequences of which >120kb has been assembled.
Phylogenetic analysis is carried out using RaxML with the GTRCAT substitution model. The analysis is bootstrapped to provide an estimate of the confidence of the placement of each sequence in the tree. The number of bootstrap iteriations is determed automatically using the 'autoMRE' method.
A plot of the resulting tree is produced using the R ggtree package.
Albatross has clear white spreading, broadly ovate perianth segments with wavy margins, and a bowl-shaped, ribbed pale lemon cup with an orange rim. It's seed parent is Ornatus, and pollen parent Empress.
Total Sequenced Bases (bp)
help
|
1,844,402,915 | Mean Quality
help
|
10.5 |
Mean Read Length (bp)
help
|
2,408.6 | N50 Read Length (bp)
help
|
3,782.0 |
Proportion of Chloroplast Bases (%)
help
|
2.8 | Chloroplast Coverage
help
|
327 |
Assembled Contigs
help
|
2 | Assembled Contig Length (bp)
help
|
150604 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
150761 |
Empress has a funnel-shaped deep yellow trumpet, with ovate, white perianth segments.
Total Sequenced Bases (bp)
help
|
154,483,778 | Mean Quality
help
|
11.3 |
Mean Read Length (bp)
help
|
1,906.5 | N50 Read Length (bp)
help
|
3,539.0 |
Proportion of Chloroplast Bases (%)
help
|
17 | Chloroplast Coverage
help
|
165 |
Assembled Contigs
help
|
2 | Assembled Contig Length (bp)
help
|
164568 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
165515 |
Lady Margaret Boscawan has a star-shaped flower with a yellow trumpet, and white, broadly ovate perianths. It's seed parent is Horsfieldii, and it's pollen parent Ornatus
Total Sequenced Bases (bp)
help
|
1,293,024,147 | Mean Quality
help
|
12.1 |
Mean Read Length (bp)
help
|
1,072.8 | N50 Read Length (bp)
help
|
1,558.0 |
Proportion of Chloroplast Bases (%)
help
|
37 | Chloroplast Coverage
help
|
2,982 |
Assembled Contigs
help
|
2 | Assembled Contig Length (bp)
help
|
149092 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
153920 |
Lady Margaret Boscawan has a star-shaped flower with a yellow trumpet, and white, broadly ovate perianths. It's seed parent is Horsfieldii, and it's pollen parent Ornatus
Total Sequenced Bases (bp)
help
|
43,051,165 | Mean Quality
help
|
10.2 |
Mean Read Length (bp)
help
|
1,531.3 | N50 Read Length (bp)
help
|
2,428.0 |
Proportion of Chloroplast Bases (%)
help
|
1.5 | Chloroplast Coverage
help
|
3 |
Assembled Contigs
help
|
10 | Assembled Contig Length (bp)
help
|
35625 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
128314 |
Loch Fyne has a yellow trumpet, and ivory white, overlapping ovate perianth segments. It's seed parent was Minnie Hume, and the pollen parent Lady Margaret Boscawen.
Total Sequenced Bases (bp)
help
|
276,232,033 | Mean Quality
help
|
11.9 |
Mean Read Length (bp)
help
|
751.6 | N50 Read Length (bp)
help
|
1,024.0 |
Proportion of Chloroplast Bases (%)
help
|
55 | Chloroplast Coverage
help
|
954 |
Assembled Contigs
help
|
3 | Assembled Contig Length (bp)
help
|
159260 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
142923 |
Lucifer has an orange trumpet, and ivory white, overlapping ovate perianth segments.
Total Sequenced Bases (bp)
help
|
3,793,871,703 | Mean Quality
help
|
10.7 |
Mean Read Length (bp)
help
|
2,344.6 | N50 Read Length (bp)
help
|
4,325.0 |
Proportion of Chloroplast Bases (%)
help
|
1.9 | Chloroplast Coverage
help
|
451 |
Assembled Contigs
help
|
2 | Assembled Contig Length (bp)
help
|
138489 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
138598 |
Minnie Hume has a white trumpet, and white, ovate, spreading perianth segments. It's seed parent is Albicans, and pollen parent N. radiiflorus
Total Sequenced Bases (bp)
help
|
637,388,752 | Mean Quality
help
|
10.3 |
Mean Read Length (bp)
help
|
1,562.3 | N50 Read Length (bp)
help
|
2,449.0 |
Proportion of Chloroplast Bases (%)
help
|
1.7 | Chloroplast Coverage
help
|
65 |
Assembled Contigs
help
|
2 | Assembled Contig Length (bp)
help
|
127041 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
143798 |
Ornatus has white, obviate, spreading perianth segments, with a disc-shaped cup fading from green at the base with a red band at the rim.
Total Sequenced Bases (bp)
help
|
172,644,669 | Mean Quality
help
|
10.3 |
Mean Read Length (bp)
help
|
2,731.9 | N50 Read Length (bp)
help
|
4,175.0 |
Proportion of Chloroplast Bases (%)
help
|
9.9 | Chloroplast Coverage
help
|
106 |
Assembled Contigs
help
|
2 | Assembled Contig Length (bp)
help
|
143892 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
149960 |
N. radiiflorus var. poetarum has white, obviate, spreading perianth segments, with a disc-shaped cup with a red rim.
Total Sequenced Bases (bp)
help
|
4,063,677,562 | Mean Quality
help
|
10.2 |
Mean Read Length (bp)
help
|
3,396.6 | N50 Read Length (bp)
help
|
5,846.0 |
Proportion of Chloroplast Bases (%)
help
|
1.4 | Chloroplast Coverage
help
|
353 |
Assembled Contigs
help
|
2 | Assembled Contig Length (bp)
help
|
159765 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
159874 |
Priceps has sulphur white, narrowly ovate perianth segments overlapping only at the base, with a narrowly funnel-shaped rich yellow corona.
Total Sequenced Bases (bp)
help
|
558,705,420 | Mean Quality
help
|
10.2 |
Mean Read Length (bp)
help
|
740.5 | N50 Read Length (bp)
help
|
1,020.0 |
Proportion of Chloroplast Bases (%)
help
|
0.13 | Chloroplast Coverage
help
|
4 |
Assembled Contigs
help
|
6 | Assembled Contig Length (bp)
help
|
44844 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
145145 |
Total Sequenced Bases (bp)
help
|
148,406 | Mean Quality
help
|
10.2 |
---|---|---|---|
Mean Read Length (bp)
help
|
1,613.1 | N50 Read Length (bp)
help
|
2,265.0 |
Proportion of Chloroplast Bases (%)
help
|
11 | Chloroplast Coverage
help
|
0 |
Assembled Contigs
help
|
0 | Assembled Contig Length (bp)
help
|
0 |
Assembled Scaffolds
help
|
0 | Assembled Scaffold Length (bp)
help
|
0 |
Total Sequenced Bases (bp)
help
|
941,653 | Mean Quality
help
|
10.1 |
---|---|---|---|
Mean Read Length (bp)
help
|
973.8 | N50 Read Length (bp)
help
|
1,556.0 |
Proportion of Chloroplast Bases (%)
help
|
11 | Chloroplast Coverage
help
|
0 |
Assembled Contigs
help
|
0 | Assembled Contig Length (bp)
help
|
0 |
Assembled Scaffolds
help
|
0 | Assembled Scaffold Length (bp)
help
|
0 |
Total Sequenced Bases (bp)
help
|
689,139 | Mean Quality
help
|
10.1 |
---|---|---|---|
Mean Read Length (bp)
help
|
1,027.0 | N50 Read Length (bp)
help
|
1,512.0 |
Proportion of Chloroplast Bases (%)
help
|
16 | Chloroplast Coverage
help
|
0 |
Assembled Contigs
help
|
1 | Assembled Contig Length (bp)
help
|
3029 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
3038 |
Total Sequenced Bases (bp)
help
|
16,703,346 | Mean Quality
help
|
10.0 |
---|---|---|---|
Mean Read Length (bp)
help
|
2,058.6 | N50 Read Length (bp)
help
|
4,184.0 |
Proportion of Chloroplast Bases (%)
help
|
3.3 | Chloroplast Coverage
help
|
3 |
Assembled Contigs
help
|
4 | Assembled Contig Length (bp)
help
|
45827 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
131294 |
Total Sequenced Bases (bp)
help
|
660,329 | Mean Quality
help
|
11.0 |
---|---|---|---|
Mean Read Length (bp)
help
|
2,934.8 | N50 Read Length (bp)
help
|
6,953.0 |
Proportion of Chloroplast Bases (%)
help
|
5 | Chloroplast Coverage
help
|
0 |
Assembled Contigs
help
|
0 | Assembled Contig Length (bp)
help
|
0 |
Assembled Scaffolds
help
|
0 | Assembled Scaffold Length (bp)
help
|
0 |
Total Sequenced Bases (bp)
help
|
11,975,706 | Mean Quality
help
|
10.4 |
---|---|---|---|
Mean Read Length (bp)
help
|
1,217.8 | N50 Read Length (bp)
help
|
2,056.0 |
Proportion of Chloroplast Bases (%)
help
|
9 | Chloroplast Coverage
help
|
6 |
Assembled Contigs
help
|
12 | Assembled Contig Length (bp)
help
|
70605 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
127240 |
Total Sequenced Bases (bp)
help
|
13,958,953 | Mean Quality
help
|
10.3 |
---|---|---|---|
Mean Read Length (bp)
help
|
2,510.2 | N50 Read Length (bp)
help
|
5,812.0 |
Proportion of Chloroplast Bases (%)
help
|
6.9 | Chloroplast Coverage
help
|
6 |
Assembled Contigs
help
|
6 | Assembled Contig Length (bp)
help
|
105128 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
130314 |
Total Sequenced Bases (bp)
help
|
4,591,656 | Mean Quality
help
|
10.1 |
---|---|---|---|
Mean Read Length (bp)
help
|
912.5 | N50 Read Length (bp)
help
|
1,442.0 |
Proportion of Chloroplast Bases (%)
help
|
12 | Chloroplast Coverage
help
|
3 |
Assembled Contigs
help
|
6 | Assembled Contig Length (bp)
help
|
43778 |
Assembled Scaffolds
help
|
1 | Assembled Scaffold Length (bp)
help
|
142227 |
Two multiple sequence alignments have been carried out. The first includes all (including partial) sequences, while the second includes just sequences > 120kb in length, which were used for carrying out the phylogenetic analysis.
The JalviewJS link below for the 'long' sequences will also load they phylogenetic tree in addition to the alignment.
Each identified gene within the assembled chloroplast genomes has been translated to a protein sequence, and then a multiple sequence alignment generated for each gene. Alignments were created using Muscle 5.1 with default parameters.