The Data Analysis Group

RNA-Seq

RNA-Seq utilises modern sequencing technologies to obtain the sequences of RNA molecules within a sample, and has become a standard technique for studying the transcriptome. RNA-Seq can be applied to identifying differentially expressed genes, de-novo transcriptome assembly, improving genome annotation and analysis of splice variants. When used for Differential gene expression, RNA-Seq offers reduced noise and a much wider dynamic range than microarray based methods, while also being capable of assessing all transcripts within a sample, whereas a microarray is restricted to the known gene sequences initially used in the design of the array. When combined with transcriptome assembly, transcripts can be quantified even in the absence of a reference genome sequence.

The DAG has extensive experience in the analysis of RNA-Seq experiments, having collaborated on over 40 projects to date. In addition the DAG has participated in a large-scale experiment to identify which tools provide the best measures of differential expression, and the effects of sample replication on analysis sensitivity.

Relevant Publications

Schurch N.J. et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? 2016 RNA 22:839-851
Gierliński M. et al. Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment 2015 Bioinformatics btv425

ChIP-Seq

ChIP-Seq allows the identification of binding sites of proteins which interact directly with DNA. Proteins bound to the DNA are immunoprecipitated using an antibody specific to the target protein, which leads to the coprecipitation of the bound DNA. This is then isolated and sequenced. In its standard applications, ChIP-Seq is used for defining transcription factor binding sites or histone modifications across the entire genome. When used in combination with RNA-Seq and methylation analysis, it can allow determination of wider regulatory networks. No prior knowledge of binding sites is required, unlike conventional array based approaches.

Analysis of ChIP-Seq data typically involves alignment of the sequenced reads against the genome sequence. The reads will be enriched in regions which were bound to the target protein, resulting in peaks in the read alignments. These peaks are first identified (e.g. using MACS2), following which motif discovery may be carried out within the peaks (e.g. with HOMER) to identify conserved binding motifs.

The DAG has been involved in numerous ChIP-Seq projects, ranging from those investigating changes in protein binding during DNA replication, analysis of rDNA promoter binding factors, and analysis of SUMO1 distribution between pluripotent and differentiated cells.

Relevant Publications

Natsume T. et al. Kinetochores coordinate pericentromeric cohesion and early DNA replication by Cdc7-Dbf4 kinase recruitment 2013 Molecular Cell 50:661-674

Variant Analysis

Differences between the genome sequences underlie many of the defining characteristics of a member of a population, and are frequently implicated in disease. Genetic variants can be divided into three categories:

Single Nucleotide Variants (SNVs) - changes to a single base within the genome. When located within coding genes, these may be classed as being synonymous, where the change does not result in a modification to the resulting protein sequence, or non-synonymous, which results in an amino-acid substitution in the protein. The impact of non-synonymous variants will differ according to the type of amino-acid substituted and the impact this has on the protein structure. SNVs in non-coding regions can also impact on gene expression i.e. when located in promoter or enhancer regions, while the most severe effects tend to come from variants which introduce stop codons within a coding sequence, resulting in truncated proteins being produced.
Insertions/Deletions (INDELs) - short insertions or deletions within the genome. INDELs can cause considerably more severe impacts on an individual than SNVs since these can result in downstream sequences being out of frame, severely disrupting (if not truncating) the protein sequence.
Structural Variations (SVs) - large scale structural rearrangements within the genome such as translocations, inversions or duplications. These are normally far rarer events than SNVs and INDELs, but have the potential to produce greater impacts through alteration of gene structure or dose.

DAG members have experience of variant analysis encompassing all three classes of variant, in organisms of differing ploidy. These range from individual microbial genomes through non-model organisms to large scale human population studies.

Relevant Publications

Chambers, J.C. et al. The South Asian Genome 2014 PLoS One 9(8) e102645
Corrigan, R.M et al. c-di-AMP is a new second messenger in Staphylococcus aureus with a role in controlling cell size and envelope stress 2011 PLoS pathogens 7(9) e1002217

Genome Assembly

Assembly of novel genome sequences was largely considered a solved problem in the early 2000s, however the development of high-throughput sequencing methodologies drastically changed the assembly problem with a shift to very short reads, necessitating much greater depths of sequencing and the introduction of novel assembly methodologies. While initial assemblies from such approaches were notably less complete and more fragmented than those achieved using traditional methods, incremental improvements in sequencing technology and assembly algorithms have led to the routine production of much improved genome assemblies. The greatest obstacle in obtaining complete assemblies is correctly handling complex repetitive regions. Recent advances in long-read sequencing methods now provide the ability to span such regions effectively, and produce chromosome-level assemblies.

Once an assembly has been produced, it must be annotated to provide information on gene structures, repetitive sequences, tRNAs, rRNAs and protein functions. The gene prediction process can be challenging in organisms where little previous sequencing has been carried out, and will typically involve carrying out additional transcriptome sequencing to identify expressed genes, and to obtain sufficient full-length gene sequences to enable ab-initio gene finding software to be trained to correctly identify gene structures within the genome.

DAG members have extensive expertise in genome assembly and annotation, from the production of a pipeline for automated assembly and annotation of bacterial genome sequences, to the assembly of fungal, insect and large complex plant genomes.

Relevant Publications

Spanu, P.D. et al. Genome expansion and gene loss in powdery mildew fungi reveal trade-offs in extreme parasitism 2010 Science 330 (6010) 1543-6
Abbott, J.C. BugBuilder - An automated microbial genome assembly and analysis pipeline 2017 BioRxiv 148783

Metagenomics involves the study of the collection of genomes within an environmental sample, although analysis is frequently restricted to bacterial or fungal members of the community. Direct analysis of an environmental sample allows relative quantitation of the community members to be determined, but is also not restricted to organisms which can be effectively cultured. Analysis has frequently focused on sequencing of highly conserved genes (such as the bacterial 16S rRNA, or fungal ITS) to determine the community membership. More recently, shotgun sequencing approaches have been applied to metagenomic studies, resulting in the sequence of the genomes of all the community members to be determined. These resulting sequence reads can be used to assign taxonomic classification to the community using a tools such as MetaPhlAn or Kraken. An alternative approach utilises assembly of the metagenome sequence reads to produce contig sequences, from which coding genes can be identified and functionally classified, enabling the identification of functional enrichment between communities.

DAG members have experience of studies involving both 16S bacterial analysis, and large-scale shotgun metagenome experiments, the latter involving production of an analysis pipeline for assembly based metagenome reconstruction and analysis.

Relevant Publications

Hoyles, L. et al. Molecular phenomics and metagenomics of hepatic steatosis in non-diabetic obese women 2018 Nature Medicine 24 1070-80