Task 1 - Data analysis genomics
For your coursework, you will analyse next generation sequencing data and spectrometry data produced by different techniques.
The coursework has two sections:
Section A tests your learning in genomics (learning outcome 1).
Section B tests your learning in proteomics (learning outcome 3).
You will be given an individual dataset for each section.
This document contains the brief for Section A. Dr David Smith will provide the brief for Section B.
You must perform the analyses yourself and write methods, results and answers in your own words. Doing the analysis yourself and writing in your own words is important because we are assessing whether you are able to perform the analysis and if you have understood the ideas you have been learning about.
Section A - Genomics
Your data are from Illumina paired-end sequencing of one human. You should analyse these resequencing data to
• get a list of genomic variants
• summarise the variants
Part 1: Producing a list of the individual's genomic variants Instructions
Start by creating a directory for this coursework inside your home directory. All the files you produce should go into this folder. You should not copy the raw data files into this folder, use the raw data files as input by providing their file path.
Make a text file (in MS Wordpad, Apple TextEdit, notepad++ or another software). On the first line of the text file state which directory you are working in. Copy each command into the text file as you run them. Check that the command is written exactly as you gave it on the command line. Make sure there are no typos and no spelling has been autocorrected.
This is very similar analysis that you did with a practise dataset in the computer practicals. You should do the computational steps that are needed to produce a list of variants (some exericses we did in the practicals were to help you understand the format and contents of file types, you do not need to repeat those exercises).
Part 2: Summary of the variants detected Instructions
Investigate the types of variants that were found and report how many were of each type (e.g. the number that were SNVs and the number that were indels). There are many ways of classifying variants and you should decide yourself how to do this (by thinking about what aspects of variants is most interesting). ANNOVAR can annotate variants with a lot of information. In human genomics, we are usually interested in the variants that are most likely to cause disease.
You should present your results in a table and you can classifiy variants a few ways but the table should take up no more than half a page. The counting can be done in Excel, which will be demonstrated in class. Credit will be given for working out a bioinformatics method to do this (such as using software or writing a script).