Question: You are given a human gene (Ensembl Identifier) and you are asked to bioinformatically characterize it. What is all you can do with this gene?
You must present your findings in a report format which should include:
A title, your name and student ID
A brief introduction, stating what your given gene is, setting out the aims of this report and your intended methodology.
The rest of the report should outline your findings, you may choose to describe the outputs of each analysis tool under its own heading, e.g. "BlastP analysis". For each heading, along with your results and their interpretation, keep in mind to state what the purpose of the analysis is, under what parameters the analysis was carried out to generate the results, as well report on the confidence levels of the results if these are available.
Figures and tables must be numbered and captioned.
You must include a citation if you are referencing a paper (but no need to include references for the bioinformatic tools you used). References should be listed in a Bibliography at the end of your report (make use of the referencing software available to you).
Note that some marks are allocated towards the formatting and presentation of the report.
Overall length should be between 25 and 40 pages.
You are encouraged to include a form of bioinformatic analysis that is not listed below.
Some of the obvious things to do are the following:
Start by reporting the gene symbol of the gene and a summary of the function of the gene. What is the location and size of the gene, introns and exons? Are there any regulatory elements or relevant genes in the vicinity of your own? Is there a paper describing how the gene was sequenced?
Can you assess if this gene is conserved in other species such as mouse, rat, dog, elephant and horse using the UCSC genome browser?
Get the protein sequence of this gene. Recommended resources are the UCSC genome browser or Uniprot database.
Find homologs for this human gene among different placental mammals using BLASTP. This exercise is exploratory and requires that you try a few things before you conclude your BLAST search. For instace, you can start the search by using the "Non redundant/nr database" and see the number of sequences you retrieve. You could also attempt the same search using the "Refseq_protein" database and see if the results differ. "nr database" is generally the first choice for any BLAST search but if you start getting too many hits, you may resort to "Refseq_protein" database.
In general, the goal of this exercise is to retrieve orthologs of your query sequence from different placental mammals. It could very well be the case that the gene assigned to you is conserved in other vertebrates such as zebra fish or chicken. However, at such larger phylogenetic distances, the sequences start to diverge, and the alignment may not be high quality.
Therefore, try to keep sequences from placental mammals in your alignment (but you won't be penalized if your alignment also has sequences from other veterbate genomes.)
Build a multiple sequence alignment using Clustal Omega. This is a bit of hit and miss approach. Initially, there is a pool of several sequences that is available to you. However, some of the sequences will need to be removed because they could have a very long N-terminus or a C-terminus and this produces alignment columns that are mostly empty. Therefore, you have to iteratively build alignment, identify "contaminating" sequences, remove such sequences in your fasta file and then re-align. Eventually you will generate a multiple sequence alignment that has to copied in a text file and then submit this text file together with your assignment.
The purpose of MSA is to identify conserved regions as well as regions of diversity.
Find domains along the length of this protein sequence using Interpro. The output from this exercise should be a screenshot displaying the domains in the protein. Feel free to use other domain detection/prediction tools as well. You can supplement your report with prediction from additional tools. Do the various tools consolidate your results or are there differences? Do the predicted domains tell you anything about the protein?
Predict the secondary structure of the protein. Use at least 2 different secondary structure prediction tools and discuss if the predicted secondary structures are in agreement with each other or not. What, if any, is the relationship between the predicted secondary structures, conserved regions, and predicted domains?
Predict the three-dimensional structure of the protein. This is done using Swiss-Model and AlphaFold. Discuss some of the details of the predicted tertiary structures. If this is a good structure (based on some of the metrics that we have covered such as Ramachandran Plot). It could also be that someone already determined the 3D structure of your protein. Report this information accordingly.
Report some insights about this gene from databases - what are the functions of this gene, is this gene associated with any disease, does this gene have a mouse homolog and if yes, have researchers created a mouse knockout of this gene to model a disease?
More specifically, I am looking for a two-page summary about the gene assigned to you. Genecards is a good starting point but please do not copy the text from Genecards. The purpose of this task is that you learn about one human gene - what it does in terms of its molecular function, whether it is implicated in disease. Furthermore, you could look for information related to your gene on OMIM database. Is there something about this gene that you find fascinating?
A graphic that describes the function of the gene together with its implication in a human disease and if there are drugs that target this gene would be a nice addition to answer this part of this assignment. If you pulled out the information about the gene assigned to you from scientific literature (which should be the case ideally), please also add references at the end of this section of this assignment.
Summary section (conclusion) - You may conclude the assignment by writing a one-page summary of the gene assigned to you. Ideally this section should contain conclusions from tasks 1-9 listed above.
Try to put all your results together in a word document (convert this word to a pdf file), stick to the following naming convention (StudentID_Assignment2) and submit your assignment to BrightSpace. In addition, you need to submit the multiple sequence alignment file as a ".txt" file.