CSCI 5481 Homeworks 1 to 4 and Final Project solutions

$100.00

Original Work ?

Download Details:

  • Name: Homeworks-rrgzvp.zip
  • Type: zip
  • Size: 25.81 MB

Category: You will Instantly receive a download link upon Payment||Click Original Work Button for Custom work

Description

5/5 - (1 vote)

Homework 1 CSCI 5481, Computational Techniques for Genomics

The purpose of this exercise is to get you prepared to write convenience wrapper scripts that run 3rd party
software tools for performing alignment, assembly, phylogeny inference, secondary structure prediction, etc.
You will learn how to write a script in Python that uses command-line parameters, reads a file, and can
execute a command and retrieve its standard output and error output. You must be using a linux
operating system to complete this assignment.
Tasks
1. Download the Homework1 folder: https://www.dropbox.com/s/lw8a5l4g6b9yfjv/Homework1.zip?dl=1
2. Modify Homework1/template.py for Python 2 or Homework1/template-python3.py for Python 3 to do
the following:
a. Call the BURST command from within python using the following syntax:
./burst -r ref.fna -q query.fna –taxonomy taxonomy.txt -o output.txt
b. Print out the return value, standard output, standard error.
3. Verify that your program runs with:
python homework1.py -q query.fna -r ref.fna -t taxonomy.txt -c /path/to/burst -o output.txt -V
4. The output file shows information for every query sequence that hit a reference sequence at 97%
similarity or above. Answer these questions, using this guide to the meaning of the first 12 columns
in the output.txt file: https://github.com/seqan/lambda/wiki/BLAST-Output-Formats. The 13th column
shows the taxonomic assignment of each query sequence.
a. What fraction of the original input query sequences had a match in the database at 97% or
above? Note that a few matches fell below 97%. You can exclude these. Inspect query.fna,
which is in FASTA format, to ensure that you are counting the correct number of input
sequences.
b. What is the most common bacterial species in the query set? Note that any taxonomy string
ending with something after the last “s__” is a species name. If a string ends with “s__” and
nothing else, then it was ambiguous and does not count as a species.
c. What is the average percent similarity of the matches? You can exclude those below 97%.
Deliverables
1. Please turn in, via moodle:
a. Your well-commented code. Note in your code the names of any people with whom you
discussed the assignment. This includes code for answering question 4 above.
b. The output printed by your program (this is what is printed to standard out, not the BURST
output file).
c. The output.txt file generated by BURST
d. Answers to questions above in txt/doc/pdf

CSCI 5481 Homework 2: Anchored Global Sequence Alignment

This homework assignment is implementation of an anchored version of the standard Needleman-Wunsch
algorithm and application of the algorithm to align PAX and HOX proteins from human and fruit fly. The
anchored global sequence alignment assumes known matched regions between two sequences and
applies Needleman-Wunsch algorithm to align the unaligned regions between the matched ones. Implement
the anchored global sequence alignment algorithm and align the given sequences. (Hint: write
Needleman-Wunsch first — then the anchored algorithm is very simple extension of the Needleman-Wunsch
algorithm and you only need to implement a wrapper function that calls your other function).
Dataset
Download and extract the folder containing the sequences:
https://www.dropbox.com/s/clbbgkq0tem66fj/Homework2-sequences.zip?dl=1. For the matches files, i.e.
“Match_HOX.txt” and “Match_PAX.txt”, the first 2 columns represent the start and end positions of matched
regions for the human sequence, whereas the last 2 columns represent the start and end positions of the
matched regions for the fruit fly sequence.
Input and Output Format:
You can hardcode your substitution matrix and gap penalty values (-3 for mismatches, 1 for a match, -2 for
a gap; ignore the affine gaps). The command line for calling your program should be of the form:
programname seq1.fasta seq2.fasta [matches.txt]. Note that [matches.txt] means the third
file is optional. If the matches.txt is not provided, your program should run standard Needleman-Wunsch.
Output should be both the alignment score for this pair of sequences and the actual alignment itself printed
with gaps. You may also use command-line flags to label your parameters, e.g. python
programname.py -q eq1.fasta -r seq2.fasta [-m matches.txt].
Treat any special characters the same as the ones in alphabet, i.e. use the same match and mismatch
costs.
Tasks
1. (25 points) : Implement the Needleman-Wunsch algorithm (don’t forget your comments).
2. (25 points) : Based on your Needleman-Wunsch algorithm to implement the anchored version (don’t
forget your comments).
3. (25 points) : Use your algorithm to align the provided two pairs of sequences. Report the alignment
and the alignment score.
4. (25 points) : For both pairs of sequences, permute the amino acids in the sequences (use random
library in your chosen language) and repeat the alignment 10,000 times. Report the distribution with
a histogram of the random alignment score and mark the alignment score of the original sequences.
Do this only for the non-anchored version.
5. (5 bonus points) : You are required to align Titin human sequence and Titin mouse sequence using
anchored version you implemented in part 2. Titin is the largest known protein; its human variant
consists of 34,350 amino acids, and its mouse homologue is even larger, comprising 35,213 amino
acids. Hence, global alignment of these two sequences using Needleman-Wunsch will take very
long time and use a lot of memory. However, using anchored-alignment approach will significantly
improve the performance. Download and extract the bonus folder
(https://www.dropbox.com/s/50fx4ovd3zoeps9/Homework2-bonus-TITIN-sequences.zip?dl=1) to get
the Titin human sequence (“TITIN_Human.fa”), Titin mouse sequence (“TITIN_Mouse.fa”), and a
match file “TITIN_Match.txt”, in which the first 2 columns represent the start and end positions of
matched regions for the human sequence, and the last 2 columns represent the start and end
positions of the matched regions for the mouse sequence. Report the alignment and alignment
score.
Deliverables
1. Source file (your code). Please note in your code the names of the people who worked on it.
2. Readme file (text). The readme file should contain instructions on how to compile and run the
program.
3. Alignment results in a single file (text).
4. A pdf file giving the plots of the permuted and observed alignment scores in task 4.

CSCI 5481 Homework 3: Phylogeny Inference

CSCI 5481, Computational Techniques for Genomics
This homework assignment is an implementation of the Nei-Saitou neighbor-joining algorithm for phylogeny
construction, with estimation of bootstrap support.
Datasets
Download and extract the data: https://www.dropbox.com/s/81cqr72rpfgwnhy/Homework3.zip?dl=1.
If you will be using the supplied scripts for visualizing your tree (hw3-plot-newick.r or hw3-plot-edges.r), then
you will need to install R (Google it), then install the “ape” package and the “RColorBrewer” package by
running R, and then entering this command:
install.packages(c(‘ape’,’RColorBrewer’))
hw3.fna
File containing a multiple alignment of 61 bacterial 16S subunit ribosomal RNA sequences.
hw-tip-labels.txt
File containing tab-delimited rows of this format:
seqID Phylum color
There is a subfolder called solution. This contains examples of correct output:
solution/hw3-solution-genetic-distances.txt
Genetic distances (% different) between every pair of sequences in hw3.fna.
solution/hw3-nj-solution-edges.txt
Correct edges for neighbor-joining tree using R implementation, in preorder traversal order. Your edge order
and internal node labels do not need to be exactly the same.
solution/hw3-nj-solution-bootstrap.txt
Example bootstrap support values for the 59 internal nodes in preorder traversal order.
solution/tree.pdf, solution/tree-bootstrap.pdf
Example PDF plot of tree showing colored tips and bootstrap support (bonus) for internal nodes.
solution/hw3-nj-solution-tree.tre
Correct solution output tree in NEWICK format. Your node order does not need to be exactly the same.
Input and Output Format:
Your program should take one command line argument specifying the name of a sequence file. The
command line should be of the form programName sequence.fasta or programName -i sequence.fasta (it is
acceptable to invoke java or another program as part of programName, but only take one argument as
input).
Your program should output two files:
edges.txt
This file is tab delimited. Each row describes an edge in the tree. The first column is the ancestor node; the
second column is the descendant node; the third column is the edge length. Edges should be in preorder
traversal order choosing any internal node as the root. Tips must be indexed starting at 1, with 1
corresponding to the first sequence in the FASTA file, 2 corresponding to the second sequence, and so on.
The internal nodes should begin numbering at the root with ntips + 1 and the numbers should increase
according to preorder traversal.
tree.txt
This file is in NEWICK format with all edge distances and with only tips named. For example, The following
tree would be encoded (A:0.1,B:0.2,(C:0.3,D:0.4):0.5) :
Problems
1. (25 points): Read in the given FASTA file hw3.fna. Calculate the genetic distance (% dissimilarity)
between every pair of sequences, and write this to a tab-delimited file with rows and columns labeled by the
sequence identifiers. For pairwise dissimilarity calculations, you may count a gap in both sequences as a
similarity.
2. (25 points): Implement Nei-Saitou neighbor joining as described on Wikipedia
(https://en.wikipedia.org/wiki/Neighbor_joining) and/or in class notes. Provide extensive comments in your
code. I suggest that you store your tree in this data structure, but you can do it however you want:
edges, A 3-column matrix with column 1 representing an ancestor node, column 2 representing
the descendant node, and the final column representing the edge length. Tips should be indexed
starting at 1. Choose an arbitrary internal node to be the root. The internal nodes should begin
numbering with the root as ntips + 1. An example is shown in solution/hw3-nj-solution-edges.txt.
3. (25 points): Generate the two output files described above for your tree (edges matrix and NEWICK tree
file). For the edges matrix you will need to perform a preorder traversal. For the NEWICK file you will need
to perform a postorder traversal. These should be generated using recursive functions.
4. (25 points): Find a way to visualize your trees from step (3). There are many NEWICK-based viewers
online. Colors for the tips are provided in hw-tip-labels.txt. Colors are nice but optional. You must include
labels for your tips. ASCII art is acceptable but not preferred. If you made the output correctly for step (3),
you can use either of the included R scripts (after installing the “ape” and the “RColorBrewer” package as
described above):
Rscript hw3-plot-newick.r hw3-nj-solution-tree.txt hw3-tip-labels.txt
or
Rscript hw3-plot-edges.r hw3-nj-solution-edges.txt hw3-tip-labels.txt
Here is what my Nei-Saitou tree looks like:

BONUS (10 points): Perform 100 inferences of the tree in step (3) using bootstrap samples. Each bootstrap
sample of the DNA sequences should contain a random rearrangement of the original DNA columns
selected by repeated sampling with replacement until there are as many columns as in the original input.
This means that some columns will appear more than once in a bootstrap sample. Here is a nice depiction
of the resampling that Pin-Kuang showed us in class:
For each node in the original tree, get a list of the tips that are partitioned below it. Count the fraction of
bootstrap trees in which the exact same partition was made. That is the bootstrap confidence for the given
internal node. You may simply print the bootstrap values to a text file in the order of the indices of your
internal nodes as shown in your edges file from step (3). You can also plot the bootstrap confidence using
the supplied R scripts (red is fraction with bootstrap support):
Rscript hw3-plot-edges.r hw3-nj-solution-edges.txt hw3-tip-labels.txt
boots.txt

Deliverables
Source files (your code for Step 1, 2, 3)
Readme file explaining how to use your code (text)
Step 1 distances file
Step 3 outputs (edges.txt, tree.txt)
Visualization of Step 3 tree
Code for and visualization of Bonus (optional)
All files and source code should be added to a folder with your x500 username as the name of the folder.
Then, zip this folder and upload it on Moodle.

CSCI 5481 Homework 4: Metagenomics Classifiers

Background
The purpose of this assignment is to identify variable regions in amplicon sequences, and to compare those
results to the conventional wisdom about the locations of variable regions.
Datasets
Download and extract the data folder:
https://www.dropbox.com/s/zj3cuaca840qmyh/Homework4-seqs.fna.gz?dl=1.
seqs.fna
File containing a multiple alignment of about 5,000 16S rRNA gene sequences.
Input and Output Format:
This is an analysis project. You do not need to produce a standalone program, although you do need to turn
in your code (or commandline commands when using commandline tools).
Problems
1. (25 points): Calculate the conservation rate (average identity, or fraction of most common base) at each
position in the gapped alignment (1,475 positions). Save the identity values to a text file, one per line. A gap
should be treated as non-conserved for the given sequence at the given position. In other words, if a
position has 20% A’s, 5% G’s, and 75% gaps, then the position would be considered 20% conserved.
2. (25 points): Plot the variability from step (2) against the position in the gapped alignment. The plot should
look somewhat like this plot. You will need to perform some smoothing on your data before plotting. It is
your responsibility to decide on an appropriate approach to smoothing and to describe it in your code
comments. Note: this plot is old and based on a small amount of data, and may be using a slightly different
approach to calculating conservation (e.g. ignoring gaps in the denominator), so do not expect your plot to
look exactly like this.
3. (40 points): Find the variable regions. Most biologists accept that there are 9, but you may find a different
number. Use any approach you want, but justify your answer and provide any code used. Write the start
and end coordinates of each v region to a tab-delimited text file with the start in column 1 and the end in
column 2. You can use any approach you want as long as it is explained in your code comments. Here is an
approximate guide to the expected lengths and positions of the variable regions, when excluding gaps in a
representative gene from E. coli:
4. (10 points). Select a random subset of 100 sequences. Extract two variable regions (regions 1 and 4
were suggested, but if region 1 has too many gaps, you can use regions 2 and 4, for example) according to
your variable-region calling in part 3 above. Use any software you want, including your own, to build and
visualize a phylogeny from variable region 1, variable region 4, and the whole 16S. Can you tell which
variable region tree seems closest to the whole-16S tree? Which did you expect?
Deliverables
Source files (any code that you used for Step 1, 2, 3, 4)
Readme file explaining how you used your code (text)
Step 1: File giving start and end position of each variable region
Step 2: File containing identity values
Step 3: Figure showing plot of identity values.
Step 4: Fasta files containing variable regions 1 and 4. Figures of the V1 tree, V4 tree, and whole-16S tree.
Brief text response to questions.
All files and source code should be added to a folder with your x500 username as the name of the folder.
Then, zip this folder and upload it on Moodle.

Final Project CSCI 5481, Computational Techniques for Genomics

Project overview
There are three project options. Each project should culminate in the submission of a short written report in the form of
a scientific journal paper (see deliverables below), and in a short presentation on the last day of class.
This is intended to be a group-based project. Groups should be no more than 4 people. You may work on your own if
you wish. Each member of a group must contribute something critical to the project.
Project option 1
Bring your own data set to analyze. Use two or more techniques that we learned in the class to attempt to answer an
important biological question. Justify your choice of tools and your parameter settings. Include an informative
schematic diagram of the steps that you took, and 3 or more other visualizations. For each analysis step that you
perform, be sure to try at least one alternative way to do the analysis to determine if the results still hold.
Project option 2
Download a chromosome from the human genome (https://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/).
Simulate whole-genome shotgun data by breaking it into random overlapping substrings of length k. Break it down
further into a list of k-mers for some small k. Create the De Bruijn graph using k-mers as nodes, where a Hamiltonian
circuit would be needed to find a valid path through the genome. Then convert this to a k-1-dimensional De Bruijn
graph in which an Eulerian circuit would traverse all k-mers. Implement an algorithm to find an Eulerian circuit, and
report the assembled chromosome. Compare it to the original chromosome by plotting the coordinates of the
assembled location of each base against the known location of each base.
Try this first with a short section of 25-50 base pairs with several different values of k. Generate plots of the
k-dimensional and k-1-dimensional graphs. Then try it on larger and larger portions of the chromosome. Report the
performance (fraction correctly assembled and runtime). Then try simulating sequencing error in the data (1 random
error per 100 bases) and report the effects.
Project option 3
We have around 1 million paired-end RNA-Seq sequences generated from total RNA in two human fecal samples
(500k sequences each). Data available here:
https://www.dropbox.com/s/kxte655hgckwdon/Final-project-option-3-trimmed-fasta.zip?dl=1. The majority of the
sequences are ribosomal RNA. Many are from Bacteria. The goals are to identify and characterize both known and
possibly novel species in the data. You can try to follow these steps, or you can try another way to analyze the data,
but please use at least two of the techniques that we have studied in class. Some suggested analyses. Assembly is
highly recommended as the first step:
1. Use an existing assembly tool (e.g. SPAdes assembler or Meta-SPAdes) to assemble the reads into contigs.
Paired-end reads will be extremely important to help resolve ambiguities. If possible, retain only
high-confidence contigs for the next steps.
2. Use an existing alignment tool (BURST, BWA, Bowtie2) to identify sequences that match known ribosomal
sequences (See file “SILVA_123_SSURef_tax_silva_trunc.fasta.gz” here:
https://www.arb-silva.de/no_cache/download/archive/release_132/Exports/). Report the species that were
identified.
3. Choose a subset of sequences that do not match the database (say at 95% or above), and search them
against all data in BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) to determine if parts of them do match other
ribosomes. Perform a global multiple sequence alignment of these in SeaView
(https://doua.prabi.fr/software/seaview) along with a subset of the known sequences to visualize what regions
are or are conserved in the known and unknown ribosomal sequences.
4. Choose a subset of sequences that do match the database and a subset that do not, and use RNA secondary
structure prediction software to make 2D visualizations of the predicted secondary structure for the
representatives of known and novel species for comparison. Do the “known” and “novel” species have the
same secondary structure over sections of the sequences that should align?
Deliverables
Write-up, due by 10PM Wednesday 12/11 (90%). Please submit a description of your project formatted as
a short research article of 1000-1200 words, containing the following:
1. Abstract/Summary paragraph (200 words)
Follow the structure of the example Nature summary paragraph here:
https://cbs.umn.edu/sites/cbs.umn.edu/files/public/downloads/Annotated_Nature_abstract.pdf
2. Summary of previous findings (200 words)
Describe the previous analysis and findings without critique.
3. Results (300-500 words)
In 1-2 paragraphs, describe what you did and what you found. It is important to consider and discuss
alternative approaches that you could have used (or tried to use) and why your eventual choices were
justified.
4. Conclusion (100 words)
Restate the purpose of your re-analysis. Summarize your findings. Briefly describe future work.
5. Methods (100-200 words)
Describe your analysis with enough detail that it could be reproduced by another researcher.
6. Figures (2-3 figures)
Include two or more figures supporting your findings.
7. Acknowledgements (Not graded)
Describe precisely what parts of the project were contributed by each member of the group.
Presentation, due in class on Monday 12/9 (10%). Present a 5-minute presentation on your project. Each
member of the group should participate.