Description
In this assignment, you will develop a few functions for DNA analysis. These
functions will calculate common measures of DNA similarity, such as the Hamming
distance and the Best Match between two DNA sequences. Each of the DNA
sequences you need for this assignment can be copied from this write-up and stored
in a variable in your program. There is a sample DNA sequence for a mouse, human,
and an unknown species. Your mission is to determine the identity of the unknown
by comparing it to the human and the mouse. If the unknown species is more similar
to the human than it is to the mouse, then you can conclude that the unknown
sequence is from a human. Otherwise, you can conclude that the unknown is from a
mouse.
CSCI 1310 – Assignment 4 Du
Your assignment needs to include at least the following functions for full credit:
void calculateSimilarity(double *similarity, string DNA1, string DNA2);
string calculateBestMatch(double *bestscore, int *index, string DNA1,
string DNA2);
What to submit?
The assignment is set up in moodle. There are three code-runner questions. The first
question is for the full code (directives, prototypes, definitions and main) to test for
correctness. The other two questions will help for getting partial credit and they are
for the functions calculateSimilarity and calculateBestMatch.
It is your responsibility to copy them from the full code into these individual
questions. If you need any helper functions, then include them in the space provided
but above the function that will make the call. If you are not sure, please ask the
instructors before the deadline. Don’t forget that after the deadline, you have the
right to request an interview with your TA to review your work.
Hamming distance and similarity between two strings
Hamming distance is one of the most common ways to measure the similarity
between two strings of the same length. Hamming distance is a position-by-position
comparison that counts the number of positions in which the corresponding
characters in the string are different. Two strings with a small Hamming distance
are more similar than two strings with a larger Hamming distance.
Example:
first string = “ACCT”
second string = “ACCG”
A C C T
| | | *
A C C G
In this example, there are three matching characters and one mismatch, so the
Hamming distance is one.
The similarity score for two sequences is then calculated as follows:
similarity_score = (string length-hamming distance) / string length
similarity_score = (4-1)/4=3/4=0.75
Two sequences with a high similarity score are more similar than two sequences
with a lower similarity score.
The Best Match algorithm extends the Hamming distance calculation by finding the
CSCI 1310 – Assignment 4 Due Friday, Feb 24, by 8:00 am
best overlap of the two strings. For any two strings, calculate the Hamming distance
between the string and substring starting at each position of the string.
calculateSimilarity(double*, string, string)
The calculateSimilarity() function should take two arguments that are both strings
and a double pointer that stores the similarity between the strings. You can declare
a double pointer just as you would an integer pointer:
double x;
double *dPtr = &x;
The function should calculate the similarity score for the two strings and update the
similarity with that score.
Note: when you test calculateSimilarity(), pass in strings where you can calculate the
similarity by hand before passing it real data. That will help you identify errors in
your algorithm.
calculateBestMatch(double*, int*, string, string)
The calculateBestMatch() function should take four arguments – one integer
pointers and double pointer and two strings. The double pointers store the
Similarity Score calculation and the integer pointer store the index in the string
where the best match starts. The two string arguments are the two strings to
compare. The second string argument is the substring to search for. The first string
is the string you are searching. This functions returns a string which is DNA
sequence from the mouse/human DNA which best matches with the user entered
sequence with a high similarity score.
Note: you will need to be aware of the end of each string to make sure that you don’t
loop off the end of either string.
Functionality in main()
In your main() function, you will need to call the other functions you have written.
You need to use the mouse and human DNA samples shown below in this write-up
and unknown DNA sample just for testing your program. Your first task is to ask the
user to enter the unknown DNA sequence and store it in a variable. You should
output the result of the function calls in the main() function. After calling
calculateSimularity(), you need to output the identity of the unknown DNA
sequence.
if the unknownDNA is more similar to the humanDNA
print “Human”
else if the unknownDNA is more similar to the mouseDNA
print “Mouse”
CSCI 1310 – Assignment 4 Due Friday, Feb 24, by 8:00 am
else unknownDNA is equally similar to both mouse and human
print “Identity cannot be determined.”
Before calling calculateBestMatch(), you need to prompt the user for a search string.
You need to compare the search string to the mouse DNA and Human DNA, you
would do something like the following:
cout<<”Enter a substring:;
getline(cin, subStr);
calculateBestMatch(&similarityscore, &index, mouseDNA,
subStr);
calculateBestMatch(&similarityscore, &index, humanDNA,
subStr);
After calling calculateBestMatch(), you need to display the DNA sequence that is the
best match as well as the best similarity score. If there isn’t a match of any character,
print “Match not found.”
Here is the skeleton/high level code which you need to follow:
#include