## Description

## 1 Overview

For this project, you will develop an algorithm to compute the maximum alignment between a target

sequence and any possible combination of two given subsequences. Designing and implementing

this solution will require you to model the problem using dynamic programming, then understand

and implement your model.

You are only allowed to consult the class slides, the textbook, the TAs, and the professor. In

particular, you are not allowed to use the Internet. This is a group project. The only people you

can work with on this project are your group members. This policy is strictly enforced.

In addition to the group submission, you will also evaluate your teammates’ cooperation and

contribution. These evaluations will form a major part of your grade on this project, so be sure

that you respond to messages promptly, communicate effectively, and contribute substantially to

your group’s solution.

Details for your team evaluations are in Section 6.2. You will submit the

peer evaluations to another assignment on Canvas, labelled “Project 2 (individual).”

A word of warning: this project is team-based, but it is quite extensive and a nontrivial task.

You are highly encouraged to start working on (and start asking questions about) this project early;

teams who wait to start until the week before the due date may find themselves unable to complete

it in time.

## 2 Problem Description

In this problem, you are given three arrays, seq1, seq2, and target, and the solution to this problem

will be the maximum alignment score s of target and a sequence soln, where soln is formed by

combining the values in seq1 and seq2 in some order (i.e., seq1 and seq2 are subsequences of soln).

Moreover, s will be the maximum alignment score of target with any combination of seq1 and seq2.

For this project, we will calculate alignment scores as the unnormalized cosine similarity. In

general, we compute cosine similarity of two normalized vectors or sequences ~a and ~b as a sumproduct (or dot product):

h~a,~bi =

Xn

i=1

ai

· bi

,

where n is the length of ~a and ~b, and ai and bi represent element i in ~a and ~b, respectively.

This

score is only referred to as cosine similarity when the sequences are normalized; that is, they have

the property that h~a, ~ai = 1; however, we will calculate the alignment score using the dot product

1

even if the sequences involved are not normalized. That said, you may assume that the input

sequences you are given for this problem will be designed so that target and any combination of

seq1 and seq2 will be normalized.

3 Worked example

Consider the problem with seq1 = (1, 2, 7), seq2 = (1, 3, 6), and target = (1, 2, 1, 3, 7, 6). Note:

after working through this example, we will discuss what a normalized version of this problem

would look like.

The solution to this problem will be a score of 100, achieved by the sequence (1, 2, 1, 3, 7, 6)

(seq1 and seq2 highlighted in red and blue, respectively). If we compute the dot product of this

sequence with target, we see that it has an alignment score of:

h(1, 2, 1, 3, 7, 6),(1, 2, 1, 3, 7, 6)i =

X

6

i=1

ai

· bi

= 1(1) + 2(2) + 1(1) + 3(3) + 7(7) + 6(6)

= 1 + 4 + 1 + 9 + 49 + 36

= 100

If we compare this score to the alignment score for all of the other unique sequences we could

produce using seq1 and seq2, shown in the table below, we see that this is the highest possible

alignment score with target:

Possible sequence Alignment score

1, 2, 7, 1, 3, 6 72

1, 2, 1, 7, 3, 6 84

1, 2, 1, 3, 7, 6 100

1, 2, 1, 3, 6, 7 99

1, 1, 2, 7, 3, 6 83

1, 1, 2, 3, 7, 6 99

1, 1, 2, 3, 6, 7 98

1, 1, 3, 2, 7, 6 97

1, 1, 3, 2, 6, 7 96

1, 1, 3, 6, 2, 7 80

1, 3, 1, 2, 7, 6 99

1, 3, 1, 2, 6, 7 98

1, 3, 1, 6, 2, 7 82

1, 3, 6, 1, 2, 7 72

Note: there could be more than one sequence that has the maximum alignment score.

As an example of one of the calculations in the table above, the alignment score of the sequence

in the first row with target ((1, 2, 7, 1, 3, 6) with (1, 2, 1, 3, 7, 6)) is:

2

h(1, 2, 7, 1, 3, 6),(1, 2, 1, 3, 7, 6)i = 1(1) + 2(2) + 7(1) + 1(3) + 3(7) + 6(6)

= 1 + 4 + 7 + 3 + 21 + 36

= 72

## Normalization

Cosine similarity should be a value between −1 and 1 indicating how closely aligned two sequences

are, but when the sequences are not normalized, this score can be much larger than 1 or smaller

than −1. A normalized version of this problem might look like seq1 = (0.1, 0.2, 0.7), seq2 = (0.1,

0.3, 0.6), and target = (0.1, 0.2, 0.1, 0.3, 0.7, 0.6).

Note that the cosine similarity of target with itself is 0.1(0.1) + 0.2(0.2) + 0.1(0.1) + 0.3(0.3)

+ 0.7(0.7) + 0.6(0.6) = 0.01 + 0.04 + 0.01 + 0.09 + 0.49 + 0.36 = 1, and this is also the cosine

similarity of the solution with target. Meanwhile, the cosine similarity of target with the incorrect

answer (0.1, 0.2, 0.7, 0.1, 0.3, 0.6) is 0.1(0.1) + 0.2(0.2) + 0.7(0.1) + 0.1(0.3) + 0.3(0.7) + 0.6(0.6)

= 0.01 + 0.04 + 0.07 + 0.03 + 0.21 + 0.36 = 0.72.

You may notice that scaling the input vectors by a constant factor also scales the dot product.

Because of this property and the fact that all possible answers are scaled by the same factor, we

get the same sequence for the original problem above as we do for the scaled problem. Thus,

even though the alignment scores we calculated in the unscaled problem are not technically cosine

similarity, which must always be between −1 and 1, we still identify the same combination of seq1

and seq2 to solve the problem.

All of the test problems you will be given for this project will be normalized, and your final

answers should be real numbers in the range −1 to 1 (up to round-off errors). That said, the

subproblems you evaluate recursively will probably not have this property, and you should not

try to normalize the inputs to the problem in your algorithm.

#### 4 Project report

In your project report, you should include brief answers to 7 questions. Note that you must use

dynamic programming to solve this problem; other solutions will not receive substantial credit.

1. How would you break down the problem of calculating the maximum alignment score for a

combination of two sequences seq1 and seq2 (length n and m, respectively) with the sequence

target (length n+m) into one or more smaller instances of this problem? Your answer should

include what subproblems you are evaluating, as well as how you are using the answers to

these subproblems to solve the original problem.

2. What are the base cases of this recurrence?

3. What data structure would you use to recognize repeated problems? You should describe

both the abstract data structure as well as its implementation.

4. Give pseudocode for a memoized dynamic programming algorithm to find the maximum

alignment score when combining seq1 and seq2 to align with target.

5. Give pseudocode for an iterative algorithm to find the maximum alignment score when combining seq1 and seq2 to align with target. This algorithm does not need to have a reduced

space complexity relative to the memoized solution.

3

6. Can the space complexity of the iterative algorithm be improved relative to the memoized

algorithm? Justify your answer.

7. Give pseudocode for an algorithm that identifies the sequence which will achieve the maximum

alignment score. If there is more than one sequence that yields the maximum score, you may

return any such sequence. Your algorithm may be iterative or recursive.

5 Coding your solutions

In addition to the report, you should implement a dynamic programming algorithm that can

compute a maximum cosine similarity for a combination of two subsequences with a target. Your

code may be iterative or recursive, but it must be a dynamic programming algorithm.

You may

code your solution in C++ or Java, but it must compile and run in a Linux environment. If you

are using C++ and your code cannot be compiled using the command

g++ -o align *.cpp

you should include a Makefile that is capable of compiling the code via the make command.

If you choose to implement your code in Java, you should submit an executable jar file with

your source. In either case, your source code may be split into any number of files.

Your code will not need to handle invalid input (e.g., non-normalized sequences or problems

where the length of target doesn’t equal the length of seq1 and seq2 combined).

5.1 Input format

Your program should read its input from the file input.txt, in the following format. The first line

of the file has two positive integers n and m specifying the lengths of seq1 and seq2, respectively.

The second line will list the n real numbers (in the range −1 to 1) that make up seq1, the third

line will have the m real numbers that make up seq2, and the fourth line will have the n + m real

numbers that make up target. All numbers on a line will be separated by spaces.

5.2 Output

Your program should write its output to the file output.txt. The first line of your output should

contain the maximum cosine similarity for seq1, seq2, and target. The second line of your output should list the solution sequence (n + m) that achieves this maximum cosine similarity, with

consecutive values separated by spaces. (seq1 and seq2 should be subsequences of this solution

sequence.)

#### 6 Submission

Your submission for this project will be in two parts, the group submission and your individual

peer evaluations.

4

6.1 Group submission

The submission for your group should be a zip archive containing 1) your report (described in

Section 4) as a PDF document, and 2) your code (described in Section 5). If your code requires

more than a simple command to compile and run then you must also provide a Makefile and/or

shell script. You should submit this zip archive to the “Project 2 (group)” assignment on Canvas.

Be aware that your project report and code will be checked for plagiarism.

6.2 Teamwork evaluation

The second part of your project grade will be determined by a peer evaluation. Your peer evaluation

should be a text file that includes 1) the names of all of your teammates (including yourself), 2) the

team member responsibilities, 3) whether or not your teammates were cooperative, 4) a numeric

rating indicating the proportional amount of effort each of you put into the project, and 5) other

issues we should be aware of when evaluating your and your teammates’ relative contribution. The

numeric ratings must be integers that sum to 30.

It’s important that you be honest in your evaluation of your peers. In addition to letting your

team members whether they do (or do not) need to work on their teamwork and communication

skills, we will also evaluate your group submission in light of your team evaluations. For example,

a team in which one member refused to contribute would be assessed differently than a team with

three functioning members.

You should submit your peer evaluation to the “Project 2 (individual)” assignment on Canvas.

7 Grading

Report 40 points

Question 1, 4, and 5 8 points

Questions 2, 3, 6, and 7 4 each

Code 30 points

Compiles 5

Uses correct input and output format 5

Computes correct alignment score 10

Computes correct sequence 5

Good coding style 5

Teamwork 30 points

Note that if your algorithm is inefficient, you may lose points for both your pseudocode and

your submission. Also, in extreme cases, the teamwork portion of your grade may become negative

or greater than 30. In particular, if you do not contribute to your group’s solution at all, you can

expect to receive an overall grade of 0 on the project.

5