## Description

Introduction Welcome to your second programming assignment of the Algorithms on Strings class! In this programming assignment, you will be practicing implementing Burrows–Wheeler transform and suffix arrays. Recall that starting from this programming assignment, the grader will show you only the first few tests (see the questions 6.4 and 6.5 in the FAQ section).

Learning Outcomes Upon completing this programming assignment you will be able to: 1. compute the Burrows–Wheeler transform (BWT) of a string; 2. compute the inverse of BWT; 3. use BWT for pattern matching; 4. construct the suffix array of a string.

Passing Criteria: 2 out of 4 Passing thisprogramming assignmentrequires passingat least2out of4code problemsfrom thisassignment. In turn, passing a code problem requires implementing a solution that passes all the tests for this problem in the grader and does so under the time and memory limits specified in the problem statement.

Contents 1 Problem: Construct the Burrows–Wheeler Transform of a String 3

2 Problem: Reconstruct a String from its Burrows–Wheeler Transform 5

3 Problem: Implement BetterBWMatching 7

4 Problem: Construct the Suffix Array of a String 10

1

5 General Instructions and Recommendations on Solving Algorithmic Problems 13 5.1 Reading the Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 Designing an Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.3 Implementing Your Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.4 Compiling Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.5 Testing Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.6 Submitting Your Program to the Grading System . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.7 Debugging and Stress Testing Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Frequently Asked Questions 16 6.1 I submit the program, but nothing happens. Why? . . . . . . . . . . . . . . . . . . . . . . . . 16 6.2 I submit the solution only for one problem, but all the problems in the assignment are graded. Why? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 What are the possible grading outcomes, and how to read them? . . . . . . . . . . . . . . . . 16 6.4 How to understand why my program fails and to fix it? . . . . . . . . . . . . . . . . . . . . . 17 6.5 Why do you hide the test on which my program fails? . . . . . . . . . . . . . . . . . . . . . . 17 6.6 My solution does not pass the tests? May I post it in the forum and ask for a help? . . . . . 18 6.7 My implementation always fails in the grader, though I already tested and stress tested it a lot. Would not it be better if you give me a solution to this problem or at least the test cases that you use? I will then be able to fix my code and will learn how to avoid making mistakes. Otherwise, I do not feel that I learn anything from solving this problem. I am just stuck. . . 18

2

1 Problem: Construct the Burrows–Wheeler Transform of a String Problem Introduction Our goal is to further improve on the memory requirements of the suffix array. Given a string Text, form all possible cyclic rotations of Text; a cyclic rotation is defined by chopping off a suffix from the end of Text and appending this suffix to the beginning of Text. Next — similarly to suffix arrays — order all the cyclic rotations of Text lexicographically to form a |Text|×|Text| matrix of symbols that we call the Burrows–Wheeler matrix and denote by M(Text). Note that the first column of M(Text) contains the symbols of Text ordered lexicographically. In turn, the second column of M(Text) contains the second symbols of all cyclic rotations of Text, and so it too represents a (different) rearrangement of symbols from Text. The same reasoning applies to show that any column of M(Text) is some rearrangement of the symbols of Text. We are interested in the last column of M(Text), called the Burrows–Wheeler transform of Text, or BWT(Text).

Problem Description Task. Construct the Burrows–Wheeler transform of a string. Input Format. A string Text ending with a “$” symbol. Constraints. 1 ≤|Text|≤ 1000; except for the last symbol, Text contains symbols A, C, G, T only. Output Format. BWT(Text). Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 0.5 0.5 0.75 0.5 0.75 1 2.5 2.5 1.5

Memory Limit. 512MB. Sample 1. Input: AA$ Output: AA$ Explanation:

M(Text) =⎡ ⎣

$ A A A $ A A A $⎤ ⎦

3

Sample 2. Input: ACACACAC$ Output: CCCC$AAAA Explanation:

M(Text) =

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

$ A C A C A C A C A C $ A C A C A C A C A C $ A C A C A C A C A C $ A C A C A C A C A C $ C $ A C A C A C A C A C $ A C A C A C A C A C $ A C A C A C A C A C $ A

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

Sample 3. Input: AGACATA$ Output: ATG$CAAA Explanation:

M(Text) =

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

$ A G A C A T A A $ A G A C A T A C A T A $ A G A G A C A T A $ A T A $ A G A C C A T A $ A G A G A C A T A $ A T A $ A G A C A

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: bwt

What To Do To solve this problem, it is enough to construct BWT(Text) by sorting all cyclic rotations of BWT(Text).

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

4

2 Problem: Reconstruct a String from its Burrows–Wheeler Transform Problem Introduction In the previous problem, we introduced the Burrows–Wheeler transform of a string Text. In this problem, we give you the opportunity to reverse this transform.

Problem Description Task. Reconstruct a string from its Burrows–Wheeler transform. Input Format. A string Transform with a single “$” sign. Constraints. 1 ≤|Transform|≤ 1000000; except for the last symbol, Text contains symbols A, C, G, T only. Output Format. The string Text such that BWT(Text) = Transform. (There exists a unique such string.) Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 2 2 3 10 3 4 10 10 6

Memory Limit. 512MB. Sample 1. Input: AC$A Output: ACA$ Explanation:

M(Text) =⎡ ⎢ ⎢ ⎣

$ A C A A $ A C A C A $ C A $ A

⎤ ⎥ ⎥ ⎦

Sample 2. Input: AGGGAA$ Output: GAGAGA$ Explanation:

M(Text) =

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

$ G A G A G A A $ G A G A G A G A $ G A G A G A G A $ G G A $ G A G A G A G A $ G A G A G A G A $ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

5

Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: bwtinverse

What To Do Tosolvethisproblem, itisenoughtoimplementcarefullythecorrespondingalgorithmcoveredinthelectures.

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

6

3 Problem: Implement BetterBWMatching Problem Introduction The algorithm BWMatching counts the total number of matches of Pattern in Text, where the only information that we are given is FirstColumn and LastColumn = BWT(Text) in addition to the Last-to-First mapping. The pointers top and bottom are updated by the green lines in the following pseudocode. BWMatching(FirstColumn, LastColumn, Pattern, LastToFirst): top← 0 bottom←|LastColumn|−1 while top≤bottom: if Pattern is nonempty: symbol←last letter in Pattern remove last letter from Pattern if positions from top to bottom in LastColumn contain an occurrence of symbol: topIndex←first position of symbol among positions from top to bottom in LastColumn bottomIndex←last position of symbol among positions from top to bottom in LastColumn top← LastToFirst(topIndex) bottom← LastToFirst(bottomIndex) else: return 0 else: return bottom−top+ 1 The Last-to-First array, denoted LastToFirst(i), answers the following question: given a symbol at position i in LastColumn, what is its position in FirstColumn? For example, if Text = panamabananas$, BWT(Text) = smnpbnnaaaaa$a, FirstCol(Text) = $aaaaaabmnnnps, then we can rewrite BWT(Text) = s1m1n1p1b1n2n3a1a2a3a4a5$1a6 and FirstCol(Text) = $1a1a2a3a4a5a6b1m1n1n2n3p1s1, and now we see that a3 in BWT(Text) corresponds to a3 in FirstCol(Text). If you implement BWMatching, you probably will find the algorithm to be slow. The reason for its sluggishness is that updating the pointers top and bottom is time-intensive, since it requires examining every symbol in LastColumn between top and bottom at each step. To improve BWMatching, we introduce a function Countsymbol(i, LastColumn), which returns the number of occurrences of symbol in the first i positions of LastColumn. For example,

Count“n”(10,“smnpbnnaaaaa$a”) = 3 and Count“a”(4,“smnpbnnaaaaa$a”) = 0.

The green lines from BWMatching can be compactly described without the First-to-Last mapping by the following two lines: top← (Countsymbol + 1)-th occurrence of character symbol in FirstColumn bottom←position of symbol with rank Countsymbol(bottom+ 1, LastColumn) in FirstColumn

Define FirstOccurrence(symbol) as the first position of symbol in FirstColumn. If Text = “panamabananas$”, then FirstColumn is “$aaaaaabmnnnps”, and the array holding all values of FirstOccurrence is [0, 1, 7, 8, 9, 12, 13]. For DNA strings of any length, the array FirstOccurrence contains only five elements. The two lines of pseudocode from the previous step can now be rewritten as follows: top←FirstOccurrence(symbol) +Countsymbol(top,LastColumn) bottom←FirstOccurrence(symbol) +Countsymbol(bottom+ 1,LastColumn)−1

7

In the process of simplifying the green lines of pseudocode from BWMatching, we have also eliminated the need for both FirstColumn and LastToFirst, resulting in a more efficient algorithm called BetterBWMatching. BWMatching(FirstOccurrence, LastColumn, Pattern, Count): top← 0 bottom←|LastColumn|−1 while top≤bottom: if Pattern is nonempty: symbol←last letter in Pattern remove last letter from Pattern if positions from top to bottom in LastColumn contain an occurrence of symbol: top←FirstOccurrence(symbol) +Countsymbol(top,LastColumn) bottom←FirstOccurrence(symbol) +Countsymbol(bottom+ 1,LastColumn)−1 else: return 0 else: return bottom−top+ 1

Problem Description Task. Implement BetterBWMatching algorithm. Input Format. A string BWT(Text), followed by an integer n and a collection of n strings Patterns = {p1,…,pn} (on one line separated by spaces). Constraints. 1 ≤ |BWT(Text)| ≤ 106; except for the one $ symbol, BWT(Text) contains symbols A, C, G, T only; 1 ≤ n ≤ 5 000; for all 1 ≤ i ≤ n, pi is a string over A, C, G, T; 1 ≤|pi|≤ 1000. Output Format. A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text. Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 4 4 6 24 6 8 24 24 12 Memory Limit. 512MB. Sample 1. Input: AGGGAA$ 1 GA Output: 3 Explanation: In this case, Text = GAGAGA$. The pattern GA appears three times in it. Sample 2. Input: ATT$AA 2 ATA A

8

Output: 2 3 Explanation: Text = ATATA$ contains two occurrences of ATA and three occurrences of A. Sample 3. Input: AT$TCTATG 2 TCT TATG Output: 0 0 Explanation: Text = ATCGTTTA does not contain any occurrences of two given patterns.

Starter Files The starter solutions for this problem read the input data from the standard input, pass the Burrows– Wheeler Transform to a preprocessing procedure to precompute some useful values, then pass each pattern along with BWT and precomputed values to the procedure which counts the number of occurrences of the pattern in the text, and then write the result to the standard output. You are supposed to implement these two procedure which are left blank if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: bwmatching.

What To Do To solve this problem, it is enough to carefully implement the algorithm described in the lectures. However, don’t forget that you need to do the preprocessing of the Text only once, and then use the results. If you do the preprocessing of the Text each time, there is no point in such preprocessing, you don’t save anything. But if you do the preprocessing once, save the results, and use them for searching each pattern, you save a lot on each search.

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

9

4 Problem: Construct the Suffix Array of a String Problem Introduction We saw that suffix trees can be too memory intensive to apply in practice. In 1993, Udi Manber and Gene Myers introduced suffix arrays as a memory-efficient alternative to suffix trees. To construct SuffixArray(Text), we first sort all suffixes of Text lexicographically, assuming that “$” comes first in the alphabet. The suffix array is the list of starting positions of these sorted suffixes. For example, SuffixArray(“panamabananas$”) = (13,5,3,1,7,9,11,6,4,2,8,10,0,12)

Problem Description Task. Construct the suffix array of a string. Input Format. A string Text ending with a “$” symbol. Constraints. 1 ≤|Text(Text)|≤ 104; except for the last symbol, Text contains symbols A, C, G, T only. Output Format. SuffixArray(Text), that is, the list of starting positions (0-based) of sorted suffixes separated by spaces. Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 2 1 1.5 2 5 5 4

Memory Limit. 512MB. Sample 1. Input: GAC$ Output: 3 1 2 0 Explanation: Sorted suffixes: 3 $ 1 AC$ 2 C$ 0 GAC$

10

Sample 2. Input: GAGAGAGA$ Output: 8 7 5 3 1 6 4 2 0 Explanation: Sorted suffixes: 8 $ 7 A$ 5 AGA$ 3 AGAGA$ 1 AGAGAGA$ 6 GA$ 4 GAGA$ 2 GAGAGA$ 0 GAGAGAGA$ Sample 3. Input: AACGATAGCGGTAGA$ Output: 15 14 0 1 12 6 4 2 8 13 3 7 9 10 11 5 Explanation: Sorted suffixes: 15 $ 14 A$ 0 AACGATAGCGGTAGA$ 1 ACGATAGCGGTAGA$ 12 AGA$ 6 AGCGGTAGA$ 4 ATAGCGGTAGA$ 2 CGATAGCGGTAGA$ 8 CGGTAGA$ 13 GA$ 3 GATAGCGGTAGA$ 7 GCGGTAGA$ 9 GGTAGA$ 10 GTAGA$ 11 TAGA$ 5 TAGCGGTAGA$

Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: suffix_array

What To Do To solve this problem, it is enough to just sort all suffixes of Text.

11

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

12

5 General Instructions and Recommendations on Solving Algorithmic Problems Your main goal in an algorithmic problem is to implement a program that solves a given computational problem in just few seconds even on massive datasets. Your program should read a dataset from the standard input and write an answer to the standard output. Below we provide general instructions and recommendations on solving such problems. Before reading them, go through readings and screencasts in the first module that show a step by step process of solving two algorithmic problems: link.

5.1 Reading the Problem Statement You start by reading the problem statement that contains the description of a particular computational task as well as time and memory limits your solution should fit in, and one or two sample tests. In some problems your goal is just to implement carefully an algorithm covered in the lectures, while in some other problems you first need to come up with an algorithm yourself.

5.2 Designing an Algorithm If your goal is to design an algorithm yourself, one of the things it is important to realize is the expected running time of your algorithm. Usually, you can guess it from the problem statement (specifically, from the subsection called constraints) as follows. Modern computers perform roughly 108–109 operations per second. So, if the maximum size of a dataset in the problem description is n = 105, then most probably an algorithm with quadratic running time is not going to fit into time limit (since for n = 105, n2 = 1010) while a solution with running time O(nlogn) will fit. However, an O(n2) solution will fit if n is up to 103 = 1000, and if n is at most 100, even O(n3) solutions will fit. In some cases, the problem is so hard that we do not know a polynomial solution. But for n up to 18, a solution with O(2nn2) running time will probably fit into the time limit. To design an algorithm with the expected running time, you will of course need to use the ideas covered in the lectures. Also, make sure to carefully go through sample tests in the problem description.

5.3 Implementing Your Algorithm When you have an algorithm in mind, you start implementing it. Currently, you can use the following programming languages to implement a solution to a problem: C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, Scala. For all problems, we will be providing starter solutions for C++, Java, and Python3. If you are going to use one of these programming languages, use these starter files. For other programming languages, you need to implement a solution from scratch.

5.4 Compiling Your Program For solving programming assignments, you can use any of the following programming languages: C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, and Scala. However, we will only be providing starter solution files for C++, Java, and Python3. The programming language of your submission is detected automatically, based on the extension of your submission. We have reference solutions in C++, Java and Python3 which solve the problem correctly under the given restrictions, and in most cases spend at most 1/3 of the time limit and at most 1/2 of the memory limit. You can also use other languages, and we’ve estimated the time limit multipliers for them, however, we have no guarantee that a correct solution for a particular problem running under the given time and memory constraints exists in any of those other languages. Your solution will be compiled as follows. We recommend that when testing your solution locally, you use the same compiler flags for compiling. This will increase the chances that your program behaves in the

13

same way on your machine and on the testing machine (note that a buggy program may behave differently when compiled by different compilers, or even by the same compiler with different flags). ∙ C (gcc 5.2.1). File extensions: .c. Flags: gcc -pipe -O2 -std=c11