## Description

Introduction Welcome to your third programming assignment of the Algorithms on Strings class! In this programming assignment, you will be practicing implementing very efficient string algorithms.

Learning Outcomes Upon completing this programming assignment you will be able to: 1. find all occurrences of a pattern in text; 2. construct the suffix array efficiently; 3. use suffix arrays for solving the multiple pattern matching problem; 4. construct the suffix tree from the suffix array in linear time.

Passing Criteria: 2 out of 4 Passing thisprogramming assignmentrequires passingat least2out of4code problemsfrom thisassignment. In turn, passing a code problem requires implementing a solution that passes all the tests for this problem in the grader and does so under the time and memory limits specified in the problem statement.

Contents 1 Problem: Find All Occurrences of a Pattern in a String 3

2 Problem: Construct the Suffix Array of a Long String 5

3 Problem: Pattern Matching with the Suffix Array 8

4 Advanced Problem: Construct the Suffix Tree from the Suffix Array 10

1

5 General Instructions and Recommendations on Solving Algorithmic Problems 13 5.1 Reading the Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 Designing an Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.3 Implementing Your Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.4 Compiling Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.5 Testing Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.6 Submitting Your Program to the Grading System . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.7 Debugging and Stress Testing Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Frequently Asked Questions 16 6.1 I submit the program, but nothing happens. Why? . . . . . . . . . . . . . . . . . . . . . . . . 16 6.2 I submit the solution only for one problem, but all the problems in the assignment are graded. Why? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 What are the possible grading outcomes, and how to read them? . . . . . . . . . . . . . . . . 16 6.4 How to understand why my program fails and to fix it? . . . . . . . . . . . . . . . . . . . . . 17 6.5 Why do you hide the test on which my program fails? . . . . . . . . . . . . . . . . . . . . . . 17 6.6 My solution does not pass the tests? May I post it in the forum and ask for a help? . . . . . 18 6.7 My implementation always fails in the grader, though I already tested and stress tested it a lot. Would not it be better if you give me a solution to this problem or at least the test cases that you use? I will then be able to fix my code and will learn how to avoid making mistakes. Otherwise, I do not feel that I learn anything from solving this problem. I am just stuck. . . 18

2

1 Problem: Find All Occurrences of a Pattern in a String Problem Introduction In this problem, we ask a simple question: how many times one string occurs as a substring of another? Recall that different occurrences of a substring can overlap with each other. For example, ATA occurs three times in CGATATATCCATAG.

Problem Description Task. Find all occurrences of a pattern in a string. Input Format. Strings Pattern and Genome. Constraints. 1 ≤|Pattern|≤ 106; 1 ≤|Genome|≤ 106; both strings are over A, C, G, T. Output Format. All starting positions in Genome where Pattern appears as a substring (using 0-based indexing as usual). Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 12 4 1.5 2 4 4 24

Memory Limit. 512MB. Sample 1. Input: TACG GT Output:

Explanation: The pattern is longer than the text and hence has no occurrences in the text. Sample 2. Input: ATA ATATA Output: 0 2 Explanation: The pattern appears at positions 1 and 3 (and these two occurrences overlap each other). Sample 3. Input: ATAT GATATATGCATATACTT Output: 1 3 9 Explanation: The pattern appears at positions 1, 3, and 9 in the text.

3

Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: kmp

What To Do Tosolvethisproblem, itisenoughtoimplementcarefullythecorrespondingalgorithmcoveredinthelectures.

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

4

2 Problem: Construct the Suffix Array of a Long String Problem Introduction The goal in this problem is to construct the suffix array of a given string again, but this time for a longer string. This will require you to implement an efficient algorithm.

Problem Description Task. Construct the suffix array of a string. Input Format. A string Text ending with a “$” symbol. Constraints. 1 ≤|Text(Text)|≤ 2·105; except for the last symbol, Text contains symbols A, C, G, T only. Output Format. SuffixArray(Text), that is, the list of starting positions of sorted suffixes separated by spaces. Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 4 30 1.5 2 30 30 8

Memory Limit. 512MB. Sample 1. Input: AAA$ Output: 3 2 1 0 Explanation: Sorted suffixes: 3 $ 2 A$ 1 AA$ 0 AAA$ Sample 2. Input: GAC$ Output: 3 1 2 0 Explanation: Sorted suffixes: 3 $ 1 AC$ 2 C$ 0 GAC$

5

Sample 3. Input: GAGAGAGA$ Output: 8 7 5 3 1 6 4 2 0 Explanation: Sorted suffixes: 8 $ 7 A$ 5 AGA$ 3 AGAGA$ 1 AGAGAGA$ 6 GA$ 4 GAGA$ 2 GAGAGA$ 0 GAGAGAGA$ Sample 4. Input: AACGATAGCGGTAGA$ Output: 15 14 0 1 12 6 4 2 8 13 3 7 9 10 11 5 Explanation: Sorted suffixes: 15 $ 14 A$ 0 AACGATAGCGGTAGA$ 1 ACGATAGCGGTAGA$ 12 AGA$ 6 AGCGGTAGA$ 4 ATAGCGGTAGA$ 2 CGATAGCGGTAGA$ 8 CGGTAGA$ 13 GA$ 3 GATAGCGGTAGA$ 7 GCGGTAGA$ 9 GGTAGA$ 10 GTAGA$ 11 TAGA$ 5 TAGCGGTAGA$

Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: suffix_array_long

What To Do Tosolvethisproblem, itisenoughtoimplementcarefullythecorrespondingalgorithmcoveredinthelectures.

6

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

7

3 Problem: Pattern Matching with the Suffix Array Problem Introduction In this problem, we will let you use the suffix array to solve the Multiple Pattern Matching Problem.

Problem Description Task. Find all occurrences of a given collection of patterns in a string. Input Format. The first line contains a string Text). The second line specifies an integer n. The last line gives a collection of n strings Patterns = {p1,…,pn} separated by spaces. Constraints. All strings contain symbols A, C, G, T only; 1 ≤|Text|≤ 105; 1 ≤ n ≤ 104;∑︀n i=1|pi|≤ 105. Output Format. All starting positions (in any order) in Text where a pattern appears as a substring (using 0-based indexing as usual). If several patterns occur at the same position of the Text, still output this position only once. Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 4 12 1.5 2 12 12 8 Memory Limit. 512MB. Sample 1. Input: AAA 1 A Output: 0 1 2 Explanation: The pattern A appears at positions 0, 1, and 2 in the text. Sample 2. Input: ATA 3 C G C Output:

Explanation: There are no occurrences of the patterns in the text Sample 3. Input: ATATATA 3 ATA C TATAT Output: 4 2 0 1 Explanation: The pattern ATA appears at positions 0, 2, and 4, the pattern TATAT appears at position 1.

8

Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: suffix_array_matching

What To Do Tosolvethisproblem, itisenoughtoimplementcarefullythecorrespondingalgorithmcoveredinthelectures.

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

9

4 Advanced Problem: Construct the Suffix Tree from the Suffix Array Westronglyrecommendyoustartsolvingadvancedproblemsonlywhenyouaredonewiththebasicproblems (for some advanced problems, algorithms are not covered in the video lectures and require additional ideas to be solved; for some other advanced problems, algorithms are covered in the lectures, but implementing them is a more challenging task than for other problems).

Problem Introduction SuffixTree(Text) can be constructed in linear time from SuffixArray(Text) by using the longest common prefix (LCP) array of Text, LCP(Text), which stores the length of the longest common prefix shared by consecutive lexicographically ordered suffixes of Text. For example,

LCP(“panamabananas$”) = (0,1,1,3,3,1,0,0,0,2,2,0,0).

Problem Description Task. Construct a suffix tree from the suffix array and LCP array of a string. Input Format. The first line contains a string Text ending with a “$” symbol, the second line contains SuffixArray(Text) as a list of |Text| integers separated by spaces, the last line contains LCP(Text) as a list of |Text|−1 integers separated by spaces. Constraints. 1 ≤|Text(Text)|≤ 2·105; except for the last symbol, Text contains symbols A, C, G, T only. Output Format. The output format in this problem differs from the output format in the problem “Suffix Tree” from the Programming Assignment 2 and is somewhat tricky. It is because this problem is harder: the input string can be longer, so it would take too long to output all the edge labels directly and compare them with the correct ones, as their combined length can be Θ(|Text|2), which is too much when the Text can be as long as 200 000 characters. Output the Text from the input on the first line. Then output all the edges of the suffix tree in a specific order (see below), each on its own line. Output each edge as a pair of integers (start,end), where start is the position in Text corresponding to the start of the edge label substring in the Text and end is the position right after the end of the edge label in the Text. Note that start must be a valid position in the Text, that is, 0 ≤ start ≤|Text|−1, and end must be between 1 and |Text| inclusive. Substring Text[start..end−1] must be equal to the edge label of the corresponding edge. For example, if Text = “ACACAA$” and the edge label is “CA”, you can output this edge either as (1,3) corresponding to Text[1..2] = “CA” or as (3,5) corresponding to Text[3..4] = “CA” — both variants will be accepted. The order of the edges is important here — if you output all the correct edges in the wrong order, your solution will not be accepted. However, you don’t need to construct this order yourself if you write in C++, Java or Python3, because it is implemented for you in the starter files. Output all the edges in the order of sorted suffixes: first, take the leaf of the suffix tree corresponding to the smallest suffix of Text and output all the edges on the path from the root to this leaf. Then take the leaf corresponding to the second smallest suffix of Text and output all the edges on the path from the root to this leaf except for those edges which were printed before. Then take the leaf corresponding to the third smallest suffix, fourth smallest suffix and so on. Print each edge only once — as a part of the path corresponding to the smallest suffix of Text where this edge appears. This way, you will only output O(|Text|) integers. See the examples below for clarification.

10

Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 2 2 20 10 3 4 10 10 40 Memory Limit. 512MB. Sample 1. Input: A$ 1 0 0 Output: A$ 1 2 0 2 Explanation:

i SA[i] LCP[i] suffix 0 1 0 $ 1 0 A$ 1 0 $ A$

Sample 2. Input: AAA$ 3 2 1 0 0 1 2 Output: AAA$ 3 4 0 1 3 4 1 2 3 4 2 4 Explanation:

i SA[i] LCP[i] suffix 0 3 0 $ 1 2 1 A$ 2 1 2 AA$ 3 0 AAA$

3

2

1 0

$ A

A$

$ A$

11

Sample 3. Input: GTAGT$ 5 2 3 0 4 1 0 0 2 0 1 Output: GTAGT$ 5 6 2 6 3 5 5 6 2 6 4 5 5 6 2 6 Explanation:

i SA[i] LCP[i] suffix 0 5 0 $ 1 2 0 AGT$ 2 3 2 GT$ 3 0 0 GTAGT$ 4 4 1 T$ 5 1 TAGT$

5 2

3 0 4 1

$ AGT$ GT T

$ AGT$ $ AGT$

Starter Files The starter solutions for this problem read the input data from the standard input, pass it to the blank procedure which is supposed to build the suffix tree, then print the edges of the resulting suffix tree in the order specified in the output format description. You don’t have to construct the correct order of edges for the output yourself if you use these starter files. You are only supposed to implement the procedure to build the suffix tree given the string, its suffix array and LCP array if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: suffix_tree_from_array

What To Do Tosolvethisproblem, itisenoughtoimplementcarefullythecorrespondingalgorithmcoveredinthelectures.

Need Help? Ask a question or see the questions asked by other learners at this forum thread.

12

5 General Instructions and Recommendations on Solving Algorithmic Problems Your main goal in an algorithmic problem is to implement a program that solves a given computational problem in just few seconds even on massive datasets. Your program should read a dataset from the standard input and write an answer to the standard output. Below we provide general instructions and recommendations on solving such problems. Before reading them, go through readings and screencasts in the first module that show a step by step process of solving two algorithmic problems: link.

5.1 Reading the Problem Statement You start by reading the problem statement that contains the description of a particular computational task as well as time and memory limits your solution should fit in, and one or two sample tests. In some problems your goal is just to implement carefully an algorithm covered in the lectures, while in some other problems you first need to come up with an algorithm yourself.

5.2 Designing an Algorithm If your goal is to design an algorithm yourself, one of the things it is important to realize is the expected running time of your algorithm. Usually, you can guess it from the problem statement (specifically, from the subsection called constraints) as follows. Modern computers perform roughly 108–109 operations per second. So, if the maximum size of a dataset in the problem description is n = 105, then most probably an algorithm with quadratic running time is not going to fit into time limit (since for n = 105, n2 = 1010) while a solution with running time O(nlogn) will fit. However, an O(n2) solution will fit if n is up to 103 = 1000, and if n is at most 100, even O(n3) solutions will fit. In some cases, the problem is so hard that we do not know a polynomial solution. But for n up to 18, a solution with O(2nn2) running time will probably fit into the time limit. To design an algorithm with the expected running time, you will of course need to use the ideas covered in the lectures. Also, make sure to carefully go through sample tests in the problem description.

5.3 Implementing Your Algorithm When you have an algorithm in mind, you start implementing it. Currently, you can use the following programming languages to implement a solution to a problem: C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, Scala. For all problems, we will be providing starter solutions for C++, Java, and Python3. If you are going to use one of these programming languages, use these starter files. For other programming languages, you need to implement a solution from scratch.

5.4 Compiling Your Program For solving programming assignments, you can use any of the following programming languages: C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, and Scala. However, we will only be providing starter solution files for C++, Java, and Python3. The programming language of your submission is detected automatically, based on the extension of your submission. We have reference solutions in C++, Java and Python3 which solve the problem correctly under the given restrictions, and in most cases spend at most 1/3 of the time limit and at most 1/2 of the memory limit. You can also use other languages, and we’ve estimated the time limit multipliers for them, however, we have no guarantee that a correct solution for a particular problem running under the given time and memory constraints exists in any of those other languages. Your solution will be compiled as follows. We recommend that when testing your solution locally, you use the same compiler flags for compiling. This will increase the chances that your program behaves in the

13

same way on your machine and on the testing machine (note that a buggy program may behave differently when compiled by different compilers, or even by the same compiler with different flags). ∙ C (gcc 5.2.1). File extensions: .c. Flags: gcc -pipe -O2 -std=c11