Programming Assignment 1: Suffix Trees solution

\$29.99

Original Work ?

Description

Introduction Welcome to your first programming assignment of the Algorithms on Strings class! In this programming assignment, you will be practicing implementing a fundamental data structure — suffix tree. Once constructed, a suffix tree allows to solve many non-trivial computational problems for a given string (or strings). You will solve one such problem in the end of this assignment. In this programming assignment, the grader will show you the input data if your solution fails on any of the tests. This is done to help you to get used to the algorithmic problems in general and get some experience debugging your programs while knowing exactly on which tests they fail. However, for all the following programming assignments, the grader will show the input data only in case your solution fails on one of the first few tests (please review the questions 7.4 and 7.5 in the FAQ section for a more detailed explanation of this behavior of the grader).
Learning Outcomes Upon completing this programming assignment you will be able to: 1. construct a trie from a collection of patterns; 2. use this trie to find all occurrences of patterns in a given text without scanning the text many times; 3. do this again, but in a situation when it is allowed for some patterns to be prefixes of some other patterns; 4. construct the suffix tree of a string; 5. use suffix trees to find the shortest non-shared substring.
Passing Criteria: 3 out of 5 Passing thisprogramming assignmentrequires passingat least3out of5code problemsfrom thisassignment. In turn, passing a code problem requires implementing a solution that passes all the tests for this problem in the grader and does so under the time and memory limits specified in the problem statement.
Contents 1 Problem: Construct a Trie from a Collection of Patterns 3
1
2 Problem: Implement TrieMatching 6
3 Problem: Extend TrieMatching 8
4 Problem: Construct the Suffix Tree of a String 10
5 Advanced Problem: Find the Shortest Non-Shared Substring of Two Strings 14
6 General Instructions and Recommendations on Solving Algorithmic Problems 16 6.1 Reading the Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.2 Designing an Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 Implementing Your Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.4 Compiling Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.5 Testing Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.6 Submitting Your Program to the Grading System . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.7 Debugging and Stress Testing Your Program . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Frequently Asked Questions 19 7.1 I submit the program, but nothing happens. Why? . . . . . . . . . . . . . . . . . . . . . . . . 19 7.2 I submit the solution only for one problem, but all the problems in the assignment are graded. Why? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.3 What are the possible grading outcomes, and how to read them? . . . . . . . . . . . . . . . . 19 7.4 How to understand why my program fails and to fix it? . . . . . . . . . . . . . . . . . . . . . 20 7.5 Why do you hide the test on which my program fails? . . . . . . . . . . . . . . . . . . . . . . 20 7.6 My solution does not pass the tests? May I post it in the forum and ask for a help? . . . . . 21 7.7 My implementation always fails in the grader, though I already tested and stress tested it a lot. Would not it be better if you give me a solution to this problem or at least the test cases that you use? I will then be able to fix my code and will learn how to avoid making mistakes. Otherwise, I do not feel that I learn anything from solving this problem. I am just stuck. . . 21
2
1 Problem: Construct a Trie from a Collection of Patterns Problem Introduction Reads will form a collection of strings Patterns that we wish to match against a reference genome Text. For each string in Patterns, we will first find all its exact matches as a substring of Text (or conclude that it does not appear in Text). When hunting for the cause of a genetic disorder, we can immediately eliminate from consideration areas of the reference genome where exact matches occur. Multiple Pattern Matching Problem: Find all occurrences of a collection of patterns in a text. Input: A string Text and a collection Patterns containing (shorter) strings. Output: All starting positions in Text where a string from Patterns appears as a substring. To solve this problem, we will consolidate Patterns into a directed tree called a trie (pronounced “try”), which is written Trie(Patterns) and has the following properties. ∙ The trie has a single root node with indegree 0, denoted root. ∙ Each edge of Trie(Patterns) is labeled with a letter of the alphabet. ∙ Edges leading out of a given node have distinct labels. ∙ Every string in Patterns is spelled out by concatenating the letters along some path from the root downward. ∙ Every path from the root to a leaf, or node with outdegree 0, spells a string from Patterns. The most obvious way to construct Trie(Patterns) is by iteratively adding each string from Patterns to the growing trie, as implemented by the following algorithm. TrieConstruction(Patterns) Trie ← a graph consisting of a single node root for each string Pattern in Patterns: currentNode←root for i from 0 to |Pattern|−1: currentSymbol←Pattern[i] if there is an outgoing edge from currentNode with label currentSymbol: currentNode ← ending node of this edge else: add a new node newNode to Trie add a new edge from currentNode to newNode with label currentSymbol currentNode ← newNode return Trie
Problem Description Task. Construct a trie from a collection of patterns. Input Format. An integer n and a collection of strings Patterns = {p1,…,pn} (each string is given on a separate line). Constraints. 1 ≤ n ≤ 100; 1 ≤|pi|≤ 100 for all 1 ≤ i ≤ n; pi’s contain only symbols A, C, G, T; no pi is a prefix of pj for all 1 ≤ i ̸= j ≤ n.
3
Output Format. The adjacency list corresponding to Trie(Patterns), in the following format. If Trie(Patterns) has n nodes, first label the root with 0 and then label the remaining nodes with the integers 1 through n−1 in any order you like. Each edge of the adjacency list of Trie(Patterns) will be encoded by a triple: the first two members of the triple must be the integers i,j labeling the initial and terminal nodes of the edge, respectively; the third member of the triple must be the symbol c labeling the edge; output each such triple in the format u-v:c (with no spaces) on a separate line. Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 0.5 0.5 2 2 0.75 1 2 2 4
Memory Limit. 512Mb. Sample 1. Input: 1 ATA Output: 0-1:A 2-3:A 1-2:T Explanation:
0
1
2
3
A
T
A
Sample 2. Input: 3 AT AG AC Output: 0-1:A 1-4:C 1-3:G 1-2:T Explanation:
0
1
2
T
3
G
4
C
A
4
Sample 3. Input: 3 ATAGA ATC GAT Output: 0-1:A 1-2:T 2-3:A 3-4:G 4-5:A 2-6:C 0-7:G 7-8:A 8-9:T Explanation:
0
1
2
3
4
5
A
G
A
6 C
T
A
7
8
9
T
A
G
Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: trie
What To Do Tosolvethisproblem, itisenoughtoimplementcarefullythecorrespondingalgorithmcoveredinthelectures.
Need Help? Ask a question or see the questions asked by other learners at this forum thread.
5
2 Problem: Implement TrieMatching Problem Introduction Given a string Text and Trie(Patterns), we can quickly check whether any string from Patterns matches a prefix of Text. To do so, we start reading symbols from the beginning of Text and see what string these symbols “spell” as we proceed along the path downward from the root of the trie, as illustrated in the pseudocode below. For each new symbol in Text, if we encounter this symbol along an edge leading down from the present node, then we continue along this edge; otherwise, we stop and conclude that no string in Patterns matches a prefix of Text. If we make it all the way to a leaf, then the pattern spelled out by this path matches a prefix of Text. This algorithm is called PrefixTrieMatching. PrefixTrieMatching(Text, Trie) symbol ← first letter of Text v ←root of Trie while forever: if v is a leaf in Trie: return the pattern spelled by the path from the root to v else if there is an edge (v,w) in Trie labeled by symbol: symbol←next letter of Text v ← w else: output “no matches found” return PrefixTrieMatching finds whether any strings in Patterns match a prefix of Text. To find whether any strings in Patterns match a substring of Text starting at position k, we chop off the first k−1 symbols from Text and run PrefixTrieMatching on the shortened string. As a result, to solve the Multiple Pattern Matching Problem, we simply iterate PrefixTrieMatching |Text| times, chopping the first symbol off of Text before each new iteration. TrieMatching(Text, Trie) while Text is nonempty: PrefixTrieMatching(Text, Trie) remove first symbol from Text Note that in practice there is no need to actually chop the first k−1 symbols of Text. Instead, we just read Text from the k-th symbol.
Problem Description Task. Implement TrieMatching algorithm. Input Format. The first line of the input contains a string Text, the second line contains an integer n, each of the following n lines contains a pattern from Patterns = {p1,…,pn}. Constraints. 1 ≤|Text|≤ 10000; 1 ≤ n ≤ 5000; 1 ≤|pi|≤ 100 for all 1 ≤ i ≤ n; all strings contain only symbols A, C, G, T; no pi is a prefix of pj for all 1 ≤ i ̸= j ≤ n. Output Format. All starting positions in Text where a string from Patterns appears as a substring in increasing order (assuming that Text is a 0-based array of symbols).
6
Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 3 7 1.5 2 7 7 6
Memory Limit. 512Mb. Sample 1. Input: AAA 1 AA Output: 0 1 Explanation: The pattern AA appears at positions 0 and 1. Note that these two occurrences of the pattern overlap. Sample 2. Input: AA 1 T Output:
Explanation: There are no occurrences of the pattern in the text. Sample 3. Input: AATCGGGTTCAATCGGGGT 2 ATCG GGGT Output: 1 4 11 15 Explanation: The pattern ATCG appears at positions 1 and 11, the pattern GGGT appears at positions 4 and 15.
Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: trie_matching
What To Do Tosolvethisproblem, itisenoughtoimplementcarefullythecorrespondingalgorithmcoveredinthelectures.
Need Help? Ask a question or see the questions asked by other learners at this forum thread.
7
3 Problem: Extend TrieMatching Problem Introduction The goal in this problem is to extend the algorithm from the previous problem such that it will be able to handle cases when one of the patterns is a prefix of another pattern. In this case, some patterns are spelled in a trie by traversing a path from the root to an internal vertex, but not to a leaf.
Problem Description Task. Extend TrieMatching algorithm so that it handles correctly cases when one of the patterns is a prefix of another one. Input Format. The first line of the input contains a string Text, the second line contains an integer n, each of the following n lines contains a pattern from Patterns = {p1,…,pn}. Constraints. 1 ≤|Text|≤ 10000; 1 ≤ n ≤ 5000; 1 ≤|pi|≤ 100 for all 1 ≤ i ≤ n; all strings contain only symbols A, C, G, T; it can be the case that pi is a prefix of pj for some i,j. Output Format. All starting positions in Text where a string from Patterns appears as a substring in increasing order (assuming that Text is a 0-based array of symbols). If more than one pattern appears starting at position i, output i once. Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 3 7 1.5 2 7 7 6
Memory Limit. 512Mb. Sample 1. Input: AAA 1 AA Output: 0 1 Explanation: The pattern AA appears at positions 0 and 1. Note that these two occurrences of the pattern overlap.
8
Sample 2. Input: ACATA 3 AT A AG Output: 0 2 4 Explanation: Text contains occurrences of A at positions 0, 2, and 4, as well as an occurrence of AT at position 2. Note that the trie looks as follows in this case:
T G
A
When spelling Text from position 0, we don’t reach a leaf. Still, there is an occurrence of the pattern A at this position.
Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: trie_matching_extended
What To Do To solve this problem, you may want to store in each node of the trie an additional flag indicating whether the path from the root to this node spells a pattern.
Need Help? Ask a question or see the questions asked by other learners at this forum thread.
9
4 Problem: Construct the Suffix Tree of a String Problem Introduction Storing Trie(Patterns) requires a great deal of memory. So let’s process Text into a data structure instead. Our goal is to compare each string in Patterns against Text without needing to traverse Text from beginning to end. In more familiar terms, instead of packing Patterns onto a bus and riding the long distance down Text, our new data structure will be able to “teleport” each string in Patterns directly to its occurrences in Text. A suffix trie, denoted SuffixTrie(Text), is the trie formed from all suffixes of Text. From now on, we append the dollar-sign (“\$”) to Text in order to mark the end of Text. We will also label each leaf of the resulting trie by the starting position of the suffix whose path through the trie ends at this leaf (using 0-based indexing). This way, when we arrive at a leaf, we will immediately know where this suffix came from in Text. However, the runtime and memory required to construct SuffixTrie(Text) are both equal to the combined length of all suffixes in Text. There are |Text| suffixes of Text, ranging in length from 1 to |Text| and having total length |Text|·(|Text|+1)/2, which is Θ(|Text|2). Thus, we need to reduce both the construction time and memory requirements of suffix tries to make them practical. Let’s not give up hope on suffix tries. We can reduce the number of edges in SuffixTrie(Text) by combining the edges on any non-branching path into a single edge. We then label this edge with the concatenation of symbols on the consolidated edges. The resulting data structure is called a suffix tree, written SuffixTree(Text). To match a single Pattern to Text, we thread Pattern into SuffixTree(Text) by the same process used for a suffix trie. Similarly to the suffix trie, we can use the leaf labels to find starting positions of successfully matched patterns. Suffix trees save memory because they do not need to store concatenated edge labels from each nonbranching path. For example, a suffix tree does not need ten bytes to store the edge labeled “mabananas\$” in SuffixTree(“panamabananas\$”); instead, it suffices to store a pointer to position 4 of “panamabananas\$”, as well as the length of “mabananas\$”. Furthermore, suffix trees can be constructed in linear time, without having to first construct the suffix trie! We will not ask you to implement this fast suffix tree construction algorithm because it is quite complex.
Problem Description Task. Construct the suffix tree of a string. Input Format. A string Text ending with a “\$” symbol. Constraints. 1 ≤|Text|≤ 5000; except for the last symbol, Text contains symbols A, C, G, T only. Output Format. The strings labeling the edges of SuffixTree(Text) in any order. Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 3 10 1.5 2 10 10 6
Memory Limit. 512Mb.
10
Sample 1. Input: A\$ Output: A\$ \$ Explanation:
1 0
\$ A\$
Sample 2. Input: ACA\$ Output: \$ A \$ CA\$ CA\$ Explanation:
3 1
2 0
\$ CA\$ A
\$ CA\$
11
Sample 3. Input: ATAAATG\$ Output: AAATG\$ G\$ T ATG\$ TG\$ A A AAATG\$ G\$ T G\$ \$ Explanation:
4 032
1 5
67 A
TA
T
G\$ AAATG\$TG\$ATG\$
AAATG\$ G\$
G\$\$
Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: suffix_tree
What To Do You can construct a trie from all the suffixes of the initial string as in the first problem. Then you can “compress” it into the suffix tree by deleting all nodes of the trie with only one child, merging the incoming and the outgoing edge of such node into one edge, concatenating the edge labels. However, if you do this and also store the substrings as edge labels directly, this will be too slow and also use too much memory. Use the hint from the lecture to only store the pair (start, length) of the substring of text corresponding to the edge label instead of storing this substring itself. Also note that when you create an edge from a node to a leaf of the tree, you don’t need to go through the whole substring corresponding to this edge character-by-character, you already know the start and the length of the corresponding substring. If it’s still too slow, you’ll need to build the suffix tree directly without building the suffix trie first. To do that, you’ll need to do almost the same, but creating the nodes only when branching happens by breaking the existing edge in the middle.
12
Need Help? Ask a question or see the questions asked by other learners at this forum thread.
13
5 Advanced Problem: Find the Shortest Non-Shared Substring of Two Strings Westronglyrecommendyoustartsolvingadvancedproblemsonlywhenyouaredonewiththebasicproblems (for some advanced problems, algorithms are not covered in the video lectures and require additional ideas to be solved; for some other advanced problems, algorithms are covered in the lectures, but implementing them is a more challenging task than for other problems).
Problem Introduction The longest repeat in a string and the longest substring shared by two strings can be found using a suffix tree. Another such problem is shown below.
Problem Description Task. Find the shortest substring of one string that does not appear in another string. Input Format. Strings Text1 and Text2. Constraints. 1 ≤ |Text1|,|Text2| ≤ 2000; strings have equal length (|Text1| = |Text2|), are not equal (Text1 ̸= Text2), and contain symbols A, C, G, T only. Output Format. The shortest (non-empty) substring of Text1 that does not appear in Text2. (Multiple solutions may exist, in which case you may return any one.) Time Limits. language C C++ Java Python C# Haskell JavaScript Ruby Scala time in seconds 1 1 5 8 1.5 2 8 8 10 Memory Limit. 1024Mb. Sample 1. Input: A T Output: A Explanation: Text2 does not contain the string A, hence it is clearly a shortest such string. Sample 2. Input: AAAAAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTTTTT Output: A Explanation: Again, Text2 does not contain the string A, so it is a shortest one. Sample 3. Input: CCAAGCTGCTAGAGG CATGCTGGGCTGGCT
14
Output: AA Explanation: In this case, Text2 contains all symbols A, C, G, T, that is, all substrings of Text1 of length 1. At the same time, Text2 does not contain AA, hence it is a shortest substring of Text1 that does not appear in Text2. Sample 4. Input: ATGCGATGACCTGACTGA CTCAACGTATTGGCCAGA Output: ATG Explanation: The string ATG is a substring of Text1 and it does not appear in Text2. At the same time, Text2 contains all 16 strings of length 2 and all 4 strings of length 1.
Starter Files The starter solutions for this problem read the input data from the standard input, pass it to a blank procedure, and then write the result to the standard output. You are supposed to implement your algorithm in this blank procedure if you are using C++, Java, or Python3. For other programming languages, you need to implement a solution from scratch. Filename: non_shared_substring
What To Do Hint: construct the suffix tree of a string Text1#Text2\$ (where # and \$ are new symbols).
Need Help? Ask a question or see the questions asked by other learners at this forum thread.
15
6 General Instructions and Recommendations on Solving Algorithmic Problems Your main goal in an algorithmic problem is to implement a program that solves a given computational problem in just few seconds even on massive datasets. Your program should read a dataset from the standard input and write an answer to the standard output. Below we provide general instructions and recommendations on solving such problems. Before reading them, go through readings and screencasts in the first module that show a step by step process of solving two algorithmic problems: link.
6.1 Reading the Problem Statement You start by reading the problem statement that contains the description of a particular computational task as well as time and memory limits your solution should fit in, and one or two sample tests. In some problems your goal is just to implement carefully an algorithm covered in the lectures, while in some other problems you first need to come up with an algorithm yourself.
6.2 Designing an Algorithm If your goal is to design an algorithm yourself, one of the things it is important to realize is the expected running time of your algorithm. Usually, you can guess it from the problem statement (specifically, from the subsection called constraints) as follows. Modern computers perform roughly 108–109 operations per second. So, if the maximum size of a dataset in the problem description is n = 105, then most probably an algorithm with quadratic running time is not going to fit into time limit (since for n = 105, n2 = 1010) while a solution with running time O(nlogn) will fit. However, an O(n2) solution will fit if n is up to 103 = 1000, and if n is at most 100, even O(n3) solutions will fit. In some cases, the problem is so hard that we do not know a polynomial solution. But for n up to 18, a solution with O(2nn2) running time will probably fit into the time limit. To design an algorithm with the expected running time, you will of course need to use the ideas covered in the lectures. Also, make sure to carefully go through sample tests in the problem description.
6.3 Implementing Your Algorithm When you have an algorithm in mind, you start implementing it. Currently, you can use the following programming languages to implement a solution to a problem: C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, Scala. For all problems, we will be providing starter solutions for C++, Java, and Python3. If you are going to use one of these programming languages, use these starter files. For other programming languages, you need to implement a solution from scratch.
6.4 Compiling Your Program For solving programming assignments, you can use any of the following programming languages: C, C++, C#, Haskell, Java, JavaScript, Python2, Python3, Ruby, and Scala. However, we will only be providing starter solution files for C++, Java, and Python3. The programming language of your submission is detected automatically, based on the extension of your submission. We have reference solutions in C++, Java and Python3 which solve the problem correctly under the given restrictions, and in most cases spend at most 1/3 of the time limit and at most 1/2 of the memory limit. You can also use other languages, and we’ve estimated the time limit multipliers for them, however, we have no guarantee that a correct solution for a particular problem running under the given time and memory constraints exists in any of those other languages. Your solution will be compiled as follows. We recommend that when testing your solution locally, you use the same compiler flags for compiling. This will increase the chances that your program behaves in the
16
same way on your machine and on the testing machine (note that a buggy program may behave differently when compiled by different compilers, or even by the same compiler with different flags). ∙ C (gcc 5.2.1). File extensions: .c. Flags: gcc -pipe -O2 -std=c11