In this final assignment for Data Structures, you will carry out a series of tests on the fundamental data
structures in the Standard Template Library to first hypothesize and then measure the relative performance
(running time & memory usage) of these data structures and solidify your understanding of algorithm
complexity analysis using Big ’O’ Notation.
The five fundamental data structures we will study are: vector,
list, binary search tree (set / map), priority queue, and hash table (unordered set / unordered map).
Be sure to read the entire handout before beginning your implementation.
Overview of Operations
We will consider the following six simple, but moderately compute-intensive, operations that are common
subtasks of many interesting real-world algorithms and applications. We will test these operations using
integers and/or STL strings as noted below.
• Sort – we’ll use the default operator< for the specific data type.
• Remove duplicates from a sequence – without otherwise changing the overall order (keeping the first
occurrence of the element).
• Determine the mode – most frequently occurring element. If there is a tie, you may return any of the
most frequently occurring elements.
• Identify the closest pair of items within the dataset – integer data only. We’ll use operator- to
measure distance. If there is a tie in distance, you may return any of the pairs of closest distance.
• Output the first/smallest f items – a portion of the complete sorted output.
• Determine the longest matching substring between any two elements – STL string data only. For
example, if the input contains the words ‘antelope’, ‘buffalo’ and ‘elephant’, the longest substring
match is ‘ant’ (found within both ‘antelope’ and ‘elephant’). If there is a tie, you may return any of
the longest matching substrings.
See also the provided sample output for each of these operations.
“Rules” for comparison: For each operation, we will analyze the cost of a program/function that reads
the input from an STL input stream object (e.g., std::cin or std::ifstream) and writes the answer to an
STL output stream (e.g., std::cout or std::ofstream). The function should read through the input only
once and construct and use a single instance of the specified STL data structure to compute the output.
function may not use any other data structure to help with the computation (e.g., storing data in a C-style
Your Initial Predictions of Complexity Analysis
Before doing any implementation or testing, think about which data structures are better, similarly good, or
useless for tackling each operation.
Fill in the table on the next page with the big ’O’ notation for both the runtime and memory usage to
complete each operation using that data structure. If it is not feasible/sensible to use a particular data
structure to complete the specified operation put an X in the box. Hint: In the first 3 columns there should
only be 2 X’s! If two or more data structures have the same big ‘O’ notation for one operation, predict and
rank the data structures by faster running time for large data.
We combine set & map (and unordered set
& unordered map) in this analysis, but be sure to specify which datatype of the two makes the most sense
for each operation.
For your answers, n is the number of elements in the input, f is the requested number of values in the output
(only relevant for the ‘first sorted’ operation), and l is the maximum length of each string (only use this
variable for the ‘longest substring match’ operation).
Type your answers into your README.txt file.
You’ll also paste these answers into Submitty for autograding.
We provide a framework to implement and test these operations with each data structure and measure the
runtime and overall memory usage. The input will come from a file, redirected to std::cin on the command
line. Similarly, the program will write to std::cout and we can redirect that to write to a file.
statistics will be printed to std::cerr to help with complexity analysis. Here’s examples of how to compile
and run the provided code:
clang++ -g -Wall -Wextra performance*.cpp -o perf.out
./perf.out vector mode string < small_string_input.txt
./perf.out vector remove_duplicates string < small_string_input.txt > my_out.txt
diff my_out.txt small_string_output_remove_duplicates.txt
./perf.out vector closest_pair integer < small_integer_output_remove_duplicates.txt
./perf.out vector first_sorted string 3 < small_string_input.txt
./perf.out vector longest_substring string < small_string_output_remove_duplicates.txt
./perf.out vector sort string < medium_string_input.txt > vec_out.txt 2> vec_stats.txt
./perf.out list sort string < medium_string_input.txt > list_out.txt 2> list_stats.txt
diff vec_out.txt list_out.txt
The first example reads string input from small string input.txt, uses an STL vector to find the most
frequently occurring value (implemented by first sorting the data), and then outputs that string (the mode)
The second example uses an STL vector to remove the duplicate values (without otherwise changing the
order) from small string input.txt storing the answer in my out.txt, and then uses diff to compare
that file to the provided answer.
The next 3 command lines show examples of how to run the closest pair, first sorted and
longest substring operations. Note that the first sorted operation takes an additional argument, the
number of elements to output from the sorted order. Also note that the closest pair and longest substring
operations are more interesting when the input does not contain duplicate values.
The final example sorts a larger input of random strings first using an STL vector, and then using an STL
list and confirms that the answers match.
Generating Random Input
We provide a small standalone program to generate input data files with random strings. Here’s how you
compile and use this program to generate a file named medium string input.txt with 10,000 strings, each
with 5 random letters (‘a’-‘z’). And also a file named medium integer input.txt with 10,000 integers, each
with 3-5 digits (ranging in value from 100-99999).
clang++ -g -Wall -Wextra generate_input.cpp -o generate_input.out
./generate.out string 10000 5 5 > medium_string_input.txt
./generate.out integer 10000 3 5 > medium_integer_input.txt
First, create and save several large randomly generated input files with different numbers of elements. Test the
vector code for each operation with each of your input files. The provided code uses the clock() function
to measure the processing time of the computation.
The resolution accuracy of the timing mechanism is
system and hardware dependent and may be in seconds, milliseconds, or something else. Make sure you use
large enough inputs so that your running time for the largest test is about a second or more (to ensure the
measurement isn’t just noise).
Record the results in a table like this:
Sorting random 5 letter strings using STL vector
# of strings vector sort operation time (sec)
As the dataset grows, does your predicted big ‘O’ notation match the raw performance numbers? We know
that the running time for sorting with the STL vector sorting algorithm is O(n log2 n) and we can estimate
the coefficient k in front of the dominant term from the collected numbers.
vector sort operation time(n) = kvector sort ∗ n log2 n
Thus, on the machine which produced these numbers, coefficient kvector sort ≈ 2.3 x 10−7
sec. Of course
these constants will be different on different operating systems, different compilers, and different hardware!
These constants will allow us to compare data structures / algorithms with the same big ‘O’ notation. The
STL list sorting algorithm is also O(n log2 n), but what is the estimate for klist sort?
Be sure to try different random string lengths because this number will impact the number of repeated/
duplicate values in the input. The ratio of the number of input strings to number of output strings is
reported to std::cerr with the operation running time. Which operations are impacted by the number of
repeated/duplicate values? What is the relative impact?
Operation Implementation using Different Data Structures
The provided code includes the implementation of each operation (except longest substring) for the vector
datatype. Your implementation task for this assignment is to extend the program to the other data structures
in the table.
You should carefully consider the most efficient way (minimize the running time) to use each
data structure to complete the operation.
Focus on the first three operations from the table first (sort, remove duplicates, and mode). Once those
are debugged and tested, and you’ve analyzed the relative performance, you can proceed to implement the
Estimate of Total Memory Usage
When you upload your code to Submitty, the autograder will measure not only the running time, but also
the total memory usage. Compare the memory used by the different data structures to perform the same
operation on the same input dataset.
Does the total memory usage match your understanding of the relative
memory requirements for the internal representation of each data structure?
You can also run this tool on your local GNU/Linux machine (it may not work on other systems):
clang runstats.c -o runstats.out
./runstats.out ./perf.out vector sort string < medium_string_input.txt > my_out.txt
Results and Discussion
For each data type and each operation, run several sufficiently large tests and collect the operation time output
by the program.
Organize these timing measurements in your README.txt file and estimate the coefficients
for the dominant term of your Big ‘O’ Notation. Do these measurements and the overall performance match
your predicted Big ‘O‘ Notation for the data type and operation? Did you update your initial answers for
the Big ‘O‘ Notation of any cell in the table?
Compare the relative coefficients for different data types that have the same Big ‘O’ Notation for a specific
operation. Do these match your intuition? Are you surprised by any of the results? Will these results impact
your data structure choices for future programming projects?
You must do this assignment on your own, as described in the “Collaboration Policy &
Academic Integrity”. If you did discuss the problem or error messages, etc. with anyone,
please list their names in your README.txt file. Important Note: Do not include any large test datasets
with your submission, because this may easily exceed the submission size. Instead describe any datasets you
created, citing the original source of the data as appropriate.