Description
1 Overview
This project asks you to evaluate four of the sorting algorithms we have
discussed in class on various inputs and relate their theoretical run time
to their empirical (actual) behavior. These algorithms are SelectionSort,
InsertionSort, MergeSort, and QuickSort. In the interests of time, you have
been provided with code that implements all four algorithms and can run
them on inputs of varying sizes and types.
2 Running the provided program
Along with this document, you have been provided with source code and a
Makefile for the various sorting algorithms and input types to test in this
project. This program will run the specified algorithm (SelectionSort, InsertionSort, MergeSort, or QuickSort) on a sorted, random, or constant input
array of a given size.
I expect that you can use an IDE or the Linux make
utility in order to compile the code. Once compiled, the program should
be run from the command line, as it expects three arguments. If fewer than
three arguments are specified, the program will assume default values for any
unspecified arguments.
The first of these three arguments represents the size of the input array. The program only accepts sizes between 1 and 1,000,000,000, inclusive
(default 10,000).
The second argument represents the sorting algorithm you
wish to run. Valid algorithms include SelectionSort, InsertionSort, MergeSort, and QuickSort (default MergeSort), or you can abbreviate the algorithms by their first letter (‘s,’ ‘i,’ ‘m,’, or ‘q’). The last argument represents
1
the type of input to sort. Valid input types are sorted, random, and constant
(or ‘s,’ ‘r’, ‘c’; default ‘r’), where ‘random’ is an unsorted array, ‘sorted’ is
a sorted array (in increasing order), and ‘constant’ is an array where every
entry is identical.
In order to improve the timing stability, the algorithm runs the requested
sort three times and reports the median of the three timing results to you.
In order to get the most accurate timing results, there should be no other
processes running on the machine at the same time, but this may not always
be possible, especially if you are running the program on a lab machine.
3 Project report
Your project report should be divided into two parts, Results and Analysis.
For the Results section, you will need to prepare a data file describing the
performance of your algorithms, and in the Analysis portion, you will need
to prepare a table describing their complexity, as well as a response to these
results.
3.1 Results
Run each of the four sorting algorithms on constant, sorted, and random
arrays of different sizes. For each of the twelve cases, you should record the
following:
1. nmin: the smallest array size that takes 30 milliseconds or more per run
to sort;
2. tmin: the time to sort an array of size nmin;
3. nmax: the largest array that you sorted
4. tmax: the time required to sort nmax elements.
When trying to find nmin and nmax, I recommend starting with an arbitrary input size and multiplying or dividing the array size by something
like 10 or 2, as this should quickly get you to arrays that are large or small
enough. For nmax, I recommend you stop the program if it takes longer than
30 minutes (˜5–10 minutes per call) and try something smaller. Note that
2
for some of the experiments the time will not get close to 30 minutes; just
use the largest inputs that you were able to sort in these cases.
For nmin and tmin, you do not need to find the smallest array that takes
30 milliseconds to run; anything less than 500 ms (ideally less than 100
ms) is acceptable. Also, you can go below 30 milliseconds, though I would
recommend finding a result above 20 ms, as timing values that are too small
may have significant errors due to rounding.
You may also use different systems to perform the tests for different algorithm/input combinations, but obviously, you’ll want all of the runs for a
single experiment (e.g., testing MergeSort on random data) to be on the same
machine, as computing a timing ratio with results from different machines is
virtually meaningless.
Lastly, increasing your stack size before testing QuickSort can help to
sort larger arrays without crashing. If the recursion depth exceeds the
maximum stack size, the program will crash due to “stack overflow”. Different systems may handle this error differently. For example, the program may halt with no message, or it may “hang” without terminating.
You can run ulimit -s unlimited in Linux before executing the algorithm
to increase the available stack space. For g++, the command line options
-Wl,–stack_size,0x20000000 or -Wl,–stack,0x20000000 may work to
set the stack size to 1 GB. The stack size can also be changed in many IDEs
on Mac and Windows; consult the appropriate documentation. Try to play
with it a bit, but ultimately, just use the largest array that you were able to
run successfully; an nmax of 200 million should be sufficient to show how the
time complexity increases if you’re not able to run the program on an array
of size 1 billion.
You should enter your results into a comma-separated value (CSV) file.
The CSV file should contain 5 columns and 13 rows. Your first column should
label the 12 different experiments, while the first row labels the experiment
variables (nmin, tmin, nmax, and tmax).
Your row labels should include the
algorithm name (SelectionSort, InsertionSort, MergeSort, or Quicksort) and
input type (Sorted, Random, or Constant). You may abbreviate these labels
as S, I, M, Q, and S, R, C. (For example, SS represents your SelectionSort
result on a sorted array.) An example table appears below. You may use
Excel (or any other software) to prepare your data.
3
nmin tmin nmax tmax
SC
SS
SR
IC
IS
IR
MC
MS
MR
QC
QS
QR
3.2 Analysis
In this section, you will estimate the complexity of the four algorithms by
comparing the ratio between tmin and tmax to ratios representing the complexity of the algorithm. Specifically, you should compute f(nmax)/f(nmin)
for f1(n) = n, f2(n) = n ln(n), and f3(n) = n
2
. You should round to the
nearest integer when computing these ratios. Finally, you should label each
experiment according to the ratio tmax/tmin most resembles.
For example, if one of your experiments resulted in nmin = 100 and nmax =
10, 000, 000, your ratios would be:
f1(nmax)/f1(nmin) = 10, 000, 000/100
= 100, 000
f2(nmax)/f2(nmin) = 10, 000, 000 ln(10, 000, 000)/(100 ln(100))
= 350, 000
f3(nmax)/f3(nmin) = 10, 000, 0002
/1002
= 10, 000, 000, 000
These three ratios represent the timing increase if the complexity of your
code was exactly n, n lg n, or n
2
, and you should pick out which of these three
ratios is most similar to your actual time increase (tmax/tmin). In reality, the
growth rate of the timing function will include some lower order terms and
may depend on other factors that are hard to model, like caching behavior,
so do not expect the numbers to line up exactly.
As part of your report, you should create a chart that includes the computed ratios as well as the behavior of the algorithm (Linear, n lg n, or
4
Quadratic), across all 12 experiments. An example chart appears below:
tmax/tmin n ratio n ln(n) ratio n
2
ratio Behavior
SC
SS
SR
IC
IS
IR
MC
MS
MR
QC
QS
QR
In addition to the table, you should write a brief summary of (1) how your
results compare to the theoretical analysis for the four algorithms (below),
and (2) why your results make sense or are surprising. You should write a
summary for each experiment. You should spend more time explaining your
results when they are unusual or unexpected.
Best-case Average-case Worst-case
complexity complexity complexity
SelectionSort Θ(n
2
) Θ(n
2
) Θ(n
2
)
InsertionSort Ω(n) Θ(n
2
) O(n
2
)
MergeSort Θ(n lg n) Θ(n lg n) Θ(n lg n)
QuickSort Ω(n lg n) Θ(n lg n) O(n
2
)
4 Submission
For this project, you should submit a zip archive containing (1) a CSV file
containing your results (described in Section 3.1), and (2) your tables and
analysis (described in Section 3.2), in PDF format.
Note: This is an individual project. You are not allowed to submit work
that has been pulled from the Internet, nor work that has been done by your
peers.
Your submitted materials will be analyzed for plagiarism. Project 1
will be evaluated out of 50 points:
5
5 Grading
Data file containing results 15 points
Table with ratios 15 points
Analysis 20 points
Requirements for each portion of the grade are described in Sections 3.1
and 3.2.
6