Description
1 Assignment Overview
This programming assignment covers an application of algorithms and data structures arising in the implementation of database systems. Databases and data management are covered by later courses, such as CSC 370. In this assignment, you will design and implement an efficient and correct algorithm for aggregation, which is a fundamental operation in relational database systems. Since this is not a database course, the algorithm will operate on simple text-based spreadsheet files instead of a database.
The specification for this assignment can be found in Section 2. Sections 1.1 – 1.3 contain a description and several examples of grouping and aggregation, using the following (fictional) table of student numbers and grades. Table grades id course subject course num grade V00123457 CSC 225 78 V00654322 MATH 110 63 V00654322 CSC 110 75 V00314159 CSC 106 85 V00951413 CSC 106 72 V00951413 MATH 110 68 V00654322 CSC 106 70 V00123456 CSC 110 79 V00314159 CSC 110 47 V00654322 CSC 225 91 V00123456 CSC 106 68 V00123456 CSC 225 89 V00951413 CSC 225 40 V00314159 MATH 100 74 V00654322 MATH 100 71
1.1 Counts and Averages by Student
Consider the task of computing the number of courses taken by each student, or the task of computing the average grade for each student. For both of these tasks, the rows of the table need to be divided into groups by student number (which is the id column of the table). Then, the number
1
of courses per student or average grade per student can be computed by counting or averaging the grade values within each group. Applying a function (such as counting or averaging) to collapse a group of rows into a single row is called aggregation. The table on the left below shows the groupings of rows by the id column, with each group receiving a different colour, and the tables on the right show the counts and averages within each group. Notice that the only columns in the result tables are the grouping criteria (in this case, the id column) and the aggregation result. All other columns in the original table are omitted.
Table grades id course subject course num grade V00123457 CSC 225 78 V00654322 MATH 110 63 V00654322 CSC 110 75 V00314159 CSC 106 85 V00951413 CSC 106 72 V00951413 MATH 110 68 V00654322 CSC 106 70 V00123456 CSC 110 79 V00314159 CSC 110 47 V00654322 CSC 225 91 V00123456 CSC 106 68 V00123456 CSC 225 89 V00951413 CSC 225 40 V00314159 MATH 100 74 V00654322 MATH 100 71 Original table, grouped by id column.
Aggregation (Counting Rows) id count(grade) V00123457 1 V00654322 5 V00314159 3 V00951413 3 V00123456 3 Result of aggregation by counting the number of grade values in each group.
Aggregation (Averaging) id avg(grade) V00123457 78 V00654322 74 V00314159 68.66 V00951413 60 V00123456 78.66 Result of aggregation by averaging the grade values in each group.
1.2 Averages by Subject
Consider the task of computing the average grade within each subject (CSC and MATH). In this case, the grouping criteria is the course_subject column and the aggregation result is avg(grade). The tables below show the grouping and aggregation. As above, notice that the other columns are excluded from the result table.
2
Table grades id course subject course num grade V00123457 CSC 225 78 V00654322 MATH 110 63 V00654322 CSC 110 75 V00314159 CSC 106 85 V00951413 CSC 106 72 V00951413 MATH 110 68 V00654322 CSC 106 70 V00123456 CSC 110 79 V00314159 CSC 110 47 V00654322 CSC 225 91 V00123456 CSC 106 68 V00123456 CSC 225 89 V00951413 CSC 225 40 V00314159 MATH 100 74 V00654322 MATH 100 71 Original table, grouped by the course_subject column.
Aggregation Result course subject avg(grade) CSC 72.18 MATH 69 Result of aggregation by averaging the grade values in each group.
1.3 Averages by Course
Now suppose we want to compute the average grade for each course. It is not sufficient to group by the course_num column, since two courses with the same number may not be the same course (for example, consider CSC 110 and MATH 110). The grouping criteria therefore must include two columns: course_subject and course_num. The tables below show the grouping and aggregation. Again, the only columns in the output table are the grouping criteria and the aggregation result.
3
Table grades id course subject course num grade V00123457 CSC 225 78 V00654322 MATH 110 63 V00654322 CSC 110 75 V00314159 CSC 106 85 V00951413 CSC 106 72 V00951413 MATH 110 68 V00654322 CSC 106 70 V00123456 CSC 110 79 V00314159 CSC 110 47 V00654322 CSC 225 91 V00123456 CSC 106 68 V00123456 CSC 225 89 V00951413 CSC 225 40 V00314159 MATH 100 74 V00654322 MATH 100 71 Original table, grouped by the course_subject and course_num columns.
Aggregation Result course subject course num avg(grade) CSC 225 74.5 MATH 110 65.5 CSC 110 67 CSC 106 73.75 MATH 100 72.5 Result of aggregation by averaging the grade values in each group.
2 Specification
Your task for this assignment is to write a Java program Aggregate which reads a table from a text file and performs grouping and aggregation with one of three aggregation functions: count (count the number of rows in each group), sum (add up all elements in the aggregation column within each group) and avg (compute the average of all elements in the aggregation column within each group). To receive full marks, the entire implementation must be correct, allow any number of columns to be used as grouping criteria, and run in O(nlog2 n) time on a table with n rows. See Section 3 for more details on the evaluation process.
2.1 Data Format
Tabular data can be stored in a variety of ways. One simple format for text-based tables and spreadsheets is the comma-separated value (CSV) format. Files in CSV format normally have the extension .csv or .txt. A CSV spreadsheet consists of a line of text for each row of data, with each column separated by a comma. In this assignment, column headings will be given as the first row of the spreadsheet, and no comma will appear after the last column on each line.
The grades table in Section 1 would be represented in CSV format as follows. id,course_subject,course_num,grade V00123457,CSC,225,78 V00654322,MATH,110,63 V00654322,CSC,110,75 V00314159,CSC,106,85
4
V00951413,CSC,106,72 V00951413,MATH,110,68 V00654322,CSC,106,70 V00123456,CSC,110,79 V00314159,CSC,110,47 V00654322,CSC,225,91 V00123456,CSC,106,68 V00123456,CSC,225,89 V00951413,CSC,225,40 V00314159,MATH,100,74 V00654322,MATH,100,71 Formally, the CSV files used in this assignment will conform to the following specification. • Every CSV file must contain at least one non-blank line, which will contain the column headings. • Blank lines (consisting entirely of whitespace characters such as spaces and tabs) will be completely ignored (and will not count as a row of the table). • You may assume that every line contains the same number of columns (an input file with an inconsistent number of columns per line will be considered invalid). • There is no limit to the number of columns (as long as every line has the same number of columns) or the number of rows in the file. • The data in the file may contain letters, numbers, spaces and punctuation, with one exception: comma characters will only appear as column separators (they may not appear in the data). • The number of columns is equal to the number of commas in the line plus one. The data in a column may be blank. For example, the line ‘Hello,,World,’ contains four columns. The second and fourth columns are blank.
2.2 Aggregate program interface
Your Aggregate program will accept command line arguments in the form $ java Aggregate