Description
SENG265 ASSIGNMENT 1
1 Assignment Overview This assignment involves writing a command line application to process streams of text from a file. The program will read lines of text from a given file, compute the frequency of words of certain length from the file and print these frequencies to standard output. The overall goal of this assignment is to introduce C programming in a unix setting, with particular emphasis on C array and string processing. Your eventual submission will consist of the source files for the program, accompanied by test cases. There are three parts to this assignment and Sections 1.2, 1.3 and 1.4 below contain Specifications for each of the three programs. Section 3 describe the Constraints you should consider in your implementation, and Section 2 describes the Testing component. What you should Submit is outlined in Section 4 and the Evaluation scheme is given in Section 5. Your code is expected to compile without warnings in Senjhalla (the virtualbox+vagrant virtual machine) using the -Wall -g and -std=c99 compile flags with the gcc 9.3.0 compiler. Because a later assignment will test your knowledge of dynamic memory allocation, you are asked to not use dynamic memory allocation for this assignment, as it is useful to learn how to rely exclusively on automatic allocation. 1.1 Assignment Package These instructions assume you have completed Lab 2 and you have cloned your git repository to the home directory in your Senjhalla virtual machine. If you have not yet done so, refer to the Lab 2 video and slides. Download the assign1.zip package from Connex and unzip it into the a1 folder of your git repository. The folder structure is shown in Fig. 1. Add your code to the src files provided with your solution to parts A, B, and C. You may create additional files in src but do not rename word_count.c. A suggested template has been provided for you in word_count.h. cases hold the test files for parts A, B, and C. The input files have the format t*.txt and the corresponding expected output format is c*.txt. For example, the expected output for a solution to Part B, using input t04.txt, will be found in cases/B/c04.txt. 1 Figure 1: Assignment package The test folder contains the framework that will be used for evaluation. See section 5 for details on how to run the tests. You may add your own tests to the existing tests, but they will not be used in your evaluation. 1.2 Part A. Frequency of words of all lengths The first part of the assignment is to write a C program, contained in the source files called word_count.c and word_count.h, which counts the number of words of all lengths. You have been provided skeleton template files in the assignment package. The program must compile, with no warnings, and run using the following commands: $ gcc -Wall -std=c99 -o word_count word_count.c $ ./word_count –infile After compiling, a correct implementation will take the name of a word list file as a command line argument and output the frequency of words of all lengths in that file, e.g in the form of a function Count[arg] where arg is the length of the word. For example, consider the following as input file input_file.txt: Tomorrow, and tomorrow, and tomorrow, To the last syllable of recorded time; 2 There are 2 words of length 2: to and of, and so Count[2]=2; There are 3 words of length 3: and (twice) and the, and so Count[3]=3; There are 2 words of length 4: last and time, and so Count[4]=2; And finally there are 5 words of length 8: tomorrow (3 times), syllable and recorded, and so Count[8]=5; Therefore the complete output of the program should be: Count[02]=02; Count[03]=03; Count[04]=02; Count[08]=05; A note on output formatting: For all three parts of the assignment the outputs could be rendered as single 0 padding, e.g. Correct: Count[08]=05; or Incorrect: Count[8] = 5; Incorrect: Count[08]= 5; or even Incorrect: Count[8] = 0005; Tests associated with this part are located in cases/A and test/test_A.h. 1.3 Part B. SORTED Frequency of words of all lengths The second part of the assignment implements the same program as in Part A but the output is sorted by frequency of words, and outputted in descending order of frequency. You will also output the median word length. In the case of an even number of word lengths, take the average between the two middle bucket word lengths. Do not print the median if –print-words argument is also provided (as in Part C). Add an additional optional argument to your Part A code (i.e. do not create a new C source file), that will be run as shown. You cannot assume that the arguments will be run in this order. $ ./word_count –sort –infile For example, for the same input file as above, the output should be: Count[08]=05; Count[03]=03; Count[02]=02; Count[04]=02; Median word length: 3.5 3 In the case of a tie (as shown above) then you do a secondary sort based on the word length bucket/bin value. As in the example, Count[02] is a smaller word length than Count[04] so it is sorted above the longer world length. Tests associated with this part are located in cases/B and test/test_B.h. 1.4 Part C. SORTED Frequency of words of all lengths with Words information The third part of the assignment adds in the option to display the unique words found for each word length in alphanumeric order. Add an additional optional argument to your Part A & B code (i.e. do not create a new C source file), that will be run as shown. You cannot assume that the arguments will be run in this order. $ ./word_count –sort –print-words –infile For example, for the same input file as above, the output should be: Count[08]=05; (words: βTomorrowβ, βrecordedβ, βsyllableβ and βtomorrowβ) Count[03]=03; (words: βandβ and βtheβ) Count[02]=02; (words: βToβ, βofβ and βtoβ) Count[04]=02; (words: βlastβ and βtimeβ) Tests associated with this part are located in cases/C and test/test_C.h. 2 Test Inputs You should test all of your programs with a variety of test inputs, covering as many different use cases as possible, not just the test input provided. You should ensure that your programs handle error cases (such as files which do not exist) appropriately and do not produce errors on valid inputs. Since thorough testing is an integral part of the software engineering process, you will be expected to submit one test input. You have been provided a set of 10 test input files (t01.txt to t10.txt) and 21 (c01.txt to c07.txt) expected output files (7 for each part) located under the cases folder. Your code also needs to be able to handle basic user error. See test/test_input.h for expected behaviour when handling incorrect command-line arguments. Do not use exit() to exit the program. You have also been provided a testing framework that allows you to run these tests using a makefile. To run the makefile: $ make test This will compile and run the test framework in test and create the binary tests.out. You can re-run the tests using ./tests.out. If you modify you code, you will need to re-compile using make first. 4 The output is formatted as follows: TEST_FILE:TEST_FILE_LINE_NUMBER:TEST_FUNCTION:PASS_OR_FAIL_OR_IGNORE Expected: βTEXT_FROM_TEST_CASEβ; Actual : βOUTPUT_FROM_YOUR_PROGRAMβ;. ==> INPUT_FILE Fig.2 shows an example of the test output. When a test fails, you are provided with the function that failed, the truncated Expected (green) output from the c*.txt file and the Actual (red) output from your program, at the point it failed. You can turn tests off while testing by uncommenting TEST_IGNORE() in test function(s) located in test_A.h, test_B.h, test_C.h. There are 9 tests that are ignored that will used during the evaluation of your code. In the provided example, out of the 34 possible tests, 6 tests have failed, and 9 tests ignored. 3 Constraints You are NOT allowed to use malloc/realloc for this assignment. Therefore, you can make the following assumptions in your code: β’ The maximum file size is 5000 bytes β’ The maximum words in a file is 750 β’ The maximum word size is 34 bytes β’ Lower case and upper case words should be treated as different words (i.e. Tomorrow and tomorrow are printed separately) β’ The only allowed special characters to be included in the input file are .,;(). No other special characters are expected to be included in the input file. β’ In the case of a tie in the word frequency/count, then sort normally by word length. β’ Do NOT use exit() to exit the program on an error. 4 What you must submit β’ C source-code name word_count.c and other files which contains your solution for Parts A, B and C of Assignment #1 located in src β’ Ensure your ALL work is committed to your local repository and pushed to the remote before the due date/time. (You may keep extra files used during development within the repository.) 5 Evaluation The teaching staff will primarily mark solutions based on the input files provided for this assignment. Students must adhere to the software requirements (command execution and output formatting) outlined in this assignment. For each assignment, some students will be randomly selected to demo their code to the course markers. 5 Figure 2: Example of test results with some failing cases. 6 Each student will have this opportunity to demo at least one assignment during the semester. Sign-up procedure will be discussed in class. Our grading scheme is relatively simple. β’ βAβ grade: A submission completing ALL Parts and all requirements of the assignment and all tests pass. The word_count programs runs without any problems. β’ βB+β grade: A submission that completes part A & B of the assignment and all tests pass. The word_count programs runs without any problems. β’ βB-/Bβ grade: A submission that completes part A of the assignment and all tests pass. The word_count programs runs without any problems. β’ βCβ grade: A submission that completes part A of the assignment and passes some tests. The word_count programs runs with some problems. β’ βDβ grade: A serious attempt at completing requirements for the assignment. The word_count program compiles and runs with some problems. β’ βFβ grade: Either no submission given (or did not attend demo); submission represents very little work or understanding of the assignment. 7
SENG265 ASSIGNMENT 2
1 Assignment Overview This assignment involves writing a program to solve the same problem as Assignments 1 except using Python 3. As a reminder, your programs will read lines of text from a given file, compute the frequency of words of certain length from the file and print these frequencies to standard output. The overall goal of this assignment is to familiarize yourself with programming in Python. Section 2 contains the Specifications for the program. Section 3 describes the (new) Constraints you should consider in your implementation, and Section 4 describes the Testing component. What you should Submit is outlined in Section 5 and the Evaluation scheme is given in Section 6. Your code is expected to run without warnings in the course virtual machine Senjhalla using Python 3.9.6. 2 Implementation The assignment involves writing a program to solve the same problem as Assignments 1 except using Python 3. It includes optional arguments to sort and print the words of the same length, exactly as in Assignment 1. $ python word_count.py –sort –print-words –infile Recall that there are four possible options to run the program with these arguments: $ python word_count.py –infile or, $ python word_count.py –sort –infile or, $ python word_count.py –print-words –infile or, $ python word_count.py –sort –print-words –infile You may assume that if exists, it is passed immediately after –infile. You cannot assume that the arguments will be run in this order. The same test files will be used during the evaluation so please review the expected output formatting. 1 3 Constraints You may only use python3 modules that are installed using the Vagrantfile for your VM Senjhalla. If a python3 module is not installed, you may not use that module in your code. 4 Test Inputs You should test all of your programs with a variety of test inputs, covering as many different use cases as possible, not just the test input provided. You should ensure that your programs handle error cases (such as files which do not exist) appropriately and do not produce errors on valid inputs. Since thorough testing is an integral part of the software engineering process, you will be provided test input. Provided For You There will a assign2.zip package available on Connex. Download and extract to the a2 folder in your git repository. The folder will consist of: cases, test, run_all.sh, and src/word_count.py. Do not move word count.py from the src directory. Like with Assignment 1, do not modify the main() function description. It is required for the test framework to βhookβ into your solution. For this assignment, you will be using the same test cases as from Assignment 1. You are also provided a test framework to test your code. We will be using the python unittest module to run tests of the code, in the same manner as in Assignment 1. These test files are located in the test folder. There are three test files test_A.py, test_B.py, and test_C.py, corresponding to the parts A,B,C from Assignment 1. There will be no tests provided for argument handling but your solution is expected to work as described in Section 2. Your solution is expected to work for all 4 permutations of –sort, –print-words and –infile. To run an each part (A,B,or C) individually: python test/test_A.py python test/test_B.py python test/test_C.py A BASH script has been provided for you in order to run all the tests: bash run_all.sh Deciphering test output In Fig. 1, we see an example of the results for python test/test_B.py. The summary shows that the solution passed the first 3 three tests, ok. The FAIL indicates that the solution ran without error, but produced incorrect output. An AssertionError is returned by the test framework indicated the two strings (expected and actual output) do not match. The ERROR indicates that the solution could not run to completion, and exited due to an error (exception). The test framework will show the error message and code line number where the error occurred. 2 Figure 1: Test framework summary output Figure 2: Test framework failed test output Fig. 2 shows the result of a FAIL test. The test framework will show you the βdiff outputβ between the expected and actual results. In the βdiff outputβ, you may see these symbols: β-β, a β+β, β?β, or a βΛβ. β’ – (minus) : this line was expected but not produced by your output (you need to add it) β’ + (plus) : this line was produced by your output, but not expected (you need to remove it) β’ ? : the test framework has a suggestion (youβre almost right) β’ ^ : indicate you need to change a character In Fig. 2 we can see that for test_t05_multi_line weβve added the bin Count[02] in the wrong spot and the median calculation is incorrect. (HINT: You can also use diff to compare your results with the provided expected output) 5 What you must submit β’ Python source-code name word_count.py which contains your solution for assignment #2. 3 β’ Ensure your work is committed to your local repository in the provided a2 folder and pushed to the remote before the due date/time. (You may keep extra files used during development within the repository.) 6 Evaluation The teaching staff will primarily mark solutions based on the input files provided for this assignment, as well as additional solution files not provided. Students must adhere to the command execution and output formatting outlined in this assignment. For each assignment, some students will be randomly selected to demo their code to the course markers. Each student will have this opportunity to demo at least one assignment during the semester. Sign-up procedure will be discussed in class. In addition to automated testing, your code will be evaluated based on: β’ Proper handling of different command line argument passing β’ Good coding practices (i.e. good variable/function naming, use of functions when appropriate, limited globals, etc.) Our grading scheme is relatively simple. β’ βA+β grade: A submission completing ALL Parts and all requirements of the assignment and all tests pass. The word_count.py programs runs without any problems. β’ βA/A-β grade: A submission completing ALL Parts and all requirements of the assignment and test 1-7 pass. Most of the 8-10 tests pass. The word_count.py programs runs without any problems. β’ βB+β grade: A submission that completes ALL parts of the assignments and tests 1-7 pass. Some of the 8-10 tests pass. The word_count.py programs runs without any problems. β’ βB-/Bβ grade: A submission that completes part A & part B of the assignment and all tests (1-7) pass. Most or some of the tests 8-10 pass. The word_count.py programs runs without any problems. β’ βCβ grade: A submission that completes part A of the assignment and passes all of the tests (1-7). The word_count.py programs runs with some problems. β’ βDβ grade: A serious attempt at completing requirements for the assignment. The word_count.py program compiles and runs with some problems. β’ βFβ grade: Either no submission given (or did not attend demo); submission represents very little work or understanding of the assignment. 4
SENG265 ASSIGNMENT 3
1 Assignment Overview This assignment involves writing a program to solve the same problem as Assignment 1 Part C except by using dynamic memory allocation and linked lists techniques. As a reminder, the program will read lines of text from a given file, compute the frequency of words of certain length from the file and print these frequencies to standard output. The overall goal of this assignment is to use the dynamic-memory available in C (i.e. malloc(), realloc(), etc.) as well as the concept of linked lists in C. You may only use dynamic-memory for any arrays. For example, you may not use int array[n] syntax. Similar to Assignment 1, your eventual submission will consist of the source file(s) for the program. There are three parts to this assignment and Sections 1.2, 1.3, 1.4 below contain Specifications for each. These sections do not necessarily need to be completed sequentially, but to be developed collectively to produce a final solution. Section 2 describes the (new) Constraints you should consider in your implementation, and Section 3 describes the Testing component. What you should Submit is outlined in Section 4 and the Evaluation scheme is given in Section 5. Your code is expected to compile without warnings in the vagrant VM Senjhalla using the -Wall -g and -std=c99 flags with the gcc 9.3.0 compiler. 1.1 Data Type Constraints In your previous C assignment, you were not allowed to use dynamic memory allocation methods, such as malloc, calloc, realloc, etc. For this assignment, you are NOT allowed to create any static-sized arrays using the syntax int array[10]. In other words, you are not allowed to store any arrays in stack memory. You must use dynamic memory allocation (as discussed in the lecture and labs) to store your arrays in heap memory. To achieve full marks on this assignment, your solution must use a linked list as your data structure for: 1. Your histogram of word frequencies, and; 2. Your list of words in each histogram bin/bucket; 3. Your linked list must be dynamically allocated; 1 The additional behaviour expected by your program is outlined in the next two sections. 1.2 Part A. Printing Words Information The requirements for this assignment are divided into three parts. The first part of the assignment is to write a C program, contained in a source file called word_count.c, which counts the number of words of all lengths. In this assignment you will only implement Part C from Assignment #1 using linked lists and dynamic memory allocation (e.g. malloc). To achieve full marks on this assignment, your solution must use a linked list as your data structure for: 1. Your histogram of word frequencies, and; 2. Your list of words in each histogram bin/bucket; You will output the histogram bin/buckets in the same order as Assignment #1 – Part A (i.e. order by word length bucket/bin size). The parameter –print-words will not be used as it is unnecessary: that is the only behaviour expected by the program. Letβs look at an example that uses input file input.txt as shown below. input.txt: Tomorrow, and tomorrow, and tomorrow, To the last syllable of recorded time; The program can be run in the following manner: $ ./word_count –infile This will produce output that matches the behaviour of assignments 1 and 2 (i.e. using –print-words without –sort). The histogram buckets/bin are printed in ascending order based on the word length. The unique words in each histogram are printed in ascending alphanumeric order. The expected output is shown below, for input file input.txt: Count[02]=02; (words: βToβ and βofβ) Count[03]=03; (words: βandβ and βtheβ) Count[04]=02; (words: βlastβ and βtimeβ) Count[08]=05; (words: βTomorrowβ, βrecordedβ and βsyllableβ) A second parameter –sort is used to change the word list order to descending alphanumeric order. The program can be run in the follow manner: $ ./word_count –infile –sort , or $ ./word_count –sort –infile 2 The expected output is shown below, for input file input.txt: Count[02]=02; (words: βofβ and βToβ) Count[03]=03; (words: βtheβ and βandβ) Count[04]=02; (words: βtimeβ and βlastβ) Count[08]=05; (words: βsyllableβ, βrecordedβ and βtomorrowβ) 1.3 Part B. Error Handling For this assignment, you will be returning error codes when there is an error. These are codes that are sent to the operating system that indicate the state of the program when it finished. In Linux and UNIX the error code standard is zero (0) to indicate the program completely successfully, and 1-255 for any other state. In C, the error code is the integer returned from main or using exit(return_code). You can see the error code by typing echo $? in the terminal, after running a command. Your solution must return the following error codes: 1. Invalid command line arguments: return code = 1; 2. Unable to open –infile : return code = 2; 3. Unable to allocate memory: return code = 3; 4. All error messages should be sent to stderr (and not stdout); You may assume that if exists, it follows –infile. Your solution may ignore invalid switches (i.e. –bad-param). There is no standard for the each error output message, but all error messages should be sent to stderr (i.e. use fprintf). 1.4 Part C. Memory Handling and Debugging When you use dynamic memory allocation, the programmer must release the memory directly using free(). Memory that is not released once it is no longer in use is called a memory leak. For full marks, your solution must produce no memory leaks. In addition to gdb you can use valgrind for debugging memory errors. 1.4.1 Memory Leaks If you have a pointer to a block of memory you allocated, and the pointer is deleted but the block is still tied up (i.e. you have it reserved), itβs called a memory leak. This can happen if, for example, the pointer is a local variable inside a function, which returns without free()ing the associated block of memory. If enough memory becomes tied up in memory leaks, your program could run out of memory and crash. Or, the virtual memory keeps expanding and you run out of memory for your computer! Once you free() a pointer to a block of memory, trying to free() it again will cause an error. Once you allocate some memory, C still doesnβt check if youβre staying within the bounds of the new block; thatβs up to you. 3 1.4.2 Valgrind Valgrind is a special debugger program that checks your program for memory allocation errors. Your program needs to be compiled with -g. Refer to the makefile provided to you. Examples of how to use vagrind: valgrind ./word_count –infile cases/t03.txt, or valgrind –leak-check=full ./word_count –infile cases/t03.txt Your program will run a bit slower than normal, but afterwards, Valgrind will print out a Leak Summary report on any memory leaks or allocation errors. If your program stops with a segfault, Valgrind will tell you where it is. –leak-check=full provides additional information for debugging. To achieve full marks, proper solution will contain no memory leaks (0 bytes). In the Leak Summary report, we will be considering any definitely loss bytes as a memory leak. Any other potential leaks will be ignored during marking. A full breakdown of how to interpret the output of valgrind can be found here. 2 Constraints You are allowed to use any code provided during the lectures or the labs. Code snippets from third-party sources are allowed if appropriately attributed (i.e. include the URL link in the comments of your code). If you are unsure, ask your lab teaching assistant. β’ You cannot assume a maximum length for an input file; β’ You cannot assume a maximum number of lines in a file; β’ You cannot assume a maximum number of words in a file; β’ You must*alloc() (malloc, realloc, calloc, etc.) to allocate memory any arrays or structs; β’ You must use a linked list as your histogram data structure; β’ You must use a linked list as your histogram word list data structure; β’ The only allowed special characters to be included in the input file are .,;(). No other special characters are expected to be included in the input file. 3 Test Inputs For this assignment, you will not be provided a test framework. You are expected to evaluate and test your code using the skills you have learned so far in the course. You may re-use the test framework from assignment #1, however, be aware that you might experience heap fragmentation or your code has memory leaks which may cause malloc() or realloc() failures. The program must compile, with no warnings, and produce no memory leaks. A makefile that when run using make creates the executable word_count has been provided for you. This makefile is in the same format as the one used for evaluation. 4 You may use make word_count to compile your word count.c and any additional C-files required for your solution into a word_count executable. This command will remove any existing word_count executable, and compile the code located in your a3/src folder. This will be the equivalent of running: $ gcc -Wall -g -std=c99 -o word_count word_count.c You will receive 8 test input files used to evaluate assignment 3. Tests t0[1-7].txt are the same tests used in assignments 1 and 2. An additional test valgrind.txt has been provided to allow you to test your program with large input for memory leaks. You may use make valgrind to compile and run: valgrind word_count –infile cases/valgrind.txt [–sort]. Both with and without the –sort parameter is run. Three additional test input files will be used during evaluation that will test the robustness of your solution. These 3 tests will test: β’ A large file size that exceeds the previously used maximum file size; β’ A file with the number of lines that exceeds the previously used maximum line count; β’ A file with the number of words that exceeds the previously used maximum word count; You should test all of your programs with a variety of test inputs, covering as many different use cases as possible, not just the test input provided. You should ensure that your programs handle error cases (such as files which do not exist) appropriately and do not produce errors on valid inputs. Read the assignment carefully to ensure you understand all the requirements of the program. Provided For You For this assignment, you will be provided an assign3.zip containing an empty word_count.c file, makefile, the 7 test inputs and corresponding expected output. An additional test input file valgrind.txt for valgrind memory leak testing. The expected output from part A are located in the folders A (ascending) and D (descending). Folder (A)scending contains output from: $ ./word_count –infile Folder (D)escending contains output from: $ ./word_count –sort –infile 5 4 What you must submit β’ C source-code name word_count.c located in a3/src which contains your solution for Assignment #3. β’ Any additional C (*.c or *.h) files required for your solution in a3/src β’ Ensure your work is committed to your local repository in the provided a3 folder and pushed to the remote before the due date/time. (You may keep extra files used during development within the repository.) 5 Evaluation The teaching staff will primarily mark solutions based on the input files provided for this assignment. Students must adhere to the software requirements (command execution and output formatting) outlined in this assignment. For each assignment, some students will be randomly selected to demo their code to the course markers. Each student will have this opportunity to demo at least one assignment during the semester. Sign-up procedure will be discussed in class. In addition to automated testing, your code will be evaluated based on: – Proper error handling (i.e. memory allocation handling); – Good coding practices; – Adherence to the assignment constraints Good Coding Practices: β’ Your code is properly structured into functions, without too much functionality in any one function; β’ Your variables names are descriptive and easy to infer their functionality; β’ Your function names are descriptive and easy to infer their functionality; β’ Program properly checks for possible error conditions; β’ Program properly checks for user error; β’ Your code doesnβt contain (nearly) duplicate code in different functions; β’ Global variables are used sparingly, and appropriately; β’ Your code is fairly efficient and executes in a reasonable time (i.e. roughly less than 3 minutes); Assignment Requirements 1. All arrays must be allocated using dynamic memory allocation; 2. Word histogram is constructed using a linked list; 3. The word list for each histogram bucket/bin is constructed using a linked list; 4. Your code returns the correct error codes; 5. Your code passes the provided tests (1-7); 6. Your code passes the 3 evaluation tests (8-10); 7. A complete solution will have no memory leaks (as described in Section 1.4) for valgrind.txt; Our grading scheme is relatively simple. See above for requirements. Not following good coding practices will lower your grade. β’ βA/A+β: A submission completing ALL requirements of the assignment. β’ βA-β: A submission completing 1-6 requirements of the assignment. Some or minimal memory leaks in the program. 6 β’ βB/B+β: A submission completing 1-5 requirements of the assignment. β’ βB-β: A submission completing 1-4 requirements. Non-trivial test cases (5-7) pass. β’ βC+β: A submission completing 1-4 requirements. Trivial test cases (1-4) pass. β’ βCβ: A submission completing 1-3 requirements. Some test cases pass and some of the correct error codes are turned. β’ βDβ: A submission that fails to complete the base requirements of the assignment. Trivial or lack of use of linked lists. Dynamic memory is used, but some arrays are also used. No attempt was made to free dynamic memory. β’ βFβ: A submission that fails to complete the base requirements of the assignment. Either no submission given (or did not attend demo); submission represents very little work or understanding of the assignment. 7
SENG265 ASSIGNMENT 4
Assignment Overview This assignment involves writing an interactive program to decode a data set of encrypted ferry schedules from BC Ferries. The program will interactively prompt the user for input which will be used to decrypt and process the appropriate files. The overall goal of this assignment is to familiarize yourself with regex and Python classes. The Section 3 Implementation describes the program and class you are to implement. Section 4 describes the Constraints you should consider in your implementation, and Section 5 describes the Testing component. What you should Submit is outlined in Section 6 and the Evaluation scheme is given in Section 7. Your code is expected to run without warnings in the virtual machine Senjhalla using Python 3.6.9. Read the assignment carefully to ensure you understand all the requirements of the assignment. The assignment might seem large and overwhelming at first (if you are new to Python); take a deep breath and read again, detailed explanations exist in this specification. The assignment is also very modular and individual components can be attempted separately 1.1 Terminology β’ (Shift) Cipher: Encryption algorithm β’ Encrypted or encoded: Text that has been transformed using the algorithm described in section 2; β’ Decrypted or decoded: Original text recovered from the encrypted version, using a password or key; β’ Password or key: The cipher key used to decrypt or encrypt a text; 2 Encrypted Data There are 3 data files in ascii-text format included with the assignment using the naming convention ferryX.out. These files contain information from BC ferries regarding their sailings. The original CSV files have been encrypted using a basic shift cipher implementation (described below). You may assume that every original CSV is properly formatted and every row has the same number of fields. The first row will be the CSV file headers. These files can be decoded into CSV format with the following headers: 1 β’ departure terminal β’ arrival terminal β’ vessel name β’ scheduled departure year β’ scheduled departure month β’ scheduled departure day β’ scheduled departure hour β’ scheduled departure minute β’ actual departure year β’ actual departure month β’ actual departure day β’ actual departure hour β’ actual departure minute β’ arrival year β’ arrival month β’ arrival day β’ arrival hour β’ arrival minute 2.0.1 Encryption Algorithm You are provided three files that were encoded using a shift cipher. As an encryption algorithm with a particular key and alphabet, the shift cipher works as follows to encode the given input text: For each character of the input text and the corresponding character of the password (key), add their value to obtain the shift key. The value is determined by the location or βrankβ (starting at 0) in the encryption alphabet. If the value is too large is not long enough to map directly the alphabet, than take the modulo of the length of your alphabet (i.e. βwrap aroundβ). The encoded text is the character from the encryption alphabet at index value. Consider the following simplified example of the algorithm: For the input text: HELLO WORLD For the key: ZIP For the encryption alphabet: [A-Z], β β (27 characters) First character: H (value=7 in the alphabet) First key: Z (value=25 in the alphabet) Shift key: 7 + 25 = 32 modulo 27 = 5 Encoded character = F (alphabet[5]) Fifth character: O (value=14 in the alphabet) Fifth key (5mod3 = 2): I (value=8 in the alphabet) Shift key: 14 + 8 = 22 Encoded character = W (alphabet[22]) This process is repeated for each character in the input text until the entire input is encrypted. For this example, the full encrypted text is FM JWOUWFJL. 2 The files were encrypted using the following encryption alphabet (in this order: lowercase and uppercase letters, digits, punctuation, space and newline. The alphabet is included for you in the decoder.py file provided to you. Please note that the punctuation is not preserved (i.e. it is also encrypted). The passwords (keys) are provided for you with the format inX.txt. The first line is the encrypted file (i.e. ferry1.out) and the second line is the encryption password or key used to encrypt the file. Below is an example of a inX.txt file: cases/ferry1.out A!00b12$ In order to decrypt the files, you need to perform the inverse (reverse) of the encoding process. HINT: You can use the inX.txt files (and any you create yourself) to automate your testing using redirection. 3 Implementation For this assignment you are expected to write an interactive Python program, contained in a source file called decoder.py, which calculates ferry delays for a CSV file containing ferry schedule data from BC ferries. The program must run, with no errors, using the following commands: $ python3 decode.py 3.1 Program Flow Your program will follow this general process: 1. Ask user for a encrypted filename; 2. Check if file exists, if not, prompt user again; 3. The password (key) to decode the file; 4. Check if the password is in a valid format, if not, prompt the user again; 5. Check if the provided valid password successfully decrypts the provided file, if not, prompt the user for the password again; 6. Iterate through the decrypted rows; 7. Calculate the average monthly delay between scheduled time and when the ferry actually leaves the port; 3.2 User Input The program is required to prompt the user for the encrypted file to decode and the password (key) to use. You must also validate the data upon input. If the data is invalid notify the user of their error and prompt them to enter their selections again. Follow the steps below: The user should also be able to enter q for quit at either prompt. 3 3.2.1 Check Password Validity You will be using regular expressions to determine if the password is in a valid format. A valid password must have: β’ At least 1 uppercase letter; β’ At least 2 numerical digits; β’ Exactly 2 special characters, !@#$&*-_ β’ Password length of 6-8 characters; Use the python module re to determine if the password is a valid password. You may use multiple regular expressions (HINT: it is possible to use only one regular expression), but you can only use a regular expression to validate the password (key) format. For example, you may not use string.count() or subtring in alphabet (this list is not inclusive). Even if the provided password (key) is a valid password, it may not be the password used to encrypt the provided file. Check if the provided valid password successfully decrypts the provided file, if not, prompt the user for the password again. See the next section ( 3.3) for details on the FileDecoder class you are to implement which will decrypt the encrypted file. 3.3 FileDecoder You will create an iterable class called FileDecoder in a file called cipher.py. This class will (a) read the decrypted file, (b) decrypt the file, (c) check if the password successfully decrypted the file, and (d) return each decrypted CSV row iteratively. For evaluation purposes, we will create and use your class separately (i.e. outside of decoder.py). 3.3.1 Constructor Your class will take in three string variables: β’ key: password (key) provided by user; β’ filename: file path to the encrypted file for your class to decode; β’ alphabet: encryption alphabet used for encoding; The following is a simplified example of the constructor call: file_decoder = FileDecoder(key=”ABC”, filename=”file.out”, alphabet=”ABC”) You may use inheritance to extend existing Python classes. file_decoder will be used in the following examples to represent an FileDecoder object. 3.3.2 Decryption Your class will load and decrypt the provided encrypted file using the provided key (password) and alphabet. See Section 2 for details on the encryption algorithm (cipher). 4 3.3.3 Exceptions Your class should raise a custom DecryptException if the password provided does not successfully decode a valid CSV file (as described in Section 2). Your decoder.py solution should successfully handle the exception and continue the functionality described in Section 3.1. 3.3.4 Additional Required Functionality Python provides built-in functions that you may override to provide certain behaviours to your classes. With this in mind, program your class to perform the following functionality. print By default, if you use print(object) on a class object, it will return a class description. For example: This is not very helpful. Instead, we can use __repr__ to provide un-ambiguous description of our class, or __str__ to provide human-readable information about our class. The print method will check for these two functions. Your class should return a unambigious human-readable message string containing the key and filename used in the constructor when print(your_class_object) is used. Example output for print(file_decoder) usage: FileDescriptor(key=βA00!$aβ, file=βferry1.outβ) len You have used the len() function in the past to get the length of a list or a string. Python provides the function __len__ to allow you provide this functionality to your class. Your class should return the number of entries (rows or lines) in the decoded CSV file (including the header). All entries (list or tuple) should be the same size. Expected output for len(file_decoder) usage: 498 Iterable Class Your class will be an iterable class. An example of an iterable class is TextIOWrapper (created when using open(“file.txt”, “r”)). For assignment #2, you may have iterated over a file line by line using a for-loop. Your class will have the same functionality, only it will return a decoded row from the decrypted file that you must implement. Your program will iterate over your class object, returning a list or tuple object containing the contents of a single row from the decoded CSV file. Example iterator usage with the FileDecoder object: 5 for decoded_csv_row in file_decoder: print(decoded_csv_row) break Example of expected output: [βdeparture_terminalβ, βarrival_terminalβ, βvessel_nameβ, βscheduled_departure_yearβ, βscheduled_departure_monthβ, βscheduled_departure_dayβ, βscheduled_departure_hourβ, βscheduled_departure_minuteβ, βactual_departure_yearβ, βactual_departure_monthβ, βactual_departure_dayβ, βactual_departure_hourβ, βactual_departure_minuteβ, βarrival_yearβ, βarrival_monthβ, βarrival_dayβ, βarrival_hourβ, βarrival_minuteβ] 3.4 Program Output You will process the output provided by your FileDecoder (see Section 3.3). The decrypted file will contain information on BC Ferries on their sailing schedules for some months. You will calculate the average departure delay for each month in the file. These files will have data for some, but not all, months. Ignore months (i.e. do not output) for which there is no data. For each month in the file, output the line Average departure delay for MON: AVG, where MON is the 3 letter month abbreviation and AVG is the average delay in minutes rounded to 2 decimal places for each month. β’ The departure delay is calculated by taking the difference between scheduled departure and actual departure. When calculating delay, you can assume that the no ships will be delayed by a day (i.e. delays are only within the same day). A negative departure delay indicates the ship left early, and a positive delay indicates the ship left late. β’ Use the following three letter month abbreviations when printing month MON output: Jan, Feb, Mar, Apr, Jun, Jul, Aug, Sep, Oct, Nov, Dec β’ To make sure our automated testing works properly, the printing of results should be framed between the keywords RESULTS and END (in all caps) β see example below. β’ The example formatting must be followed exactly due to the automatic testing. Submissions that do not follow the output spec, will fail. Here is an example of user input/output with the required formatting. There is no standard for the messages to the user. When the user inputs the following (in1.txt): Enter name of file: cases/ferry1.out Enter password: A00!$a The expected output (out1.txt): RESULTS FileDecoder: FileDecoder(key=βA00!$aβ, file=βferry1.outβ) Entries: 498 6 Average delay for Feb: 3.71 END This particular file only contains information for February. 4 Constraints You may only use python3 modules that are installed on Senjhalla. If an python3 module is not installed, you may not use that module in your code. You may use multiple regular expressions, but you may only use regular expressions to validate the password (key) format. Your FileDecoder the class must to be in the cipher.py file (not in the decoder file). Your class must be an iterable class. Your must do all file handling inside your FileDecoder. For example, you canβt read the file outside your class and pass the data to your class. Iterating through the ferry files must be done in the class using class iterator functions (not outside in decoder.py). 5 Test Inputs You should test all of your programs with a variety of test inputs, covering as many different use cases as possible, not just the test input provided. You should ensure that your programs handle error cases (such as files which do not exist) appropriately and do not produce errors/exceptions. You are expected to evaluate and test your code using the skills you have learned so far in the course. You are provided three encrypted input files and their corresponding passwords and expected output. The original CSV files are not provided. During evaluation, three additional encrypted BC ferry CSV files will be used for testing. These files will be in the same CSV format as those files provided. This assignment is modular by design to allow for individual testing of components. We will load and run your FileDecoder class independent of decoder.py. Provided For You For this assignment, you will be provided an assign3.zip containing decoder.py and an empty cipher.py file located in folder src. Three encrypted files (ferryX.out) and corresponding expected input and output (located in the folder cases) inX.txt and outX.txt. (HINT: Use diff and redirection to compare your results with the provided expected output) 7 6 What you must submit β’ Python source-code name decoder.py which contains your solution for assignment #4. β’ Python source-code name cipher.py which contains your FileDecoder class. β’ Ensure your work is committed to your local repository in the provided a4 folder and pushed to the remote before the due date/time. (You may keep extra files used during development within the repository.) 7 Evaluation The teaching staff will primarily mark solutions based on the input files provided for this assignment. Students must adhere to the software requirements (command execution and output formatting) outlined in this assignment. For this assignment, there will be no student demos. In addition to automated testing, your code will be evaluated based on: – Proper error handling (i.e. exception handling); – Good coding practices; – Adherence to the assignment constraints Good Coding Practices: β’ Your class is self-contained and can be run by any file; β’ Your code is properly structured into functions, without too much functionality in any one function; β’ Your variables names are descriptive and easy to infer their functionality; β’ Your function names are descriptive and easy to infer their functionality; β’ Program properly checks for possible error conditions; β’ Program properly checks for user error; β’ Your code doesnβt contain (nearly) duplicate code in different functions; β’ Global variables are used sparingly, and appropriately; β’ Your code is fairly efficient and executes in a reasonable time (i.e. roughly less than 1 minutes); Assignment Requirements 1. print(FileDecoder) prints human-readable description; 2. FileDecoder raises DecoderException on incorrect password; 3. len(FileDecoder) properly returns number of entries in decoded CSV file; 4. FileDecoder properly iterates through decoded CSV file entries; 5. FileDecoder properly decodes provided files (1-3); 6. Program properly handles user input; 7. Program only uses regular expressions to validate the user password; 8. Program passes output tests (1-3); 9. Program passes output tests (4-6) (evaluation only); 10. Program uses robust regular expressions to validate the user password, passes tests for invalid password formats (1-6) (evaluation only); Our grading scheme is relatively simple. See above for requirements. Not following good coding practices will lower your grade. 8 β’ βA/A+β: A submission completing ALL requirements of the assignment. β’ βA-β: A submission that completes requirements 1-9. Passes some of the regular expression invalid password format tests. β’ βB/B+β: A submission that completes requirements 1-8. Passes some of the evaluation tests (4-6). β’ βB-β: A submission that completes requirements 1-7. Passes some of the output tests (1-6). β’ βC+β: A submission that completes requirements 1-6. Program does not use regular expressions or only partially uses regular expressions. Program does incorrectly states that valid passwords are invalid. β’ βCβ: A submission that completes requirements 1-5. FileDecoder works as expected but decoder.py has issues. β’ βDβ: A serious attempt is made to complete the requirements of the assignment. All class requirements are attempted but do not completely work successfully. Basic decoder.py which loads FileDecoder exists. β’ βFβ: A submission that fails to complete the base requirements of the assignment. No submission given; submission represents very little work or understanding of the assignment. 9