Description
CSCI2250 Lab 1: C
Introduction
Real operating systems are virtually always written in C. We will be using C throughout this course.
This lab lets you revisit some language features specific to C:
Pointers;
Dynamic memory management;
Variadic functions.
Overview
In this lab, you will write an argument manipulator program called nyuc , which stands for Now You
Understand C. Your program will print each command-line argument in its original form, uppercase
form, and lowercase form.
For example, if you run:
It should print:
$ ./nyuc Hello, world
Download
Please download nyuc.tar.xz. You can use the following command to unarchive it:
Your task
The main() function is already given to you in nyuc.c :
[./nyuc] -> [./NYUC] [./nyuc]
[Hello,] -> [HELLO,] [hello,]
[world] -> [WORLD] [world]
$ tar xf nyuc.tar.xz
The main() function invokes two functions, defined in argmanip.h :
Your task is to implement these two functions.
manipulate_args()
#include
#include
#include “argmanip.h”
int main(int argc, const char *const *argv) {
char **upper_args = manipulate_args(argc, argv, toupper);
char **lower_args = manipulate_args(argc, argv, tolower);
for (char *const *p = upper_args, *const *q = lower_args; *p && *q; ++argv, ++p, ++q) {
printf(“[%s] -> [%s] [%s]\n”, *argv, *p, *q);
}
free_copied_args(upper_args, lower_args, NULL);
}
#ifndef _ARGMANIP_H_
#define _ARGMANIP_H_
char **manipulate_args(int argc, const char *const *argv, int (*const manip)(int));
void free_copied_args(char **args, …);
#endif
The manipulate_args() function takes three arguments:
argc and argv are the same as the command-line arguments.
manip is a pointer to a function that manipulates a character, such as toupper() and tolower() .
In this function, you will make a “copy” of the argument list. First, you will call malloc() once for the
overall array in which each element is a string of type char * . Then, you will call malloc() for each
element of that array, each of which will hold the manipulated version of each argument. Of course,
you will have to copy each string character-by-character, passing through the manip function as you
go.
free_copied_args()
The free_copied_args() function takes a variable number of arguments, each of which is a “copied”
argument list returned by manipulate_args() . It is terminated by a null argument.
For each argument list, you must free() everything that you have malloc() ed. First, you need to
free() all individual strings. Then, you need to free() the overall array.
Compiling
We will grade your submission on linserv1.cims.nyu.edu, which runs CentOS Linux release 7.9.2009.
We will compile your program using gcc 9.2.0 with the C17 standard and GNU extensions. You need
to run the following command to load it:
We have provided a Makefile , which turns on a few compilation flags to let gcc catch more
potential issues:
To compile your program, type:
Testing
$ module load gcc-9.2
CC=gcc
CFLAGS=-g -pedantic -std=gnu17 -Wall -Werror -Wextra
.PHONY: all
all: nyuc
nyuc: nyuc.o argmanip.o
nyuc.o: nyuc.c argmanip.h
argmanip.o: argmanip.c argmanip.h
.PHONY: clean
clean:
rm -f *.o nyuc
$ make
We will run your program under Valgrind Memcheck, a tool to detect many memory-related errors
that are common in C programs, such as:
Allocating the wrong size of memory.
Using an uninitialized pointer.
Accessing memory after it was freed.
Overrunning a buffer.
Leaking memory.
…
These errors can lead to crashes and unpredictable behavior.
Valgrind is already installed on the Linux server. To run your program under Memcheck, type:
It will print a lot of information. Here is a sample output:
$ valgrind –leak-check=full ./nyuc Hello, world
In addition to printing the correct output, your program must have “0 errors from 0 contexts,” which
means no illegal memory access, no memory leak, etc. Otherwise, Valgrind will give you detailed
information about the memory errors and potential ways to diagnose the issue. You must fix all of
them. You may find gdb useful for debugging.
Submission
You will submit a single file, argmanip.c .
==12345== Memcheck, a memory error detector
==12345== Copyright (C) 2002-2017, and GNU GPL’d, by Julian Seward et al.
==12345== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==12345== Command: ./nyuc Hello, world
==12345==
[./nyuc] -> [./NYUC] [./nyuc]
[Hello,] -> [HELLO,] [hello,]
[world] -> [WORLD] [world]
==12345==
==12345== HEAP SUMMARY:
==12345== in use at exit: 0 bytes in 0 blocks
==12345== total heap usage: 8 allocs, 8 frees, 108 bytes allocated
==12345==
==12345== All heap blocks were freed — no leaks are possible
==12345==
==12345== For lists of detected and suppressed errors, rerun with: -s
==12345== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
You must not modify any provided files ( nyuc.c , argmanip.h , and Makefile ). We will compile your
code together with our provided files and use automated scripts to grade your submission.
Rubric
The total of this lab is 100 points, mapped to 5% of your final grade of this course.
You will get 40 points as long as you submit something meaningful and your program compiles
successfully.
You will get another 30 points if your program prints the correct output.
You will get the final 30 points if your program prints the correct output and Valgrind reports 0
errors. Otherwise, for each error context, there will be a 5-point deduction, up to 30 points.
Tips
If you are experienced in C programming, this lab shouldn’t be too difficult. However, if you are not
familiar with C (even if you know Java), now is an excellent time to catch up before Lab 1 posts. The C
programming language (2nd ed.) by Brian W. Kernighan and Dennis M. Ritchie is a timeless classic
(NYU students can read it online for free), and this C tutorial is a good alternative.
References: Lee et al., Follow the River and You Will Find the C, SIGCSE’11.
CSCI2250 Lab 2: Shell
Introduction
The shell is the main command-line interface between a user and the operating system, and it is an
essential part of the daily lives of computer scientists, software engineers, system administrators,
and so on. It makes heavy use of many OS features. In this lab, you will build a simplified version of
the Unix shell called the New Yet Usable SHell, or nyush for short.
Objectives
Through this lab, you will:
Familiarize yourself with the Linux programming environment and the shell, of course.
Learn how to write an interactive command-line program.
Learn how processes are created, destroyed, and managed.
Learn how to handle signals and I/O redirection.
Get a better understanding of the OS and system calls.
Be a better C programmer and be better prepared for your future technical job interviews. In
particular, the string parsing skill that you will practice in this lab frequently appears in interview
questions.
Overview
The shell is essentially a command-line interpreter. It works as follows:
1. It prompts you to enter a command.
2. It interprets the command you entered.
If you entered a built-in command (e.g., cd ), then the shell runs that command.
If you entered an external program (e.g., /bin/ls ), or multiple programs connected through
pipes (e.g., ls -l | less ), then the shell creates child processes, executes these programs, and
waits for all these processes to either terminate or be suspended.
If you entered something wrong, then the shell prints an error message.
3. Rinse and repeat until you enter the built-in command exit , at which point it exits.
Specifications
The prompt
The promptis what the shell prints before waiting for you to enter a command. In this lab, you will
use the following format (the final underscore represents your cursor; you should not print that
underscore):
The dir is the basename of the current working directory. For example, if you are in
/home/abc123/os/lab2 , then the prompt should be:
[nyush dir]$ _
Note that there is a space after the dollar sign.
The command
In each iteration, the user inputs a command terminated by the “enter” key (i.e., newline). For
simplicity, we have the following assumptions:
Each command has no more than 1000 characters.
Program arguments, if any, are separated by a single space.
There is no space within the program name or any command-line argument.
A command may contain multiple programs separated by the pipe ( | ) symbol.
In each command, the first program may redirectits input using < , and the last program may redirectits output using > or >> . If there is only one program in the command, it may redirect
both input and output.
In each command, there may be at most one input redirection and one output redirection.
Built-in commands (e.g., cd ) cannot be I/O redirected or piped.
For your reference, here is the grammar for valid commands (don’t worry if you can’t understand it;
just look at the examples below):
[nyush lab2]$ _
Here are some examples of valid commands:
A blank line.
ls -a -l
[command] := “”; or
:= [cd] [arg]; or
:= [exit]; or
:= [fg] [arg]; or
:= [jobs]; or
:= [cmd] ‘<‘ [filename] [recursive]; or
:= [cmd] ‘<‘ [filename] [terminate]; or
:= [cmd] [recursive]; or
:= [cmd] [terminate] < [filename]; or := [cmd] [terminate]. [recursive] := ‘|’ [cmd] [recursive]; or := ‘|’ [cmd] [terminate]. [terminate] := “”; or := ‘>’ [filename].
:= ‘>>’ [filename].
[cmd] := [cmdname] [arg]*
[cmdname] := A string without any space, tab, > (ASCII 62), < (ASCII 60), | (ASCII 124), * (ASCII 42), ! [arg] := A string without any space, tab, > (ASCII 62), < (ASCII 60), | (ASCII 124), * (ASCII 42), ! (AS [filename] := A string without any space, tab, > (ASCII 62), < (ASCII 60), | (ASCII 124), * (ASCII 42),
cat shell.c | grep main | less
cat < input.txt > output.txt
cat < input.txt >> output.txt
cat < input.txt | cat | cat > output.txt
Here are some examples of invalid commands:
cat < cat >
cat |
| cat
cat << input.txt
cat < output.txt < output2.txt
cat < output.txt output2.txt cat > output.txt > output2.txt
cat > output.txt output2.txt
cat > output.txt | cat
cat | cat < input.txt cd / > output.txt
If there is any error in parsing the command, then your shell should print the following error message
to STDERR and prompt for the next command.
Note that there should be a newline at the end of the error message. For example:
Locating programs
You can specify a program by either an absolute path, a relative path, or base name only.
An absolute path begins with a slash ( / ). If the user specifies an absolute path, then your shell must
run the program at that location.
A relative path contains, but not begins with, a slash ( / ). If the user specifies a relative path, then
your shell should locate the program by following the path from the current working directory. For
example, dir1/dir2/program is equivalent to ./dir1/dir2/program .
Otherwise, if the user specifies only the base name without any slash ( / ), then your shell must
search for the program under /bin and /usr/bin (in such order). For example, when the user types
ls , then your shell should try /bin/ls . If that fails, try /usr/bin/ls . If that also fails, it is an error. In
Error: invalid command
[nyush lab2]$ cat <
Error: invalid command
[nyush lab2]$ _
this case, your shell should not search the current working directory. If there is a program named
hello in the current working directory. Entering hello should result in an error, whereas ./hello
runs the program.
In any case, if the program cannot be located, your shell should print the following error message to
STDERR and prompt for the next command.
Process termination and suspension
After creating the processes, your shell must wait until all the processes have stopped running:
either terminated or suspended. Then, your shell should prompt the user for the next command.
Your shell must notleave any zombies in the system when it is ready to read the next command
from the user.
Signal handling
If a user presses Ctrl-C or Ctrl-Z , they don’t expect to terminate or suspend the shell. Therefore,
your shell should ignore the following signals: SIGINT , SIGQUIT , SIGTERM , SIGTSTP . All other signals
not listed here should keep the default signal handlers.
Note that only the shell itself, not the child processes created by the shell, should ignore these
signals. For example,
Error: invalid program
Here, the signal SIGINT generated by Ctrl-C terminates only the process cat , not the shell itself.
As a side note, if your shell ever hangs and you would like to kill the shell, you can still send it the
SIGKILL signal.
I/O redirection
Sometimes, a user would read the input to a program from a file rather than the keyboard, or send
the output of a program to a file rather than the screen. Your shell should be able to redirect the
standard input( STDIN ) and the standard output( STDOUT ). Forsimplicity, you are not required to
redirect the standard error ( STDERR ).
Input redirection
Input redirection is achieved by a < symbol followed by a file name. For example:
If the file does not exist, your shell should print the following error message to STDERR and prompt
for the next command.
[nyush lab2]$ cat
^C
[nyush lab2]$ _
[nyush lab2]$ cat < input.txt Output redirection Output redirection is achieved by > or >> followed by a file name. For example:
If the file does not exist, a new file should be created. If the file already exists, redirecting with >
should overwrite the file (after truncating it), whereas redirecting with >> should append to the
existing file.
Pipe
A pipe ( | ) connects the standard output of the first program to the standard input of the second
program. For example:
The user may invoke n programs chained through (n – 1) pipes. Each pipe connects the output of the
program immediately before the pipe to the input of the program immediately after the pipe. For
example:
Error: invalid file
[nyush lab2]$ ls -l > output.txt
[nyush lab2]$ ls -l >> output.txt
[nyush lab2]$ cat shell.c | wc -l
Here, the output of cat shell.c is the input of grep main , and the output of grep main is the input of
less .
Built-in commands
Every shell has a few built-in commands. When the user issues a command, the shell should first
check if it is a built-in command. If so, it should not be executed like other programs.
In this lab, you will implement four built-in commands: cd , jobs , fg , and exit .
cd [dir]
This command changes the current working directory of the shell. It takes exactly one argument: the
directory, which may be an absolute or relative path.
If cd is called with 0 or 2+ arguments, your shell should print the following error message to STDERR
and prompt for the next command.
If the directory does not exist, your shell should print the following error message to STDERR and
prompt for the next command.
[nyush lab2]$ cat shell.c | grep main | less
Error: invalid command
jobs
This command prints a list of currently suspended jobs to STDOUT in the following format:
[index] command . For example:
A job is the whole command, which may be either one program or multiple programs connected
through pipes. A job may be suspended by Ctrl-Z , the SIGTSTP signal, or the SIGSTOP signal. This list
is sorted by the time each job is suspended (oldest first).
Forsimplicity, you can assume that there are no more than 100 suspended jobs at one time. Besides, you are
not required to handle the case where a suspended job isresumed or terminated by other processes.
The jobs command takes no arguments. If it is called with any arguments, your shell should print the
following error message to STDERR and prompt for the next command.
Error: invalid directory
[nyush lab2]$ jobs
[1] cat
[2] top | cat
[3] cat > output.txt
[nyush lab2]$ _
Error: invalid command
fg [index]
This command resumes a job in the foreground. It takes exactly one argument: the job index, which is
the number in the bracket printed by the jobs command. For example:
This command would resume top | cat in the foreground.
If fg is called with 0 or 2+ arguments, your shell should print the following error message to STDERR
and prompt for the next command.
If the job index does not exist in the list of currently suspended jobs, your shell should print the
following error message to STDERR and prompt for the next command.
exit
[nyush lab2]$ jobs
[1] cat
[2] top | cat
[3] cat > output.txt
[nyush lab2]$ fg 2
Error: invalid command
Error: invalid job
This command terminates your shell. However, if there are currently suspended jobs, your shell
should not terminate. Instead, it should print the following error message to STDERR and prompt for
the next command.
The exit command takes no arguments. If it is called with any arguments, your shell should print the
following error message to STDERR and prompt for the next command.
Compiling
We will grade your submission on linserv1.cims.nyu.edu, which runs CentOS Linux release 7.9.2009.
We will compile your program using gcc 9.2.0. You need to run the following command to load it:
You must provide a Makefile , and by running make , it should generate an executable file named
nyush in the current working directory.
Your program must not callthe system() function. Otherwise, what is the whole point of this lab?
Error: there are suspended jobs
Error: invalid command
$ module load gcc-9.2
Testing
Beat up your own code extensively. Better yet, eat your own dog food. I would happily use nyush as
my main shell (at least for the duration of this lab), so why wouldn’t you?
By popular demand, we are providing some sample test cases. Note that these cases are not
exhaustive. The test cases for final grading will be different from the ones provided and will not be
shared. Commands like job and fg are not tested extensively in these test cases, but they will be
for the final grading.
Submission
You will submit an archive containing all files needed to compile nyush . You can do so with the
following command:
Rubric
The total of this lab is 100 points, mapped to 15% of your final grade of this course.
Compile successfully and can prompt the user for input. (40 points)
Process creation and termination. (20 points)
I/O redirection and pipe. (20 points)
$ tar cvJf nyush-Your_NetID.tar.xz Makefile *.h *.c
Handling suspended jobs ( jobs and fg ). (10 points)
Built-in commands and error handling. (10 points)
You will get 0 points for this lab if you callthe system() function.
Please make sure that your shell prompt and all error messages are as specified in this document.
Any discrepancy may lead to point deductions.
Tips
Don’t procrastinate
This lab requires significant programming effort. Therefore, start as early as possible! Don’t wait
until the last week.
Step by step
Remember to get the basic functionality working first, and build up your shell step-by-step.
Here is how I would tackle this lab:
1. Write a simple command parser.
2. Be able to run a simple program, such as ls .
3. Run a program with arguments, such as ls -l .
4. Handle simple built-in commands ( cd and exit ).
5. Handle output redirection, such as cat > output.txt .
6. Handle input redirection, such as cat < input.txt .
7. Run two programs with one pipe, such as cat | cat .
8. Handle multiple pipes, such as cat | cat | cat .
9. Handle suspended jobs.
10. Handle more built-in commands ( jobs and fg ).
Feel free to rearrange as you see fit.
Keep versions of your code! Use git or similar tools, but don’t make your repository public.
Read man pages
Man pages are of vital importance for programmers working on Linux and such. They contain a trove
of information.
Man pages are divided into sections. Please see man man for the description of each section. In
particular, Section 2 contains system calls. You will need to look them up a lot in this lab.
Sometimes, you need to specify the section number explicitly. For example, man pipe shows the page
in Section 8 by default. If you need to look up the pipe() system call, you need to invoke man 2 pipe .
Useful system calls
Your shell will make many system calls. Here are a few that you may find useful.
Process management: fork() , exec*() , wait() , waitpid() .
I/O redirection and pipe: dup2() , creat() , open() , close() , pipe() .
Signal handling: signal() .
Built-in commands: chdir() , getcwd() , kill() .
You may not need to use all of them, and you are free to use other system calls not mentioned above.
Check the return values of all system calls from the very beginning of your work. This will often
catch errors early, and it is a good programming practice.
Parsing the command line
You may find writing the command parser troublesome. Don’t be frustrated. You are not alone.
However, it is an essential skill for any programmer, and it often appears in software engineer
interviews. Once you get through it, you will never be afraid of it again.
I personally find the strtok_r() function extremely helpful. You don’t have to use it, but why not give
it a try?
Thislab has borrowed some ideasfrom Prof. Arpaci-Dusseau and Dr. T. Y. Wong.
CSCI2250 Lab 3: Encoder
Introduction
Data compression is the process of encoding information using fewer bits than the original
representation. Run-length encoding (RLE) is a simple yet effective compression algorithm:
repeated data are stored as a single data and the count. In this lab, you will build a parallel run-length
encoder called Not Your Usual ENCoder, or nyuenc for short.
Objectives
Through this lab, you will:
Familiarize yourself with multithreaded programming using POSIX threads.
Learn how to implement a thread pool using mutexes and condition variables.
Learn how to use a thread pool to parallelize a program.
Get a better understanding of key IPC concepts and issues.
Be a better C programmer and be better prepared for your future technical job interviews. In
particular, the string encoding technique that you will practice in this lab frequently appears in
interview questions.
Run-length encoding
Run-length encoding (RLE) is quite simple. When you encounter n characters of the same type in a
row, the encoder ( nyuenc ) will turn that into a single instance of the character followed by the count
n.
For example, a file with the following contents:
would be encoded (logically, as the numbers would be in binary format) as:
Note that the exact format of the encoded file is important. You will store the character in ASCII and
the count as a 1-byte unsigned integer in binary format. In other words, the output is actually:
In this example, the original file is 16 bytes, and the encoded file is 6 bytes.
Forsimplicity, you can assume that no character will appear more than 255 timesin a row. In other words,
you can safely store the count in one byte.
Milestone 1: sequential RLE
aaaaaabbbbbbbbba
a6b9a1
a\x06b\x09a\x01
You will first implement nyuenc as a single-threaded program. The encoder reads from one or more
files specified as command-line arguments and writes to STDOUT . Thus, the typical usage of nyuenc
would use shell redirection to write the encoded output to a file.
Note that you can use xxd to inspect a file in binary format.
For example, let’s encode the aforementioned file.
If multiple files are passed to nyuenc , they will be concatenated and encoded into a single
compressed output. For example:
Note that the last a in the first file and the leading a ’s in the second file are merged.
$ echo -n “aaaaaabbbbbbbbba” > file.txt
$ xxd file.txt
0000000: 6161 6161 6161 6262 6262 6262 6262 6261 aaaaaabbbbbbbbba
$ ./nyuenc file.txt > file.enc
$ xxd file.enc
0000000: 6106 6209 6101 a.b.a.
$ echo -n “aaaaaabbbbbbbbba” > file.txt
$ xxd file.txt
0000000: 6161 6161 6161 6262 6262 6262 6262 6261 aaaaaabbbbbbbbba
$ ./nyuenc file.txt file.txt > file2.enc
$ xxd file2.enc
0000000: 6106 6209 6107 6209 6101 a.b.a.b.a.
Forsimplicity, you can assume that there are no more than 100 files, and the totalsize of all filesis no more
than 1GB.
Milestone 2: parallel RLE
Next, you will parallelize the encoding using POSIX threads. In particular, you will implement a
thread pool for executing encoding tasks.
You should use mutexes, condition variables, or semaphores to realize proper synchronization
among threads. Your code must be free of race conditions. You must not perform busy waiting, and
you must not use sleep() , usleep() , or nanosleep() .
Your nyuenc will take an optional command-line option -j jobs , which specifies the number of
worker threads. (If no such option is provided, it runs sequentially.)
For example:
$ time ./nyuenc file.txt > /dev/null
real 0m0.527s
user 0m0.475s
sys 0m0.233s
$ time ./nyuenc -j 3 file.txt > /dev/null
real 0m0.191s
user 0m0.443s
sys 0m0.179s
You can see the difference in running time between the sequential version and the parallel version.
How to parallelize the encoding
Think about what can be done in parallel and what must be done serially by a single thread.
Also, think about how to divide the encoding task into smaller pieces. Note that the input files may
vary greatly in size. Will all the worker threads be fully utilized?
Here is an illustration of a thread pool from Wikipedia:
Task Queue
…
Thread
Pool
Completed Tasks
…
At the beginning of your program, you should create a pool of worker threads (the green boxes in the
figure above). The number of threads is specified by the command-line argument -j jobs .
You should divide the input data into fixed-size (4KB) chunks and submit the tasks (the blue circles in
the figure above) to the task queue, where each task would encode a chunk. Whenever a worker
thread becomes available, it would execute the next task in the task queue.
The main thread should collect the results (the yellow circles in the figure above) and write them to
STDOUT . Note that you may need to stitch the chunk boundaries. For example, if the previous chunk
ends with aaaaa , and the next chunk starts with aaa , instead of writing a5a3 , you should write a8 .
It is important that you synchronize the threads properly so that there are no deadlocks or race
conditions. In particular, there are two things that you need to consider carefully:
The worker thread should wait until there is a task to do.
The main thread should wait until a task has been completed so that it can collect the result.
Compiling
We will grade your submission on one of the compute servers, which runs CentOS Linux release
7.9.2009. We will compile your program using gcc 9.2.0. You need to run the following command to
load it:
You must provide a Makefile , and by running make , it should generate an executable file named
nyuenc in the current working directory. Note that you need to add the compiler option -pthread .
$ module load gcc-9.2
For those of you who choose to use C++, you are not allowed to include , ,
, , or .
Testing
Your code will first be measured for correctness. Parallelism means nothing if it cannot encode files
correctly.
If you pass the correctness tests, your code will be tested for performance. Higher performance will
lead to better scores.
strace
We will use strace to examine the system calls you have invoked. You will suffer from severe score
penalties if you call clone() too many times, which indicates that you do not use a thread pool
properly, or if you call nanosleep() , which indicates that you do not have proper synchronization.
On the other hand, you can expect to find many futex() invocations in the strace log. They are used
for synchronization.
You can use the following command to run your program under strace and dump the log to
strace.txt :
$ strace ./nyuenc -j 3 file.txt > /dev/null 2> strace.txt
Valgrind
We will use two Valgrind tools, namely Helgrind and DRD, to detect thread errors in your code. Both
should report 0 errors from 0 contexts.
You can use the following command to run your program under Helgrind:
You can use the following command to run your program under DRD:
$ valgrind –tool=helgrind –read-var-info=yes ./nyuenc -j 3 file.txt > /dev/null
==12345== Helgrind, a thread error detector
==12345== Copyright (C) 2007-2017, and GNU GPL’d, by OpenWorks LLP et al.
==12345== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==12345== Command: ./nyuenc -j 3 file.txt
==12345==
==12345==
==12345== Use –history-level=approx or =none to gain increased speed, at
==12345== the cost of reduced accuracy of conflicting-access information
==12345== For lists of detected and suppressed errors, rerun with: -s
==12345== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 240 from 89)
Note that Valgrind is very slow; so you should only test it with small files.
The autograder
Please download the autograder to test your program. Note that the test cases are not exhaustive.
However, if you can’t pass these cases, you can’t expect to pass the final grading.
Submission
You will submit an archive containing all files needed to compile nyuenc . You can do so with the
following command:
Rubric
$ valgrind –tool=drd –read-var-info=yes ./nyuenc -j 3 file.txt > /dev/null
==12345== drd, a thread error detector
==12345== Copyright (C) 2006-2017, and GNU GPL’d, by Bart Van Assche.
==12345== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==12345== Command: ./nyuenc -j 3 file.txt
==12345==
==12345==
==12345== For lists of detected and suppressed errors, rerun with: -s
==12345== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 139 from 11)
$ tar cvJf nyuenc-Your_NetID.tar.xz Makefile *.h *.c
The total of this lab is 100 points, mapped to 15% of your final grade of this course.
Meaningful submission that compiles successfully. (40 points)
Milestone 1.
Correctly encode one file. (10 points)
Correctly encode multiple files. (5 points)
Milestone 2.
Correctness. (15 points)
Performance. (15 points)
Free of thread errors (e.g., deadlocks, race conditions). (15 points)
For each error context reported by Helgrind or DRD (we will take the smaller of the two),
there will be 5 points deduction.
Tips
Don’t procrastinate
This lab requires significant programming effort. Therefore, start as early as possible! Don’t wait
until the last week.
How to parse command-line options
The getopt() function is useful for parsing command-line options such as -j jobs . Read its man
page and example.
How to access the input file efficiently
You may have used read() or fread() before, but one particularly efficient way is to use mmap() ,
which maps a file into the memory address space. Then, you can efficiently access each byte of the
input file via pointers. Read its man page and example.
How to debug
Debugging multithreaded code can be frustrating, especially when it hangs. If you are using gdb , you
can press Ctrl-C to stop the running program and use the following command to show what each
thread is doing:
Suppose you want to switch to Thread 2. You can type:
Then, you can use up and down to navigate through the stack frames and use info locals or print
to examine variables.
A particularly useful command is x , which can show a portion of memory in binary format. For
example, to examine the next 8 bytes pointed by ptr , you can use:
(gdb) thread apply all backtrace
(gdb) thread 2
In addition, we have partnered with Undo to provide free educational licenses for the UDB time
travel debugger. You can use udb in place of gdb . Here is a quick reference guide. You can also read
a tutorial on debugging race conditions using udb .
Thislab has borrowed some ideasfrom Prof. Arpaci-Dusseau.
(gdb) x/8bx ptr
CSCI2250 Lab 4: File Recovery
Introduction
FAT32 has been around for 25 years. Because of its simplicity, it is the most widely compatible file
system. Although recent computers have adopted newer file systems, FAT32 is still dominant in SD
cards and USB flash drives due to its compatibility.
Have you ever accidentally deleted a file? Do you know that it could be recovered?In this lab, you
will build a FAT32 file recovery tool called Need You to Undelete my FILE, or nyufile for short.
Objectives
Through this lab, you will:
Learn the internals of the FAT32 file system.
Learn how to access and recover files from a raw disk.
Get a better understanding of key file system concepts.
Be a better C programmer. Learn how to write code that manipulates data at the byte level and
understand the alignment issue.
Overview
In this lab, you will work on the data stored in the FAT32 file system directly, without the OS file
system support. You will implement a tool that recovers a deleted file specified by the user.
For simplicity, you can assume thatthe deleted file is in the root directory. Therefore, you don’t
need to search subdirectories.
Preparation
Unfortunately, non-root users cannot mount a file system on the CIMS compute servers. If you want
to mount a file system, you have to use your own Linux machine or virtual machine.
If you do not already have a Linux system installed, you can follow these steps to install a virtual
machine running CentOS 7.
If your machine has an Intel CPU…
1. Install VirtualBox.
2. Install Vagrant.
3. Install the Vagrant scp plugin, which allows you to transfer files between your host machine and
the virtual machine:
4. Download and extract the configuration file:
$ vagrant plugin install vagrant-scp
5. Start the virtual machine (all Vagrant commands must be run in the centos7 directory):
6. Log into the virtual machine:
7. Now, you can work on this lab in the virtual machine. It already has gcc , gdb , and vim installed.
If you want to install other packages, you can run this command inside the virtual machine:
8. To copy files from your host machine to the virtual machine, run this command on your host
machine:
Conversely, to copy files from the virtual machine to your host machine, run:
$ tar xf centos7.tar.gz
$ cd centos7
$ vagrant up
$ vagrant ssh
$ sudo yum install $ vagrant scp :
9. To pause the virtual machine, run:
To shut down the virtual machine, run:
If you have a Mac with an M1 CPU…
Unfortunately, the virtual machine does not work on a Mac with an M1 CPU. You can run the virtual
machine on one of the CIMS compute servers. VirtualBox and Vagrant are already installed. You just
need to load the module:
Then, follow steps 3–9 above.
Working with a FAT32 disk image
$ vagrant scp :
$ vagrant suspend
$ vagrant halt
$ module load vagrant
Before going through the details of this lab, let’s first create a FAT32 disk image. Follow these steps:
Step 1: create an empty file of a certain size
On Linux, /dev/zero is a special file that provides as many \0 as are read from it. The dd command
performs low-level copying of raw data. Therefore, you can use it to generate an arbitrary-size file
full of zeros.
For example, to create a 256KB empty file named fat32.disk :
Read man dd for its usage. You will use this file as the disk image.
Step 2: format the disk with FAT32
You can use the mkfs.fat command to create a FAT32 file system. The most basic usage is:
You can specify a variety of options. For example:
$ dd if=/dev/zero of=fat32.disk bs=256k count=1
$ mkfs.fat -F 32 fat32.disk
$ mkfs.fat -F 32 -f 2 -S 512 -s 1 -R 32 fat32.disk
Here are the meanings of each option:
-F : type of FAT (FAT12, FAT16, or FAT32).
-f : number of FATs.
-S : number of bytes per sector.
-s : number of sectors per cluster.
-R : number of reserved sectors.
Step 3: verify the file system information
The fsck.fat command can check and repair FAT file systems. You can invoke it with -v to see the
FAT details. For example:
You can see that there are 2 FATs, 512 bytes per sector, 512 bytes per cluster, and 32 reserved
sectors. These numbers match our specified options in Step 2. You can try different options yourself.
Step 4: mount the file system
$ fsck.fat -v fat32.disk
fsck.fat 3.0.20 (12 Jun 2013)
fsck.fat 3.0.20 (12 Jun 2013)
Checking we can access the last sector of the filesystem
Warning: Filesystem is FAT32 according to fat_length and fat32_length fields,
but has only 472 clusters, less than the required minimum of 65525.
This may lead to problems on some systems.
Boot sector contents:
System ID “mkfs.fat”
Media byte 0xf8 (hard disk)
512 bytes per logical sector
512 bytes per cluster
32 reserved sectors
First FAT starts at byte 16384 (sector 32)
2 FATs, 32 bit entries
2048 bytes per FAT (= 4 sectors)
Root directory start at cluster 2 (arbitrary size)
Data area starts at byte 20480 (sector 40)
472 data clusters (241664 bytes)
32 sectors/track, 64 heads
0 hidden sectors
512 sectors total
Checking for unused clusters.
Checking free cluster summary.
fat32.disk: 0 files, 1/472 clusters
You can use the mount command to mount a file system to a mount point. The mount point can be
any empty directory. For example, you can create one at /mnt/disk :
Then, you can mount fat32.disk at that mount point:
Step 5: play with the file system
After the file system is mounted, you can do whatever you like on it, such as creating files, editing
files, or deleting files. In order to avoid the hassle of having long filenames in your directory entries, it
is recommended that you use only 8.3 filenames, which means:
The filename contains at most eight characters, followed optionally by a . and at most three
more characters.
The filename contains only uppercase letters, numbers, and the following special characters:
! # $ % & ‘ ( ) – @ ^ _ ` { } ~ .
For example, you can create a file named HELLO.TXT :
$ sudo mkdir /mnt/disk
$ sudo mount -o umask=0 fat32.disk /mnt/disk
For the purpose of this lab, after you write anything to the disk, make sure to flush the file system
cache using the sync command:
(Otherwise, if you create a file and immediately delete it, the file may not be written to the disk at all
and is unrecoverable.)
Step 6: unmount the file system
When you finish playing with the file system, you can unmount it:
Step 7: examine the file system
You can examine the file system using the xxd command. You can specify a range using the -s
(starting offset) and -l (length) options.
For example, to examine the root directory:
$ echo “Hello, world.” > /mnt/disk/HELLO.TXT
$ mkdir /mnt/disk/DIR
$ touch /mnt/disk/EMPTY
$ sync
$ sudo umount /mnt/disk
To examine the contents of HELLO.TXT :
Note that the offsets may vary depending on how the file system is formatted.
Your tasks
Important: before running your nyufile program, please make sure that your FAT32 disk is
unmounted.
Milestone 1: validate usage
There are several ways to invoke your nyufile program. Here is its usage:
$ xxd -s 20480 -l 96 fat32.disk
00005000: 4845 4c4c 4f20 2020 5458 5420 0000 0000 HELLO TXT ….
00005010: 6e53 6e53 0000 0000 6e53 0300 0e00 0000 nSnS….nS……
00005020: 4449 5220 2020 2020 2020 2010 0000 0000 DIR …..
00005030: 6e53 6e53 0000 0000 6e53 0400 0000 0000 nSnS….nS……
00005040: 454d 5054 5920 2020 2020 2020 0000 0000 EMPTY ….
00005050: 6e53 6e53 0000 0000 6e53 0000 0000 0000 nSnS….nS……
$ xxd -s 20992 -l 14 fat32.disk
0005200: 4865 6c6c 6f2c 2077 6f72 6c64 2e0a Hello, world..
The first argument is the filename of the disk image. After that, the options can be one of the
following:
-i
-l
-r filename
-r filename -s sha1
-R filename -s sha1
You need to check if the command-line arguments are valid. If not, your program should print the
above usage information and exit.
Milestone 2: print the file system information
If your nyufile program is invoked with option -i , it should print the following information about
the FAT32 file system:
Number of FATs;
Number of bytes per sector;
$ ./nyufile
Usage: ./nyufile disk
-i Print the file system information.
-l List the root directory.
-r filename [-s sha1] Recover a contiguous file.
-R filename -s sha1 Recover a possibly non-contiguous file.
Number of sectors per cluster;
Number of reserved sectors.
Your output should be in the following format:
For all milestones, you can assume that nyufile is invoked while the disk is unmounted.
Milestone 3: list the root directory
If your nyufile program is invoked with option -l , it should list all valid entries in the root directory
with the following information:
Filename. Similar to /bin/ls -p , if the entry is a directory, you should append a / indicator.
File size.
Starting cluster.
You should also print the total number of entries at the end. Your output should be in the following
format:
$ ./nyufile fat32.disk -i
Number of FATs = 2
Number of bytes per sector = 512
Number of sectors per cluster = 1
Number of reserved sectors = 32
Here are a few assumptions:
You should notlist entries marked as deleted.
You don’t need to print the details inside subdirectories.
For all milestones, there will be no long filename (LFN) entries. (If you have accidentally created
LFN entries when you test your program, don’t worry. You can just skip the LFN entries and print
only the 8.3 filename entries.)
All files and directories, including the root directory, may span across more than one cluster.
There may be empty files.
Milestone 4: recover a small file
If your nyufile program is invoked with option -r filename , it should recover the deleted file with
the specified name. The workflow is better illustrated through an example:
$ ./nyufile fat32.disk -l
HELLO.TXT (size = 14, starting cluster = 3)
DIR/ (size = 0, starting cluster = 4)
EMPTY (size = 0, starting cluster = 0)
Total number of entries = 3
For all milestones, you only need to recover regular files (including empty files, but not directory
files) in the root directory. When the file is successfully recovered, your program should print
filename: successfully recovered .
$ sudo mount -o umask=0 fat32.disk /mnt/disk
$ ls -p /mnt/disk
DIR/ EMPTY HELLO.TXT
$ cat /mnt/disk/HELLO.TXT
Hello, world.
$ rm /mnt/disk/HELLO.TXT
$ ls -p /mnt/disk
DIR/ EMPTY
$ sudo umount /mnt/disk
$ ./nyufile fat32.disk -l
DIR/ (size = 0, starting cluster = 4)
EMPTY (size = 0, starting cluster = 0)
Total number of entries = 2
$ ./nyufile fat32.disk -r HELLO
HELLO: file not found
$ ./nyufile fat32.disk -r HELLO.TXT
HELLO.TXT: successfully recovered
$ ./nyufile fat32.disk -l
HELLO.TXT (size = 14, starting cluster = 3)
DIR/ (size = 0, starting cluster = 4)
EMPTY (size = 0, starting cluster = 0)
Total number of entries = 3
$ sudo mount -o umask=0 fat32.disk /mnt/disk
$ ls -p /mnt/disk
DIR/ EMPTY HELLO.TXT
$ cat /mnt/disk/HELLO.TXT
Hello, world.
For all milestones, you can assume that no other files or directories are created or modified since the
deletion of the target file. However, multiple files may be deleted.
Besides, for all milestones, you don’t need to update the FSINFO structure because most operating
systems don’t care about it.
Here are a few assumptions specifically for Milestone 4:
The size of the deleted file is no more than the size of a cluster.
At most one deleted directory entry matches the given filename. If no such entry exists, your
program should print filename: file not found .
Milestone 5: recover a large contiguously-allocated file
Now, you will recover a file that is larger than one cluster. Nevertheless, for Milestone 5, you can
assume that such a file is allocated contiguously. You can continue to assume that at most one
deleted directory entry matches the given filename. If no such entry exists, your program should
print filename: file not found .
Milestone 6: detect ambiguous file recovery requests
In Milestones 4 and 5, you assumed that at most one deleted directory entry matches the given
filename. However, multiple files whose names differ only in the first character would end up having
the same name when deleted. Therefore, you may encounter more than one deleted directory entry
matching the given filename. When that happens, your program should print
filename: multiple candidates found and abort.
This scenario is illustrated in the following example:
Milestone 7: recover a contiguously-allocated file with SHA-1 hash
To solve the aforementioned ambiguity, the user can provide a SHA-1 hash via command-line option
-s sha1 to help identify which deleted directory entry should be the target file.
In short, a SHA-1 hash is a 160-bit fingerprint of a file, often represented as 40 hexadecimal digits.
For the purpose of this lab, you can assume that identical files always have the same SHA-1 hash, and
different files always have vastly different SHA-1 hashes. Therefore, even if multiple candidates are
found during recovery, at most one will match the given SHA-1 hash.
This scenario is illustrated in the following example:
$ sudo mount -o umask=0 fat32.disk /mnt/disk
$ echo “My last name is Tang.” > /mnt/disk/TANG.TXT
$ echo “My first name is Yang.” > /mnt/disk/YANG.TXT
$ sync
$ rm /mnt/disk/TANG.TXT /mnt/disk/YANG.TXT
$ sudo umount /mnt/disk
$ ./nyufile fat32.disk -r TANG.TXT
TANG.TXT: multiple candidates found
When the file is successfully recovered with SHA-1, your program should print
filename: successfully recovered with SHA-1 .
Note that you can use the sha1sum command to compute the SHA-1 hash of a file:
Also note that it is possible that the file is empty or occupies only one cluster. The SHA-1 hash for an
empty file is da39a3ee5e6b4b0d3255bfef95601890afd80709 .
If no such file matches the given SHA-1 hash, your program should print filename: file not found .
For example:
$ ./nyufile fat32.disk -r TANG.TXT -s c91761a2cc1562d36585614c8c680ecf5712e875
TANG.TXT: successfully recovered with SHA-1
$ ./nyufile fat32.disk -l
HELLO.TXT (size = 14, starting cluster = 3)
DIR/ (size = 0, starting cluster = 4)
EMPTY (size = 0, starting cluster = 0)
TANG.TXT (size = 22, starting cluster = 5)
Total number of entries = 4
$ sha1sum /mnt/disk/TANG.TXT
c91761a2cc1562d36585614c8c680ecf5712e875 /mnt/disk/TANG.TXT
$ ./nyufile fat32.disk -r TANG.TXT -s 0123456789abcdef0123456789abcdef01234567
TANG.TXT: file not found
The OpenSSL library provides a function SHA1() , which computes the SHA-1 hash of d[0…n-1] and
stores the result in md[0…SHA_DIGEST_LENGTH-1] :
You need to add the compiler option -l crypto to link with the OpenSSL library.
Milestone 8: recover a non-contiguously allocated file
Finally, the clusters of a file are no longer assumed to be contiguous. You have to try every
permutation of unallocated clusters on the file system in order to find the one that matches the SHA1 hash.
The command-line option is -R filename -s sha1 . The SHA-1 hash must be given.
Note that it is possible that the file is empty or occupies only one cluster. If so, -R behaves the same
as -r , as described in Milestone 7.
For Milestone 8, you can assume that the entire file is within the first 12 clusters, so that a bruteforce search is feasible.
#include <openssl/sha.h>
#define SHA_DIGEST_LENGTH 20
unsigned char *SHA1(const unsigned char *d, size_t n, unsigned char *md);
If you cannot find a file that matches the given SHA-1 hash, your program should print
filename: file not found .
FAT32 data structures
For your convenience, here are some data structures that you can copy and paste. Please refer to the
lecture slides for details on the FAT32 file system layout.
Boot sector
#pragma pack(push,1)
typedef struct BootEntry {
unsigned char BS_jmpBoot[3]; // Assembly instruction to jump to boot code
unsigned char BS_OEMName[8]; // OEM Name in ASCII
unsigned short BPB_BytsPerSec; // Bytes per sector. Allowed values include 512, 1024, 2048, and 409
unsigned char BPB_SecPerClus; // Sectors per cluster (data unit). Allowed values are powers of 2,
unsigned short BPB_RsvdSecCnt; // Size in sectors of the reserved area
unsigned char BPB_NumFATs; // Number of FATs
unsigned short BPB_RootEntCnt; // Maximum number of files in the root directory for FAT12 and FAT16
unsigned short BPB_TotSec16; // 16-bit value of number of sectors in file system
unsigned char BPB_Media; // Media type
unsigned short BPB_FATSz16; // 16-bit size in sectors of each FAT for FAT12 and FAT16. For FAT32
unsigned short BPB_SecPerTrk; // Sectors per track of storage device
unsigned short BPB_NumHeads; // Number of heads in storage device
unsigned int BPB_HiddSec; // Number of sectors before the start of partition
unsigned int BPB_TotSec32; // 32-bit value of number of sectors in file system. Either this val
unsigned int BPB_FATSz32; // 32-bit size in sectors of one FAT
unsigned short BPB_ExtFlags; // A flag for FAT
unsigned short BPB_FSVer; // The major and minor version number
unsigned int BPB_RootClus; // Cluster where the root directory can be found
unsigned short BPB_FSInfo; // Sector where FSINFO structure can be found
unsigned short BPB_BkBootSec; // Sector where backup copy of boot sector is located
unsigned char BPB_Reserved[12]; // Reserved
unsigned char BS_DrvNum; // BIOS INT13h drive number
unsigned char BS_Reserved1; // Not used
unsigned char BS_BootSig; // Extended boot signature to identify if the next three values are
unsigned int BS_VolID; // Volume serial number
unsigned char BS_VolLab[11]; // Volume label in ASCII. User defines when creating the file system
unsigned char BS_FilSysType[8]; // File system type label in ASCII
} BootEntry;
#pragma pack(pop)
Directory entry
Compiling
We will grade your submission on a CentOS 7.9 system. We will compile your program using gcc
4.8.5. You must provide a Makefile , and by running make , it should generate an executable file
named nyufile in the current working directory. Note that you need to add the compiler option
-l crypto .
Testing
#pragma pack(push,1)
typedef struct DirEntry {
unsigned char DIR_Name[11]; // File name
unsigned char DIR_Attr; // File attributes
unsigned char DIR_NTRes; // Reserved
unsigned char DIR_CrtTimeTenth; // Created time (tenths of second)
unsigned short DIR_CrtTime; // Created time (hours, minutes, seconds)
unsigned short DIR_CrtDate; // Created day
unsigned short DIR_LstAccDate; // Accessed day
unsigned short DIR_FstClusHI; // High 2 bytes of the first cluster address
unsigned short DIR_WrtTime; // Written time (hours, minutes, seconds
unsigned short DIR_WrtDate; // Written day
unsigned short DIR_FstClusLO; // Low 2 bytes of the first cluster address
unsigned int DIR_FileSize; // File size in bytes. (0 for directories)
} DirEntry;
#pragma pack(pop)
To get started with testing, you can download a sample FAT32 disk and expand it with the following
command:
There are a few files on this disk:
HELLO.TXT – a small text file.
DIR – an empty directory.
EMPTY.TXT – an empty file.
CONT.TXT – a large contiguously-allocated file.
NON_CONT.TXT – a large non-contiguously allocated file.
You should make your own test cases and test your program thoroughly. Make sure to test your
program with disks formatted with different parameters.
We will provide sample test cases and grading script one week before the deadline.
The autograder
Please download the autograder to test your program. Note that the test cases are not exhaustive.
However, if you can’t pass these cases, you can’t expect to pass the final grading.
Submission
$ gunzip fat32.disk.gz
You will submit an archive containing all files needed to compile nyufile . You can do so with the
following command:
Rubric
The total of this lab is 100 points, mapped to 15% of your final grade of this course.
Milestone 1: validate usage. (40 points)
Milestone 2: print the file system information. (5 points)
Milestone 3: list the root directory. (10 points)
Milestone 4: recover a small file. (15 points)
Milestone 5: recover a large contiguously-allocated file. (10 points)
Milestone 6: detect ambiguous file recovery requests. (5 points)
Milestone 7a: recover a small file with SHA-1 hash. (5 points)
Milestone 7b: recover a large contiguously-allocated file with SHA-1 hash. (5 points)
Milestone 8: recover a non-contiguously allocated file. (5 points)
Tips
Don’t procrastinate
This lab requires significant programming effort. Therefore, start as early as possible! Don’t wait
until the last week.
$ tar cvJf nyufile-Your_NetID.tar.xz Makefile *.h *.c
Some general hints
Before you start, use xxd to examine the disk image to get an idea of the FAT32 layout. Keep a
backup of the hexdump.
After you create a file or delete a file, use xxd to compare the hexdump of the disk image against
your backup to see what has changed.
You can also use xxd -r to convert a hexdump back to a binary file. You can use it to “hack” a disk
image. In this way, you can try recovering a file manually before writing a program to do it. You
can also create a non-contiguously allocated file artificially for testing in this way.
Always umount before using xxd or running your nyufile program.
When updating FAT, remember to update all FATs.
Using mmap() to access the file system image is more convenient than read() or fread() .
The milestones have diminishing returns. Easier milestones are worth more points. Make sure
you get them right before trying to tackle the harder ones.
Thislab has borrowed some ideasfrom Dr. T. Y. Wong.