Description
In this first homework, you will use C to implement a rudimentary cache of words, which will
be populated with strings read from one or more input files. Your cache must be a dynamically
allocated hash table of a given fixed size that handles collisions by simply replacing the existing
word.
This hash table is really just a one-dimensional array of char * pointers. These pointers should all
initially be set to NULL, then set to point to dynamically allocated strings for each cached word.
No square brackets allowed!
To emphasize and master the use of pointers and pointer arithmetic, you are not allowed to
use square brackets anywhere in your code! As with our first lecture exercise, if a ‘[‘ or ‘]’
character is detected, including within comments, Submitty will remove that line of code before
running gcc.
To detect square brackets, consider using the command-line grep tool as shown below.
bash$ grep ‘\[‘ hw1.c
…
bash$ grep ‘\]’ hw1.c
…
Can you combine this into one grep call? As a hint, check out the man page for grep.
As shown below, square bracket expressions can generally be rewritten using pointer arithmetic by
removing the square brackets, enclosing the sum of the array variable and the index in parentheses,
then dereferencing the resulting pointer. A few equivalent examples follow:
str[32] = ‘A’;
*(str+32) = ‘A’;
values[i] += 20;
*(values+i) += 20;
results[j] = j * 3.14;
*(result+j) = j * 3.14;
if ( strcmp( a, &b[10] ) == 0 ) { … }
if ( strcmp( a, &(*(b+10)) ) == 0 ) { … }
if ( strcmp( a, b+10 ) == 0 ) { … }
Note that in this last example, we do not need to dereference the pointer since we are then using the
address-of & operator; combined, the dereference * and address-of & operators negate one another.
Command-line arguments and memory allocation
The first command-line argument specifies the size of the cache, which therefore indicates the size
of the dynamically allocated char * array that you must create. Use calloc() to create this array
of “placeholder” pointers. And use atoi() (or strtol()) to convert from a string to an integer on
the command line. The cache size must be a positive integer.
Next, your program must open and read the regular file(s) specified by the remaining command-line
arguments. Your program must parse and extract all words, if any, from each given file, in the
given order the files are listed on the command line.
Here, a word is any string of three or more alphanumeric characters; see below for how to “hash”
a word. And if a collision occurs in your cache, simply replace the existing entry.
(v1.1) To read each input file, you must use open(), read(), and close(); you may also consider
using lseek(). Any calls to library functions that make use of FILE* will be deactivated in
Submitty; this includes fopen(), fscanf(), fgets(), etc.
Initially, your cache is empty, meaning it is an array of NULL pointers. Storing each valid word
therefore also requires dynamic memory allocation. For this, use calloc() if the cache array slot
is empty; otherwise, to replace an existing value, use realloc() if the size of the required memory
differs from what is already allocated.
For words (e.g., “arch2024”), be sure to calculate the number of bytes to allocate as the length of
the given word plus one, since strings in C are implemented as char arrays that end with a ‘\0’
character.
Do not use malloc() or memset() in your code.
Is it a valid word—and how do you “hash” it?
For this assignment, words are defined as containing only alphanumeric characters (see isalnum())
and consisting of at least three characters. All other characters therefore serve as delimiters. And
note that words are case sensitive (e.g., Lion is different than lion).
(v1.2) Note that input files can be of any size and may have valid words at the beginning and/or
end of the file, i.e., the file may begin and/or end in an alphanumeric character.
To simplify your code, you can assume that the maximum valid word length is 128 bytes.
To determine the cache array index for a given word, i.e., to properly “hash” the word, write a
separate function called hash() that calculates the sum of each ASCII character in the given word
as an int variable, then applies the “mod” operator to determine the remainder after dividing by
the cache array size.
As an example, the valid word Meme consists of four ASCII characters, which sum to 77 + 101 +
109 + 101 = 388. If the cache array size was 17, for example, then the array index for Meme would
be the remainder of 388/17 or 14.
Required output
When you execute your program, you must display a line of output for each valid word that
you encounter in the given file. For each word, display the cache array index and whether you
called calloc() or realloc()—or did not need to change the already existing memory allocation.
Given the lion.txt example file, you could run your code as follows:
bash$ ./a.out 17 lion.txt
Below is sample output from the above program execution that shows the format you must follow:
Word “Once” ==> 15 (calloc)
Word “when” ==> 9 (calloc)
Word “Lion” ==> 11 (calloc)
Word “was” ==> 8 (calloc)
Word “asleep” ==> 5 (calloc)
Word “little” ==> 8 (realloc)
Word “Mouse” ==> 11 (realloc)
Word “began” ==> 16 (calloc)
Word “running” ==> 4 (calloc)
Word “and” ==> 1 (calloc)
Word “down” ==> 15 (nop)
Word “upon” ==> 8 (realloc)
Word “him” ==> 12 (calloc)
…
Further, when you have finished processing the input file(s), show the contents of the cache by
displaying a line of output for each non-empty entry in the cache. Use the following format:
Cache:
[0] ==> “they”
[1] ==> “gnawed”
[2] ==> “King”
[3] ==> “LITTLE”
[4] ==> “went”
[5] ==> “PROVE”
[6] ==> “sad”
[7] ==> “tree”
[8] ==> “little”
[9] ==> “said”
[10] ==> “MAY”
[11] ==> “Mouse”
[12] ==> “him”
[13] ==> “FRIENDS”
[14] ==> “GREAT”
[15] ==> “the”
[16] ==> “began”
Error handling
If improper command-line arguments are given, report an error message to stderr and abort further
program execution. In general, if an error is encountered, display a meaningful error message
on stderr by using either perror() or fprintf(), then aborting further program execution. Only
use perror() if the given library or system call sets the global errno variable.
Error messages must be one line only and use the following format:
ERROR: <error-text-here>
Submission instructions
To submit your assignment (and perform final testing of your code), we will use Submitty.
To help make sure that your program executes properly, use the techniques below.
First, make use of the DEBUG_MODE preprocessor technique that helps avoid accidentally displaying
extraneous output in Submitty. Here is an example:
#ifdef DEBUG_MODE
printf( “the value of q is %d\n”, q );
printf( “here12\n” );
printf( “why is my program crashing here?!\n” );
printf( “aaaaaaaaaaaaagggggggghhhh square brackets!\n” );
#endif
And to compile this code in “debug” mode, use the -D flag as follows:
bash$ gcc -Wall -Werror -D DEBUG_MODE hw1.c
Second, output to standard output (stdout) is buffered. To disable buffered output, use setvbuf()
as follows:
setvbuf( stdout, NULL, _IONBF, 0 );
You would not generally do this in practice since this can slow down your program, but to ensure
you see as much output as possible in Submitty, this is a good technique to use.