Description

5/5 - (1 vote)

Overview: In the not–to–distant future, you will be writing a program to create an index on a binary ﬁle. I could just give you the binary ﬁle, or merge the two assignments, but creating the binary ﬁle from a formatted text ﬁle makes for a nice “shake oﬀ the rust” assignment, plus it provides a gentle (?) introduction to binary ﬁle processing. A basic binary ﬁle contains information in the same format in which the information is held in memory. (In a standard text ﬁle, all information is stored as ASCII or UNICODE characters.) As a result, binary ﬁles are generally faster and easier for a program to read and write than are text ﬁles. When your data is wellstructured, doesn’t need to be read directly by people and doesn’t need to be ported to a diﬀerent type of system, binary ﬁles are usually the best way to store information. For this program, we have lines of data items that we need to store, and each line’s items need to be stored as a group (in a database, such a group of ﬁelds is called a record). Making this happen in Java requires a bit of eﬀort; Java wasn’t originally designed for this sort of task. By comparison, “systems” languages like C can interact more directly with the operating system to provide more convenient ﬁle I/O. On the class web page you’ll ﬁnd a sample Java binary ﬁle I/O program, which should help get you started. Assignment: To discourage procrastination, this assignment is in two parts, Part A and Part B: Part A Available from lectura is a ﬁle named jan2018ontime.csv (see also the Data section, below). This is a CSV (Comma–Separated Value) text ﬁle consisting of 570,119 lines of 19 values each about airline ﬂights in January of 2018. Using Java 1.8, write a program named Prog1A.java that creates a binary version of the text ﬁle’s content that is sorted (using a stable sort) in ascending order by the ﬁfth ﬁeld of each record (the ‘FL NUM’ ﬁeld). Additional details: • For an input ﬁle named file.csv, name the binary ﬁle file.bin. (That is, keep the ﬁle name, but change the extension.) Don’t put a path on it; just create it in the current directory. • Field types are limited to int , double, and String. Follow the lead of the data. For example, if values of a ﬁeld are in quotes, assume that the ﬁeld is a string (and don’t store the quote characters). • For each column, all values must consume the same quantity of bytes. This is easy for numeric columns, but for alphanumeric columns we don’t want to waste storage by choosing a huge maximum length. Instead, you need to determine the number of characters in each ﬁeld’s longest value, and use that as the length of each value in that ﬁeld. This must be done for each execution of the program. (Why? Your program needs to work on other ﬂight data ﬁles, too. You may assume that the ﬁeld order, types, and quantity will not change.) When necessary, pad strings on the right with spaces to reach the needed length(s). (For example, “abc “, where ” ” represents the space character. • Because the maximum lengths of the text ﬁelds can be diﬀerent for diﬀerent input ﬁles, you will need to store these lengths in the binary ﬁle so that your Part B program can access them. One possibility is to store the maximum lengths at the end of the binary ﬁle (after the last data record). Repeating the info at the top of this handout: Part A is due in just one week; start today! (Continued…)
DRAFT Part B Write another complete, well-documented Java 1.8 program named Prog1B.java that performs both of the following tasks, in this order: 1. Reads and prints to the screen the UNIQUE CARRIER, FL NUM, ORIGIN, and DEST ﬁeld values of the ﬁrst ﬁve records of data, the middle three records (or middle four records, if the quantity of records is even), and the last ﬁve records of data in the binary ﬁle. Conclude the output with the total number of records in the binary ﬁle, on a new line. 2. Using one or more FL NUM values given by the user, locates within the binary ﬁle (using interpolation search; see below), and displays to the screen, the same four ﬁeld values of all records having the given FL NUM value. Some output details: • Display the data one record per line, with a single space between the ﬁeld values. • seek() the Java API and ye shall ﬁnd a method that will help you read the middle and last records of the binary ﬁle. • Use a loop to prompt the user for the FL NUM values. Terminate the program when a zero is entered. Data: Write your programs to accept the complete data ﬁle pathname as a command line argument (for Prog1A, that will be the pathname of the data ﬁle, and for Prog1B, the pathname of binary ﬁle created by Prog1A). The complete lectura pathname of our data ﬁle is /home/cs460/fall18/jan2018ontime.csv. Prog1A (when running on lectura, of course) can read the ﬁle directly from that directory; just provide that path when you run your program. (That is, there’s no reason to waste disk space by making a copy of the ﬁle in your account.) Each of the lines in the ﬁle contains 19 ﬁelds (columns) of information. Here is one of them: 2018-01-01,”EV”,20366,”N604QX”,”2853″,”SAF”,”DFW”,””,,,””,””,,””,,1.00,”A”,,551.00, Some ﬁle information: • The ﬁeld names can be found in the ﬁrst line of the ﬁle. Because that line contains metadata, its content must not be stored in the binary ﬁle. • In some records, such as this sample, some ﬁeld values are missing (see the adjacent commas?). However, the binary ﬁle requires values to maintain the common record length. For missing numeric ﬁelds, store -1. For missing strings, store a string containing the appropriate quantity of spaces. • Please keep in mind that we have not combed through the data to see that it is all formatted perfectly. This is completely intentional. Corrupt data is a huge problem in data management. We hope that this ﬁle holds a few surprises. We want you to think about how to deal with malformed data, and to ask questions of us as necessary. Output: Output details for each program are stated in the Assignment section, above. Please ask (preferably on Piazza) if you need additional details. Hand In: You are required to submit your completed program ﬁles using the turnin facility on lectura. The submission folder is cs460p1. Instructions are available from the document of submission instructions linked to the class web page. In particular, because we will be grading your program on lectura, it needs to run on lectura, so be sure to test it on lectura. Feel free to split up your code over additional ﬁles if doing so is appropriate to achieve acceptable code modularity. Submit all ﬁles as-is, without packaging or compression. (Continued…) 2
DRAFT Want to Learn More? • https://www.bts.gov/topics/airlines-and-airports-0 — The provided sample data was downloaded from the U.S. Department of Transportation’s Bureau of Transportation Statistics web site. Other ﬁelds are available. If you’re curious: Click the “On-Time” drop down, and choose “Airline On-Time Performance Databases”. On the next page, click “Download”. On the following page, select the desired ﬁelds and click “Download” to create CSV. Other Requirements and Hints: • Don’t “hard–code” values in your program if you can avoid it. For example, don’t assume a certain number of records in the input ﬁle or the binary ﬁle. Your program should automatically adapt to simple changes, such as more or fewer lines in a ﬁle, or changes to the ﬁle names and paths. For example, we may test your program with a ﬁle of just two records, or even no records. We expect that your program will handle such situations gracefully. • Once in a while, a student will think that “create a binary ﬁle” means “convert all the data into the characters ‘0’ and ‘1’.” Don’t do that! The binary I/O functions in Java will read/write the data in binary format automatically. See the example program on the class web page. • Try this: Comment your code according to the style guidelines as you write the code (not an hour before the due date and time!). Explaining in words what your code must accomplish before you write that code is likely to result in better code. The requirements and some examples are available from: https://u.arizona.edu/~mccann/style.html • You can make debugging easier by using only a few lines of data from the data ﬁle for your initial testing. Try running the program on the complete ﬁle only when you can process a few reduced data ﬁles. • Late days can be used on each part of the assignment, if necessary, but we are limiting you to at most two late days on Part A. For example, you could burn one late day by turning in Part A 18 hours late, and three more by turning in Part B two and a half days late. Of course, it’s best if you don’t use any. • Finally: Start early! File processing can be tricky. Interpolation Search Interpolation Search is an enhanced binary search. To be most eﬀective, these conditions must exist: (1) The data is stored in a direct-access data structure (such as a binary ﬁle of uniformly-sized records), (2) the data is in sorted order by the search key, (3) the data is uniformly distributed, and (4) there’s a lot of data. In such situations, the reduction in quantity of probes over binary search is likely to be particularly beneﬁcial given the inherent delay that exists in ﬁle accesses. Our data falls short on (3) and (4), but that’s OK; the search will still work. Interpolation Search is just like binary search, with one change: Instead of probing the data at the midpoint (one-half of low index plus high index), we use the following probe index calculation: probe index = low index + target − key[low index] key[high index] − key[low index] · (high index − low index) For example, consider a binary ﬁle of 60,000 records (indices 0 through 59,999), with keys that range from 100 through 150,000, and a target of 125,000. Thus, low index = 0, high index = 59999, key[low index] = 100, key[high index] = 150000, and target = 125000. Our ﬁrst probe into the ﬁle would be into record number 0 +l125000−100 150000−100 · (59999 − 0)m= 49993. 3

CSc 460 Program #1: Creating and Interpolation–Searching a Binary File solved

Download Details:

Description

CSc 460 Program #1: Creating and Interpolation–Searching a Binary File solved

Download Details:

Description

Related products

CSc 460 Program #3: JDBC solved

Solved CSc 460 Program #1: Creating and Exponentially–Searching a Binary File

Solved CSc 460 Program #2: Linear Hashing Lite