Description
Program Goals
- Get familiar with Python syntax
- Get familiar with a Python development/execution environment
- Implement a basic AI algorithm
Be advised:
Some (many? most?) of you will never have used Python before, so we’re going to keep the first assignment a bit conceptually boring and hold your hand a little bit more than is typical in a 500-level course. We’ll expect more of you pretty quickly, though.
We’ve created a resourceLinks to an external site. Download We’ve created a resourceLinks to an external site. for those of you transitioning from Java to Python, check it out!
Summary
This project includes two separate implementations:
- State space generation for the water jug puzzle (code in a file called p1_statespace.py)
- Weather prediction from historical data (code in a file called p1_weather.py)
We’ll walk you through the concepts and provide quite a bit of scaffolding for each of the implementations, since you will be expected to complete them both in Python, which many of you do not know (or haven’t used in a while).
Part 1: State Space Generation
The code for this portion should be written in a file called p1_statespace.py.
You began generating this state space manually in class on January 23 — the problem of the two water jugs of fixed integer size. We’ll be using this as an excuse to practice using Python lists.
For the purposes of this assignment, we will be representing each state as a list of two ints. Please also maintain a separate list of two ints with the respective capacities of the jugs.
There are three operations you can perform on a jug:
- Fill jug to capacity from faucet
- Empty jug onto ground
- Pour into another jug until that jug is full or this jug is empty
For this component of implementation, please implement four (4) Python functions:
- fill(state, max, which) — return a copy of
state
which fills the jug corresponding to the index inwhich
(0 or 1) to its maximum capacity. Do not modifystate
. - empty(state, max, which) — return a copy of
state
which empties the jug corresponding to the index inwhich
(0 or 1). Do not modifystate
. - xfer(state, max, source, dest) — return a copy of
state
which pours the contents of the jug at indexsource
into the jug at indexdest
, untilsource
is empty ordest
is full. Do not modifystate
. - succ(state, max) — display the list of unique successor states of the current state in any order.
Each of these functions should take as an argument the current state of the two water jugs as a list of two ints, as well as the maximum capacities of each jug as a list of two ints.
These functions should work as follows (though be sure to test them more thoroughly yourself):
>>> max = [5,7] >>> s0 = [0,0] >>> fill(s0,max,1) => [0,7] >>> s0 => [0,0] >>> s1 = fill(s0,max,1) >>> xfer(s1,max,1,0) => [5,2] >>> succ(s0,max) [0,0] [5,0] [0,7]
You may assume our states and capacities will be valid, non-negative integers presented in a valid, two-element list, and indexes will be valid. Error-check if you want the practice, but it will not be part of your score.
Be sure to test your functions with capacities other than 5 and 7! We will 🙂
In the event that a function does not modify the state (e.g. you fill a jug which is already at capacity or empty a jug which is already empty), you should still return a list whose contents are identical to the original state.
>>> s0 == empty(s0,max,1) => True
Part 2: Weather Prediction
The code for this portion should be written in a file called p1_weather.py.
This implementation will require some basic math, file I/O, and use of some built-in Python data structures, and is a little more involved than the previous one.
You’ll be using a version of the k-Nearest Neighbors algorithmLinks to an external site. (we’ll look at it in much more depth later this semester) to predict whether we expect it to be raining in Seattle based on various weather conditions.
For this component of implementation, please implement four (4) Python functions:
- euclidean_distance(data_point1, data_point2) – return the Euclidean distance between two dictionary data points from the data set.
- read_dataset(filename) – return a list of data point dictionaries read from the specified file.
- majority_vote(nearest_neighbors) – return a prediction of whether it is raining or not based on a majority vote of the list of neighbors.
- k_nearest_neighbors(filename, test_point, k) – using the above functions, return the majority vote prediction for whether it’s raining or not on the provided test point.
1. Euclidean distance
This is what you probably think of as a “normal” distance function:
except in this case we’re taking the distance between points in three dimensional space, where those dimensions are the precipitation amount, maximum temperature, and minimum temperature for the day. (You can ignore the date when doing this calculation.)
>>> euclidean_distance({'DATE': '1951-05-19', 'TMAX': 66.0, 'PRCP': 0.0, 'TMIN': 43.0, 'RAIN': 'FALSE'},{'DATE': '1951-01-27', 'TMAX': 33.0, 'PRCP': 0.0, 'TMIN': 19.0, 'RAIN': 'FALSE'}) => 40.80441152620633 >>> euclidean_distance({'DATE': '2015-08-12', 'TMAX': 83.0, 'PRCP': 0.3, 'TMIN': 62.0, 'RAIN': 'TRUE'}, {'DATE': '2014-05-19', 'TMAX': 70.0, 'PRCP': 0.0, 'TMIN': 50.0, 'RAIN': 'FALSE'}) => 17.694349380522585
2. Read dataset
Okay, so why do the data points in the previous example code look like that?
Because you’ll be reading them in from rain.txtLinks to an external site. Download rain.txtLinks to an external site..
Each line of the file looks something like this:
1948-01-01 0.47 51 42 TRUE
where the first entry is the DATE, the second entry is the PRCP (precipitation), the third entry is the TMAX (maximum temperature), the fourth entry is the TMIN (minimum temperature) and the last entry is a boolean representing RAIN (what we’ll be predicting for new data points).
In this function, you must read the file in line by line, split each line out with its spaces, and create a dictionary with the keys listed above and the values on the line, where the numeric values have been converted to floats. The function should return a list with one dictionary for each line in the file.
The sample line provided above, for example, should produce the following dictionary:
{'DATE': '1948-01-01', 'TMAX': 51.0, 'PRCP': 0.47, 'TMIN': 42.0, 'RAIN': 'TRUE'}
We won’t show you the whole result of testing this function, but here are some things you can try:
>>> dataset = read_dataset('rain.txt') >>> len(dataset) => 25548 >>> dataset[0] => {'DATE': '1948-01-01', 'TMAX': 51.0, 'PRCP': 0.47, 'TMIN': 42.0, 'RAIN': 'TRUE'}
Again, here’s the file for you to read in: rain.txtLinks to an external site. Download rain.txtLinks to an external site.
3. Majority vote
Very simply, a k-Nearest Neighbor (or kNN) algorithm looks at the classes assigned to the k points closest to the query point, and says that the query point has the same label as most of its neighbors (in this case, whether it was raining or not). This function takes in a list of those neighbors, and returns the string representing whether it’s raining or not. (Note: this function does not return a boolean! Booleans in Python look like True
or False
. We want all-caps.)
If a tie occurs, default to ‘TRUE’ as your answer. This is Seattle, after all.
>>> majority_vote([{'DATE': '2015-08-12', 'TMAX': 83.0, 'PRCP': 0.3, 'TMIN': 62.0, 'RAIN': 'TRUE'}, {'DATE': '2014-05-19', 'TMAX': 70.0, 'PRCP': 0.0, 'TMIN': 50.0, 'RAIN': 'FALSE'}, {'DATE': '2014-12-05', 'TMAX': 55.0, 'PRCP': 0.12, 'TMIN': 44.0, 'RAIN': 'TRUE'}, {'DATE': '1954-09-08', 'TMAX': 71.0, 'PRCP': 0.02, 'TMIN': 55.0, 'RAIN': 'TRUE'}, {'DATE': '2014-08-27', 'TMAX': 84.0, 'PRCP': 0.0, 'TMIN': 61.0, 'RAIN': 'FALSE'}]) => 'TRUE'
4. k-Nearest Neighbors
In this function, you’ll be given the path to a correctly-formatted file (to read using your function), a test point dictionary that contains all of the same keys (except the label ‘RAIN’), and the value of k indicating how many neighbors to select.
Find the closest k neighbors using your Euclidean distance, and return their majority vote on whether it’s raining or not in the test point.
>>> k_nearest_neighbors('rain.txt', {'DATE': '1948-01-01', 'TMAX': 51.0, 'PRCP': 0.47, 'TMIN': 42.0}, 2) => 'TRUE' >>> k_nearest_neighbors('rain.txt', {'DATE': '1948-01-01', 'TMAX': 51.0, 'PRCP': 0.00, 'TMIN': 42.0}, 2) => 'FALSE' >>> k_nearest_neighbors('rain.txt', {'DATE': '1948-01-01', 'TMAX': 51.0, 'PRCP': 0.00, 'TMIN': 42.0}, 10) => 'FALSE' >>> k_nearest_neighbors('rain.txt', {'DATE': '1948-01-01', 'TMAX': 51.0, 'PRCP': 0.05, 'TMIN': 42.0}, 10) => 'TRUE'
You may notice that you could take the ‘PRCP’ element, check whether it’s nonzero, and simply make your prediction on that. Very good: you’re right! This is actually a pretty trivial classification task. In practice, it’s sometimes better (and more accurate) to be less clever with your implementation. For the purposes of this assignment, though, pretend you’re an absent-minded data scientist or just trying to be Very Impressive. 😉
Submission
Please submit both p1_statespace.py and p1_weather.py at the same time so the TAs can grade you easily. You may submit more than once; unless you specify otherwise, we will grade your last submission.