CS 4480:Programming Assignment 1 HTTP Web Proxy Server solution

$25.00

Category:

Description

5/5 - (5 votes)

1 Background 1.1 HTTP Proxies Ordinarily, HTTP is a client-server protocol. The client (usually your web browser) communicates directly with the server (the web server software). However, in some circumstances it may be useful to introduce an intermediate entity called a proxy. Conceptually, the proxy sits between the client and the server. In the simplest case, instead of sending requests directly to the server the client sends all its requests to the proxy. The proxy then opens a connection to the server, and passes on the client’s request. The proxy receives the reply from the server, and then sends that reply back to the client. Notice that the proxy is essentially acting like both a HTTP client (to the remote server) and a HTTP server (to the initial client). Why use a proxy? There are a few possible reasons: • Performance: By saving a copy of the pages that it fetches, a proxy can reduce the need to create connections to remote servers. This can reduce the overall delay involved in retrieving a page, particularly if a server is remote or under heavy load. • Content Filtering and Transformation: While in the simplest case the proxy merely fetches a resource without inspecting it, there is nothing that says that a proxy is limited to blindly fetching and serving files. The proxy can inspect the requested URL and selectively block access to certain domains, reformat web pages (for instances, by stripping out images to make a page easier to display on a handheld or other limited-resource client), or perform other transformations and filtering. • Privacy: Normally, web servers log all incoming requests for resources. This information typically includes at least the IP address of the client, the browser or other client program that they are using (called the User-Agent), the date and time, and the requested file. If a client does not wish to have this personally identifiable information recorded, routing HTTP requests through a proxy is one solution. All requests coming from clients using the same proxy appear to come from the IP address and UserAgent of the proxy itself, rather than the individual clients. If a number of clients use the same proxy (say, an entire business or university), it becomes much harder to link a particular HTTP transaction to a single computer or individual. 2 Assignment details The goal of this programming assignment1 is to develop a simple HTTP Web Proxy Server, which is capable of filtering malware from reaching a user’s system. The proxy should be capable of serving multiple concurrent requests. Your proxy only need to support the HTTP GET method. 1Credit: This programming assignment was derived from similar assignments at Stanford (Nick McKeown) and Princeton (Jennifer Rexford – COS-461) and as described in Kurose and Ross. 1 The current version of the textbook covers socket programming in python and you are required to use python to complete the assignment. For the basic proxy functionality, you are not allowed to use any libraries other than the standard socket libraries. It is strongly recommended that you approach the assignment as a multi-step process: First develop a proxy that is capable of receiving a request from a client, passing that through to the origin (real) server and then passing the server’s response to the client. Then extend this basic capability so that your proxy is capable of serving multiple clients concurrently. Then enhance the functionality of the basic multi-client proxy to create a filtering proxy. 2.1 Basic multi-client proxy Basics. Your first task is to build a web proxy capable of accepting HTTP requests, forwarding requests to remote (origin) servers, and returning response data to a client. The proxy MUST handle concurrent requests. E.g., by using the multiprocessing package in python. You will only be responsible for implementing the GET method. All other request methods received by the proxy should elicit a “Not Implemented” (501) error (see RFC 1945 section 9.5 – Server Error). You should not assume that the server will be running on a particular IP address, or use a particular TCP port number, or that clients will be coming from a pre-determined IP. (I.e., your proxy should correctly handle port numbers being specified in the URL.) Listening. When your proxy starts, the first thing that it will need to do is establish a socket connection that it can use to listen for incoming connections. Your proxy should listen on a port specified from the command line and wait for incoming client connections. Each new client request is accepted, and a new process/thread is spawned to handle the request. There could be a reasonable limit on the number of processes/threads that your proxy can create (e.g., 100). Once a client has connected, the proxy should read data from the client and then check for a properly-formatted HTTP request. Specifically, you should ensure that the proxy receives a request that contains a valid request line: All other headers just need to be properly formatted:

:

In this assignment, client requests to the proxy must be in their absolute URI form (see RFC 1945, Section 5.1.2) – as your browser will send if properly configured to explicitly use a proxy (as opposed to a transparent on-path proxies that some ISPs deploy, unbeknownst to their users). An invalid request from the client should be answered with an appropriate error code, i.e., “Bad Request” (400) or “Not Implemented” (501) for valid HTTP methods other than GET. Similarly, if headers are not properly formatted for parsing, your client should also generate a type-400 message. Getting Data from the Remote Server Once the proxy has parsed the URL, it can make a connection to the requested host (using the appropriate remote port, or the default of 80 if none is specified) and send the HTTP request for the appropriate resource. The proxy should always send the request in the relative URL + Host header format regardless of how the request was received from the client: Accept from client: GET http://www.google.com/ HTTP/1.0 Send to remote server: 2 GET / HTTP/1.0 Host: www.google.com Connection: close (Additional client specified headers, if any..) Note that we always send HTTP/1.0 a “Connection: close” header to the server, so that it will close the connection after its response is fully transmitted, as opposed to keeping open a persistent connection. So while you should pass the client headers you receive on to the server, you should make sure you replace any Connection header received from the client with one specifying “close”, as shown. After the response from the remote server is received, the proxy should send the response message as-is to the client via the appropriate socket. 2.1.1 Testing Your Proxy Basic test. Assuming your proxy listens on port number port and is bound to the localhost interface, then as a basic test of functionality, try requesting a page using telnet: telnet localhost Trying 127.0.0.1… Connected to localhost.localdomain (127.0.0.1). Escape character is ’^]’. GET http://www.cs.utah.edu/~kobus/simple.html HTTP/1.0 If your proxy is working correctly, the headers and HTML of a (real) simple HTML document should be displayed on your terminal screen. Notice that here we request the absolute URL: http://www.cs.utah.edu/~kobus/simple.html instead of just the relative URL. The relative URL can be requested as follows: telnet localhost Trying 127.0.0.1… Connected to localhost.localdomain (127.0.0.1). Escape character is ’^]’. GET /~kobus/simple.html HTTP/1.0 Host: www.cs.utah.edu A good sanity check of proxy behavior would be to compare the HTTP response (headers and body) obtained via your proxy with the response from a direct telnet connection to the remote server. Concurrency test. You should also test the multi-client functionality of your proxy by requesting a page using telnet concurrently from two different shells. If you request a very simple page, like the one above, it might be difficult to show concurrency as the page downloads very quickly. A simple way to work around this is to download a large file so that you can verify that the file is concurrently being served to both clients. (For example, you can create a large file and serve it up with your own web server, e.g., using something like the Python SimpleHTTPServer.) Alternatively you can verify correct multi-client functionality by using telnet from two different shells (as above), but leaving the first shell in the “telnet connected” state, i.e., not actually issuing the GET request, and then issuing a GET request from the second shell. 3 More sophisticated test. For a slightly more complex test, you should configure your web browser to use your proxy server as its web proxy. The exact way to configure this will depend on the browser you use. For example, when using Firefox version 15.0, (i) select Firefox>Preferences (or Edit>Preferences), (ii) select Advanced, (iii) select the Networking tab, (iv) select Setting for “Configure how Firefox connects to the Internet”, (iv) select “Manual proxy configuration” and enter the IP address and port number for you proxy.2 2.2 Malware filtering proxy Detecting malware is a complex and ever evolving task, well beyond the scope of this programming assignment. To realize our goal of creating a malware filtering proxy we will use a publicly available “malware lookup service”. Specifically, VirusTotal (https://www.virustotal.com/), runs an open virus scanning service which can be programmatically queried through an API. VirusTotal provide a variety of services, including the ability to upload files to be scanned, retrieving file scan reports, uploading URLs to be scanned, IP address and domain name reports etc. HTTP client (e.g., wget/curl) VirusTotal Your Filtering Proxy Server Origin server (e.g., python SimpleHTTPServer) Regular file File with malware 1 2 3 4 5 6 7 Figure 1: Malware filtering proxy operation Required functionality. For the purpose of this project we will limit our interaction with VirusTotal to the Retrieving file scan reports API. Specifically, your proxy will have to realize the steps shown in Figure 1 and described below: 2Version 15.0 is a very old version of Firefox. As detailed below you might want to install that because it still works with HTTP 1.0. Note, however, that such an old version of software will have many security vulnerabilities. You should therefor not use it for general browsing purposes. 4 1. An HTTP client, configured to go through your proxy, will issue a GET request for a file or object. 2. Your proxy will perform the necessary checks on the request and in turn issue a GET request to the origin (real) server. 3. The origin server will obtain the requested object and return it to your proxy in a response message. 4. If the request is successful, your proxy server will calculate the MD5 checksum of the retrieved object. (If the request was not successful, your proxy will simply send the appropriate response on to the HTTP client.) 5. Your proxy will issue a file scan report (/file/report) request to VirusTotal using the MD5 checksum of the retrieved object as the resource (and your apikey). (E.g., see VirusTotal API here: https: //developers.virustotal.com/reference.) 6. The VirusTotal service will provide a response, which your proxy will have to parse and interpret to determine whether the retrieved object contained malware. 7. Your proxy server will respond back to the HTTP client. The response from the VirusTotal service will determine the appropriate response from your proxy server to the client: • If the object is deemed not to contain any malware, it is returned to the client in a normal HTTP response message (200 OK). • However, if your proxy determined that the object contained malware, you should respond with a normal 200 OK HTTP response message, but replace the object with a simple HTML page indicating that the content was blocked because it is suspected of containing malware. About VirusTotal. VirusTotal provides free (but rate limited) access to their services. You will have to register as a user to obtain an API key which is required for programmatic access. (On the VirusTotal front page (https://www.virustotal.com), click on Sign in in the top right corner. Then click on Join the community and follow the directions presented.) For this part of the assignment you can simply use the HTTP libraries provided by your implementation language to query the VirusTotal API. I.e., with Python, you can use the requests library as shown in the VirusTotal API examples. Your proxy should accept an API key as a commandline option to allow the TAs to use their own VirusTotal API keys to perform the evaluation. Important notes. In this assignment we will be dealing with real malware which can cause real damage to your computing environment. You should therefor be extremely careful when dealing with files containing malware. Specifically: • As depicted in Figure 1, rather than using your browser, you should use a commandline HTTP client, like curl or wget, to retrieve objects that might contain malware. • You should not directly download content from websites suspected of hosting malware. Rather, as depicted in Figure 1, you should obtain a sample malware file from the web, and host it on a dummy web server which you only use for this project. For example, as shown, you can simply put a sample malware file, together with a known “clean” file (e.g., any document) in a directory, and use the python SimpleHTTPServer to realize your own origin server. Several web sites exist that contains sample malware files for research purposes. • Note that we will essentially follow a similar approach when we evaluate your filtering proxy server. I.e., we will set up our own origin server, with two files, one clean and one containing malware recognized by VirusTotal, and then verify that your proxy deals with the content in the appropriate manner. 5 3 Grading and evaluation To encourage you to start early and systematically work on the assignment, there will be three submissions as outlined below. (Note that the primary purpose of dividing the work like this is to help you to systematically work through the assignment. At the discretion of the instructor, the early parts of the assignment may (or may not) be graded. I.e., you should not expect feedback on a particular part before the next part will be due.) 3.1 PA 1 – A: Basic Proxy For this first part of the assignment, the focus will be on basic (single client) proxy functionality. Specifically, we will evaluate your code by performing a test similar to the Basic test paragraph in Section 2.1.1. What to hand in Your should submit your completed sub-assignment electronically on Cade by the due date. You submission should consist of: 1. A single python file which implements your basic proxy. You should use the following naming convention for your python file: Firstname_Lastname_UID.py e.g., Joe_Doe_u0000000.py Any special instructions related to your proxy should be made available via −h, −−help commandline options. To electronically submit files while logged in to a CADE machine, use: % handin cs4480 assignment name name of file where cs4480 is the name of the class account and assignment name (pa1 a, pa1 final etc.) is the name of the appropriate subdirectory in the handin directory. Use pa1 a for this assignment. 3.2 PA 1 – B: Multi-client Proxy For this part of the assignment you are to develop the multi-client proxy as described in Section 2.1. At this point the emphasis will be on the multi-client aspect of the proxy. Specifically, we will evaluate your code by performing a test similar to that described in the Concurrency test and More sophisticated test paragraphs in Section 2.1.1. We will use the Firefox browser for the latter. What to hand in You should submit your completed sub-assignment electronically on Cade by the due date. You submission should consist of: 1. A single python file which implements your multi-client proxy. You should use the following naming convention for your python file: Firstname_Lastname_UID.py e.g., Joe_Doe_u0000000.py Any special instructions related to your proxy should be made available via −h, −−help commandline options. To electronically submit files while logged in to a CADE machine, use: % handin cs4480 assignment name name of file where cs4480 is the name of the class account and assignment name (pa1 a, pa1 final etc.) is the name of the appropriate subdirectory in the handin directory. Use pa1 b for this assignment. 6 3.3 PA 1 – Final: Complete Assignment For your final submission you should implement the remaining required functionality. Your final submission will be tested more thoroughly, specifically using the setup described in Section 2.2. What to hand in You should submit your completed assignment electronically on Cade by the due date. You submission should consist of: 1. A single python file which implements your filtering proxy. You should use the following naming convention for your python file: Firstname_Lastname_UID.py e.g., Joe_Doe_u0000000.py Any special instructions related to your proxy should be made available via −h, −−help commandline options. To electronically submit files while logged in to a CADE machine, use: % handin cs4480 assignment name name of file where cs4480 is the name of the class account and assignment name (pa1 a, pa1 final etc.) is the name of the appropriate subdirectory in the handin directory. Use pa1 final for this assignment. 3.4 Grading Criteria Points Sub-assignment: PA 1 – A 10 Sub-assignment: PA 1 – B 10 PA 1 – Final Program 70 Inline documentation 5 Exception handling in code 5 Total 100 Other important points • Every programming assignment of this course must be done individually by a student. No teaming or pairing is allowed. Note that your code will be checked for similarity against other submissions. • Your programs will be tested on CADE Lab Linux machines. You can develop your program(s) on any OS platform or machine but it is your responsibility to ensure that it runs on CADE Lab machines. You will not get any credit if the TA is unable to run your program(s). • When working on CADE Lab machines, use TCP port numbers 2100 through 2120 for this programming assignment. These ports are only accessible from inside the CADE Lab network. Additional notes For simplicity your proxy should support version 1.0 of the HTTP protocol, as defined in RFC 1945. 7 When using a real browser Only applies to basic proxy functionality. Do not use a real browser when working with malware samples. A significant simplification that comes from limiting the proxy functionality to version 1.0 is that your proxy does not need to support persistent connections. Recall that the default behavior for HTTP 1.0 is non-persistent connections. Note, however, that most browsers, including the latest version of Firefox automatically add the optional “Connection: keep-alive” header line. Also, with the latest version of Firefox it is no longer possible to disable this behavior. I suggest you download an older version of Firefox so that you can disable this. I tested with Firefox 15.0, although other versions might also work. (See notes below.) IMPORTANT: Note that older versions of browsers might have security vulnerabilities. I therefor strongly suggest that if you get an older version of Firefox to use for this programming assignment, you do not use that for general web browsing, but only for the assignment. A further implication of using HTTP 1.0 is that you can infer that you have received all the content from the server when the server closes the connection. Also, as specified in the assignment the request forwarded to the server should contain a “Connection: close” header line to ensure that it closes the connection. (This might imply replacing a “Connection: keep-alive” header if the client had that in its request.) Notes about configuring Firefox: Type ’about:config’ in the title/search bar. You will be presented with a very large number of parameter settings. You should set the following parameters as shown: network.http.proxy.version 1.0 network.http.version 1.0 network.http.keep-alive false network.http.proxy.keep-alive false network.http.pipelining false network.http.proxy.pipelining false When using a commandline HTTP client Commandline HTTP clients have options to select the version of HTTP used and to force the client to go through a proxy. For example for the curl HTTP client the following option will instruct curl to use HTTP version 1.0: –http1.0 Similarly the following option will instruct curl to connect to an explicit inline proxy using the HTTP 1.0 protocol: –proxy1.0 <proxyhost[:port]> Consult the documentation for your selected commandline HTTP client for more details. 8