Description
Oil Wells Analysis and Visualization
This assignment is the first part of a 2-part assignment that focuses on providing you with PDF text extraction, web scraping, data preprocessing, and visualization.
You will better understand how to collect and organize data from PDFs and create a web interface to visualize the collected information. You will work with your team on pdf text extraction in this lab.
Additionally, you’ll preprocess the data to remove missing values, fetch additional data from the web, and store them in a database.
1) Initial Setup You may use OCRMYPDF, PyPDF, PyTesseract, requests/selenium, beautifulsoup4, mySQL database for this assignment. The installations are similar to the ones used earlier, and you should already have them setup. For the assignment, you will need to use Python scripts and other useful tools in the Linux environment. (Make sure to document any setup steps/requirements for running your scripts in your submit document). Do not spend much time on the installation and setup and invest your time on exploring the concepts and improvising your submission.
2) Data Collection / Storage For this task, you will focus on creating the database tables, parsing PDF files, and collecting information from given websites. For this set of assignments, we would focus on information related to oil wells, such as their physical location and specifications, and create a webpage in Part 2 to plot this information on maps and visualize the collected data.
3) PDF Extraction We will be using this Drive Folder for the Assignment https://drive.google.com/drive/u/4/folders/12g-bhOylyaMoLF5djocnAeZHBx-gsxgY The above folder has different PDFs of scanned images of different oil wells and information/specifications related to them. Download a copy of the folder to your local machine.
Your task is to write a Python script to iterate over all the PDFs in the folder, extract the information from the PDFs, and store it in your database tables. All PDFs will have well-specific information and stimulation data (how much proppant chemical was injected after drilling). Figure 1: Relevant data include API#, longitude, latitude, well name & number, address, and any relevant fields.
Figure 2: For Stimulation Data, extract all the fields mentioned in the snapshot above. 4) Additional Web Scraped Information In This part, we will use the API# and well name extracted above to gather additional information related to the wells from Internet sources.
Your task here is to iterate over each row of the database, and for each database entry, use the API# and well name to make a search query on this page: https://www.drillingedge.com/search Once you get the search results, you must open the well page and gather information about the well status, type, closest city, and barrels of oil and gas produced. The required fields you need to scrape are highlighted in yellow in the example snapshot below. Figure 3: Search results on drillingedge.com The extracted information should be appended as additional fields to the existing entries in the database.
5) Data Preprocessing Preprocess the data by removing HTML tags, special characters, and irrelevant information before storing it in the database. Transform the data into a suitable format for analysis, such as converting timestamps. Replace any missing data with 0 /or N/A for web-scraped information and text extracted from PDF files.
6) Resources Extracting text from PDFs using PyPDF2: https://automatetheboringstuff.com/chapter13/ OCRMYPDF: https://github.com/ocrmypdf/OCRmyPDF Understanding PyTesseract: https://nanonets.com/blog/ocr-with-tesseract/
7) Team Discussions Your team is expected to meet in-person / virtually each day of the week and discuss the assignment progress & next steps. Document and compile minutes of all meetings in a separate file called ‘meeting_notes_A5_P1_.pdf’
8) Submission Make one submission per team. Each team must submit all the code files for the working solution, a readme document containing information for running the code in pdf format, and a document that outlines the minutes of all team meetings in pdf format. Provide a video per team that demonstrates the entire working solution, explains how the data tables were loaded, demonstrates query results, and talks about the design decisions and reasoning for the same. Also, include details about how your team preprocessed the data. Please include the team name and the names of all three team members in the video. There will be a 50% penalty for all late submissions.