Description
Scenario
A regular expression (RE) is a sequence of characters that forms a search pattern. RE can be used for string searching and manipulation tasks, such as finding, replacing, or validating text. Regular expressions are powerful tool in many languages for handling text data. They are useful in data cleaning, parsing, and text preprocessing.
Task
This assignment has two parts to it:
Part A): You are given a small csv file with five short stories listed in rows. The file also contains empty columns with header labels. Use RE to extract information for the empty columns.
Part B) Download all 5 volumes of “A system of practical medicine” form Gutenberg Library. Then apply RE search to look for the number of times most common modern health conditions are mentioned in each text. Your objective is to create a df with five rows in it, one for each volume. The df should contain columns for various health conditions and their frequency within each volume. Here are the most frequent health conditions:
- Heart disease
- Cancer
- Stroke
- Respiratory diseases
- Alzheimer’s disease
- Diabetes
- Influenza and Pneumonia
- Kidney diseases
- Septicemia
- Liver disease
- Hypertension
- Parkinson’s disease
- Chronic lower respiratory disease
- Accidents/injuries
- Osteoporosis
- Asthma
- Depression
- Oral health issues
- HIV/AIDS
- Tuberculosis
- Malaria
- Dengue fever
- Hepatitis
- Epilepsy
- Multiple sclerosis
Expected Output
Please submit a fully executed Jupyter notebook clearly identifying question number and steps. Make sure to add proper commentary to your solution.