Description
1. (Machine Learning (Classification))
a. Choose one of the toy classification datasets bundled with sklearn other than the digits dataset.
b. Train three distinct sklearn classification estimators for the chosen dataset and compare the results to see which one performs the best when using 2-fold cross-validation. Note that you should use three distinct classification models here (not just tweak underlying parameters). A relatively complete listing of the available estimators can be found here (https://scikit-learn.org/stable/supervised_learning.html) — but make sure you only use classifiers! Unless you have an inclination to do otherwise, I recommend using the model default parameters when available.
c. Repeat a. for 20-fold cross-validation. Explain in a paragraph the difference in your results when using 20-fold vs 2-fold cross-validation (if any).
d. Construct a confusion matrix for your most accurate model between the three estimators and two cross-fold options. Which class in your dataset is most accurately predicted to have the correct label by the best classifier, and and which is most likely to be confused among one or more of the wrong classes?
2 (Option I). (Trends, Searches, and Sentiment)
a. Use the Twitter Trends API to determine the available trending topics for a city of your choice, assigning a tweet volume of 5000 to any trend with no volume provided.
b. After sorting the trends in descending order by volume, create a bar graph with each (sorted) trend on the x-axis against its volume on the y-axis.
c. Use the Twitter Search API to find 20 tweets for each of the three most popular trends in the chosen city, and preprocess their associated tweet text (preferring extended tweet text, if available) in a manner appropriate for tweets.
d. Use TextBlob
to determine the sentiment for each set of 20 tweets.
i. Do you notice a substantial difference in the proportion of positive and negative sentiment for the three trends? Try to theorize why or why not.
ii. Do you believe the sentiment analysis to be reliable for any or all of the trend? Explain why or why not.
2 (Option II). (Machine Learning (Regression))
a. Locate a non-proprietary, small-scale dataset suitable for regression online. There are countless sources and repositories than you can use in this task, but if you have trouble finding one, I recommend starting via Kaggle (https://www.kaggle.com/code/rtatman/datasets-for-regression-analysis/notebook). Explain briefly what the dataset represents, what target variable you will be using, and what other features are present. You may want or need to apply preprocessing to your data to insure it can be used properly with the regression models (e.g. making every feature numeric through transformation or by dropping some)
b. Train three distinct sklearn regression estimators for the chosen dataset and compare the results to see which one performs the best when using 10-fold cross-validation, utilizing the R-Squared score to gauge performance. Note that you should use two distinct regression models here (not just tweak underlying parameters). A relatively complete listing of the available estimators can be found here (https://scikit-learn.org/stable/supervised_learning.html) — but make sure you only use regression models! Unless you have an inclination to do otherwise, I recommend using the model default parameters when available.
c. Repeat part b utilizing the Mean Square Error to gauge performance. Briefly research the difference between the two metrics (MSE and R2), and explain in a paragraph or two i. the difference between them ii. when each one is the preferable metric to use.