Description

5/5 - (1 vote)

Scenario

Integrating clustering techniques with dimension reduction in unsupervised learning presents a fascinating study area. Dimension reduction, a process that streamlines complex, high-dimensional datasets into a more manageable form, is essential for efficient data analysis and visualization. Techniques like Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation Projection (UMAP) are instrumental in this context. Applying clustering methods such as k-means, c-means, hierarchical clustering, DBSCAN, HDBSCAN, and Expectation-Maximization (EM) to dimensionally reduced datasets offer a comprehensive understanding of how these algorithms can identify patterns and groupings effectively. This approach facilitates a practical application of these algorithms and deepens the knowledge of their collective impact in enhancing data analysis, particularly within unsupervised learning.

Tasks

In this assignment, you are provided with 40,000 physician notes authored by test-takers of the USMLE. These notes, written for ten standardized patients, offer a unique dataset for analysis. The notes contain a natural ten clusters as the patients are the same for all note writers. The task is a good example of unsupervised learning where the ground truth can be used for post-hoc analysis. Your tasks are as follows:

Data Preprocessing:
1. Begin by preprocessing the physician’s notes. This should include:
  1. Case Conversion
  2. Removing Punctuation and Special Characters

Correcting Typos and Spelling. Think of quicker faster ways of doing this.

Standardizing Formats for dates, numbers, and currencies etc.
Handling Contractions: Expanding contractions like “can’t” to “cannot”.
You may also consider optional steps like stemming and lemmatization.

Apply a stop word list to filter out unnecessary words.

Document-Term Matrix (DTM) Creation:
1. Create a Document-Term Matrix with appropriate weighting, choice of n-grams, and other hyperparameters. This matrix will form the basis for your subsequent analyses.
2. Describe the DTM from the dataframe (df) perspective, such as size, memory etc.

Machine Learning Implementation:
1. ML Pipeline: This should include normalization of your data to ensure uniformity and outlier analysis to identify and address any anomalies in your dataset.
2. Dimension Reduction: Apply dimension reduction techniques. The aim here is to reduce the complexity of your data while retaining its essential characteristics. This step is crucial for effective clustering.
3. Clustering: With your data preprocessed and dimensionally reduced, apply clustering techniques. Choose appropriate clustering algorithms to identify patterns and groupings in your dataset.

Expected Output

Please submit a fully executed jupyter notebook identifying question number and steps. Make sure to add comments to your solution.

Solved HW Assignment 1 (A1): Bag of Words CS6120

Download Details:

Description

Scenario

Tasks

Expected Output

Solved HW Assignment 1 (A1): Bag of Words CS6120

Download Details:

Description

Scenario

Tasks

Expected Output

Related products

CSE 590 Assignment 2 solved

Solved HW Assignment 3 (A3): Sentiment Analysis CS6120

Solved ECE467 Natural Language Processing Project 2: First Deep Learning Project