Description
Project Objec+ves
You will learn how to crawl social media data, consider privacy and data usage implicaAons,
process, model and analyze the data. You will write a detailed wri>en report and give a short
oral presentaAon summarizing your results.
Project Outline
1. Data CollecAon
2. Data VisualizaAon
3. Network Measures CalculaAon
Guidelines
Data Collec*on – Your iniAal task is to choose a social media platorm to collect data from.
Some example plaMorms include instagram, dblp, Reddit, arXiv, ResearchGate, Stackoverflow,
Stackexchange, Wikipedia, etc. Figure out how you can crawl data from these websites. Some
of these plaMorms provide an API for collecAng data. Make sure you have the needed
credenAals for scraping the data (i.e. API key).
You should collect enough data to create a social network with 100-500 nodes. Some
representaAve network types are described as follows.
• Friendship Network. A user’s friendship network can be represented as a graph that the
nodes are the users and the edges show whether there is a friendship relaAonship
between them. Example: Users and connecAons in LinkedIn.
• Co-authorship Network. The nodes are scienAsts and two scienAsts are connected if they
have co-authored a paper. Example: An authorship network in the Computer Science
category of papers in arXiv.
• Diffusion Network. A node represents an enAty which can publish, receive and
propagate informaAon. A directed edge between nodes represents the direcAon of
informaAon propagaAon. Example: Fake news propagaAon when the nodes are users
and the edges are re-tweets/replies/likes.
Your report will include a descripAon of how you crawled your chosen plaMorm to collect the
data. Please also describe any challenges you faced, how you overcame the challenges and how
the challenges impacted the data that you were ulAmately able to collect. Your report should
also include the user privacy policy for your chosen social media plaMorm and data usage policy.
If you cannot find these policies, please describe where you looked for them.
Data Visualiza*on – Once the data is collected, the next step is to uAlize a graph analysis
socware to visualize your network as a graph.
There are many socware packages available
including networkx [link], snap [link], Gephi [link], NodeXL [link] and graph-tool [link]. Choose
one and read the instrucAons to determine how to input and visualize your graph. Each
package may require a parAcular format (i.e., adjacency matrix, adjacency list, edge list) for
input of the graph data.
Your report will include a short descripAon of the graph analysis socware that you used, your
reasoning for choosing the socware and the format of the data input file. You will incude a
screenshot of your visualized graph along with any informaAon needed for the reader to
understand the visualizaAon.
Network Measures – You will learn different network measures in class (Degree DistribuAon,
Clustering Coefficient, PageRank, Diameter, Closeness, Betweenness, etc.). Use your chosen
graph anaysis socware to obtain degree distribuAon and plot it as a histogram. In addiAon to
this, choose two other network measures to report on. Choose any two from those that we’ve
learned about. Report on these measures in an appropriate format.
Your report will include a descripAon of how you used the graph analysis socware to get each of
the three measures along with the measures and corresponding visualizaAons as appropriate.
Discussion of Results – Your report will include a discussion of the results of the data
visualizaAon and network measures. What insights do these results provide? What further
quesAons do these results raise? What would your next step to invesAgate further be?
Reference – Your report will cite all tutorials, packages, socware and libraries you used in your
data collecAon and analysis.
Video – Each team will submit a video (no longer than 4 minutes) where each team member
talks about the most significant challenge they faced working on the project.
Submission
We will run your code to see if it works for all of the steps. You should put all of your files
including your raw data, your cleaned data, source code files, a report in pdf format and your
short video into a .zip folder named LASTNAME1_LASTNAME2_PJ1 (Instead of LASTNAME1 and
LASTNAME2 type the lastname of each team member). Submit your zip folder to Blackboard.
One submission per team.