ECE 232E Project 2 Social Network Mining solution

$29.99

Category:

Description

5/5 - (2 votes)

In this project, we will study the various properties of social networks.
In the first part of the project, we will study an undirected social network (Facebook). In the second part of the project, we will study a
directed social network (Google +).
1 Facebook network
In this project, we will be using the dataset given below:
http://snap.stanford.edu/data/egonets-Facebook.html
The Facebook network can be created from the edgelist file (facebook
combined.txt)
1.1 Structural properties of the facebook network
Having created the facebook network, we will study some of the structural properties of the network. To be specific, we will study
1
• Connectivity
• Degree distribution
Question 1: Is the facebook network connected? If not, find the giant
connected component (GCC) of the network and report the size of the
GCC.
Question 2: Find the diameter of the network. If the network is not
connected, then find the diameter of the GCC.
Question 3: Plot the degree distribution of the facebook network and
report the average degree.
Question 4: Plot the degree distribution of question 3 in a log-log scale.
Try to fit a line to the plot and estimate the slope of the line.
1.2 Personalized network
A personalized network of an user vi is defined as the subgraph induced
by vi and it’s neighbors. In this part, we will study some of the structural properties of the personalized network of the user whose graph
node ID is 1 (node ID in edgelist is 0). From this point onwards, whenever we are refering to a node ID we mean the graph node ID which is
1 + node ID in edgelist.
Question 5: Create a personalized network of the user whose ID is 1.
How many nodes and edges does this personalized network have?
Question 6: What is the diameter of the personalized network? Please
state a trivial upper and lower bound for the diameter of the personalized network.
Question 7: In the context of the personalized network, what is the
meaning of the diameter of the personalized network to be equal to the
2
upper bound you derived in question 6. What is the meaning of the diameter of the personalized network to be equal to the lower bound you
derived in question 6?
1.3 Core node’s personalized network
A core node is defined as the nodes that have more than 200 neighbors.
For visualization purpose, we have displayed the personalized network
of a core node below.
An example of a personal network. The core node is shown in black.
In this part, we will study various properties of the personalized network of the core nodes.
Question 8: How many core nodes are there in the Facebook network.
What is the average degree of the core nodes?
3
1.3.1 Community structure of core node’s personalized network
In this part, we study the community structure of the core node’s personalized network. To be specific, we will study the community structure of the personalized network of the following core nodes:
• Node ID 1
• Node ID 108
• Node ID 349
• Node ID 484
• Node ID 1087
Question 9: For each of the above core node’s personalized network,
find the community structure using Fast-Greedy, Edge-Betweenness,
and Infomap community detection algorithms. Compare the modularity scores of the algorithms. For visualization purpose, display the community structure of the core node’s personalized networks using colors.
Nodes belonging to the same community should have the same color
and nodes belonging to different communities should have different
color. In this question, you should have 15 plots in total.
1.3.2 Community structure with the core node removed
In this part, we will explore the effect on the community structure of a
core node’s personalized network when the core node itself is removed
from the personalized network.
Question 10: For each of the core node’s personalized network(use same
core nodes as question 9), remove the core node from the personalized
network and find the community structure of the modified personalized
network. Use the same community detection algorithm as question 9.
4
Compare the modularity score of the community structure of the modified personalized network with the modularity score of the community
structure of the personalized network of question 9. For visualization
purpose, display the community structure of the modified personalized
network using colors. In this question, you should have 15 plots in
total.
1.3.3 Characteristic of nodes in the personalized network
In this part, we will explore characteristics of nodes in the personalized network using two measures. These two measures are stated and
defined below:
• Embeddedness of a node is defined as the number of mutual
friends a node shares with the core node.
• Dispersion of a node is defined as the sum of distances between
every pair of the mutual friends the node shares with the core node.
The distances should be calculated in a modified graph where the
node (whose dispersion is being computed) and the core node are
removed.
For further details on the above characteristics, you can read the paper
below:
http://arxiv.org/abs/1310.6753
Question 11: Write an expression relating the Embeddedness of a
node to it’s degree.
Question 12: For each of the core node’s personalized network (use the
same core nodes as question 9), plot the distribution of embeddedness
and dispersion. In this question, you will have 10 plots.
Question 13: For each of the core node’s personalized network, plot
5
the community structure of the personalized network using colors and
highlight the node with maximum dispersion. Also, highlight the edges
incident to this node. To detect the community structure, use FastGreedy algorithm. In this question, you will have 5 plots.
Question 14: Repeat question 13, but now highlight the node with
maximum embeddedness and the node with maximum dispersion
embeddedness . Also,
highlight the edges incident to these nodes
Question 15: Use the plots from questions 13 and 14 to explain the
characteristics of a node revealed by each of this measure.
1.4 Friend recommendation in personalized networks
In many social networks, it is desirable to predict future links between
pairs of nodes in the network. In the context of this Facebook network
it is equivalent to recommending friends to users. In this part of the
project, we will explore some neighborhood-based measures for friend
recommendation. The network that we will be using for this part is the
personalized network of node with ID 415.
1.4.1 Neighborhood based measure
In this project, we will be exploring three different neighborhood-based
measures. Before we define these measures, let’s introduce some notation:
• Si is the neighbor set of node i in the network
• Sj is the neighbor set of node j in the network
Then, with the above notation we define the three measures below:
6
• Common neighbor measure between node i and node j is defined
as
CommonNeighbors(i, j) = |Si ∩ Sj
|
• Jaccard measure between node i and node j is defined as
Jaccard(i, j) = |Si ∩ Sj
|
|Si ∪ Sj
|
• Adamic-Adar measure between node i and node j is defined as
AdamicAdar(i, j) = X
k∈Si∩Sj
1
log(|Sk|)
1.4.2 Friend recommendation using neighborhood based measures
We can use the neighborhood based measures defined in the previous
section to recommend new friends to users in the network. Suppose we
want to recommend t new friends to some user i in the network using
Jaccard measure. We follow the steps listed below:
1. For each node in the network that is not a neighbor of i, compute
the jaccard measure between the node i and the node not in the
neighborhood of i
Compute Jaccard(i, j) ∀j ∈ S
C
i
2. Then pick t nodes that have the highest jaccard measure with node
i and recommend these nodes as friends to node i
1.4.3 Creating the list of users
Having defined the friend recommendation procedure, we can now apply it to the personalized network of node ID 415. Before we apply
7
the algorithm, we need to create the list of users who we want to recommend new friends to. We create this list by picking all nodes with
degree 24. We will denote this list as Nr.
Question 16: What is |Nr|?
1.4.4 Average accuracy of friend recommendation algorithm
In this part, we will apply the 3 different types of friend recommendation algorithms to recommend friends to the users in the list Nr. We
will define an average accuracy measure to compare the performances
of the friend recommendation algorithms.
Suppose we want to compute the average accuracy of the friend recommendation algorithm. This task is completed in two steps:
1. Compute the average accuracy for each user in the list Nr
2. Compute the average accuracy of the algorithm by averaging across
the accuracies of the users in the list Nr
Let’s describe the procedure for accomplishing the step 1 of the task.
Suppose we want to compute the average accuracy for user i in the list
Nr. We can compute it by iterating over the following steps 10 times
and then taking the average:
1. Remove each edge of node i at random with probability 0.25. In
this context, it is equivalent to deleting some friends of node i. Let’s
denote the list of friends deleted as Ri
2. Use one of the three neighborhood based measures to recommend
|Ri
| new friends to the user i. Let’s denote the list of friends recommended as Pi
3. The accuracy for the user i for this iteration is given by |Pi∩Ri
|
|Ri
|
8
By iterating over the above steps for 10 times and then taking the average gives us the average accuracy of user i. In this manner, we compute
the average accuracy for each user in the list Nr. Once we have computed them, then we can take the mean of the average accuracies of the
users in the list Nr. The mean value will be the average accuracy of the
friend recommendation algorithm.
Question 17: Compute the average accuracy of the friend recommendation algorithm that uses:
• Common Neighbors measure
• Jaccard measure
• Adamic Adar measure
Based on the average accuracy values, which friend recommendation
algorithm is the best?
2 Google+ network
In this part, we will explore the structure of the Google+ network. The
dataset for creating the network can be found in the link below:
http://snap.stanford.edu/data/egonets-Gplus.html
Create directed personal networks for users who have more than 2 circles. The data required to create such personal networks can be found
in the file named gplus.tar.gz.
Question 18: How many personal networks are there?
Question 19: For the 3 personal networks (node ID given below), plot
the in-degree and out-degree distribution of these personal networks.
9
Do the personal networks have a similar in and out degree distribution. In this question, you should have 6 plots.
• 109327480479767108490
• 115625564993990145546
• 101373961279443806744
2.1 Community structure of personal networks
In this part of the project, we will explore the community structure
of the personal networks that we created and explore the connections
between communities and user circles.
Question 20: For the 3 personal networks picked in question 19, extract the community structure of each personal network using Walktrap community detection algorithm. Report the modularity scores and
plot the communities using colors. Are the modularity scores similar?
In this question, you should have 3 plots.
Having found the communities, now we will explore the relationship
between circles and communities. In order to explore the relationship,
we define two measures:
• Homogeneity
• Completeness
Before, we state the expression for homogeneity and completeness, let’s
introduce some notation:
• C is the set of circles, C = {C1, C2, C3, · · · }
10
• K is the set of communities, K = {K1, K2, K3, · · · }
• ai is the number of people in circle Ci
• bi is the number of people in community Ki with circle information
• N is the total number of people with circle information
• Cji is the number of people belonging to community j and circle i
Then, with the above notation, we have the following expressions for
the entropy
H(C) = −
X
|C|
i=1
ai
N
log( ai
N
) (1)
H(K) = −
X
|K|
i=1
bi
N
log( bi
N
) (2)
and conditional entropy
H(C|K) = −
X
|K|
j=1
X
|C|
i=1
Cji
N
log(Cji
bj
) (3)
H(K|C) = −
X
|C|
i=1
X
|K|
j=1
Cji
N
log(Cji
ai
) (4)
Now we can state the expression for homogeneity, h as
h = 1 −
H(C|K)
H(C)
(5)
and the expression for completeness, c as
c = 1 −
H(K|C)
H(K)
(6)
Question 21: Based on the expression for h and c, explain the meaning
of homogeneity and completeness in words.
Question 22: Compute the h and c values for the community structures
of the 3 personal network (same nodes as question 19). Interpret the
values and provide a detailed explanation.
11
3 Submission
Please submit a zip file containing your codes and report to
CCLE. The zip file should be named as “Project2_UID1_…_UIDn.zip”
where UIDx are student ID numbers of team members. If you
had any questions you can post on piazza.
12