Homophily#

Homophily is the network principle that describes the way that nodes which have common properties or attributes are likely to be or become linked to one another. It’s sometimes also referred to as assortative mixing.

In order to see if a network is homophilous, there must be node attributes for you to investigate. The example below uses the node attributes in the Quaker network from the Six Degrees of Francis Bacon project.

Importing data#

# Import NetworkX and key data science libraries
import networkx as nx
import pandas as pd
import numpy as np
import altair as alt
# Import edge table as normal
edges = pd.read_csv("../data/quaker-edges.csv")
edges
Source Target
0 George Keith William Bradford
1 George Keith George Whitehead
2 George Keith George Fox
3 George Keith William Penn
4 George Keith Franciscus Mercurius van Helmont
... ... ...
157 Joseph Besse Samuel Bownas
158 Joseph Besse Richard Claridge
159 Silvanus Bevan Daniel Quare
160 John Penington Mary Penington
161 Lewis Morris Sir Charles Wager

162 rows × 2 columns

# Import node table
nodes = pd.read_csv("../data/quaker-nodes.csv")
nodes
Id Label historical significance gender birthdate deathdate other_id
0 George Keith George Keith Quaker schismatic and Church of England clergyman male 1638 1716 10006784
1 Robert Barclay Robert Barclay religious writer and colonial governor male 1648 1690 10054848
2 Benjamin Furly Benjamin Furly merchant and religious writer male 1636 1714 10004625
3 Anne Conway Viscountess Conway and Killultagh Anne Conway Viscountess Conway and Killultagh philosopher female 1631 1679 10002755
4 Franciscus Mercurius van Helmont Franciscus Mercurius van Helmont physician and cabbalist male 1614 1698 10005781
... ... ... ... ... ... ... ...
91 Elizabeth Leavens Elizabeth Leavens Quaker missionary female 1555 1665 10007246
92 Lewis Morris Lewis Morris politician in America male 1671 1746 10008534
93 Sir Charles Wager Sir Charles Wager naval officer and politician male 1666 1743 10012403
94 William Simpson William Simpson Quaker preacher male 1627 1671 10011114
95 Thomas Aldam Thomas Aldam Quaker preacher and writer male 1616 1660 10000099

96 rows × 7 columns

# Add edges to graph object
quakers = nx.from_pandas_edgelist(edges, source="Source", target="Target")
print(quakers)
Graph with 96 nodes and 162 edges
# Add node attributes for gender
nx.set_node_attributes(quakers, dict(zip(nodes.Id, nodes.gender)), 'gender')

Calculating mixed edge probability#

In a network that is not homophilous, the expected probability of mixed edges is 2 times the product of the percentage of nodes in the first group (p) and the percentage of the nodes in the second group (q): \(2pq\).

# Calculate percentage of male people in the Quaker graph, using pandas
p = nodes.gender.value_counts()["male"]/nodes.gender.count()
p
0.84375
# Calculate percentage of female people in the Quaker graph, using pandas
q = nodes.gender.value_counts()["female"]/nodes.gender.count()
q
0.15625
# Calculate probability of mixed edges
2*p*q
0.263671875

Comparing to the observed number of mixed edges#

We know that if the network is not homophilous we would expect that around 26% of its edges would be mixed edges. To test this we can first calculate the actual number of mixed edges and compare it to the probability above.

# Find the total number of mixed edges in the network
mixed_edges = len([(s,t) for s,t in quakers.edges if quakers.nodes[s]['gender'] != quakers.nodes[t]['gender']])
mixed_edges
32
# Get the percentage of mixed edges in the network
mixed_edges/quakers.number_of_edges()
0.19753086419753085

Let’s define a “homophily” measure as the different between the expected percentage of mixed edges and the observed percentage of mixed edges.

def homophily(mixed_edges):
    return 2*p*q - mixed_edges/quakers.number_of_edges()
obs_homophily = homophily(mixed_edges)
obs_homophily
0.06614101080246915

Our observed homophily measure is .07: we know that the percentage of mixed edges in this graph is 7% less than the percentage we would expect to see in a graph that is not homophilous. But how do we know if this measure is significant?

Hypothesis testing#

To see whether our graph is significantly homophilous, we can set up a resampling procedure to create a hypothesis test for the homophily statistic. This is similar to how you would set up a permutation test for a difference in means.

# Create a simulation function
def simulate_mixed_edges(data, attribute, id_attr, graph):
    attr_column = data[attribute].sample(frac=1).reset_index(drop=True) # Reshuffle column
    nx.set_node_attributes(graph, dict(zip(data[id_attr], attr_column)), attribute) # Set node attribute
    mixed_edges = len([(s,t) for s,t in graph.edges if graph.nodes[s][attribute] != graph.nodes[t][attribute]]) # Get number of mixed edges
    return mixed_edges
# Simulate homophily 5000 times
sim_homophily = pd.DataFrame().assign(sim_homophily=[homophily(simulate_mixed_edges(nodes, 'gender', 'Id', quakers)) for i in range(5000)])
sim_homophily
sim_homophily
0 0.078487
1 0.016758
2 -0.069661
3 0.053795
4 0.047622
... ...
4995 -0.001760
4996 0.103178
4997 -0.038797
4998 -0.100526
4999 -0.001760

5000 rows × 1 columns

# Plot the results of the permutation test
# plt = sns.histplot(x=sim_homophily)
# plt.axvline(x=obs_homophily, color="red", ls="--")

alt.data_transformers.disable_max_rows()
histogram = alt.Chart(sim_homophily).mark_bar().encode(
    x=alt.X("sim_homophily:Q").bin(maxbins=20),
    y=alt.Y("count():Q")
)

sim_homophily = sim_homophily.assign(obs_homophily=obs_homophily)
observed_difference = alt.Chart(sim_homophily).mark_rule(color="red", strokeDash=(8,4)).encode(
    x=alt.X("obs_homophily")
)

histogram + observed_difference
# Calculate a p-value
p_value = np.mean(sim_homophily > obs_homophily)
p_value
0.0555

Now we can use the resulting graph and p-value to determine whether the amount of homophily we see in the Quaker graph is statistically significant.