Homophily

Homophily#

Homophily is the network principle that describes the way that nodes which have common properties or attributes are likely to be or become linked to one another. It’s sometimes also referred to as assortative mixing.

In order to see if a network is homophilous, there must be node attributes for you to investigate. The example below uses the node attributes in the Quaker network from the Six Degrees of Francis Bacon project.

Importing data#

# Import NetworkX and key data science libraries
import networkx as nx
import pandas as pd
import numpy as np
import altair as alt

# Import edge table as normal
edges = pd.read_csv("../data/quaker-edges.csv")
edges

	Source	Target
0	George Keith	William Bradford
1	George Keith	George Whitehead
2	George Keith	George Fox
3	George Keith	William Penn
4	George Keith	Franciscus Mercurius van Helmont
...	...	...
157	Joseph Besse	Samuel Bownas
158	Joseph Besse	Richard Claridge
159	Silvanus Bevan	Daniel Quare
160	John Penington	Mary Penington
161	Lewis Morris	Sir Charles Wager

162 rows × 2 columns

# Import node table
nodes = pd.read_csv("../data/quaker-nodes.csv")
nodes

	Id	Label	historical significance	gender	birthdate	deathdate	other_id
0	George Keith	George Keith	Quaker schismatic and Church of England clergyman	male	1638	1716	10006784
1	Robert Barclay	Robert Barclay	religious writer and colonial governor	male	1648	1690	10054848
2	Benjamin Furly	Benjamin Furly	merchant and religious writer	male	1636	1714	10004625
3	Anne Conway Viscountess Conway and Killultagh	Anne Conway Viscountess Conway and Killultagh	philosopher	female	1631	1679	10002755
4	Franciscus Mercurius van Helmont	Franciscus Mercurius van Helmont	physician and cabbalist	male	1614	1698	10005781
...	...	...	...	...	...	...	...
91	Elizabeth Leavens	Elizabeth Leavens	Quaker missionary	female	1555	1665	10007246
92	Lewis Morris	Lewis Morris	politician in America	male	1671	1746	10008534
93	Sir Charles Wager	Sir Charles Wager	naval officer and politician	male	1666	1743	10012403
94	William Simpson	William Simpson	Quaker preacher	male	1627	1671	10011114
95	Thomas Aldam	Thomas Aldam	Quaker preacher and writer	male	1616	1660	10000099

96 rows × 7 columns

# Add edges to graph object
quakers = nx.from_pandas_edgelist(edges, source="Source", target="Target")
print(quakers)

Graph with 96 nodes and 162 edges

# Add node attributes for gender
nx.set_node_attributes(quakers, dict(zip(nodes.Id, nodes.gender)), 'gender')

Calculating mixed edge probability#

In a network that is not homophilous, the expected probability of mixed edges is 2 times the product of the percentage of nodes in the first group (p) and the percentage of the nodes in the second group (q): \(2pq\).

# Calculate percentage of male people in the Quaker graph, using pandas
p = nodes.gender.value_counts()["male"]/nodes.gender.count()
p

0.84375

# Calculate percentage of female people in the Quaker graph, using pandas
q = nodes.gender.value_counts()["female"]/nodes.gender.count()
q

0.15625

# Calculate probability of mixed edges
2*p*q

0.263671875

Comparing to the observed number of mixed edges#

We know that if the network is not homophilous we would expect that around 26% of its edges would be mixed edges. To test this we can first calculate the actual number of mixed edges and compare it to the probability above.

# Find the total number of mixed edges in the network
mixed_edges = len([(s,t) for s,t in quakers.edges if quakers.nodes[s]['gender'] != quakers.nodes[t]['gender']])
mixed_edges

# Get the percentage of mixed edges in the network
mixed_edges/quakers.number_of_edges()

0.19753086419753085

Let’s define a “homophily” measure as the different between the expected percentage of mixed edges and the observed percentage of mixed edges.

def homophily(mixed_edges):
    return 2*p*q - mixed_edges/quakers.number_of_edges()

obs_homophily = homophily(mixed_edges)
obs_homophily

0.06614101080246915

Our observed homophily measure is .07: we know that the percentage of mixed edges in this graph is 7% less than the percentage we would expect to see in a graph that is not homophilous. But how do we know if this measure is significant?

Hypothesis testing#

To see whether our graph is significantly homophilous, we can set up a resampling procedure to create a hypothesis test for the homophily statistic. This is similar to how you would set up a permutation test for a difference in means.

# Create a simulation function
def simulate_mixed_edges(data, attribute, id_attr, graph):
    attr_column = data[attribute].sample(frac=1).reset_index(drop=True) # Reshuffle column
    nx.set_node_attributes(graph, dict(zip(data[id_attr], attr_column)), attribute) # Set node attribute
    mixed_edges = len([(s,t) for s,t in graph.edges if graph.nodes[s][attribute] != graph.nodes[t][attribute]]) # Get number of mixed edges
    return mixed_edges

# Simulate homophily 5000 times
sim_homophily = pd.DataFrame().assign(sim_homophily=[homophily(simulate_mixed_edges(nodes, 'gender', 'Id', quakers)) for i in range(5000)])
sim_homophily

	sim_homophily
0	0.078487
1	0.016758
2	-0.069661
3	0.053795
4	0.047622
...	...
4995	-0.001760
4996	0.103178
4997	-0.038797
4998	-0.100526
4999	-0.001760

5000 rows × 1 columns

# Plot the results of the permutation test
# plt = sns.histplot(x=sim_homophily)
# plt.axvline(x=obs_homophily, color="red", ls="--")

alt.data_transformers.disable_max_rows()
histogram = alt.Chart(sim_homophily).mark_bar().encode(
    x=alt.X("sim_homophily:Q").bin(maxbins=20),
    y=alt.Y("count():Q")
)

sim_homophily = sim_homophily.assign(obs_homophily=obs_homophily)
observed_difference = alt.Chart(sim_homophily).mark_rule(color="red", strokeDash=(8,4)).encode(
    x=alt.X("obs_homophily")
)

histogram + observed_difference

# Calculate a p-value
p_value = np.mean(sim_homophily > obs_homophily)
p_value

0.0555

Now we can use the resulting graph and p-value to determine whether the amount of homophily we see in the Quaker graph is statistically significant.