{ "cells": [ { "cell_type": "markdown", "id": "60c2dc9c-8ee0-4a5b-98d3-34025807db4e", "metadata": {}, "source": [ "# Homophily\n", "\n", "Homophily is the network principle that describes the way that nodes which have common properties or attributes are likely to be or become linked to one another. It's sometimes also referred to as assortative mixing.\n", "\n", "In order to see if a network is homophilous, there must be node attributes for you to investigate. The example below uses the node attributes in the Quaker network from the *Six Degrees of Francis Bacon* project.\n", "\n", "## Importing data" ] }, { "cell_type": "code", "execution_count": 2, "id": "d84c8c1b-8c3e-4a16-be2d-8c2a7cc7f50c", "metadata": {}, "outputs": [], "source": [ "# Import NetworkX and key data science libraries\n", "import networkx as nx\n", "import pandas as pd\n", "import numpy as np\n", "import altair as alt" ] }, { "cell_type": "code", "execution_count": 3, "id": "ad653da6-8449-4dd9-a508-dd1273b4f225", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SourceTarget
0George KeithWilliam Bradford
1George KeithGeorge Whitehead
2George KeithGeorge Fox
3George KeithWilliam Penn
4George KeithFranciscus Mercurius van Helmont
.........
157Joseph BesseSamuel Bownas
158Joseph BesseRichard Claridge
159Silvanus BevanDaniel Quare
160John PeningtonMary Penington
161Lewis MorrisSir Charles Wager
\n", "

162 rows × 2 columns

\n", "
" ], "text/plain": [ " Source Target\n", "0 George Keith William Bradford\n", "1 George Keith George Whitehead\n", "2 George Keith George Fox\n", "3 George Keith William Penn\n", "4 George Keith Franciscus Mercurius van Helmont\n", ".. ... ...\n", "157 Joseph Besse Samuel Bownas\n", "158 Joseph Besse Richard Claridge\n", "159 Silvanus Bevan Daniel Quare\n", "160 John Penington Mary Penington\n", "161 Lewis Morris Sir Charles Wager\n", "\n", "[162 rows x 2 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import edge table as normal\n", "edges = pd.read_csv(\"../data/quaker-edges.csv\")\n", "edges" ] }, { "cell_type": "code", "execution_count": 4, "id": "94518ba4-3666-4975-b439-ea466c3767c4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdLabelhistorical significancegenderbirthdatedeathdateother_id
0George KeithGeorge KeithQuaker schismatic and Church of England clergymanmale1638171610006784
1Robert BarclayRobert Barclayreligious writer and colonial governormale1648169010054848
2Benjamin FurlyBenjamin Furlymerchant and religious writermale1636171410004625
3Anne Conway Viscountess Conway and KillultaghAnne Conway Viscountess Conway and Killultaghphilosopherfemale1631167910002755
4Franciscus Mercurius van HelmontFranciscus Mercurius van Helmontphysician and cabbalistmale1614169810005781
........................
91Elizabeth LeavensElizabeth LeavensQuaker missionaryfemale1555166510007246
92Lewis MorrisLewis Morrispolitician in Americamale1671174610008534
93Sir Charles WagerSir Charles Wagernaval officer and politicianmale1666174310012403
94William SimpsonWilliam SimpsonQuaker preachermale1627167110011114
95Thomas AldamThomas AldamQuaker preacher and writermale1616166010000099
\n", "

96 rows × 7 columns

\n", "
" ], "text/plain": [ " Id \n", "0 George Keith \\\n", "1 Robert Barclay \n", "2 Benjamin Furly \n", "3 Anne Conway Viscountess Conway and Killultagh \n", "4 Franciscus Mercurius van Helmont \n", ".. ... \n", "91 Elizabeth Leavens \n", "92 Lewis Morris \n", "93 Sir Charles Wager \n", "94 William Simpson \n", "95 Thomas Aldam \n", "\n", " Label \n", "0 George Keith \\\n", "1 Robert Barclay \n", "2 Benjamin Furly \n", "3 Anne Conway Viscountess Conway and Killultagh \n", "4 Franciscus Mercurius van Helmont \n", ".. ... \n", "91 Elizabeth Leavens \n", "92 Lewis Morris \n", "93 Sir Charles Wager \n", "94 William Simpson \n", "95 Thomas Aldam \n", "\n", " historical significance gender birthdate \n", "0 Quaker schismatic and Church of England clergyman male 1638 \\\n", "1 religious writer and colonial governor male 1648 \n", "2 merchant and religious writer male 1636 \n", "3 philosopher female 1631 \n", "4 physician and cabbalist male 1614 \n", ".. ... ... ... \n", "91 Quaker missionary female 1555 \n", "92 politician in America male 1671 \n", "93 naval officer and politician male 1666 \n", "94 Quaker preacher male 1627 \n", "95 Quaker preacher and writer male 1616 \n", "\n", " deathdate other_id \n", "0 1716 10006784 \n", "1 1690 10054848 \n", "2 1714 10004625 \n", "3 1679 10002755 \n", "4 1698 10005781 \n", ".. ... ... \n", "91 1665 10007246 \n", "92 1746 10008534 \n", "93 1743 10012403 \n", "94 1671 10011114 \n", "95 1660 10000099 \n", "\n", "[96 rows x 7 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import node table\n", "nodes = pd.read_csv(\"../data/quaker-nodes.csv\")\n", "nodes" ] }, { "cell_type": "code", "execution_count": 5, "id": "5fbb10bd-1375-4cb7-b9df-aa8ca8f18a80", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Graph with 96 nodes and 162 edges\n" ] } ], "source": [ "# Add edges to graph object\n", "quakers = nx.from_pandas_edgelist(edges, source=\"Source\", target=\"Target\")\n", "print(quakers)" ] }, { "cell_type": "code", "execution_count": 6, "id": "15eca6e5-051b-4c94-9d15-d967026a4618", "metadata": {}, "outputs": [], "source": [ "# Add node attributes for gender\n", "nx.set_node_attributes(quakers, dict(zip(nodes.Id, nodes.gender)), 'gender')" ] }, { "cell_type": "markdown", "id": "e9d723f2-529e-4a8a-ae86-4b3b8de64c51", "metadata": {}, "source": [ "## Calculating mixed edge probability\n", "\n", "In a network that is *not* homophilous, the expected probability of mixed edges is 2 times the product of the percentage of nodes in the first group (`p`) and the percentage of the nodes in the second group (`q`): $2pq$." ] }, { "cell_type": "code", "execution_count": 7, "id": "521e7b0d-8878-4526-94dd-962f1ab62391", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.84375" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate percentage of male people in the Quaker graph, using pandas\n", "p = nodes.gender.value_counts()[\"male\"]/nodes.gender.count()\n", "p" ] }, { "cell_type": "code", "execution_count": 8, "id": "9484ca73-98d4-430f-a9ec-a1bf7d86e5fd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.15625" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate percentage of female people in the Quaker graph, using pandas\n", "q = nodes.gender.value_counts()[\"female\"]/nodes.gender.count()\n", "q" ] }, { "cell_type": "code", "execution_count": 9, "id": "a8666454-66d3-4543-b99d-324d1932eb95", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.263671875" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate probability of mixed edges\n", "2*p*q" ] }, { "cell_type": "markdown", "id": "bbf2a7f9-9892-418a-9b7e-88463ea3b32c", "metadata": {}, "source": [ "## Comparing to the observed number of mixed edges\n", "\n", "We know that if the network is not homophilous we would expect that around 26% of its edges would be mixed edges. To test this we can first calculate the actual number of mixed edges and compare it to the probability above." ] }, { "cell_type": "code", "execution_count": 10, "id": "3cc88e5e-4d04-481a-a14d-b51761f95012", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find the total number of mixed edges in the network\n", "mixed_edges = len([(s,t) for s,t in quakers.edges if quakers.nodes[s]['gender'] != quakers.nodes[t]['gender']])\n", "mixed_edges" ] }, { "cell_type": "code", "execution_count": 11, "id": "2a0541c4-1625-4e3b-8095-ff5c2b3b3096", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.19753086419753085" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the percentage of mixed edges in the network\n", "mixed_edges/quakers.number_of_edges()" ] }, { "cell_type": "markdown", "id": "8c6bef0f-bbcf-425d-a32e-d4d55347a885", "metadata": {}, "source": [ "Let's define a \"homophily\" measure as the different between the expected percentage of mixed edges and the observed percentage of mixed edges." ] }, { "cell_type": "code", "execution_count": 12, "id": "1702e30c-e073-4e99-973e-f757e637fbb8", "metadata": { "tags": [] }, "outputs": [], "source": [ "def homophily(mixed_edges):\n", " return 2*p*q - mixed_edges/quakers.number_of_edges()" ] }, { "cell_type": "code", "execution_count": 13, "id": "a829bf8d-febe-4faf-a305-19865b3bd73d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.06614101080246915" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs_homophily = homophily(mixed_edges)\n", "obs_homophily" ] }, { "cell_type": "markdown", "id": "b338d76e-449d-4ba1-b5c2-3aaca4ff829a", "metadata": {}, "source": [ "Our observed homophily measure is .07: we know that the percentage of mixed edges in this graph is 7% less than the percentage we would expect to see in a graph that is *not* homophilous. But how do we know if this measure is significant?\n", "\n", "## Hypothesis testing\n", "\n", "To see whether our graph is *significantly* homophilous, we can set up a resampling procedure to create a hypothesis test for the homophily statistic. This is similar to how you would set up a permutation test for a difference in means." ] }, { "cell_type": "code", "execution_count": 14, "id": "63e93e7c-71e3-4542-b09e-e0d470f6153b", "metadata": {}, "outputs": [], "source": [ "# Create a simulation function\n", "def simulate_mixed_edges(data, attribute, id_attr, graph):\n", " attr_column = data[attribute].sample(frac=1).reset_index(drop=True) # Reshuffle column\n", " nx.set_node_attributes(graph, dict(zip(data[id_attr], attr_column)), attribute) # Set node attribute\n", " mixed_edges = len([(s,t) for s,t in graph.edges if graph.nodes[s][attribute] != graph.nodes[t][attribute]]) # Get number of mixed edges\n", " return mixed_edges" ] }, { "cell_type": "code", "execution_count": 16, "id": "bb505900-1ea1-498a-8c38-e1c7406489ff", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sim_homophily
00.022931
10.072314
2-0.044970
30.016758
4-0.020279
......
4995-0.131390
4996-0.069661
49970.022931
49980.029104
49990.059968
\n", "

5000 rows × 1 columns

\n", "
" ], "text/plain": [ " sim_homophily\n", "0 0.022931\n", "1 0.072314\n", "2 -0.044970\n", "3 0.016758\n", "4 -0.020279\n", "... ...\n", "4995 -0.131390\n", "4996 -0.069661\n", "4997 0.022931\n", "4998 0.029104\n", "4999 0.059968\n", "\n", "[5000 rows x 1 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Simulate homophily 5000 times\n", "sim_homophily = pd.DataFrame().assign(sim_homophily=[homophily(simulate_mixed_edges(nodes, 'gender', 'Id', quakers)) for i in range(5000)])\n", "sim_homophily" ] }, { "cell_type": "code", "execution_count": 19, "id": "d5c958f4-ac8c-49cc-a10a-e0e2c443e33b", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Plot the results of the permutation test\n", "# plt = sns.histplot(x=sim_homophily)\n", "# plt.axvline(x=obs_homophily, color=\"red\", ls=\"--\")\n", "\n", "alt.data_transformers.disable_max_rows()\n", "histogram = alt.Chart(sim_homophily).mark_bar().encode(\n", " x=alt.X(\"sim_homophily:Q\").bin(maxbins=20),\n", " y=alt.Y(\"count():Q\")\n", ")\n", "\n", "sim_homophily = sim_homophily.assign(obs_homophily=obs_homophily)\n", "observed_difference = alt.Chart(sim_homophily).mark_rule(color=\"red\", strokeDash=(8,4)).encode(\n", " x=alt.X(\"obs_homophily\")\n", ")\n", "\n", "histogram + observed_difference" ] }, { "cell_type": "code", "execution_count": 20, "id": "68d0ac48-c616-441c-8e4d-312fbfba8f35", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0.0583" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate a p-value\n", "p_value = np.mean(sim_homophily > obs_homophily)\n", "p_value" ] }, { "cell_type": "markdown", "id": "5bc79d7f-eb45-4071-b049-d871182e6f45", "metadata": {}, "source": [ "Now we can use the resulting graph and p-value to determine whether the amount of homophily we see in the Quaker graph is statistically significant." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }