07: The Shape of the Web#
*n.b. BECAUSE THIS ASSIGNMENT IS DUE AFTER SPRING BREAK, YOU CANNOT RECEIVE AN EXTENSION.
This week we’ll explore a large directed information network: a snapshot of the Web. Download the Stanford Web Graph from the Stanford Large Network Dataset Collection, which is also available on the Datasets page in the textbook. In this dataset, nodes represent pages and edges represent hyperlinks between those pages.
Keep in mind that this network has a huge number of nodes and edges; it’s far larger than any network we’ve used so far. Some of your code might take a little longer to run than usual, but if it’s taking more than 5 minutes or so there may be something else wrong. Also, you do not need to unzip the txt.gz
file. You can read the file directly without unzipping it, and that will save space on your computer.
In a short Jupyter notebook report, answer the following questions about this network. Don’t simply calculate the answers: make sure you’re fully explaining (in writing) the metrics and visualizations that you generate. Consider the Criteria for Good Reports as a guide. You can create markdown cells with section headers to separate the different sections of the report. Rather than number the report as if you’re answering distinct questions, use the questions as a guide to do some data storytelling, i.e. explain this network’s data in an organized way.
What is the average in-degree and out-degree of all nodes in this network? What do you notice when you calculate this, and how does that match with what you know about these measures? Also create two histograms showing the distribution of in-degree and out-degree. Make sure each of these histograms has a good bin size, a title, and a label for the x-axis. Describe each visualization completely—what do these distributions tell you about this network?
n.b. You may need to add axis limits to make these plots functional.
Is this network strongly connected? Is it weakly connected? What does that tell you about the graph? How many strongly connected components does the network have, and what is the size of the largest strongly connected component? What have you learned about the network from these measures?
Does this network have a “bow tie” structure like the one mentioned in Chapter 13 of your book? How would you find this out? You’ll need to calculate a few different measures here and report your findings. (Hint: focus on just the three largest strongly connected components.)
When you’re finished, remove these instructions from the top of the file, leaving only your own writing and code. Export the notebook as an HTML file, check to make sure everything is formatted correctly, and submit your HTML file to Sakai.