When we've worked with data in this class, we've typically used a spreadsheet, a prepared text corpus, or a CSV file. In the humanities, data does sometimes come in a ready-made form, even when it needs considerable "cleaning". The formats we've used already are often the forms that hand-collected data will take. But the authors we're reading for this week don't collect data by hand. When studying the web, especially social media platforms, the data almost always comes from the web.
In this tutorial, we'll look at two techniques, web scraping and application programming interfaces, that allow us to pull information from different kinds of web sources.
Say you've found a webpage that has data you would like to study. If you're lucky, the folks who made the page will give you an easy way to download the data directly, but that's often not the case. You could copy-paste the data off of the page into a txt file, but you might miss data this way (or wind up with wrongly-formatted data, missing words, etc.).
The best way to be sure that you're getting all the information on a webpage is to save the page itself. In its simplest form, this is what "web scraping" means. Every website is made of HTML, and all browsers let you save or download that HTML directly. You can do this by hitting CTRL-S
or CMD-S
(just like saving any other kind of document). Or you can find a save button.
Let's say you'd like to save the page for our course schedule. Here's what that would look like in Firefox:
Once you've saved the HTML, you can process it in a number of ways (and we'll talk about those below). But obviously, this method would only be convenient if all the data you need is on a single webpage, and that is seldom in the case. Thankfully there are more efficient ways of saving webpages.
You already experimented with one way back in the Command Line Workshop. You can download webpages directly on the command line using the wget
command. Just type wget
followed by the URL of the page you'd like to download. Let's download our course schedule again with wget
. Type the one-line command below into your terminal. (Don't include the % sign, which is a special notation just for Jupyter notebooks).
%alias wget wget #Don't type in this line in your terminal! It just tells Jupyter what to expect.
%wget jrladd.com/dh2020/schedule
--2020-11-08 11:55:57-- http://jrladd.com/dh2020/schedule Resolving jrladd.com (jrladd.com)... 185.199.111.153, 185.199.108.153, 185.199.109.153, ... Connecting to jrladd.com (jrladd.com)|185.199.111.153|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: http://jrladd.com/dh2020/schedule/ [following] --2020-11-08 11:55:57-- http://jrladd.com/dh2020/schedule/ Reusing existing connection to jrladd.com:80. HTTP request sent, awaiting response... 200 OK Length: 13387 (13K) [text/html] Saving to: ‘schedule’ schedule 100%[===================>] 13.07K --.-KB/s in 0s 2020-11-08 11:55:57 (131 MB/s) - ‘schedule’ saved [13387/13387]
The wget
command pulled down the HTML file for our page and saved it to your computer in a file called "schedule". You can look at the first part of that file by typing:
%alias head head #Again, ignore this line
%head schedule
<!DOCTYPE html> <html lang="en"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <link rel="stylesheet" href="/dh2020/assets/main.css"><link type="application/atom+xml" rel="alternate" href="https://jrladd.com/dh2020/feed.xml" title="Introduction to Digital Humanities 2020" /></head> <body><header class="site-header" role="banner"> <div class="wrapper"><a class="site-title" rel="author" href="/dh2020/">Introduction to Digital Humanities 2020</a><nav class="site-nav"> <input type="checkbox" id="nav-trigger" class="nav-trigger" />
If you had a list of webpages, you could loop through it and easily download each one without having to manually save each individual page.
But notice that none of the text for our schedule is in this first part of the file! Instead, there's a lot of HTML markup, encoding that tells the browser how to display the page.
If you're only interested in the content, it's possible to "parse" the HTML and retrieve just the text. There are lots of different tools you can use for this, but an HTML or XML parser like lxml or BeautifulSoup does the job well in Python.
You could retrieve HTML files with wget
and then process them in Python. Or you could retrieve the HTML directly in Python using the requests
package. It's a very simple Python package that lets you request pages using HTTP (the HyperText Transfer Protocol). This is the same protocol that wget
uses, and it's what web browsers us to display HTML pages.
Here's a quick sample of requesting our schedule page in Python:
import requests # Import requests module
html = requests.get('https://jrladd.com/dh2020/schedule') # Request page
print(html.text[:1000]) # Print beginning of results
<!DOCTYPE html> <html lang="en"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <link rel="stylesheet" href="/dh2020/assets/main.css"><link type="application/atom+xml" rel="alternate" href="https://jrladd.com/dh2020/feed.xml" title="Introduction to Digital Humanities 2020" /></head> <body><header class="site-header" role="banner"> <div class="wrapper"><a class="site-title" rel="author" href="/dh2020/">Introduction to Digital Humanities 2020</a><nav class="site-nav"> <input type="checkbox" id="nav-trigger" class="nav-trigger" /> <label for="nav-trigger"> <span class="menu-icon"> <svg viewBox="0 0 18 15" width="18px" height="15px"> <path d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z M18,7.516C18,8.335,17.335,9,16.516,9H
You can see that though the syntax is slightly different, the result is exactly the same as with wget
. We requested the page and returned its HTML.
Now we can parse, or process, that HTML using lxml
. Parsing allows Python to recognize the nested data structure of an HTML or XML document, and it will let us select individual parts of those documents.
Parsing usually requires a little knowledge of the markup structure. It's typically best to take a look at the raw document beforehand to get a sense of how to parse.
Let's say we wanted to create a spreadsheet of all the links on the schedule page. We can see in the HTML snippet above that there is at least one <a>
HTML tag with information about a link. Each <a>
element has text, which describes the link, and an href
element, which contains the URL itself.
You don't need to worry about the Python details of parsing, but here's what parsing out the link text and URLs using lxml
looks like:
from lxml import etree
parser = etree.HTMLParser()
root = etree.fromstring(html.text, parser)
for a in root.iter("a"):
print(a.text, ",", a.get("href"))
Introduction to Digital Humanities 2020 , /dh2020/ Assignments , /dh2020/assignments/ Credits , /dh2020/credits/ Course Policies , /dh2020/policies/ Schedule & Readings , /dh2020/schedule/ Workshops , /dh2020/workshops/ “What Is Digital Humanities and What’s It Doing in English Departments?” , https://dhdebates.gc.cuny.edu/read/untitled-88c11800-9446-469b-a3be-3fdb36bfbd1e/section/f5640d43-b8eb-4d49-bc4b-eb31a16f3d06#ch01 “Toward a Critical Black Digital Humanities” , https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/5aafe7fe-db7e-4ec1-935f-09d8028a2687#ch02 “‘This is Why We Fight’: Defining the Values of the Digital Humanities” , https://dhdebates.gc.cuny.edu/read/untitled-88c11800-9446-469b-a3be-3fdb36bfbd1e/section/9e014167-c688-43ab-8b12-0f6746095335 Google Ngram Viewer , https://books.google.com/ngrams EEBO Ngram Browser , https://earlyprint.org/lab/tool_ngram_browser.html “Introduction: Why Data Needs Feminism” , https://data-feminism.mitpress.mit.edu/pub/frfa9szd/release/3 6. The Numbers Don’t Speak for Themselves , https://data-feminism.mitpress.mit.edu/pub/czq9dfs5/release/2 “How did they make that?” , http://miriamposner.com/blog/how-did-they-make-that/ Voyant , https://voyant-tools.org/ Palladio , http://hdlab.stanford.edu/projects/palladio/ RAWGraphs , https://rawgraphs.io/ The Data-Sitters Club , https://datasittersclub.github.io/site/ “Josephine Miles and the Origins of Distant Reading” , https://modernismmodernity.org/forums/posts/search-and-replace “The History of Humanities Computing” , http://www.digitalhumanities.org/companion/view?docId=blackwell/9781405103213/9781405103213.xml&chunk.id=ss1-2-1 Wordhoard , http://wordhoard.northwestern.edu/userman/index.html The Rosetti Archive , http://www.rossettiarchive.org/index.html Sea and Spar Between , https://nickm.com/montfort_strickland/sea_and_spar_between/ “Machine Reading the , http://www.digitalhumanities.org/dhq/vol/10/4/000268/000268.html “Alien Reading: Text Mining, Language Standardization, and the Humanities” , https://dhdebates.gc.cuny.edu/read/untitled/section/4b276a04-c110-4cba-b93d-4ded8fcfafc9#ch18 The Princeton Prosody Archive , https://prosody.princeton.edu/ Digital Harlem , http://digitalharlem.org/ Viral Texts , https://viraltexts.org/ Colored Conventions , https://coloredconventions.org/ “Spaces of Meaning: Conceptual History, Vector Semantics, and Close Readings” , https://dhdebates.gc.cuny.edu/read/untitled-f2acf72c-a469-49d8-be35-67f9ac1e3a60/section/4ce82b33-120f-423f-ba4c-40620913b305 Cultural Analytics , https://culturalanalytics.org/ Open Syllabus , https://opensyllabus.org/ “America’s Next Top Novel” , https://post45.org/2020/04/americas-next-top-novel/ “Demystifying Networks, Parts I & II” , http://journalofdigitalhumanities.org/1-1/demystifying-networks-by-scott-weingart/ “If Everything is a Network, Nothing is a Network” , https://visualisingadvocacy.org/node/739.html Six Degrees of Francis Bacon , http://sixdegreesoffrancisbacon.com/ LinkedJazz , https://linkedjazz.org/ Mapping the Republic of Letters , http://republicofletters.stanford.edu/ Network Analysis + Digital Art History , https://sites.haa.pitt.edu/na-dah/ All One People Under One King , https://maevekane.net/wmq-uc/ “Humanities Approaches to Graphical Display” , http://digitalhumanities.org/dhq/vol/5/1/000091/000091.html Chapter 7: “Show Your Work” , https://data-feminism.mitpress.mit.edu/pub/0vgzaln4/release/2 “A Layered Grammar of Graphics” , http://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf RAWGraphs , https://rawgraphs.io/ StoryMap , https://storymap.knightlab.com/ TimelineJS , https://timeline.knightlab.com/ The Decolonial Atlas , https://decolonialatlas.wordpress.com/ Two Plantations , http://twoplantations.com/ West River Inscriptions , http://digital.wustl.edu/westriver/index.html “Reconstitute the World” , http://nowviskie.org/2018/reconstitute-the-world/ Change Us, Too , http://nowviskie.org/2019/change-us-too/ Torn Apart/Separados , http://xpmethod.columbia.edu/torn-apart/volume/1/index Mapping Police Violence , https://mappingpoliceviolence.org/ Collections as Data , https://collectionsasdata.github.io/ DH Syllabi Collection , /dh2020/credits Directory of Caribbean Digital Scholarship Data Sheet , https://docs.google.com/spreadsheets/d/1PfgI0GrQR60gwRFVIZmZtWae9JyAMpZNFOZRe5xsMsg/edit?usp=sharing Alex Gil’s tweets , https://mobile.twitter.com/elotroalex/status/1320770230805762050 “Tweets of a Native Son: The Quotation and Recirculation of James Baldwin from Black Power to #BlackLivesMatter” , https://muse-jhu-edu.turing.library.northwestern.edu/article/704336 “The Route of a Text Message, A Love Story” , https://www.vice.com/en_us/article/kzdn8n/the-route-of-a-text-message-a-love-story Documenting the Now , https://www.docnow.io/ Tweets of a Native Son , https://tweetsofanativeson.com/ Algorithmic Accountability: A Primer , https://datasociety.net/library/algorithmic-accountability-a-primer/ jrladd@northwestern.edu , mailto:jrladd@northwestern.edu None , https://github.com/jrladd None , https://www.twitter.com/johnrladd
You can see above that the code has output a list of data, where link text is separated by URLs with a comma. We could easily put this information into a CSV file or spreadsheet.
Note that, as is very common with web scraping, the data needs some further processing or cleaning. Some of the links don't have accompanying text at all (the ones labeled "None"). Other links refer to relative URL paths (i.e. "/dh2020/credits") rather than absolute URL paths (i.e. anything that begins with http). Usually after web scraping you wind up with a data set that needs more work before analysis can be done, and for that you might return to a tool like OpenRefine.
A final, very important note on web scraping: While technically speaking there's nothing to stop you from requesting any web page in the manner shown above, scraping a page that contains proprietary information is sometimes illegal and has led to legal action in the past. Use extreme caution when web scraping! If the data seems restricted or propriety, check to see if the site already has a public API, or reach out to the site's owner.
Web scraping is all well and good if the information you need is available in static HTML on one page or a few pages. The problem is that this isn't how most modern websites work, especially the ones we all visit most frequently (social media sites, search engines, etc.). These sites are web applications, in which HTML templates are constantly being updated with new data. This is how you wind up with timeline streams, profile pages, search results, and all the things we've come to expect from the modern web. In these cases, the HTML on a given page isn't consistent, and requesting just one page or set of pages won't ever get you access to the underlying data.
A modern web application uses an application programming interface (API) to exchange data between the pages you see in a browser and a web server. When you perform a Google search, the site requests data via an API, and the same thing happens when your timeline updates on Twitter or Facebook.
APIs are usually invisible to end users, but in some cases sites make public APIs available so that users can request data directly, bypassing the front-end website. This is even true among some digital humanities projects. (Six Degrees of Francis Bacon has a public API, for example.)
The process of requesting data via APIs is similar to requesting webpages. APIs work by creating URLs for specific kinds of database queries. Typically you input the URL for an API followed by a "query string" that includes information about what data you would like it to return.
This is best understood by example. We'll use the API for Wordnik, a popular dictionary site whose open API is often used for poetry generation projects, including by Darius Kazemi for his many Twitter bots and online projects.
Go to wordnik.com and search for "cat. You'll get their information on the word, which includes definitions, examples, etymologies, synonyms, rhymes, and more. You could scrape the information off of this page, but it would take a lot of time to process the data into a usable form. (And you might get in trouble with Wordnik for doing so.)
Instead, you can sign up for the Wordnik API. Their free account lets you make 100 API requests per hour, more than enough for basic uses. By signing up you get an API Key, basically a password that allows you to pull data directly from the API. (For the purposes of this tutorial, I'm using my personal API key, which I've saved in a file called "wordnik" and am importing into this code.)
To make a request to the Wordnik's API, you can construct the appropriate URL. For example, the URL for all of Wordnik's definitions of the word cat would be:
http://api.wordnik.com/v4/word.json/cat/definitions?api_key=YOUR_API_KEY
You could change the word "cat" in the URL to any other word to get definitions for a different word. You could change "definitions" in the URL to "etymologies," "frequency," or even "scrabbleScore" to get different kinds of information about your chosen word. And Wordnik also lets you make requests for lists of words all at once.
Everything that comes after the question mark in the URL is the query string. Query strings can get quite complicated, but in this case the query contains just one variable, the API key. You would replace what comes after the equals sign with your unique API key. Though it's not the case with Wordnik, sometimes the query string contains the main information for your API query. One could imagine an API similar to Wordnik's where the query string was ?word=cat&info=definitions
.
If you follow the URL above (with an appropriate API key), you'll see this:
This is JSON, JavaScript object notation, the most popular file format for APIs. [n.b. This is how it looks in Firefox, but it might display differently depending on your browser.] You can see that the word and its definitions are organized into various fields. Every API will different, which is why it's important to read documentation and look carefully at initial results. But once you've done that once or twice, you're ready to request and process the API via Python.
Here's a Python example, requesting the JSON data for the definitions of the word "cat."
from wordnik import API_KEY #Import API key from separate file to keep it secret
import requests
word = "cat"
params = {'api_key': API_KEY}
result = requests.get(f"http://api.wordnik.com/v4/word.json/{word}/definitions", params=params)
print(result.text[:1000])
[{"id":"C5155700-1","partOfSpeech":"noun","attributionText":"from The American Heritage® Dictionary of the English Language, 5th Edition.","sourceDictionary":"ahd-5","sequence":"1","score":0,"labels":[],"citations":[],"word":"cat","relatedWords":[],"exampleUses":[],"textProns":[],"notes":[],"attributionUrl":"https://ahdictionary.com/","wordnikUrl":"https://www.wordnik.com/words/cat"},{"id":"C5155700-2","partOfSpeech":"noun","attributionText":"from The American Heritage® Dictionary of the English Language, 5th Edition.","sourceDictionary":"ahd-5","text":"A small domesticated carnivorous mammal <em>(Felis catus),</em> kept as a pet and as catcher of vermin, and existing in a variety of breeds.","sequence":"2","score":0,"labels":[],"citations":[],"word":"cat","relatedWords":[],"exampleUses":[],"textProns":[],"notes":[],"attributionUrl":"https://ahdictionary.com/","wordnikUrl":"https://www.wordnik.com/words/cat"},{"id":"C5155700-3","partOfSpeech":"noun","attributionText":"from The American
Above, I've printed just the first part of the resulting JSON data. It may look like nonsense if you're not familiar with JSON, but it's far more organized than the data that could have been scraped from the HTML page. Here's an example of parsing this data into a list of definitions:
for definition in result.json():
try:
print(definition['text'])
print()
except KeyError:
pass
A small domesticated carnivorous mammal <em>(Felis catus),</em> kept as a pet and as catcher of vermin, and existing in a variety of breeds. Any of various other carnivorous mammals of the family Felidae, including the lion, tiger, leopard, and lynx. The fur of a domestic cat. A woman who is regarded as spiteful. A person, especially a man. A player or devotee of jazz music. A cat-o'-nine-tails. A catfish. A cathead. A device for raising an anchor to the cathead. A catboat. A catamaran. To hoist an anchor to (the cathead). To look for sexual partners; have an affair or affairs. (<em>let the cat out of the bag</em>) To let a secret be known. An abbreviated form of <internalXref urlencoded="catamaran">catamaran</internalXref>. To draw (an anchor) up to the cat-head. To fill with soft clay, as the intervals between laths: as, a chimney well <em>catted.</em> To fish for catfish. The form of <internalXref urlencoded="cata-">cata-</internalXref> before a vowel. To act after the manner of soft clay or mortar in filling crevices. An abbreviation of <em>Catalan</em>: [<em>lowercase</em>] of <em>catalogue</em>; of <em>catechism.</em> In <em>medieval warfare</em>, a machine resembling the pluteus, under the protection of which soldiers worked in sapping walls and fosses. <em>plural</em> In <em>mining</em>, burnt clay used for tamping. Same as <internalXref urlencoded="channel-cat">channel-cat</internalXref>. A domesticated carnivorous quadruped of the family <em>Felidæ</em> and genus <em>Felis, F. domestica.</em> In general, any digitigrade carnivorous quadruped of the family <em>Felidæ</em>, as the lion, tiger, leopard, jaguar, etc., especially of the genus <em>Felis</em>, and more particularly one of the smaller species of this genus; and of the short-tailed species of the genus <em>Lynx.</em> A ferret. A gossipy, meddlesome woman given to scandal and intrigue. A catfish. A whip: a contraction of <em>cat-o'-nine-tails.</em> A double tripod having six feet: so called because it always lands on its feet, as a cat is proverbially said to do. In the middle ages, a frame of heavy timber with projecting pins or teeth, hoisted up to the battlements, ready to be dropped upon assailants. Also called <internalXref urlencoded="prickly%20cat">prickly cat</internalXref>. A piece of wood tapering to a point at both ends, used in playing tip-cat. The game of tip-cat. Also called <internalXref urlencoded="cat-and-dog">cat-and-dog</internalXref>. In <em>faro</em>, the occurrence of two cards of the same denomination out of the last three in the deck. In <em>coal-mining</em>, a clunchy rock. See <internalXref urlencoded="clunch">clunch</internalXref>. [Apparently in allusion to the sly and deceitful habits of the cat.] A mess of coarse meal, clay, etc., placed on dovecotes, to allure strangers. In <em>plastering</em>, that portion of the first rough coat which fills the space between the laths, often projecting at the back, and serving to hold the plaster firmly to the walls. The salt which crystallizes about stakes placed beneath the holes in the bottom of the troughs in which salt is put to drain. A ship formed on the Norwegian model, having a narrow stern, projecting quarters, and a deep waist. <em>Nautical</em>, a tackle used in hoisting an anchor from the hawse-hole to the cat-head. To bring to the cathead. See <xref urlencoded="anchor">anchor</xref>. Any animal belonging to the natural family <ex>Felidae</ex>, and in particular to the various species of the genera Felis, Panthera, and Lynx. The domestic cat is <spn>Felis domestica</spn>. The European wild cat (<spn>Felis catus</spn>) is much larger than the domestic cat. In the United States the name <stype>wild cat</stype> is commonly applied to the bay lynx (<spn>Lynx rufus</spn>). The larger felines, such as the lion, tiger, leopard, and cougar, are often referred to as <ex>cats</ex>, and sometimes as big cats. See <xref urlencoded="wild%20cat">wild cat</xref>, and <xref urlencoded="tiger%20cat">tiger cat</xref>. A strong vessel with a narrow stern, projecting quarters, and deep waist. It is employed in the coal and timber trade. A strong tackle used to draw an anchor up to the cathead of a ship.
So the code above gives you every definition of the word "cat" that Wordnik has recorded. With a deep dive into the Wordnik API documentation, you could request data on all sorts of features about various words.
Getting data from social media sites or other APIs works in exactly the same way. Pulling data from Twitter, for example, requires signing up for a number of Twitter API keys and learning the specific API queries to make. The Google Maps API is another example of a popular web API made by a large tech company that's used often in digital humanities projects. (You can use the Google Maps API to assign geocoordinates to place names.)
The requests
library is an all-purpose tool for requesting data from URLs. But many APIs have dedicated software libraries for working with their APIs in a way that might be even easier than using requests
. These apps are sometimes made by the sites themselves, but they're often made by third-party developers. Wordnik offers a long list of libraries for a wide variety of programming languages and platforms. There are lots of libraries for the most popular APIs, especially for social networks. The twarc
tool from Documenting the Now is a specialized API library for archiving tweets that works on the command line.
While DH scholars are most concerned with using APIs to retrieve information from web platforms, you can also send information to websites using APIs. This is how Twitter bots work: small code snippets make requests to Twitter's API to send tweets, react to posts, follow users, and more.
APIs are a flexible tool for interacting with modern web applications, and with a few basic techniques it's relatively easy for DH scholars to take advantage of the information these APIs provide, in order to archive and critique these platforms.