Getting Data from the Web

When we've worked with data in this class, we've typically used a spreadsheet, a prepared text corpus, or a CSV file. In the humanities, data does sometimes come in a ready-made form, even when it needs considerable "cleaning". The formats we've used already are often the forms that hand-collected data will take. But the authors we're reading for this week don't collect data by hand. When studying the web, especially social media platforms, the data almost always comes from the web.

In this tutorial, we'll look at two techniques, web scraping and application programming interfaces, that allow us to pull information from different kinds of web sources.

Web Scraping

Say you've found a webpage that has data you would like to study. If you're lucky, the folks who made the page will give you an easy way to download the data directly, but that's often not the case. You could copy-paste the data off of the page into a txt file, but you might miss data this way (or wind up with wrongly-formatted data, missing words, etc.).

The best way to be sure that you're getting all the information on a webpage is to save the page itself. In its simplest form, this is what "web scraping" means. Every website is made of HTML, and all browsers let you save or download that HTML directly. You can do this by hitting CTRL-S or CMD-S (just like saving any other kind of document). Or you can find a save button.

Let's say you'd like to save the page for our course schedule. Here's what that would look like in Firefox:

Once you've saved the HTML, you can process it in a number of ways (and we'll talk about those below). But obviously, this method would only be convenient if all the data you need is on a single webpage, and that is seldom in the case. Thankfully there are more efficient ways of saving webpages.

You already experimented with one way back in the Command Line Workshop. You can download webpages directly on the command line using the wget command. Just type wget followed by the URL of the page you'd like to download. Let's download our course schedule again with wget. Type the one-line command below into your terminal. (Don't include the % sign, which is a special notation just for Jupyter notebooks).

The wget command pulled down the HTML file for our page and saved it to your computer in a file called "schedule". You can look at the first part of that file by typing:

If you had a list of webpages, you could loop through it and easily download each one without having to manually save each individual page.

But notice that none of the text for our schedule is in this first part of the file! Instead, there's a lot of HTML markup, encoding that tells the browser how to display the page.

If you're only interested in the content, it's possible to "parse" the HTML and retrieve just the text. There are lots of different tools you can use for this, but an HTML or XML parser like lxml or BeautifulSoup does the job well in Python.

You could retrieve HTML files with wget and then process them in Python. Or you could retrieve the HTML directly in Python using the requests package. It's a very simple Python package that lets you request pages using HTTP (the HyperText Transfer Protocol). This is the same protocol that wget uses, and it's what web browsers us to display HTML pages.

Here's a quick sample of requesting our schedule page in Python:

You can see that though the syntax is slightly different, the result is exactly the same as with wget. We requested the page and returned its HTML.

Now we can parse, or process, that HTML using lxml. Parsing allows Python to recognize the nested data structure of an HTML or XML document, and it will let us select individual parts of those documents.

Parsing usually requires a little knowledge of the markup structure. It's typically best to take a look at the raw document beforehand to get a sense of how to parse.

Let's say we wanted to create a spreadsheet of all the links on the schedule page. We can see in the HTML snippet above that there is at least one <a> HTML tag with information about a link. Each <a> element has text, which describes the link, and an href element, which contains the URL itself.

You don't need to worry about the Python details of parsing, but here's what parsing out the link text and URLs using lxml looks like:

You can see above that the code has output a list of data, where link text is separated by URLs with a comma. We could easily put this information into a CSV file or spreadsheet.

Note that, as is very common with web scraping, the data needs some further processing or cleaning. Some of the links don't have accompanying text at all (the ones labeled "None"). Other links refer to relative URL paths (i.e. "/dh2020/credits") rather than absolute URL paths (i.e. anything that begins with http). Usually after web scraping you wind up with a data set that needs more work before analysis can be done, and for that you might return to a tool like OpenRefine.

A final, very important note on web scraping: While technically speaking there's nothing to stop you from requesting any web page in the manner shown above, scraping a page that contains proprietary information is sometimes illegal and has led to legal action in the past. Use extreme caution when web scraping! If the data seems restricted or propriety, check to see if the site already has a public API, or reach out to the site's owner.

APIs

Web scraping is all well and good if the information you need is available in static HTML on one page or a few pages. The problem is that this isn't how most modern websites work, especially the ones we all visit most frequently (social media sites, search engines, etc.). These sites are web applications, in which HTML templates are constantly being updated with new data. This is how you wind up with timeline streams, profile pages, search results, and all the things we've come to expect from the modern web. In these cases, the HTML on a given page isn't consistent, and requesting just one page or set of pages won't ever get you access to the underlying data.

A modern web application uses an application programming interface (API) to exchange data between the pages you see in a browser and a web server. When you perform a Google search, the site requests data via an API, and the same thing happens when your timeline updates on Twitter or Facebook.

APIs are usually invisible to end users, but in some cases sites make public APIs available so that users can request data directly, bypassing the front-end website. This is even true among some digital humanities projects. (Six Degrees of Francis Bacon has a public API, for example.)

The process of requesting data via APIs is similar to requesting webpages. APIs work by creating URLs for specific kinds of database queries. Typically you input the URL for an API followed by a "query string" that includes information about what data you would like it to return.

This is best understood by example. We'll use the API for Wordnik, a popular dictionary site whose open API is often used for poetry generation projects, including by Darius Kazemi for his many Twitter bots and online projects.

Wordnik

Go to wordnik.com and search for "cat. You'll get their information on the word, which includes definitions, examples, etymologies, synonyms, rhymes, and more. You could scrape the information off of this page, but it would take a lot of time to process the data into a usable form. (And you might get in trouble with Wordnik for doing so.)

Instead, you can sign up for the Wordnik API. Their free account lets you make 100 API requests per hour, more than enough for basic uses. By signing up you get an API Key, basically a password that allows you to pull data directly from the API. (For the purposes of this tutorial, I'm using my personal API key, which I've saved in a file called "wordnik" and am importing into this code.)

To make a request to the Wordnik's API, you can construct the appropriate URL. For example, the URL for all of Wordnik's definitions of the word cat would be:

http://api.wordnik.com/v4/word.json/cat/definitions?api_key=YOUR_API_KEY

You could change the word "cat" in the URL to any other word to get definitions for a different word. You could change "definitions" in the URL to "etymologies," "frequency," or even "scrabbleScore" to get different kinds of information about your chosen word. And Wordnik also lets you make requests for lists of words all at once.

Everything that comes after the question mark in the URL is the query string. Query strings can get quite complicated, but in this case the query contains just one variable, the API key. You would replace what comes after the equals sign with your unique API key. Though it's not the case with Wordnik, sometimes the query string contains the main information for your API query. One could imagine an API similar to Wordnik's where the query string was ?word=cat&info=definitions.

If you follow the URL above (with an appropriate API key), you'll see this:

This is JSON, JavaScript object notation, the most popular file format for APIs. [n.b. This is how it looks in Firefox, but it might display differently depending on your browser.] You can see that the word and its definitions are organized into various fields. Every API will different, which is why it's important to read documentation and look carefully at initial results. But once you've done that once or twice, you're ready to request and process the API via Python.

Here's a Python example, requesting the JSON data for the definitions of the word "cat."

Above, I've printed just the first part of the resulting JSON data. It may look like nonsense if you're not familiar with JSON, but it's far more organized than the data that could have been scraped from the HTML page. Here's an example of parsing this data into a list of definitions:

So the code above gives you every definition of the word "cat" that Wordnik has recorded. With a deep dive into the Wordnik API documentation, you could request data on all sorts of features about various words.

Getting data from social media sites or other APIs works in exactly the same way. Pulling data from Twitter, for example, requires signing up for a number of Twitter API keys and learning the specific API queries to make. The Google Maps API is another example of a popular web API made by a large tech company that's used often in digital humanities projects. (You can use the Google Maps API to assign geocoordinates to place names.)

The requests library is an all-purpose tool for requesting data from URLs. But many APIs have dedicated software libraries for working with their APIs in a way that might be even easier than using requests. These apps are sometimes made by the sites themselves, but they're often made by third-party developers. Wordnik offers a long list of libraries for a wide variety of programming languages and platforms. There are lots of libraries for the most popular APIs, especially for social networks. The twarc tool from Documenting the Now is a specialized API library for archiving tweets that works on the command line.

While DH scholars are most concerned with using APIs to retrieve information from web platforms, you can also send information to websites using APIs. This is how Twitter bots work: small code snippets make requests to Twitter's API to send tweets, react to posts, follow users, and more.

APIs are a flexible tool for interacting with modern web applications, and with a few basic techniques it's relatively easy for DH scholars to take advantage of the information these APIs provide, in order to archive and critique these platforms.