2.6 Using Application Programming Interfaces (APIS) to Retrieve Data

Sometimes it is useful for one program on this computer to be able to speak to another program on that computer, via the web. This is achieved through the use of ‘application programming interfaces’ (API). One example you may have come across is sharing a photo from your phone to some kind of social media website - the program on your phone communicates to the site (say, Twitter) and makes that post for you. The API can be thought of as a series of commands or even a kind of language that allows programmatic access to the other computer- perhaps a way of specifying how to download the results of an image search, or to retrieve every record in an archaeological database that fits a particular pattern in a format that you can then analyze.

One excellent archaeological database that uses an API to open its records is Open Context. Open Context publishes archaeological data after a rigorous editorial process. Its API is designed to allow faceted search. That is to say, it provides a way to summarize data so that we can understand aggregate patterns in the data. It also provides a path for finding individual records. Like most APIs, it provides the data in a JSON format to allow further computational work. JSON is human-readable text file. JSON organizes data in attribute-value pairs, or lists of such pairs. (Contrast that with the CSV or Excel files you are probably used to, where each row is an entity and each column is an attribute of that entity).

Many websites that have data of interest to archaeologists do not have an API. For many use cases, it can often be sufficient to expose data as simple downloads of table values. An API is valuable however in that it can be used as part of a workflow, updating data as it comes in, for some other application or website. An API can be very useful in cases where multiple persons are updating or contributing data. If the amount of data in the database is very large, an API might be the only practical way of opening the data to the world.

In the Jupyter notebooks, we show you some code for retrieving data from a few different APIs. The first is the ‘Chronicling America’ website, a repository of historical newspapers from across the United States kept by the Library of Congress. We developed our code for accessing the API by repurposing code that Tim Sherrat used for querying the Trove API from the National Library of Australia (which incidentally underlines again the value of sharing code!). Our version is in this repo. If you study that code, you can see how we built the query in python.

We -

  • defined the location of where the APIs ‘endpoint’ is. That is to say, the URL where we can pass commands to the API;
  • we identified one of the parameters that the API uses, proxtext, and assigned our search value to it
  • we also specified what format to return the results in (eg, json)
  • and then we told the python module ‘imports’ the complete URL that specifies not just the endpoint, but also all of the paramters for our search. The module ‘imports’ handles the actual getting of information for us, so we don’t need to program that from scratch.

The remainder of the code gets the data and puts it into a variable that we can either examine or save to file.

APIs can appear intimidating at first. This is partly because many developers use automatic tools to generate the documentation for their APIs in the first place! The result is a dense soup of terms and jargon that is largely impenetrable to the rest of us. One of the nice things about the Open Context API and its developers is that they also provide examples of how to use the API and some common tasks one might want to do using it. An example of the kind of scholarship that is enabled when we have large scale data repositories available for programmatic querying is Anderson et al. (2017).

2.6.1 Exercises

Launch the jupyter binder.

  1. Open the ‘Chronicling America API’ notebook. Run through its various steps so that you end up with a json file of results. Imagine that you are writing a paper on the public reception of archaeology in the 19th century in the United States. Alter the notebook so that you can find primary source material for your study. Going further Find another API for historical newspapers somewhere else in the world. Duplicate the notebook, and alter it to search this other API so that you can have material for a cross-cultural comparison.
  2. Open the ‘Open Context API’. Notice how similar it is to the first notebook! Run through the steps so that you can see it work. Study the Open Context API documentation. Modify the search to return materials from a particular project or site.
  3. The final notebook, ‘Open Context Measurements’, is a much more complicated series of calls to the Open Context API (courtesy of Eric Kansa). In this notebook, we are searching for zoological data held in Open Context, using standardised vocabularies from that field that described faunal remains. Examine the code carefully - do you see a series of nested ‘if’ statements? Remember that data is often described using JSON attribute:value pairs. These can be nested within one another, like Russian dolls. This series of ‘if’ statements parses the data into these nested levels, looking for the faunal information. Open Context is using an ontology or formal description of categorization of the data (which you can see here) that enables inter-operability with various Linked Open Data schemes. Run each section of the code. Do you see the section that defines how to make a plot? This code is called on later in the notebook, enabling us to plot the counts of the different kinds of faunal data. Try plotting different categories.
  4. The notebooks above were written in Python. We can also interact with APIs using the R statistical programming language. The Portable Antiquities Scheme database also has an API. Launch this binder and open the ‘Retrieving Data from the Portable Antiquities Scheme Database’ notebook (courtesy of Daniel Pett). This notebook is in two parts. The first frames a query and then writes the result to a csv file for you. Work out how to make the query search for medieval materials, and write a csv to keep more of the data fields.
  5. The second part of the notebook that interacts with the Portable Antiquities Scheme database uses that csv file to determine where the images for each item are located on the Scheme’s servers, and to download them. You might find this useful for building an image classifier as described in 4.2.

Going further - the Programming Historian has a lesson on creating a web API. Follow that lesson and build a web api that serves some archaeological data that you’ve created or have access to. One idea might be to extend the Digital Atlas of Egyptian Archaeology, a gazetteer created by Anthropology undergraduates at Michigan State University. The source data may be found here.

References

Anderson, David G., Thaddeus G. Bissett, Stephen J. Yerka, Joshua J. Wells, Eric C. Kansa, Sarah W. Kansa, Kelsey Noack Myers, R. Carl DeMuth, and Devin A. White. 2017. “Sea-Level Rise and Archaeological Site Destruction: An Example from the Southeastern United States Using Dinaa (Digital Index of North American Archaeology).” Sea-Level Rise and Archaeological Site Destruction: An Example from the Southeastern United States Using DINAA (Digital Index of North American Archaeology). https://doi.org/10.1371/journal.pone.0188142.