4.5 Computer Vision and Archaeology

It has become practical in recent years to use neural networks to identify objects, people, and places, in photographs. This use of neural networks applied to imagery in particular has seen rapid advancement since 2012 and the first appearance of ‘convolutional’ neural networks (Deshpande (2016) provides an accessible guide to this literature). But neural networks in general have appeared sporadically in the archaeological literature since the 1990s; Baxter (2014) provides a useful overview. Recent interesting uses include Benhabiles and Tabia (2016) which uses the approach to enhance pottery databases, and Wang et al. (2017) on stylistic analysis of statuary as an aid to restoration. In this section, we provide a gentle introduction to how convolutional neural networks work as preparation, and then two jupyter binders that may be repurposed or expanded with more data to create actual working classifiers.

4.5.1 Convolutional Neural Networks

Neural networks are a biological metaphor for a kind of sequence of computations drawing on the architecture of the eye, the optic nerve, and the brain. When the retina of the eye is exposed to light, different structures within the retina react to different aspects of the image being projected against the retina. These ‘fire’ with greater or lesser strength, depending on what is being viewed. Subsequent neurons will fire if the signal(s) they receive are strong enough. These differential cascades of firing neurons ‘light up’ the brain in particular, repeatable ways when exposed to different images. Computational neural networks aim to achieve a similar effect. In the same way that a child eventually learns to recognize this pattern of shapes and colour as an ‘apple’ and that pattern as an ‘orange’, we can train the computer to ‘know’ that a particular pattern of activations should be labelled ‘apple’.

A ‘convolutional’ neural network begins by ‘looking’ at an image in terms of its most basic features - curves or areas of contiguous colour. As the information percolates through the network the layers are sensitive to more and more abstraction in the image, some 2048 different dimensions of information. English does not have words to understand what precisely, some (most) of these dimensions are responding to, although if you’ve seen any of the ‘Deep Dream’ artworks [SG insert figure here] you are seeing a visualization of some of those dimensions of data. The final layer of neurons predicts from the 2048 dimensions what the image is supposed to be. When we are training such a network, we know at the beginning what the image is of; if at the end, the network does not correctly predict ‘apple’, this error causes the network to shift its weighting of connections between neurons back through the network (‘backpropogation’) to increase the chances of a correct response. This process of calculation, guess, evaluation, adjustment goes on until no more improvement seems to occur.

Neural networks like this can have very complicated architectures to increase their speed, or their accuracy, or some other feature of interest to the researcher. In general, such neural networks are composed of four kinds of layers. The first is the convolutional layer. This is a kind of filter that responds to different aspects of an image; it moves across the image from left to right, top to bottom (whence comes the name ‘convolutional’). The next layer is the layer that reacts to the information provided by the filter; it is the activation layer. The neural network is dealing with an astounding amount of information at this point, and so the third layer, the pooling layer does a kind of mathematical reduction or compression to strip out the noise and leave only the most important features in the data. Any particular neural network might have several such ‘sandwiches’ of neurons arranged in particular ways. The last layer is the connected layer, which is the layer with the information concerning the labels. These neurons run a kind of ‘vote’ on whether or not the 2048-dimension representation of the image ‘belongs’ to their particular category. This vote is expressed as a percentage, and is typically what we see as the output of a CNN applied to the problem of image identification.

4.5.2 Applications

Training a neural network to recognize categories of objects is massively computationally intense. Google’s Inception3 model - that is, the final state of the neural network Google trained - took the resources of a massive company to put together and millions of images. However, Google released its model to the public. Now anyone can take that finished pattern of weights and neurons and use them in their own applications. But Google didn’t train their model on archaeological materials, so it’s reasonable to wonder if such a model has any value to us.

It turns out that it does, because of an interesting by-product of the way the model was trained and created. Transfer learning allows us to take the high-dimensional ways-of-seeing that the Inception3 model has learned, and apply them to a tweaked final voting layer. We can give the computer mere thousands of images and tell it to learn these categories: and so we can train an image classifier on different kinds of pottery relatively quickly. Google has also released a version of Inception3 called Mobilnet that is much smaller (only 1001 dimensions or ways-of-seeing) and can be used in conjunction with a smartphone. We can use transfer learning on the smaller model as well and create a smartphone application trained to recognize Roman pottery fabrics, for instance.

The focus on identifying objects in photographs does obscure an interesting aspect of the model - that is, there are interesting and useful things that can be done when we dismiss the labeling. The second-to-last layer of the neural network is the numerical representation of the feature-map of the image. We don’t need to know what the image is of in order to make use of that information. We can instead feed these representations of the the images into various kinds of k-means, nearest-neighbour, t-sne, or other kinds of statistical tools to look for pattern and structure in the data. If our images are from tourist photos uploaded to flickr of archaeological sites, we might use such tools to understand how tourists are framing their photos (and so, their archaeological consciousness). Huffer and Graham (2018) are using this tool to identify visual tropes in the photographs connected to the communities of people who buy, sell, and collect photos of, human remains on Instagram. Historians are using this approach to understand patterns in 19th century photographs; others are looking at the evolution of advertising in print media.

These technologies are rapidly being put to uses that we regard as deeply unethical. Amazon, for instance, has a facial recognition service called ‘Rekognition’ that it has been trying to sell to police services (Winfield 2018), which can be considered a kind of digital ‘carding’ or profiling by race. In China, massive automated computer vision is deployed to keep control over minority populations (“China Has Turned Xinjiang into a Police State Like No Other” 2018). Various software companies promise to identify ‘ethnicity’ or ‘race’ from store security camera footage, in order to increase sales (and a quick search of the internet will find them for you). In Graham and Huffer’s Bone Trade project, one possible mooted outcome is to use computer vision to determine descendent communities to which belong the human bones being trade online. Given that many of these bones probably were removed from graves in the first place to ‘prove’ deplorable theories on race (see Redman (2016) on the origins) such a use of computer vision runs the risk of re-creating the sins of the past.

Before deploying computer vision in the service of archaeology, or indeed, any technology, one should always ask how the technology could be abused: who could this hurt?

4.5.3 Exercises

  1. Build an image classifier. The code for this exercise is in our repo; launch the binder and work carefully through the steps. Pay attention to the various ‘flags’ that you can set for the training script. Google them; what do they do? Can you improve the speed of the transfer learning? The accuracy? Use what you’ve learned in section 2.5 to retrieve more data upon which you might build a classifier (hint: there’s a script in the repo that might help you with that).

  2. Classify similar images. The code for this exercise is in Shawn Graham’s repo; launch the binder and work through the steps. Add more image data so that the results are clearer.

  3. If you are feeling adventurous, explore Matt Harris’ signboardr, an R package that uses computer vision to identify and extract text from archaeological photos containing a sign board, and then putting that data into the metadata of the photos. Harris’ code is a good example of the power of R and computer vision for automating what would otherwise be time consuming.

References

Deshpande, Adit. 2016. “The 9 Deep Learning Papers You Need to Know About.” https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html.

Baxter, Mike. 2014. “Neural Networks in Archaeology.” https://www.academia.edu/8434624/Neural_networks_in_archaeology.

Benhabiles, Halim, and Hedi Tabia. 2016. “Convolutional Neural Network for Pottery Retrieval.” Journal of Electronic Imaging 26. https://doi.org/10.1117/1.JEI.26.1.011005.

Wang, Haiyan, Zhongshi He, Yongwen Huang, Dingding Chen, and Zexun Zhou. 2017. “Bodhisattva Head Images Modeling Style Recognition of Dazu Rock Carvings Based on Deep Convolutional Network.” Journal of Cultural Heritage 27: 60–71. doi:https://doi.org/10.1016/j.culher.2017.03.006.

Huffer, Damien, and Shawn Graham. 2018. “Fleshing Out the Bones: Studying the Human Remains Trade with Tensorflow and Inception.” Journal of Computer Applications in Archaeology 1 (1). Ubiquity Press, Ltd.: 55–63.

Winfield, Nick. 2018. “Amazon Pushes Facial Recognition to Police. Critics See Surveillance Risk.” https://www.nytimes.com/2018/05/22/technology/amazon-facial-recognition.html.

“China Has Turned Xinjiang into a Police State Like No Other.” 2018. https://www.economist.com/briefing/2018/05/31/china-has-turned-xinjiang-into-a-police-state-like-no-other.

Redman, Samuel J. 2016. Bone Rooms: From Scientific Racism to Human Prehistory in Museums. Harvard University Press.