3.2 Statistical Computing with R and Python Notebooks; Reproducible code
When one of us (Graham) was a graduate student, he was tasked with teaching undergraduates how to do a chi-squared test of archaeological data. This was in the late 1990s. He duly opened up Excel, and began to craft a template there. While it was possible to use Excel’s automatic chi-square stastical test on a table of data, simply showing the students how to use the function seemed undesirable. If he used the function, would the students really know what the test was doing? In any event, after several hours of working with the function, he determined that he couldn’t actually figure out what exactly the function was doing. The dataset was from Mike Fletcher and Gary R. Lock’s very clear Digging Numbers: Elementary Statistics for Archaeologists (spearheads and the presence/absence of a loop feature; Fletcher and Lock, 1991: 115-125). The answer Excel was providing was not the answer that Fletcher and Lock demonstrated. Was there some setting somewhere in Excel that was affecting the function? Time was ticking. Graham elected instead to build the spreadsheet so that each step - each calculation - in the statistic was its own cell. He had the students perform these steps on their own spreadsheet. When even that went awry, he saved his sheet onto a floppy disc, and loaded it one computer at a time for the students. This time, working from the original spreadshet, the results agreed with the text. The association between material and period was stronger than the association of material and presence of a loop.
Excel is a black box. When we use it, we have to take on faith that its statistics do what they say they are doing. Most of the time, this is not an issue - but Excel is a very complicated piece of software. Biologists, for instance, have known for years that Excel’s automatic date-time conversions (that is, Excel detects the kind of data in a cell and changes its formatting or the actual data itself) could affect gene names (Zeeberg et al 2004) link.
(Marwick 2017) discusses additional issues that occur when using Excel to manipulate our archaeological data. Archaeological data can seldom be used in the format in which they are first recorded. Much effort has to be expended to clean the data up into the neat and tidy tables that a spreadsheet requires. What decisions were made in the tidying process? What bits and pieces were left out? What bits and pieces were transformed as they were entered into the spreadsheet? If someone else were to try to confirm the results of the study - to reproduce it - and there is no record of the manipulations of the data (the specific transformations), reproducibility is unlikely. When researchers discuss a new method, without the data being available and the kinds of transformations and analyses not clearly and fully specified, we have created a barrier to moving archaeology forward.
As an exercise right now to understand just how difficult a problem this is, make a table in Excel. Visualize the following data: 12 bronze spear heads, 18 iron spear heads; 6 of the bronze spear heads have a loop, 10 of the iron spear heads have a loop. Write down all of the steps you took to visualize this information. Save your work; close the spreadsheet. Now hand your list over to a peer. Have them follow the steps exactly, but do not clarify or otherwise explain the list.
Compare what they’ve created, with what you created. How close of a match is there? Did they reproduce your visualization? What elements or steps did you forget to include in your list?
Now you might begin to see some of the issues involved in relying on a point-and-click graphical user interface and a black-box like Excel with your archaeological data.
Marwick (2017) writes with regard to this problem,
This is a substantial barrier to progress in archaeology, both in establishing the veracity of previous claims and promoting the growth of new interpretations. Furthermore, if we are to contribute to contemporary conversations outside of archaeology (as we are supposedly well-positioned to do, cf. Kintigh et al. (2014)), we need to become more efficient, interoperative, and flexible in our research. We have to be able to invite researchers from other fields into our research pipelines to collaborate in answering interesting and broad questions about past societies.
Marwick lists four principles for reproducible archaeology. Note that ‘reproducible’ here means being able to re-do the analysis and obtain the same results; ‘replicable’ means using the same methods on a new data set collected and analyzed with the published methods of an earlier study. The principles that we should follow are:
- make the data and the methods that generated the results openly available
- script the stastical analysis (that is, no more point-and-click statistics). More on this below.
- use version control
- record the computational environment employed
3.2.1 Reproducible Methods
To enable reproducible methods, we need to ‘script’ our analyses. That means we write a sequence of commands in a language like R or Python so that other people can examine our code, and can run the code on our data (to reproduce our research) or on their own data (to reproduce our method). We also should include comments in our code explaining what each subsequent chunk is meant to achieve. This is the heart of ‘literate programming’, where the code is also annotated with the results. This is the philosophy behind the Jupyter Notebooks that we built to support this textbook. A Jupyter notebook is a step up from having your script annotated with comments. Each code section in a Jupyter notebook can be run (if you have Jupyter installed or are running a notebook within a binder instance) on its own, from top to bottom. Blocks of text that discuss the results or the code can be inserted before or after code-blocks. In this way you can walk your reader through your analyses. Such notebooks can be shared via Github and archived with the Zenodo service or the Open Science Foundation, both of which provide a ‘digital object identifier’ or DOI. A DOI enables your code and analyses to be cited, separate from your discussion (which you might explore in an article or book chapter). The Open Science Foundation also enables multiple collaborators to work on a project. Jupyter notebooks can be used with a variety of languages. ‘Jupyter’ itself is an amalgam of ‘julia - python - R’, which are the three standard languages, although more language ‘kernels’ can be added (see the Jupyter Project).
Good practice therefore is to set up a ‘research compendium’ for your project, with a sensible arrangement of folders that keep your source data separate from the data you actually analyze, and keeps your scripts in a single place. It might be set up like this:
| |-xrf-analysis-roman-bricks | |-raw-data | |-original-data.csv |-cleaned-data | |-copy-original-data.csv | |-cleaned-data.csv |-results | |-descriptive-stats.csv | |-pca-roman-bricks.png |-scripts | |-cleaning.R | |-descriptive-statistics.R | |-exploratory-visualizations.R |-article |-draft-for-jas |-public-discussion-of-research-for-blog.ipynb
Never, ever, do your analysis on the original data that you have collected. Always keep a copy of that separate from everything else, and do your cleaning and analyses on the copy. When writing analytical scripts, try to have one script doing one job. Document what works, and what fails. Note that last file in the ‘article’ folder above, that ends with .ipynb. That’s a jupyter notebook file; these can be deposited in a Github repository (for instance) and Github will understand how to display it. This is but one way you might keep an open notebook of your research.
At this point, you or your instructor might wish to explore some of the ways the R language can be used for archaeological research. Rather than re-invent the wheel, we will suggest that you explore t How to Do Archaeological Science Using R, Marwick (ed) as you try your hand at building a research compendium.
- Use our notebook containing the ‘archdata’ package for R - a collection of archaeological datasets, set up your own research compendium within that environment. Build it around one of the demo datasets. Try to incorporate some of the techniques from the How to Do Archaeological Science book. Download your compendium when you are finished.
- Explore our version of Marwick’s research compendium package, and note the ways it is different from what you produced. What would be the effect of consistently using Marwick’s package on archaeological research?
Marwick, Ben. 2017. “Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation.” Journal of Archaeological Method and Theory, 1–27. http://link.springer.com/article/10.1007/s10816-015-9272-9.