2.3 Cleaning Data

We often talk about data as being “clean”, “messy,” “raw,” “processed,” or some combination of all of those adjectives. Clean data is easy to understand, manipulate, and distill. Messy data can take a lot of work to refine into something more useful. “Raw” data is a little more complicated. We make consequential decisions at every step of the way about what kinds of information we’re going to collect, what units of measurement we’re using, which terms we choose to describe our observations and they each transform the data. With those facts in mind, data can’t ever truly be “raw”(Gitelman 2013). For a field archaeology example, we make a choice to measure and draw all artefacts and inclusions in a soil layer, or only some (leaving out exhaustive detail about categories like “rubble”). We make a choice to record dimensions of artefacts and stratigraphy visually with a ruler to the centimeter, or to the half centimeter, or with precision instrumentation to fragments of millimeters. We might excavate natural soil layers or by arbitrary levels. Each of these decisions influences the resulting data.

When we’re working with someone else’s pre-existing, or “legacy” data, it’s important to think about what kinds of choices and assumptions may have been made by the original investigator. What were the original project’s goals? What didn’t the original creator attempt to capture?

The way you choose to clean your data will be informed by the structure you want the dataset to take. Have a look at the following section on arranging and storing data before you settle on a methodology for data cleaning. And note that a key element here, as in the prior sections on useful data, preservation, and data collection, is planning.

2.3.1 Tools

You can clean data using regular expressions, or in Microsoft Excel using built in functionality. Or you could use a programming language like Python or R to text operations. But one of the most accessible and powerful, free tools for cleaning up messy data is OpenRefine.

Before you go any further, take a few moments to watch the OpenRefine videos, download the program, and learn how it works. You won’t be sorry, but you may shed a tear for time wasted manually cleaning up spreadsheet data in the past.

2.3.2 Cleaning Tasks

Gathering from many datasets
When working with legacy data, or even on your own large projects, you’ll need to begin by gathering many datasets into a harmonious collection. Don’t transform “original” data. Any cleanup should happen on derivative copies. Make copies of everything at the very beginning and store them somewhere safe.

Building on ideas discussed in the previous section on Designing Data Collection, documentation and transparency are key concepts. Where did you get your datasets? Write that down. Did you do anything (like transcription) to make them computer-readable? Write that down too, describing your methods.

Are you bringing data from multiple places together? Will they fit nicely, or will they need to be massaged a bit in order to be unified?

Examine your dataset(s) closely. Does each row describe one thing? What is that thing? Sometimes articulating the structure of data is more difficult than it seems. For example, take this dataset describing National Register of Historic Places listings in Virginia.

Name                                                      Listing_Date ID      
<chr>                                                     <chr>        <chr>   
1 Fifth Street Historic District                            2/8/2012     118-5318
2 Armstrong School                                          8/22/2012    118-5320
3 Pierce Street Historic District                           8/25/2014    118-5238
4 Doyle Florist Inc./ H. R. Schenkel Inc. Greenhouse Range  11/19/2014   118-5294
5 Lynchburg Hosiery Mill #1                                 2/21/2017    118-0126

On first glance, each row in the dataset might appear to be describing a building, historic district, or archaeological site. But on closer examination, it becomes clear that each row is describing a document which then itself describes a historic property. Whether you created a dataset or you’re working with inherited data, it is worthwhile to ensure that your data are describing what you think they are.

Define your metadata. Create a document that explains each column heading. Avoid calculated values inside your datasets themselves; leave that bit for the analysis.

Normalization refers to each fact being expressed in one place. The objective is to divide up your information into as many standard subsets as is practical. However, atomic specificity and perfection are impossible and not going to help anyone. Getting to granular may make a huge, unwieldy dataset. Ultimately, analysis will likely require recombining data together again, but that task will be straightforward if your data is normal. Whether you’re working in a relational database or performing analyses on derived tables, appropriate normalization may vary. But considering normalization of data from the start can keep things clean. A spreadsheet that mixes variables between rows and columns may be useful at a glance for visual interpretation, but the same data may not work very easily for analysis. A collection of over-normalized tables might be quite challenging to simply read and interpret (and have a computing performance cost later on if you’re working with large amounts of information).

Generally, each row should represent an observation and each column should contain a variable.¹ Watch out for many kinds of information in one column or field. Splitting these data into multiple columns will make things much easier to work with.

Following are several steps to prepare data for analysis and reuse, using examples from several different kinds of messy datasets.

Characterize the messiness
Before doing any real cleaning or reshaping of your data, take some time to explore. Are there leading or spaces in any of your values? Are there typos or nonstandard formatting? Do dates and numbers look the way they are supposed to? Don’t make any changes immediately, but note what you’d like to do. After a bit of exploration, the best order of operations will become more clear. Handling spaces and formatting issues at the outset is usually good advise, however.

NAME                                BIRTH (date, location)     DEATH       SPOUSE(S)                                                      
<chr>                               <chr>                        <chr>       <chr>                                                            
1 Bacchus, Ellen P.                   abt 1863 (Richmond)          9 May 1916  n/a                                                              
2 Ballard (Reese), Bessie E.          Abt 1879 (Richmond)          29 Dec 1928 John L. Ballard Sr. (√) (m. 18 Sep 1898, in Manhattan)           
3 Ballard, Dora Maude [went by Maude] 16 Mar 1898 (North Carolina) 31 Oct 1973 John L. Ballard Sr. (√)                                          
4 Ballard, John Lee, Sr.              15 Jun 1877                  9 Sep 1949  Dora M. — (√); Bessie E. Reese (√) (m. 18 Sep 1898, in Manhattan)

In the example above, notice that we’ve got multiple types of data stored in individual columns. In the BIRTH column we have information about both the date, and birth place within parenthesis, but we’ve also got “abt” indicating an approximate date. Some dates include the day, month, and year, but others only include a year. The SPOUSE(S) column includes not only one or more names, but a “√” symbol to indicate something, and potentially a marriage date and location. Each of those elements of information deserve their own column or row. Even the name can be separated into First, Middle, Last, Maiden, Other, Honorific, Suffix, etc. The lengths to which you need to go depend on the end goals of your project.

cf_ppl_surname cf_ppl_first_name cf_ppl_middle_name cf_ppl_other_name cf_ppl_honorif cf_ppl_birth_month cf_ppl_birth_year cf_ppl_birth_day
  <chr>          <chr>             <chr>              <chr>             <chr>          <chr>                          <int>            <int>
1 Ballard        Dora              Maude              NA                NA             NA                              1898               16
2 Ballard        Bessie            E.                 Reese             NA             NA                              1879               NA
3 Bacchus        Ellen             P.                 NA                NA             NA                              1850               NA
4 Ballard        John              Lee                NA                NA             June                            1877               15

Consider multiple related tables.
The data below come from a project to create a finding aid for a large collection of archaeological field notes. Notice that the “Associated Site Numbers” column contains a mix of individual site numbers, ranges, and multiple values separated by commas. This was a practical choice when data was collected, since multiple people were entering data via a simple KoboToolbox form on a mobile device. But to move these data into a relational database or perform any analysis, we’ll need to split up those values into a table about the recordsets and one about the sites.

Box ID Recordset ID Recordset Name                        Year Associated Site Numbers
<chr>    <chr>          <chr>                                  <int> <chr>                                       
1 FN097    RS 57214       Dulles International Airport Live Fir…  1991 44LD0500                                    
2 FN097    RS 90948       Dulles International Airport Satellit…  1992 44LD0423                                    
3 FN097    RS 22491       Fall HIll Avenue Bridge Replacement     2010 44SP0637                                    
4 FN097    RS 40525       Bluestone Ph I                          2007 44TZ0157-44TZ0168                           
5 FN097    RS 25592       Bluestone Ph II                         2008 44TZ0160-44TZ0162, 44TZ0165                 
6 FN097    RS 1583       Route 5 Improvements                    2007 44HE0057, 44HE1079, 44HE1080, 44HE1081

Again, there are lots of methods to parse those data, but in this case we used OpenRefine. After processing, we end up with two tables, field_notes_info and field_notes_sites:

Box ID Recordset ID Recordset Name                                               Year
<chr>    <chr>          <chr>                                                  <int>
1 FN097    RS 57214       Dulles International Airport Live Fire and Police/Fire  1991
2 FN097    RS 90948       Dulles International Airport Satellite Parking          1992
3 FN097    RS 22491       Fall HIll Avenue Bridge Replacement                     2010
4 FN097    RS 40525       Bluestone Ph I                                          2007
5 FN097    RS 25592       Bluestone Ph II                                         2008
6 FN097    RS 1583       Route 5 Improvements                                    2007

Table 2.1.x, field_notes_info

`Recordset ID` Split_Site_Nums
<chr>          <chr>          
1 RS 57214       44LD0500       
2 RS 90948       44LD0423       
3 RS 22491       44SP0637       
4 RS 40525       44TZ0157       
5 RS 40525       44TZ0158       
6 RS 40525       44TZ0159       
7 RS 40525       44TZ0160       
8 RS 40525       44TZ0161       
9 RS 40525       44TZ0162       
10 RS 40525       44TZ0163       
11 RS 40525       44TZ0164       
12 RS 40525       44TZ0165       
13 RS 40525       44TZ0166       
14 RS 40525       44TZ0167       
15 RS 40525       44TZ0168       
16 RS 25592       44TZ0160       
17 RS 25592       44TZ0161       
18 RS 25592       44TZ0162       
19 RS 25592       44TZ0165       
20 RS 1583       44HE0057       
21 RS 1583       44HE1079       
22 RS 1583       44HE1080       
23 RS 1583       44HE1081

Table 2.1.x, field_notes_sites

Bring them together

We can now join those two tables on the “Recordset ID” column appearing in both tables. You can find more about data structure and joining in the Databases section

2.3.3 Takeaways

Don’t clean or transform your original data. Always work on copies and keep track of your methods.
Take time to explore your data to develop a cleaning strategy. Ask a lot of questions of your data.
Don’t do cleanup manually if you don’t have to. Learning tools/languages like OpenRefine, Python, and R will pay off in time saved and mistakes avoided.

2.3.4 Further Reading

Cleaning Data with OpenRefine, from The Programming Historian

Formatting Problems, from Data Carpentry

Broman, Karl W., and Kara H. Woo 2018 Data Organization in Spreadsheets. The American Statistician 72(1):2–10. DOI:[10.1080/00031305.2017.1375989](https://doi.org/10.1080/00031305.2017.1375989).

Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal 2017 Good enough practices in scientific computing. PLOS Computational Biology 13(6):e1005510. DOI:[10.1371/journal.pcbi.1005510](https://doi.org/10.1371/journal.pcbi.1005510).

1: These are generally reflective of Hadley Wickham’s principles of “tidy data” for use in R. See (Hadley Wickham 2014)

References

Gitelman, Lisa, ed. 2013. Raw Data Is an Oxymoron. Cambridge, Massachusetts ; London, England: The MIT Press.

Hadley Wickham. 2014. “Tidy Data.” Journal of Statistical Software 59 (September). doi:10.18637/jss.v059.i10.