Example Use Case: Data Preparation¶
This tutorial is based on the paper on stringi that was published in the Journal of Statistical Software; see [2]. To learn more about R, check out Marek’s open-access (free!) textbook Deep R Programming [3].
What follows is a short case study where we prepare a web-scraped data set for further processing.
Assume we wish to gather and analyse climate data for major cities around the world based on information downloaded from Wikipedia. For each location from a given list of settlements (e.g., fetched from one of the pages linked under https://en.wikipedia.org/wiki/Lists_of_cities), we would like to harvest the relevant temperature and precipitation data. Without harm in generality, let us focus on the city of Melbourne, VIC, Australia.
The parsing of the city’s Wikipedia page can be done by means of the functions from the xml2 and rvest packages.
library("xml2")
library("rvest")
First, let us load and parse the HTML file downloaded on 2020-09-17 (see https://github.com/gagolews/stringi/tree/master/docs/_static/vignette):
f <- read_html("20200917_wikipedia_melbourne.html")
Second, we extract all table
elements and gather them in a list of
HTML nodes, all_tables
. We then extract the underlying raw text data
and store them in a character vector named text_tables
.
all_tables <- html_nodes(f, "table")
text_tables <- sapply(all_tables, html_text)
str(text_tables, nchar.max=65, vec.len=5, strict.width="wrap") # preview
## chr [1:45] "MelbourneVictoriaFrom top, left to right: Flinde"| __truncated__
## "Mean max temp\n Mean min temp\n Annual rainfal"| __truncated__ "This
## section needs additional citations for veri"| __truncated__ "Climate data
## for Melbourne Regional Office (1991"| __truncated__ "Country of Birth
## (2016)[178]Birthplace[N 1]\nPop"| __truncated__ ...
Most Wikipedia pages related to particular cities include a table
labelled as “Climate data”. We need to pinpoint it amongst all the other
tables. For this, we will rely on stringi’s stri_detect_fixed()
function that, in the configuration below, is used to extract the index
of the relevant table.
library("stringi")
(idx <- which(stri_detect_fixed(text_tables, "climate data",
case_insensitive=TRUE, max_count=1)))
## [1] 4
Let’s use rvest’s html_table()
to convert the above table
to a data frame object.
(x <- as.data.frame(html_table(all_tables[[idx]], fill=TRUE)))
## Climate data for Melbourne Regional Office (1991–2015)
## 1 Month
## 2 Record high °C (°F)
## 3 Average high °C (°F)
## 4 Daily mean °C (°F)
## 5 Average low °C (°F)
## 6 Record low °C (°F)
## 7 Average rainfall mm (inches)
## 8 Average rainy days (≥ 1mm)
## 9 Average afternoon relative humidity (%)
## 10 Mean monthly sunshine hours
## 11 Source: Bureau of Meteorology.[85][86][87]
## Climate data for Melbourne Regional Office (1991–2015).1 ...
## 1 Jan ...
## 2 45.6(114.1) ...
## 3 27.0(80.6) ...
## 4 21.6(70.9) ...
## 5 16.1(61.0) ...
## 6 5.5(41.9) ...
## 7 44.2(1.74) ...
## 8 5.6 ...
## 9 47 ...
## 10 279 ...
## 11 Source: Bureau of Meteorology.[85][86][87] ...
## Climate data for Melbourne Regional Office (1991–2015).3
## 1 Year
## 2 46.4(115.5)
## 3 20.8(69.4)
## 4 16.2(61.2)
## 5 11.6(52.9)
## 6 −2.8(27.0)
## 7 600.9(23.66)
## 8 90.6
## 9 51
## 10 2,191
## 11 Source: Bureau of Meteorology.[85][86][87]
It is evident that this object requires some significant cleansing and transforming before it can be subject to any statistical analyses. First, for the sake of convenience, let us convert it to a character matrix so that the processing of all the cells can be vectorised (a matrix in R is just a single “long” vector, whereas a data frame is a list of many atomic vectors).
x <- as.matrix(x)
The as.numeric()
function will find the parsing of the Unicode MINUS
SIGN (U+2212, “−”) difficult, therefore let us call the transliterator
first in order to replace it (and other potentially problematic
characters) with its simpler equivalent:
x[, ] <- stri_trans_general(x, "Publishing-Any; Any-ASCII")
Note that it is the first row of the matrix that defines the column names. Moreover, the last row just gives the data source and hence may be removed.
dimnames(x) <- list(x[, 1], x[1, ]) # row, column names
x <- x[2:(nrow(x) - 1), 2:ncol(x)] # skip 1st/last row and 1st column
x[, c(1, ncol(x))] # example columns
## Jan Year
## Record high °C (°F) "45.6(114.1)" "46.4(115.5)"
## Average high °C (°F) "27.0(80.6)" "20.8(69.4)"
## Daily mean °C (°F) "21.6(70.9)" "16.2(61.2)"
## Average low °C (°F) "16.1(61.0)" "11.6(52.9)"
## Record low °C (°F) "5.5(41.9)" "-2.8(27.0)"
## Average rainfall mm (inches) "44.2(1.74)" "600.9(23.66)"
## Average rainy days (>= 1mm) "5.6" "90.6"
## Average afternoon relative humidity (%) "47" "51"
## Mean monthly sunshine hours "279" "2,191"
Commas that are used as thousands separators (commas surrounded by digits) should be dropped:
x[, ] <- stri_replace_all_regex(x, "(?<=\\d),(?=\\d)", "")
The numbers and alternative units in parentheses are redundant, therefore these should be taken care of as well:
x[, ] <- stri_replace_all_regex(x,
"(\\d+(?:\\.\\d+)?)\\(\\d+(?:\\.\\d+)?\\)", "$1")
dimnames(x)[[1]] <- stri_replace_all_fixed(dimnames(x)[[1]],
c(" (°F)", " (inches)"), c("", ""), vectorise_all=FALSE)
At last, as.numeric()
can be used to re-interpret all the strings as
numbers:
x <- structure(as.numeric(x), dim=dim(x), dimnames=dimnames(x))
x[, c(1, 6, ncol(x))] # example columns
## Jan Jun Year
## Record high °C 45.6 22.4 46.4
## Average high °C 27.0 15.1 20.8
## Daily mean °C 21.6 11.7 16.2
## Average low °C 16.1 8.2 11.6
## Record low °C 5.5 -2.2 -2.8
## Average rainfall mm 44.2 49.5 600.9
## Average rainy days (>= 1mm) 5.6 8.6 90.6
## Average afternoon relative humidity (%) 47.0 61.0 51.0
## Mean monthly sunshine hours 279.0 108.0 2191.0
We now have a cleansed matrix at our disposal. We can, for instance, compute the monthly temperature ranges:
x["Record high °C", -ncol(x)] - x["Record low °C", -ncol(x)]
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 40.1 41.9 38.9 33.4 29.8 24.6 26.1 28.6 31.9 36.8 38.4 39.3
or the average daily precipitation:
sum(x["Average rainfall mm", -ncol(x)]) / 365.25
## [1] 1.6463
and so forth.
For the climate data on other cities, very similar operations will need to be performed – the whole process of scraping and cleansing data can be automated quite easily.