General Design Principles¶
This tutorial is based on the paper on stringi that will appear in the Journal of Statistical Software.
The API of the early releases of stringi has been designed so as to be fairly compatible with that of the 0.6.2 version of the stringr package [Wic10] (dated 20121), with some fixes in the consistency of the handling of missing values and zero-length vectors, amongst others. However, instead of being merely thin wrappers around base R2 functions, which we have identified as not necessarily portable across platforms and not really suitable for natural language processing tasks, all the functionality has been implemented from the ground up, with the use of ICU services wherever applicable. Since the initial release, an abundance of new features has been added and the package can now be considered a comprehensive workhorse for text data processing.
Note that the stringi API is stable. Future releases are aiming for as much backward compatibility as possible so that other software projects can safely rely on it.
Naming¶
Function and argument names use a combination of lowercase letters and
underscores (and no dots). To avoid namespace clashes, all function
names feature the “stri_
” prefix. Names are fairly self-explanatory,
e.g., stri_locate_first_regex
and stri_locate_all_fixed
find,
respectively, the first match to a regular expression and all
occurrences of a substring as-is.
Vectorisation¶
Individual character (or code point) strings can be entered using double quotes or apostrophes:
"spam" # or 'spam'
## [1] "spam"
However, as the R language does not feature any classical scalar types,
strings are wrapped around atomic vectors of type “character
”:
typeof("spam") # object type; see also is.character() and is.vector()
## [1] "character"
length("spam") # how many strings are in this vector?
## [1] 1
Hence, we will be using the terms “string” and “character vector of length 1” interchangeably.
Not having a separate scalar type is very convenient; the so-called vectorisation strategy encourages writing code that processes whole collections of objects, all at once, regardless of their size.
For instance, given the following character vector:
pythons <- c("Graham Chapman", "John Cleese", "Terry Gilliam",
"Eric Idle", "Terry Jones", "Michael Palin")
we can separate the first and the last names from each other (assuming for simplicity that no middle names are given), using just a single function call:
(pythons <- stri_split_fixed(pythons, " ", simplify=TRUE))
## [,1] [,2]
## [1,] "Graham" "Chapman"
## [2,] "John" "Cleese"
## [3,] "Terry" "Gilliam"
## [4,] "Eric" "Idle"
## [5,] "Terry" "Jones"
## [6,] "Michael" "Palin"
Due to vectorisation, we can generally avoid using the for
- and
while
-loops (“for each string in a vector…”), which makes the code
much more readable, maintainable, and faster to execute.
Acting Elementwise with Recycling¶
Binary and higher-arity operations in R are oftentimes vectorised with
respect to all arguments (or at least to the crucial, non-optional
ones). As a prototype, let’s consider the binary arithmetic, logical,
or comparison operators (and, to some extent, paste()
, strrep()
, and
more generally mapply()
), for example the multiplication:
c(10, -1) * c(1, 2, 3, 4) # == c(10, -1, 10, -1) * c(1, 2, 3, 4)
## [1] 10 -2 30 -4
Calling “x * y
” multiplies the corresponding components of the two
vectors elementwisely. As one operand happens to be shorter than
the other, the former is recycled as many times as necessary to match the
length of the latter (there would be a warning if partial recycling
occurred). Also, acting on a zero-length input always yields an empty
vector.
All functions in stringi follow this convention (with some obvious
exceptions, such as the collapse
argument in stri_join()
, locale
in stri_datetime_parse()
, etc.). In particular, all string search
functions are vectorised with respect to both the haystack
and the
needle
arguments (and, e.g., the replacement string, if applicable).
Some users, unaware of this rule, might find this behaviour unintuitive
at the beginning and thus miss out on how powerful it is. Therefore, let’s
enumerate the most noteworthy scenarios that are possible thanks to
the arguments’ recycling, using the call to
stri_count_fixed(haystack, needle)
(which looks for a needle in a
haystack) as an illustration:
many strings – one pattern:
stri_count_fixed(c("abcd", "abcabc", "abdc", "dab", NA), "abc")
## [1] 1 2 0 0 NA
(there is 1 occurrence of
"abc"
in"abcd"
, 2 in"abcabc"
, and so forth);one string – many patterns:
stri_count_fixed("abcdeabc", c("def", "bc", "abc", NA))
## [1] 0 2 2 NA
(
"def"
does not occur in"abcdeabc"
,"bc"
can be found therein twice, etc.);each string – its own corresponding pattern:
stri_count_fixed(c("abca", "def", "ghi"), c("a", "z", "h"))
## [1] 2 0 1
(there are two
"a"
s in"abca"
, no"z"
in"def"
, and one"h"
in"ghi"
);each row in a matrix – its own corresponding pattern:
(haystack <- matrix( # example input do.call(stri_join, expand.grid( c("a", "b", "c"), c("a", "b", "c"), c("a", "b", "c") )), nrow=3))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## [1,] "aaa" "aba" "aca" "aab" "abb" "acb" "aac" "abc" "acc" ## [2,] "baa" "bba" "bca" "bab" "bbb" "bcb" "bac" "bbc" "bcc" ## [3,] "caa" "cba" "cca" "cab" "cbb" "ccb" "cac" "cbc" "ccc"
needle <- c("a", "b", "c") matrix(stri_count_fixed(haystack, needle), # call to stringi nrow=3, dimnames=list(needle, NULL))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## a 3 2 2 2 1 1 2 1 1 ## b 1 2 1 2 3 2 1 2 1 ## c 1 1 2 1 1 2 2 2 3
(this looks for
"a"
in the 1st row ofhaystack
,"b"
in the 2nd row, and"c"
in the 3rd; in particular, there are 3"a"
s in"aaa"
, 2 in"aba"
, and 1"b"
in"baa"
; this is possible due to the fact that matrices are represented as “flat” vectors of lengthnrow*ncol
, whose elements are read in a column-major (Fortran) order; therefore, here, pattern “a” is being sought in the 1st, 4th, 7th, … string inhaystack
, i.e.,"aaa"
,"aba"
,"aca"
, …; pattern"b"
in the 2nd, 5th, 8th, … string; and"c"
in the 3rd, 6th, 9th, … one);On a side note, to match different patterns with respect to each column, we can (amongst others) apply matrix transpose twice (
t(stri_count_fixed(t(haystack), needle))
).all strings – all patterns:
haystack <- c("aaa", "bbb", "ccc", "abc", "cba", "aab", "bab", "acc") needle <- c("a", "b", "c") structure( outer(haystack, needle, stri_count_fixed), dimnames=list(haystack, needle)) # add row and column names
## a b c ## aaa 3 0 0 ## bbb 0 3 0 ## ccc 0 0 3 ## abc 1 1 1 ## cba 1 1 1 ## aab 2 1 0 ## bab 1 2 0 ## acc 1 0 2
(which computes the counts over the Cartesian product of the two arguments)
This is equivalent to:
matrix( stri_count_fixed(rep(haystack, each=length(needle)), needle), byrow=TRUE, ncol=length(needle), dimnames=list(haystack, needle))
## a b c ## aaa 3 0 0 ## bbb 0 3 0 ## ccc 0 0 3 ## abc 1 1 1 ## cba 1 1 1 ## aab 2 1 0 ## bab 1 2 0 ## acc 1 0 2
Missing Values¶
Some base R string processing functions, e.g., paste()
, treat missing
values as literal "NA"
strings. stringi, however, does enforce the
consistent propagation of missing values (like arithmetic operations):
paste(c(NA_character_, "b", "c"), "x", 1:2) # base R
## [1] "NA x 1" "b x 2" "c x 1"
stri_join(c(NA_character_, "b", "c"), "x", 1:2) # stringi
## Warning in stri_join(c(NA_character_, "b", "c"), "x", 1:2): longer object
## length is not a multiple of shorter object length
## [1] NA "bx2" "cx1"
For dealing with missing values, we may rely on the convenience
functions such as stri_omit_na()
or stri_replace_na()
.
Data Flow¶
All vector-like arguments (including factors and objects) in stringi
are treated in the same manner: for example, if a function expects a
character vector on input and an object of other type is provided,
as.character()
is called first (we see that in the example above,
“1:2
” is treated as c("1", "2")
).
Following [Wic10], stringi makes sure the output data types are consistent and that different functions are interoperable. This makes operation chaining easier and less error prone.
For example, stri_extract_first_regex()
finds the first occurrence of
a pattern in each string, therefore the output is a character of the
same length as the input (with recycling rule in place if necessary).
haystack <- c("bacon", "spam", "jam, spam, bacon, and spam")
stri_extract_first_regex(haystack, "\\b\\w{1,4}\\b")
## [1] NA "spam" "jam"
Note that a no-match (here, we have been looking for words of at most 4 characters) is marked with a missing string. This makes the output vector size consistent with the length of the inputs.
On the other hand, stri_extract_all_regex()
identifies all occurrences
of a pattern, whose counts may differ from input to input, therefore it
yields a list of character vectors.
stri_extract_all_regex(haystack, "\\b\\w{1,4}\\b", omit_no_match=TRUE)
## [[1]]
## character(0)
##
## [[2]]
## [1] "spam"
##
## [[3]]
## [1] "jam" "spam" "and" "spam"
If the 3rd argument was not specified, a no-match would be represented by a missing value (for consistency with the previous function).
Also, care is taken so that the “data” or “x
” argument is most often
listed as the first one (e.g., in base R we have
grepl(needle, haystack)
vs stri_detect(haystack, needle)
here). This
makes the functions more intuitive to use, but also more forward pipe
operator-friendly (either when using “|>
” introduced in R 4.1 or
“%>%
” from magrittr).
Furthermore, for increased convenience, some functions have been added despite the fact that they can trivially be reduced to a series of other calls. In particular, writing:
stri_sub_all(haystack,
stri_locate_all_regex(haystack, "\\b\\w{1,4}\\b", omit_no_match=TRUE))
yields the same result as in the previous example, but refers to
haystack
twice.
Further Deviations from Base R¶
stringi can be used as a replacement of the existing string processing functions. Also, it offers many facilities not available in base R. Except for being fully vectorised with respect to all crucial arguments, propagating missing values and empty vectors consistently, and following coherent naming conventions, our functions deviate from their classic counterparts even further.
Following Unicode Standards. Thanks to the comprehensive coverage of the most important services provided by ICU, its users gain access to collation, pattern searching, normalisation, transliteration, etc., that follow the recent Unicode standards for text processing in any locale. Due to this, as we state in Character Encodings, all inputs are converted to Unicode and outputs are always in UTF-8.
Portability Issues in Base R.
As we have mentioned in the introduction, base R string operations have
traditionally been limited in scope. There also might be some issues
with regards to their portability, reasons for which may be plentiful.
For instance, varied versions of the PCRE
(8.x or 10.x) pattern
matching libraries may be linked to during the compilation of R. On
Windows, there is a custom implementation of iconv that has a set of
character encoding IDs not fully compatible with that on GNU/Linux: to
select the Polish locale, we are required to pass "Polish_Poland"
to
Sys.setlocale()
on Windows whereas "pl_PL"
on Linux. Interestingly,
R can be built against the system ICU so that it uses its Collator for
comparing strings (e.g., using the “<=
” operator), however this is
only optional and does not provide access to any other Unicode services.
For example, let us consider the matching of “all letters” by means of
the built-in gregexpr()
function and the TRE
(perl=FALSE
) and
PCRE (perl=TRUE
) libraries using a POSIX-like and Unicode-style
character set (see Regular Expressions for more details):
x <- "AEZaezĄĘŻąęż" # "AEZaez\u0104\u0118\u017b\u0105\u0119\u017c"
stri_sub(x, gregexpr("[[:alpha:]]", x, perl=FALSE)[[1]], length=1)
stri_sub(x, gregexpr("[[:alpha:]]", x, perl=TRUE)[[1]], length=1)
stri_sub(x, gregexpr("\\p{L}", x, perl=TRUE)[[1]], length=1)
On Ubuntu Linux 20.04 (UTF-8 locale), the respective outputs are:
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
## [1] "A" "E" "Z" "a" "e" "z"
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
On Windows, when x
is marked as UTF-8
(see Character Encodings), the current author obtained:
## [1] "A" "E" "Z" "a" "e" "z"
## [1] "A" "E" "Z" "a" "e" "z"
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
And again on Windows using the Polish locale but x
marked as
natively-encoded (CP-1250 in this case):
## [1] "A" "E" "Z" "a" "e" "z" "Ę" "ę"
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
## [1] "A" "E" "Z" "a" "e" "z" "Ę" "ę"
As we mention in Collation, when stringi links to
ICU built from sources
(install.packages("stringi", configure.args="--disable-pkg-config")
),
we are always guaranteed to get the same results on every platform.
High Performance of stringi. Because of the aforementioned reasons, functions in stringi do not refer to their base R counterparts. The operations that do not rely on ICU services have been rewritten from scratch with speed and portability in mind. For example, here are some timings of string concatenation:
x <- stri_rand_strings(length(LETTERS) * 1000, 1000)
microbenchmark::microbenchmark(
join2=stri_join(LETTERS, x, sep="", collapse=", "),
join3=stri_join(x, LETTERS, x, sep="", collapse=", "),
r_paste2=paste(LETTERS, x, sep="", collapse=", "),
r_paste3=paste(x, LETTERS, x, sep="", collapse=", ")
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## join2 35.055 35.843 47.088 37.069 53.653 85.487 100
## join3 79.598 80.815 85.678 81.562 91.717 128.963 100
## r_paste2 90.511 91.802 108.026 97.838 112.744 162.961 100
## r_paste3 190.622 193.449 229.653 244.515 251.249 280.253 100
Another example – timings of fixed pattern searching:
x <- stri_rand_strings(100, 100000, "[actg]")
y <- "acca"
microbenchmark::microbenchmark(
fixed=stri_locate_all_fixed(x, y),
regex=stri_locate_all_regex(x, y),
coll=stri_locate_all_coll(x, y),
r_tre=gregexpr(y, x),
r_pcre=gregexpr(y, x, perl=TRUE),
r_fixed=gregexpr(y, x, fixed=TRUE)
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## fixed 4.8786 4.9576 5.0491 5.0353 5.0934 5.7554 100
## regex 113.6029 114.8261 115.5398 115.2599 115.5966 121.5098 100
## coll 388.7710 392.9170 396.2839 394.3113 396.7988 427.4617 100
## r_tre 126.5528 127.3853 128.9907 127.8981 128.9863 141.2107 100
## r_pcre 73.8114 74.4104 75.1880 74.7540 75.1607 80.1232 100
## r_fixed 52.4524 53.0642 53.6158 53.3177 53.6758 57.4354 100
Different Default Arguments and Greater Configurability.
Some functions in stringi have different, more natural default
arguments, e.g., paste()
has sep=" "
but stri_join()
has sep=""
.
Also, as there is no one-fits-all solution to all problems, many
arguments have been introduced for more detailed tuning.
Preserving Attributes.
Generally, stringi preserves no object attributes whatsoever, but a
user can make sure themself that this is becomes the case, e.g., by
calling “x[] <- stri_...(x, ...)
” or
“`attributes<-
`(stri_...(x, ...), attributes(x))
”.
- 1
Interestingly, in 2015 the aforementioned stringr package has been rewritten as a set of wrappers around some of the stringi functions instead of the base R ones. In Section 14.7 of R for Data Science [WG17] we read: “stringr is useful when you’re learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: stringi has 250 functions to stringr’s 49”.
- 2
See also the new stringx package which supplies a set of portable and efficient replacements for and enhancements of the base R functions (based on stringi).