General Design Principles¶
This tutorial is based on the paper on stringi that was published in the Journal of Statistical Software; see [2]. To learn more about R, check out Marek’s open-access (free!) textbook Deep R Programming [3].
The API of the early releases of stringi has been designed to be fairly compatible with that of the 0.6.2 version of the stringr package [11] (dated 2012[1]), with some fixes in the consistency of the handling of missing values and zero-length vectors, amongst others. However, instead of being merely thin wrappers around base R[2] functions, which we have identified as not necessarily portable across platforms and not really suitable for natural language processing tasks, all the functionality has been implemented from the ground up, with the use of ICU services wherever applicable. Since the initial release, an abundance of new features has been added, and the package can now be considered a complete workhorse for text data processing.
Note that the stringi API is stable. Future releases will aim for as much backward compatibility as possible so that other software projects can safely rely on it.
Naming¶
Function and argument names use a combination of lowercase letters and
underscores (and no dots). To avoid namespace clashes, all function
names feature the “stri_
” prefix. Names are quite self-explanatory,
e.g., stri_locate_first_regex
and stri_locate_all_fixed
find,
respectively, the first match to a regular expression and all
occurrences of a substring as-is.
Vectorisation¶
Individual character (or code point) strings can be entered using double quotes or apostrophes:
"spam" # or 'spam'
## [1] "spam"
However, as the R language does not feature any classical scalar types,
strings are wrapped around atomic vectors of type “character
”:
typeof("spam") # object type; see also is.character() and is.vector()
## [1] "character"
length("spam") # how many strings are in this vector?
## [1] 1
Hence, we will be using the terms “string” and “character vector of length 1” interchangeably.
Not having a separate scalar type is very convenient; the so-called vectorisation strategy encourages writing code that processes whole collections of objects, all at once, regardless of their size.
For instance, given the following character vector:
pythons <- c("Graham Chapman", "John Cleese", "Terry Gilliam",
"Eric Idle", "Terry Jones", "Michael Palin")
we can separate the first and the last names from each other (assuming for simplicity that no middle names are given), using just a single function call:
(pythons <- stri_split_fixed(pythons, " ", simplify=TRUE))
## [,1] [,2]
## [1,] "Graham" "Chapman"
## [2,] "John" "Cleese"
## [3,] "Terry" "Gilliam"
## [4,] "Eric" "Idle"
## [5,] "Terry" "Jones"
## [6,] "Michael" "Palin"
Due to vectorisation, we can generally avoid using the for
- and
while
-loops (“for each string in a vector…”). This can make the code
much more readable, maintainable, and faster to execute.
Acting Elementwise with Recycling¶
Binary and higher-arity operations in R are oftentimes vectorised with
respect to all arguments (or at least to the crucial, non-optional
ones). As a prototype, let’s consider the binary arithmetic, logical,
or comparison operators (and, to some extent, paste()
, strrep()
, and
more generally mapply()
), for example the multiplication:
c(10, -1) * c(1, 2, 3, 4) # == c(10, -1, 10, -1) * c(1, 2, 3, 4)
## [1] 10 -2 30 -4
Calling “x * y
” multiplies the corresponding components of the two
vectors elementwisely. As one operand happens to be shorter than
the other, the former is recycled as many times as necessary to match the
length of the latter (there would be a warning if partial recycling
occurred). Also, acting on a zero-length input always yields an empty
vector.
All functions in stringi follow this convention (with some obvious
exceptions, such as the collapse
argument in stri_join()
, locale
in stri_datetime_parse()
, etc.). In particular, all string search
functions are vectorised with respect to both the haystack
and the
needle
arguments (and, e.g., the replacement string, if applicable).
Some users, unaware of this rule, might find this behaviour unintuitive
at the beginning and thus miss out on how powerful it is. Therefore, let’s
enumerate the most noteworthy scenarios that are possible thanks to
the arguments’ recycling, using the call to
stri_count_fixed(haystack, needle)
(which looks for a needle in a
haystack) as an illustration:
many strings – one pattern:
stri_count_fixed(c("abcd", "abcabc", "abdc", "dab", NA), "abc") ## [1] 1 2 0 0 NA
(there is 1 occurrence of
"abc"
in"abcd"
, 2 in"abcabc"
, and so forth);one string – many patterns:
stri_count_fixed("abcdeabc", c("def", "bc", "abc", NA)) ## [1] 0 2 2 NA
(
"def"
does not occur in"abcdeabc"
,"bc"
can be found therein twice, etc.);each string – its own corresponding pattern:
stri_count_fixed(c("abca", "def", "ghi"), c("a", "z", "h")) ## [1] 2 0 1
(there are two
"a"
s in"abca"
, no"z"
in"def"
, and one"h"
in"ghi"
);each row in a matrix – its own corresponding pattern:
(haystack <- matrix( # example input do.call(stri_join, expand.grid( c("a", "b", "c"), c("a", "b", "c"), c("a", "b", "c") )), nrow=3)) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## [1,] "aaa" "aba" "aca" "aab" "abb" "acb" "aac" "abc" "acc" ## [2,] "baa" "bba" "bca" "bab" "bbb" "bcb" "bac" "bbc" "bcc" ## [3,] "caa" "cba" "cca" "cab" "cbb" "ccb" "cac" "cbc" "ccc" needle <- c("a", "b", "c") matrix(stri_count_fixed(haystack, needle), # call to stringi nrow=3, dimnames=list(needle, NULL)) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## a 3 2 2 2 1 1 2 1 1 ## b 1 2 1 2 3 2 1 2 1 ## c 1 1 2 1 1 2 2 2 3
(this looks for
"a"
in the 1st row ofhaystack
,"b"
in the 2nd row, and"c"
in the 3rd; in particular, there are 3"a"
s in"aaa"
, 2 in"aba"
, and 1"b"
in"baa"
; this is possible because matrices are represented as “flat” vectors of lengthnrow*ncol
, whose elements are read in a column-major (Fortran) order; therefore, here, the pattern"a"
is being sought in the 1st, 4th, 7th, … string inhaystack
, i.e.,"aaa"
,"aba"
,"aca"
, …; pattern"b"
in the 2nd, 5th, 8th, … string; and"c"
in the 3rd, 6th, 9th, … one);On a side note, to match different patterns with respect to each column, we can (amongst others) apply matrix transpose twice (
t(stri_count_fixed(t(haystack), needle))
).all strings – all patterns:
haystack <- c("aaa", "bbb", "ccc", "abc", "cba", "aab", "bab", "acc") needle <- c("a", "b", "c") structure( outer(haystack, needle, stri_count_fixed), dimnames=list(haystack, needle)) # add row and column names ## a b c ## aaa 3 0 0 ## bbb 0 3 0 ## ccc 0 0 3 ## abc 1 1 1 ## cba 1 1 1 ## aab 2 1 0 ## bab 1 2 0 ## acc 1 0 2
(which computes the counts over the Cartesian product of the two arguments)
This is equivalent to:
matrix( stri_count_fixed(rep(haystack, each=length(needle)), needle), byrow=TRUE, ncol=length(needle), dimnames=list(haystack, needle)) ## a b c ## aaa 3 0 0 ## bbb 0 3 0 ## ccc 0 0 3 ## abc 1 1 1 ## cba 1 1 1 ## aab 2 1 0 ## bab 1 2 0 ## acc 1 0 2
Missing Values¶
Some base R string processing functions, e.g., paste()
, treat missing
values as literal "NA"
strings. stringi, however, does enforce the
consistent propagation of missing values (like arithmetic operations):
paste(c(NA_character_, "b", "c"), "x", 1:2) # base R
## [1] "NA x 1" "b x 2" "c x 1"
stri_join(c(NA_character_, "b", "c"), "x", 1:2) # stringi
## Warning in stri_join(c(NA_character_, "b", "c"), "x", 1:2): longer object
## length is not a multiple of shorter object length
## [1] NA "bx2" "cx1"
For dealing with missing values, we may rely on the convenience
functions such as stri_omit_na()
or stri_replace_na()
.
Data Flow¶
All vector-like arguments (including factors and objects) in stringi
are treated in the same manner. For example, if a function expects a
character vector on input and an object of another type is provided,
as.character()
is called first (we see that in the example above,
“1:2
” is treated as c("1", "2")
).
Following [11], stringi makes sure the output data types are consistent and that different functions are interoperable. This makes operation chaining easier and less error-prone.
For example, stri_extract_first_regex()
finds the first occurrence of
a pattern in each string. Therefore, the output is a character of the
same length as the input (with the recycling rule in place if necessary).
haystack <- c("bacon", "spam", "jam, spam, bacon, and spam")
stri_extract_first_regex(haystack, "\\b\\w{1,4}\\b")
## [1] NA "spam" "jam"
Note that a no-match (here, we have been looking for words of at most 4 characters) is marked with a missing string. This makes the output vector size consistent with the length of the inputs.
On the other hand, stri_extract_all_regex()
identifies all occurrences
of a pattern, whose counts may differ from input to input, therefore it
yields a list of character vectors.
stri_extract_all_regex(haystack, "\\b\\w{1,4}\\b", omit_no_match=TRUE)
## [[1]]
## character(0)
##
## [[2]]
## [1] "spam"
##
## [[3]]
## [1] "jam" "spam" "and" "spam"
If the 3rd argument were not specified, a no-match would be represented by a missing value (for consistency with the previous function).
Also, care is taken so that the “data” or “x
” argument is most often
listed as the first one (e.g., in base R we have
grepl(needle, haystack)
vs stri_detect(haystack, needle)
here). This
makes the functions more intuitive to use, but also more forward pipe
operator-friendly (either when using “|>
” introduced in R 4.1 or
“%>%
” from magrittr).
Furthermore, for increased convenience, some functions have been added despite the fact that they can trivially be reduced to a series of other calls. In particular, writing:
stri_sub_all(haystack,
stri_locate_all_regex(haystack, "\\b\\w{1,4}\\b", omit_no_match=TRUE))
yields the same result as in the previous example, but refers to
haystack
twice.
Further Deviations from Base R¶
stringi can be used as a replacement for the existing string processing functions. Also, it offers many facilities not available in base R. Except for being fully vectorised with respect to all crucial arguments, propagating missing values and empty vectors consistently, and following coherent naming conventions, our functions deviate from their classic counterparts even further.
Following Unicode Standards. Thanks to the comprehensive coverage of the most important services provided by ICU, its users gain access to collation, pattern searching, normalisation, transliteration, etc., that follow the current Unicode standards for text processing in any locale. Due to this, as we state in Character Encodings, all inputs are converted to Unicode. Furthermore, all outputs are always in UTF-8.
Portability Issues in Base R.
As mentioned in the introduction, base R string operations have
traditionally been limited in scope. There also might be some issues
with regards to their portability, reasons for which may be plentiful.
For instance, varied versions of the PCRE
(8.x or 10.x) pattern
matching libraries may be linked to during the compilation of R. On
Windows, there is a custom implementation of iconv that has a set of
character encoding IDs not fully compatible with that on GNU/Linux: to
select the Polish locale, we are required to pass "Polish_Poland"
to
Sys.setlocale()
on Windows whereas "pl_PL"
on Linux. Interestingly,
R can be built against the system ICU so that it uses its Collator for
comparing strings (e.g., using the “<=
” operator). However, this is
only optional and does not provide access to any other Unicode services.
For example, let us consider the matching of “all letters” by means of
the built-in gregexpr()
function and the TRE
(perl=FALSE
) and
PCRE (perl=TRUE
) libraries using a POSIX-like and Unicode-style
character set (see Regular Expressions for more details):
x <- "AEZaezĄĘŻąęż" # "AEZaez\u0104\u0118\u017b\u0105\u0119\u017c"
stri_sub(x, gregexpr("[[:alpha:]]", x, perl=FALSE)[[1]], length=1)
stri_sub(x, gregexpr("[[:alpha:]]", x, perl=TRUE)[[1]], length=1)
stri_sub(x, gregexpr("\\p{L}", x, perl=TRUE)[[1]], length=1)
On Ubuntu Linux 20.04 (UTF-8 locale), the respective outputs are:
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
## [1] "A" "E" "Z" "a" "e" "z"
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
On Windows, when x
is marked as UTF-8
(see Character Encodings), the current author obtained:
## [1] "A" "E" "Z" "a" "e" "z"
## [1] "A" "E" "Z" "a" "e" "z"
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
And again on Windows using the Polish locale but x
marked as
natively-encoded (CP-1250 in this case):
## [1] "A" "E" "Z" "a" "e" "z" "Ę" "ę"
## [1] "A" "E" "Z" "a" "e" "z" "Ą" "Ę" "Ż" "ą" "ę" "ż"
## [1] "A" "E" "Z" "a" "e" "z" "Ę" "ę"
As we mention in Collation, when stringi links to
ICU built from sources
(install.packages("stringi", configure.args="--disable-pkg-config")
),
we are always guaranteed to get the same results on every platform.
High Performance of stringi. Because of the aforementioned reasons, functions in stringi do not refer to their base R counterparts. The operations that do not rely on ICU services have been rewritten from scratch with speed and portability in mind. For example, here are some timings of string concatenation:
x <- stri_rand_strings(length(LETTERS) * 1000, 1000)
microbenchmark::microbenchmark(
join2=stri_join(LETTERS, x, sep="", collapse=", "),
join3=stri_join(x, LETTERS, x, sep="", collapse=", "),
r_paste2=paste(LETTERS, x, sep="", collapse=", "),
r_paste3=paste(x, LETTERS, x, sep="", collapse=", ")
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## join2 39.153 40.157 54.064 41.688 52.681 109.94 100
## join3 84.818 88.149 92.345 90.330 94.587 138.78 100
## r_paste2 83.995 87.088 104.517 90.834 105.171 183.10 100
## r_paste3 176.490 182.296 228.608 243.960 258.631 343.04 100
Another example – timings of fixed pattern searching:
x <- stri_rand_strings(100, 100000, "[actg]")
y <- "acca"
microbenchmark::microbenchmark(
fixed=stri_locate_all_fixed(x, y),
regex=stri_locate_all_regex(x, y),
coll=stri_locate_all_coll(x, y),
r_tre=gregexpr(y, x),
r_pcre=gregexpr(y, x, perl=TRUE),
r_fixed=gregexpr(y, x, fixed=TRUE)
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## fixed 4.9861 5.0658 5.1497 5.088 5.1666 5.4805 100
## regex 98.5586 99.1162 99.5049 99.342 99.6874 109.1632 100
## coll 261.6454 262.9504 263.4606 263.463 263.9660 265.3722 100
## r_tre 117.0878 117.4128 117.8794 117.561 117.7429 127.0573 100
## r_pcre 70.8115 71.2062 71.8814 71.410 71.5560 89.8478 100
## r_fixed 23.7709 23.9305 24.0572 24.061 24.1899 24.3638 100
Different Default Arguments and Greater Configurability.
Some functions in stringi have different, more natural default
arguments, e.g., paste()
has sep=" "
but stri_join()
has sep=""
.
Also, as there is no one-fits-all solution to all problems, many
arguments have been introduced for more detailed tuning.
Preserving Attributes.
Generally, stringi preserves no object attributes whatsoever, but a
user can make sure themself that this is the case, e.g., by
calling “x[] <- stri_...(x, ...)
” or
“`attributes<-
`(stri_...(x, ...), attributes(x))
”.