Collation¶
This tutorial is based on the paper on stringi that was published in the Journal of Statistical Software; see [2]. To learn more about R, check out Marek’s open-access (free!) textbook Deep R Programming [3].
Historically, the code-pointwise comparison had been used in most string comparison activities, especially when strings in ASCII (i.e., English) were involved. However, nowadays, this does not necessarily constitute the most suitable approach to processing natural-language texts. In particular, a code-pointwise matching neither takes accented and conjoined letters nor ignorable punctuation and case into account.
The ICU Collation Service provides the basis for string comparison activities such as string sorting and searching, or determining if two strings are equivalent. This time, though, due to its conformance to the Unicode Collation Algorithm, we may expect that the generated results will meet the requirements of the culturally correct natural language processing in any locale.
Locales¶
String collation is amongst many locale-sensitive operations available in stringi. However, before proceeding any further, we should first discuss how we can parameterise the ICU services to deliver the results that reflect the expectations of a specific user community, such as the speakers of different languages and their various regional variants.
Specifying Locales¶
A locale specifier[1] is of the form "Language"
,
"Language_Country"
, or "Language_Country_Variant"
, where:
Language
is, most frequently, a two- or three-letter code that conforms to the ISO-639-1 or ISO-630-2 standard, respectively; e.g.,"en"
or"eng"
for English,"es"
or"spa"
for Spanish,"zh"
or"zho"
for Chinese, and"mas"
for Masai (which lacks the corresponding two-letter code); however, more specific language identifiers may also be available, e.g.,"zh_Hans"
for Simplified- and"zh_Hant"
for Traditional-Chinese or"sr_Cyrl"
for Cyrillic- and"sr_Latn"
for Latin-Serbian;Country
is a two-letter code following the ISO-3166 standard that enables different language conventions within the same language; e.g., the US-English ("en_US"
) and Australian-English ("en_AU"
) not only observe some differences in spelling and vocabulary but also in the units of measurement;Variant
is an identifier indicating a preference towards some convention within the same country; e.g.,"de_DE_PREEURO"
formats currency values using the pre-2002 Deutsche Mark (DEM).
Moreover, following the “@
” symbol, semicolon-separated “key=value
”
pairs can be appended to the locale specifier, in order to customise
some locale-sensitive services even further (see below for an example
using “@collation=phonebook
” and
Parsing and Formatting Date and Time for “@calendar=hebrew
”, amongst others).
Listing Locales¶
To list the available locale identifiers, we call stri_locale_list()
.
length(stri_locale_list())
## [1] 784
As the number of supported locales is very high, here we’ll display only 5 randomly chosen ones:
sample(stri_locale_list(), 5)
## [1] "nl_CW" "pt_CH" "ff_Latn_SL" "en_PH" "en_HK"
Querying for Locale-Specific Services¶
The availability of locale-specific services can only be determined during the request for a particular resource (for more details, see the ICU User Guide on Locales), which may depend on the ICU library version actually in use as well as the way the ICU Data Library (icudt) has been packaged. Therefore, for maximum portability, it is best to rely on the ICU library bundle shipped with stringi. This is the case on Windows and macOS, whose users typically download the pre-compiled versions of the package from CRAN. However, on various flavours of GNU/Linux and other Unix-based systems, the system ICU is used more eagerly[2]. To force building ICU from sources, we may call:
install.packages("stringi", configure.args="--disable-pkg-config")
Overall, if a requested service is unavailable in a given locale, the best possible match is returned.
Default Locale¶
Each locale-sensitive operation in stringi selects the current default
locale if no locale has been explicitly requested, i.e., when a
function’s locale
argument is left alone in its “NULL
” state. The
default locale is initially set to match the system locale on the
current platform, and may be changed with stri_locale_set()
, e.g., in
the sporadic case of improper automatic locale detection.
As we have stated in the introduction, in this paper we use:
stri_locale_get()
## [1] "en_AU"
i.e., the Australian-English locale (which formats dates like “29 September 2021” and uses metric units of measurement).
Testing String Equivalence¶
In Unicode, some characters may have multiple representations. For instance, “LATIN SMALL LETTER A WITH OGONEK” (“ą”) can be stored as a single code point U+0105 or as a sequence that is comprised of the letter “LATIN SMALL LETTER A”, U+0061, and the “COMBINING OGONEK”, U+0328 (when rendered properly, they should appear as if they were identical glyphs). This is an example of canonical equivalence of strings.
Testing for the Unicode equivalence between strings can be performed by
calling %s==%
and, more generally, stri_cmp_equiv()
, or their
negated versions, %s!=%
and stri_cmp_nequiv()
.
In the example below, we have: a followed by ogonek (two code points) vs a with ogonek (single code point).
"a\u0328" %s==% "ą" # a, ogonek == a with ogonek
## [1] TRUE
There are also functions for removing and indicating duplicated elements in a character vector:
x <- c("Gągolewski", "Gagolewski", "Ga\u0328golewski")
stri_unique(x)
## [1] "Gągolewski" "Gagolewski"
stri_duplicated(x) # from_last=FALSE
## [1] FALSE FALSE TRUE
Moreover, stri_duplicated_any()
returns the index of the first
non-unique element.
Linear Ordering of Strings¶
Operators such as %s<%
, %<=%
, etc., and the corresponding
functions stri_cmp_lt()
(“less than”), stri_cmp_le()
(“less than or
equal”), etc., implement locale-sensitive linear orderings of strings.
Moreover, stri_sort()
returns the lexicographically-sorted version of
a given input vector, stri_order()
yields the corresponding (stable)
ordering permutation, and stri_rank()
ranks strings within a vector.
For instance, here’s a comparison in the current default locale (Australian-English):
"chaotic" %s<% "hard" # c < h
## [1] TRUE
Similar comparison in Polish:
stri_cmp_lt("chłodny", "hardy", locale="pl_PL") # c < h
## [1] TRUE
And now for something completely different – the Slovak language:
stri_cmp_lt("chladný", "hladný", locale="sk_SK") # ch > h
## [1] FALSE
This is an example of the locale-aware comparison that is
context-sensitive and which goes beyond the simple code-pointwise
comparison. In the example above, a contraction occurred: in Slovak, two
code points “ch
” are treated as a single entity and are sorted after
“h
”:
Compare the ordering of Polish and Slovak words:
stri_sort(c("chłodny", "hardy", "cichy", "cenny"), locale="pl_PL")
## [1] "cenny" "chłodny" "cichy" "hardy"
stri_sort(c("cudný", "chladný", "hladný", "čudný"), locale="sk_SK")
## [1] "cudný" "čudný" "hladný" "chladný"
An opposite situation is called an expansion:
german_k_words <- c("können", "kondensieren", "kochen", "korrelieren")
stri_sort(german_k_words, locale="de_DE")
## [1] "kochen" "kondensieren" "können" "korrelieren"
stri_sort(german_k_words, locale="de_DE@collation=phonebook")
## [1] "kochen" "können" "kondensieren" "korrelieren"
In the latter example, where we used the German phone-book order, "ö"
is treated as "oe"
.
Collator Options¶
The table below lists the options that can be passed to
stri_opts_collator()
via the dot-dot-dot parameter, “...
”, in all
the functions that rely on the ICU Collator.
Below we play with some of them.
Option |
Purpose |
---|---|
|
a string specifying the locale to use; |
|
an integer in {1,2,3,4} defining collation strength; 1 for the most permissive collation rules, 4 for the strictest ones; defaults to 3 |
|
logical; |
|
logical; if |
|
logical; if |
|
logical; if |
|
logical; if |
|
logical; |
Collation Strength¶
The Unicode Collation Algorithm can go beyond simple canonical equivalence: it can treat some other (depending on the context) differences as negligible too.
The strength
option controls the Collator’s “attention to detail”. For
instance, it can be used to make the ligature “ff” (U+FB00) compare
equal to the two-letter sequence “ff”:
stri_cmp_equiv("\ufb00", "ff", strength=2)
## [1] TRUE
which is not the case at the default strength level (3).
Generally, four (nested) levels of inter-string differences can be distinguished:
A primary difference – the strongest one – occurs where there is a mismatch between base characters (e.g.,
"a"
vs"b"
).Some character accents can be considered a secondary difference in many languages. However, in other ones, an accented letter is considered a unique letter.
Distinguishing between upper- and lower case typically happens on the tertiary level (see, however, the
case_level
option).If
alternate_shifted
isTRUE
, differences in punctuation can be determined at the quaternary level. This is also meaningful in the processing of Hiragana text.
Ignoring Case¶
Note which strings are deemed equivalent when considering different collation strengths:
x <- c("gro\u00df", "gross", "GROSS", "Gro\u00df", "Gross", "GRO\u1e9e")
stri_unique(x, strength=1) # ß == ss, case insensitive
## [1] "groß"
stri_unique(x, strength=2) # ß != ss, case insensitive
## [1] "groß" "gross"
Hence, strength equal to 1 takes only primary differences into account. Strength of 2 will also be sensitive to secondary differences (distinguishes between “ß” and “ss” above), but will ignore tertiary differences (case).
Also, introducing an extra case level yields a case sensitive comparison that ignores secondary differences:
stri_unique(x, strength=1, case_level=TRUE) # ß == ss, case sensitive
## [1] "groß" "GROSS" "Groß"
Ignoring Some Punctuation¶
Here are some effects of changing the alternate_shifted
option, which
allows for ignoring some punctuation marks:
x <- c("code point", "code-point", "codepoint", "CODE POINT", "CodePoint")
stri_unique(x, alternate_shifted=TRUE) # strength=3
## [1] "code point" "CODE POINT" "CodePoint"
Here, when strength = 3
is used (the default), punctuation
differences are ignored, but the letter case is deemed significant.
stri_unique(x, alternate_shifted=TRUE, strength=2)
## [1] "code point"
In this case, all strings are considered equivalent.
Ignoring case but not punctuation yields:
stri_unique(x, strength=2)
## [1] "code point" "code-point" "codepoint"
Backward Secondary Sorting¶
The French Canadian Sorting Standard CAN/CSA Z243.4.1 (historically, this
had been the default for all French locales) requires the word ordering
with respect to the last accent difference. Such a behaviour can be
applied either by setting the French-Canadian locale, or by passing the
french=TRUE
option to the Collator.
stri_sort(c("cote", "côte", "coté", "côté"), locale="fr_FR")
## [1] "cote" "coté" "côte" "côté"
stri_sort(c("cote", "côte", "coté", "côté"), locale="fr_CA") # french=TRUE
## [1] "cote" "côte" "coté" "côté"
Sorting Numerals¶
By default, just like in base R and most other programming languages, a lexicographic ordering is used: the corresponding code points are being compared one by one, from left to right, and once a difference is detected, the result is returned immediately.
x <- c("a1", "a2", "a11", "a1", "a99", "a10", "a100", "a2", "a9", "a2")
stri_sort(x)
## [1] "a1" "a1" "a10" "a100" "a11" "a2" "a2" "a2" "a9" "a99"
For example, "a99"
is ordered after "a100"
, because "a" == "a"
(first characters are equal) but then "9" > "1"
(second characters are
already different).
Let us, however, note the effect of setting the numeric
option on the
sorting of strings that involves numbers:
stri_sort(x, numeric=TRUE)
## [1] "a1" "a1" "a2" "a2" "a2" "a9" "a10" "a11" "a99" "a100"
However, the limitation is that only natural (nonnegative, integer) numbers will be dealt with correctly.
Here is an example of sorting a data frame with respect to two criteria:
X <- data.frame(a=x, b=runif(length(x)))
X[order(-stri_rank(X$a, numeric=TRUE), X$b), ]
## a b
## 7 a100 0.528105
## 5 a99 0.940467
## 3 a11 0.408977
## 6 a10 0.045556
## 9 a9 0.551435
## 10 a2 0.456615
## 2 a2 0.788305
## 8 a2 0.892419
## 1 a1 0.287578
## 4 a1 0.883017
The object is now ordered by the first column decreasingly (using a “numeric” order) and ties are resolved based on increasing values in the second column.
A Note on Compatibility Equivalence¶
In Normalising Strings we describe different ways to normalise canonically equivalent code point sequences so that they are represented by the same code points, which can account for some negligible differences (as in the “a with ogonek” example above).
Apart from ignoring punctuation and case, the
Unicode Standard Annex #15
also discusses the so-called compatibility
equivalence of strings. This is a looser form of similarity; it is
observed where there is the same abstract content, yet displayed by
means of different glyphs, for instance “¼” (U+00BC) vs “1/4
” or
“ℝ” vs “R”. In the latter case, whether these should be
treated as equal, depends on the context (e.g., this can be the set of
real numbers vs one’s favourite programming language). Compatibility
decompositions (NFKC, NFKD) mentioned in the section on
Normalising Strings or other types of transliteration can be
used to normalise strings so that such differences are not accounted
for.
Also, for “fuzzy” matching of strings, the stringdist package might be utile.
Searching for Fixed Strings Revisited¶
The ICU Collator can also be utilised where there is a need to locate
the occurrences of simple textual patterns. The counterparts of the
string search functions described in the section on
Code-Pointwise Comparing have
their names ending with *_coll()
. Albeit slower than the *_fixed()
functions, they are more appropriate in natural language
processing activities.
For instance:
stri_detect_coll("Er ist so groß.", "GROSS", strength=1, locale="de_AT")
## [1] TRUE
stri_detect_coll("On je chladný", "chladny", strength=1, locale="sk_SK")
## [1] TRUE