Collation

This tutorial is based on the paper on stringi that will appear in the Journal of Statistical Software.

Historically, code-pointwise comparison had been used in most string comparison activities, especially when strings in ASCII (i.e., English) were involved. However, nowadays this does not necessarily constitute the most suitable approach to the processing of natural-language texts. In particular, a code-pointwise matching neither takes accented and conjoined letters nor ignorable punctuation and case into account.

The ICU Collation Service provides the basis for string comparison activities such as string sorting and searching, or determining if two strings are equivalent. This time, though, due to its conformance to the Unicode Collation Algorithm, we may expect that the generated results will meet the requirements of the culturally correct natural language processing in any locale.

Locales

String collation is amongst many locale-sensitive operations available in stringi. Before proceeding any further, we should first discuss how we can parameterise the ICU services so as to deliver the results that reflect the expectations of a specific user community, such as the speakers of different languages and their various regional variants.

Specifying Locales

A locale specifier1 is of the form "Language", "Language_Country", or "Language_Country_Variant", where:

  • Language is, most frequently, a two- or three-letter code that conforms to the ISO-639-1 or ISO-630-2 standard, respectively; e.g., "en" or "eng" for English, "es" or "spa" for Spanish, "zh" or "zho" for Chinese, and "mas" for Masai (which lacks the corresponding two-letter code); however, more specific language identifiers may also be available, e.g., "zh_Hans" for Simplified- and "zh_Hant" for Traditional-Chinese or "sr_Cyrl" for Cyrillic- and "sr_Latn" for Latin-Serbian;

  • Country is a two-letter code following the ISO-3166 standard that enables different language conventions within the same language; e.g., the US-English ("en_US") and Australian-English ("en_AU") not only observe some differences in spelling and vocabulary but also in the units of measurement;

  • Variant is an identifier indicating a preference towards some convention within the same country; e.g., "de_DE_PREEURO" formats currency values using the pre-2002 Deutsche Mark (DEM).

Moreover, following the “@” symbol, semicolon-separated “key=value” pairs can be appended to the locale specifier, in order to customise some locale-sensitive services even further (see below for an example using “@collation=phonebook” and Parsing and Formatting Date and Time for “@calendar=hebrew”, amongst others).

Listing Locales

To list the available locale identifiers, we call stri_locale_list().

length(stri_locale_list())
## [1] 784

As the number of supported locales is very high, here we’ll display only 5 randomly chosen ones:

sample(stri_locale_list(), 5)
## [1] "nl_CW"      "pt_CH"      "ff_Latn_SL" "en_PH"      "en_HK"

Querying for Locale-Specific Services

The availability of locale-specific services can only be determined during the request for a particular resource (for more details, see the ICU User Guide on Locales), which may depend on the ICU library version actually in use as well as the way the ICU Data Library (icudt) has been packaged. Therefore, for maximum portability, it is best to rely on the ICU library bundle that is shipped with stringi. This is the case on Windows and macOS, whose users typically download the pre-compiled versions of the package from CRAN. However, on various flavours of GNU/Linux and other Unix-based systems, the system ICU is used more eagerly2. To force building ICU from sources, we may call:

install.packages("stringi", configure.args="--disable-pkg-config")

Overall, if a requested service is unavailable in a given locale, the best possible match is returned.

Default Locale

Each locale-sensitive operation in stringi selects the current default locale if no locale has been explicitly requested, i.e., when a function’s locale argument is left alone in its “NULL” state. The default locale is initially set to match the system locale on the current platform, and may be changed with stri_locale_set(), e.g., in the very rare case of improper automatic locale detection.

As we have stated in the introduction, in this paper we use:

stri_locale_get()
## [1] "en_AU"

i.e., the Australian-English locale (which formats dates like “29 September 2021” and uses metric units of measurement).

Testing String Equivalence

In Unicode, some characters may have multiple representations. For instance, “LATIN SMALL LETTER A WITH OGONEK” (“ą”) can be stored as a single code point U+0105 or as a sequence that is comprised of the letter “LATIN SMALL LETTER A”, U+0061, and the “COMBINING OGONEK”, U+0328 (when rendered properly, they should appear as if they were identical glyphs). This is an example of canonical equivalence of strings.

Testing for the Unicode equivalence between strings can be performed by calling %s==% and, more generally, stri_cmp_equiv(), or their negated versions, %s!=% and stri_cmp_nequiv().

In the example below we have: a followed by ogonek (two code points) vs a with ogonek (single code point).

"a\u0328" %s==% "ą"             # a, ogonek == a with ogonek
## [1] TRUE

There are also functions for removing and indicating duplicated elements in a character vector:

x <- c("Gągolewski", "Gagolewski", "Ga\u0328golewski")
stri_unique(x)
## [1] "Gągolewski" "Gagolewski"
stri_duplicated(x)  # from_last=FALSE
## [1] FALSE FALSE  TRUE

Moreover, stri_duplicated_any() returns the index of the first non-unique element.

Linear Ordering of Strings

Operators such as %s<%, %<=%, etc., and the corresponding functions stri_cmp_lt() (“less than”), stri_cmp_le() (“less than or equal”), etc., implement locale-sensitive linear orderings of strings. Moreover, stri_sort() returns the lexicographically-sorted version of a given input vector, stri_order() yields the corresponding (stable) ordering permutation, and stri_rank() ranks strings within a vector.

For instance, here’s a comparison in the current default locale (Australian-English):

"chaotic" %s<% "hard"  # c < h
## [1] TRUE

Similar comparison in Polish:

stri_cmp_lt("chłodny", "hardy", locale="pl_PL")  # c < h
## [1] TRUE

And now for something completely different – the Slovak language:

stri_cmp_lt("chladný", "hladný", locale="sk_SK") # ch > h
## [1] FALSE

This is an example of the locale-aware comparison that is context-sensitive and which goes beyond the simple code-pointwise comparison. In the example above, a contraction occurred: in Slovak, two code points “ch” are treated as a single entity and are sorted after “h”:

Compare the ordering of Polish and Slovak words:

stri_sort(c("chłodny", "hardy", "cichy", "cenny"), locale="pl_PL")
## [1] "cenny"   "chłodny" "cichy"   "hardy"
stri_sort(c("cudný", "chladný", "hladný", "čudný"), locale="sk_SK")
## [1] "cudný"   "čudný"   "hladný"  "chladný"

An opposite situation is called an expansion:

german_k_words <- c("können", "kondensieren", "kochen", "korrelieren")
stri_sort(german_k_words, locale="de_DE")
## [1] "kochen"       "kondensieren" "können"       "korrelieren"
stri_sort(german_k_words, locale="de_DE@collation=phonebook")
## [1] "kochen"       "können"       "kondensieren" "korrelieren"

In the latter example, where we used the German phone-book order, "ö" is treated as "oe".

Collator Options

The table below lists the options that can be passed to stri_opts_collator() via the dot-dot-dot parameter, “...”, in all the functions that rely on the ICU Collator. Below we we play with some of them.

Option

Purpose

locale

a string specifying the locale to use; NULL (default) or "" for the current default locale as indicated by stri_locale_get()

strength

an integer in {1,2,3,4} defining collation strength; 1 for the most permissive collation rules, 4 for the strictest ones; defaults to 3

uppercase_first

logical; NA (default) orders upper and lower case letters in accordance to their tertiary weights, TRUE forces upper case letters to sort before lower case letters, FALSE does the opposite

numeric

logical; if TRUE, a collation key for the numeric value of substrings of digits is generated; this is a way to make "100" ordered after "2"; defaults to FALSE

case_level

logical; if TRUE, an extra case level (positioned before the third level) is generated; defaults to FALSE

normalisation

logical; if TRUE, then an incremental check is performed to see whether input data are in the FCD (“fast C or D”) form; if data are not in the FCD form, the incremental NFD normalisation is performed, see Normalising Strings; defaults to FALSE

alternate_shifted

logical; if FALSE (default), all code points with non-ignorable primary weights are handled in the same way; TRUE causes the code points with primary weights that are less than or equal to the variable top value to be ignored on the primary level and moved to the quaternary level; this can be used to, e.g., ignore punctuation, see the examples provided

french

logical; TRUE results in secondary weights being considered backwards, i.e., ordering according to the last accent difference – nowadays only used in Canadian-French; defaults to FALSE

Collation Strength

The Unicode Collation Algorithm can go beyond simple canonical equivalence: it can treat some other (depending on the context) differences as negligible too.

The strength option controls the Collator’s “attention to detail”. For instance, it can be used to make the ligature “ff” (U+FB00) compare equal to the two-letter sequence “ff”:

stri_cmp_equiv("\ufb00", "ff", strength=2)
## [1] TRUE

which is not the case at the default strength level (3).

Generally, four (nested) levels of inter-string differences can be distinguished:

  1. A primary difference – the strongest one – occurs where there is a mismatch between base characters (e.g., "a" vs "b").

  2. Some character accents can be considered a secondary difference in many languages. However, in other ones, an accented letter is considered a unique letter.

  3. Distinguishing between upper- and lower case typically happens on the tertiary level (see, however, the case_level option).

  4. If alternate_shifted is TRUE, differences in punctuation can be determined at the quaternary level. This is also meaningful in the processing of Hiragana text.

Hence, strength equal to 1 takes only primary differences into account. Strength of 2 will also be sensitive to secondary differences (distinguishes between “ß” and “ss” above), but will ignore tertiary differences (case).

Also, introducing an extra case level yields a case sensitive comparison that ignores secondary differences:

stri_unique(x, strength=1, case_level=TRUE) # ß == ss, case sensitive
## [1] "Gągolewski"

Ignoring Some Punctuation

Here are some effects of changing the alternate_shifted option, which allows for ignoring some punctuation marks:

x <- c("code point", "code-point", "codepoint", "CODE POINT", "CodePoint")
stri_unique(x, alternate_shifted=TRUE)  # strength=3
## [1] "code point" "CODE POINT" "CodePoint"
stri_unique(x, alternate_shifted=TRUE, strength=2)
## [1] "code point"

In the latter case, all strings are considered equivalent. Ignoring case but not punctuation yields:

stri_unique(x, strength=2)
## [1] "code point" "code-point" "codepoint"

Backward Secondary Sorting

The French Canadian Sorting Standard CAN/CSA Z243.4.1 (historically this had been the default for all French locales) requires the word ordering with respect to the last accent difference. Such a behaviour can be applied either by setting the French-Canadian locale or by passing the french=TRUE option to the Collator.

stri_sort(c("cote", "côte", "coté", "côté"), locale="fr_FR")
## [1] "cote" "coté" "côte" "côté"
stri_sort(c("cote", "côte", "coté", "côté"), locale="fr_CA") # french=TRUE
## [1] "cote" "côte" "coté" "côté"

Sorting Numerals

By default, just like in base R and most other programming languages, a lexicographic ordering is used: the corresponding code points are being compared one by one, from left to right, and once a difference is detected, the result is returned immediately.

x <- c("a1", "a2", "a11", "a1", "a99", "a10", "a100", "a2", "a9", "a2")
stri_sort(x)
##  [1] "a1"   "a1"   "a10"  "a100" "a11"  "a2"   "a2"   "a2"   "a9"   "a99"

For example, "a99" is ordered after "a100", because "a" == "a" (first characters are equal) but then "9" > "1" (second characters are already different).

Let us, however, note the effect of setting the numeric option on the sorting of strings that involves numbers:

stri_sort(x, numeric=TRUE)
##  [1] "a1"   "a1"   "a2"   "a2"   "a2"   "a9"   "a10"  "a11"  "a99"  "a100"

Here is an example of sorting a data frame with respect to two criteria:

X <- data.frame(a=x, b=runif(length(x)))
X[order(-stri_rank(X$a, numeric=TRUE), X$b), ]
##       a        b
## 7  a100 0.528105
## 5   a99 0.940467
## 3   a11 0.408977
## 6   a10 0.045556
## 9    a9 0.551435
## 10   a2 0.456615
## 2    a2 0.788305
## 8    a2 0.892419
## 1    a1 0.287578
## 4    a1 0.883017

The object is now ordered by the first column decreasingly (using a “numeric” order) and ties are resolved based on increasing values in the second column.

A Note on Compatibility Equivalence

In Normalising Strings we describe different ways to normalise canonically equivalent code point sequences so that they are represented by the same code points, which can account for some negligible differences (as in the “a with ogonek” example above).

Apart from ignoring punctuation and case, the Unicode Standard Annex #15 also discusses the so-called compatibility equivalence of strings. This is a looser form of similarity; it is observed where there is the same abstract content, yet displayed by means of different glyphs, for instance “¼” (U+00BC) vs ”1/4” or “ℝ” vs “R”. In the latter case, whether these should be treated as equal, depends on the context (e.g., this can be the set of real numbers vs one’s favourite programming language). Compatibility decompositions (NFKC, NFKD) mentioned in the section on Normalising Strings or other types of transliteration can be used to normalise strings so that such differences are not accounted for.

Also, for “fuzzy” matching of strings, the stringdist package might be useful.

Searching for Fixed Strings Revisited

The ICU Collator can also be utilised where there is a need to locate the occurrences of simple textual patterns. The counterparts of the string search functions described in the section on Code-Pointwise Comparing have their names ending with *_coll(). They are slower than them, but are more appropriate in NLP activities.

For instance:

stri_detect_coll("Er ist so groß.", "GROSS", strength=1, locale="de_AT")
## [1] TRUE
stri_detect_coll("On je chladný", "chladny", strength=1, locale="sk_SK")
## [1] TRUE

1

Locale specifiers in ICU are platform-independent. This is not the case for their base-R counterparts, see help("locales"), e.g., we have "Polish_Poland" on Windows vs "pl_PL" on Linux.

2

See, e.g., software packages libicu-dev on Debian/Ubuntu or libicu-devel on RHL/Fedora/OpenSUSE. For more details regarding the configure/build process of stringi, refer to the INSTALL file.