stringi: Fast and Portable Character String Processing in R#

stringi (pronounced “stringy”, IPA [strinɡi]) is THE R package for fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding.

Thanks to ICU – International Components for Unicode, stringi fully supports a wide range of Unicode standards (see also this video).

stri_extract_all(regex="\\p{Emoji}",
    c("歡迎 欢迎! Χαίρετε! Bienvenidos! 😃❤🌍", "spam, spam, 🥓, 🍳, and spam"))
## [[1]]
## [1] "😃" "❤"  "🌍"
##
## [[2]]
## [1] "🥓" "🍳"

stri_count_fixed("ACATGAACGGGTACACACTG", "ACA", overlap=TRUE)
## [1] 3

stri_sort(c("cudný", "chladný", "hladný", "čudný"), locale="sk_SK")
## [1] "cudný"   "čudný"   "hladný"  "chladný"

stringi comes with numerous functions related to data cleansing, information extraction, and natural language processing:

  • string concatenation, padding, wrapping, and substring extraction,

  • pattern searching (e.g., with ICU Java-like regular expressions),

  • collation, sorting, and ranking,

  • random string generation,

  • string transliteration, case mapping and folding,

  • Unicode normalisation,

  • date-time formatting and parsing,

and many more.

stringi is among the most often downloaded R packages.

https://cranlogs.r-pkg.org/badges/grand-total/stringi https://cranlogs.r-pkg.org/badges/last-month/stringi

You can obtain it from CRAN by calling:

install.packages("stringi")

stringi’s source code is hosted on GitHub. It is distributed under the open source BSD-3-clause license.

The package’s API was inspired by that of the early (pre-tidyverse; v0.6.2) version of Hadley Wickham’s stringr package (and since the 2015 v1.0.0 stringr is powered by stringi). Moreover, Hadley suggested quite a few new package features. The contributions from Bartłomiej Tartanus and many others is greatly appreciated. Thanks!

See also: stringx – a set of wrappers around stringi with a base R-compatible API.

Note

To learn more about R, check out Marek’s open-access (free!) textbook Deep R Programming [3].

Citation: Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1–59, doi:10.18637/jss.v103.i02.