stringi: Fast and Portable Character String Processing in R
stringi (pronounced “stringy”, IPA [strinɡi]) is THE R package for very fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding.
Thanks to ICU, stringi fully supports a wide range of Unicode standards (see also this video).
stri_extract_all(regex="\\p{Emoji}",
c("歡迎 欢迎! Χαίρετε! Bienvenidos! 😃❤🌍", "spam, spam, 🥓, 🍳, and spam"))
## [[1]]
## [1] "😃" "❤" "🌍"
##
## [[2]]
## [1] "🥓" "🍳"
stri_count_fixed("ACATGAACGGGTACACACTG", "ACA", overlap=TRUE)
## [1] 3
stri_sort(c("cudný", "chladný", "hladný", "čudný"), locale="sk_SK")
## [1] "cudný" "čudný" "hladný" "chladný"
stringi provides you with plenty of functions related to data cleansing, information extraction, and natural language processing:
string concatenation, padding, wrapping, and substring extraction,
pattern searching (e.g., with ICU Java-like regular expressions),
collation, sorting, and ranking,
random string generation,
string transliteration, case mapping and folding,
Unicode normalisation,
date-time formatting and parsing,
and many more.
stringi is among the most often downloaded R
packages.
You can obtain it from CRAN by calling:
install.packages("stringi")
stringi’s source code is hosted on GitHub. It is distributed under the open source BSD-3-clause license.
The package’s API was inspired by that of the early (pre-tidyverse; v0.6.2) version of Hadley Wickham’s stringr package (and since the 2015 v1.0.0 stringr is powered by stringi). Moreover, Hadley suggested quite a few new package features. The contributions from Bartłomiej Tartanus and many others is greatly appreciated. Thanks!
See also: stringx – a set of wrappers around stringi with a base R-compatible API.
Note
To learn more about R, check out Marek’s open-access (free!) textbook Deep R Programming [Gag23].
Citation: Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1–59, doi:10.18637/jss.v103.i02.
Tutorial
Reference Manual
- R Package stringi Reference
- about_arguments:
- about_encoding:
- about_locale:
- about_search_boundaries:
- about_search_charclass:
- about_search_coll:
- about_search_fixed:
- about_search_regex:
- about_search: String Searching
- about_stringi: Fast and Portable Character String Processing in R
- operator_add: Concatenate Two Character Vectors
- operator_compare: Compare Strings with or without Collation
- operator_dollar: as a Binary Operator
- stri_compare: Compare Strings with or without Collation
- stri_count_boundaries: Count the Number of Text Boundaries
- stri_count: Count the Number of Pattern Occurrences
- stri_datetime_add: Date and Time Arithmetic
- stri_datetime_create: Create a Date-Time Object
- stri_datetime_fields: Get Values for Date and Time Fields
- stri_datetime_format: Date and Time Formatting and Parsing
- stri_datetime_fstr: -Style Format Strings
- stri_datetime_now: Get Current Date and Time
- stri_datetime_symbols: List Localizable Date-Time Formatting Data
- stri_detect: Detect Pattern Occurrences
- stri_dup: Duplicate Strings
- stri_duplicated: Determine Duplicated Elements
- stri_enc_detect: Detect Character Set and Language
- stri_enc_detect2: [DEPRECATED] Detect Locale-Sensitive Character Encoding
- stri_enc_fromutf32: Convert From UTF-32
- stri_enc_info: Query a Character Encoding
- stri_enc_isascii: Check If a Data Stream Is Possibly in ASCII
- stri_enc_isutf16: Check If a Data Stream Is Possibly in UTF-16 or UTF-32
- stri_enc_isutf8: Check If a Data Stream Is Possibly in UTF-8
- stri_enc_list: List Known Character Encodings
- stri_enc_mark: Get Declared Encodings of Each String
- stri_enc_set:
- stri_enc_toascii: Convert To ASCII
- stri_enc_tonative: Convert Strings To Native Encoding
- stri_enc_toutf32: Convert Strings To UTF-32
- stri_enc_toutf8: Convert Strings To UTF-8
- stri_encode: Convert Strings Between Given Encodings
- stri_escape_unicode: Escape Unicode Code Points
- stri_extract_boundaries: Extract Data Between Text Boundaries
- stri_extract: Extract Pattern Occurrences
- stri_flatten: Flatten a String
- stri_info:
- stri_isempty: Determine if a String is of Length Zero
- stri_join_list: Concatenate Strings in a List
- stri_join: Concatenate Character Vectors
- stri_length: Count the Number of Code Points
- stri_list2matrix: Convert a List to a Character Matrix
- stri_locale_info: Query Given Locale
- stri_locale_list: List Available Locales
- stri_locale_set:
- stri_locate_boundaries: Locate Text Boundaries
- stri_locate: Locate Pattern Occurrences
- stri_match: Extract Regex Pattern Matches, Together with Capture Groups
- stri_na2empty: Replace NAs with Empty Strings
- stri_numbytes: Count the Number of Bytes
- stri_opts_brkiter: Generate a List with BreakIterator Settings
- stri_opts_collator: Generate a List with Collator Settings
- stri_opts_fixed: Generate a List with Fixed Pattern Search Engine’s Settings
- stri_opts_regex: Generate a List with Regex Matcher Settings
- stri_order: Ordering Permutation
- stri_pad: Pad (Center/Left/Right Align) a String
- stri_rand_lipsum: A Lorem Ipsum Generator
- stri_rand_shuffle: Randomly Shuffle Code Points in Each String
- stri_rand_strings: Generate Random Strings
- stri_rank: Ranking
- stri_read_lines: Read Text Lines from a Text File
- stri_read_raw: Read Text File as Raw
- stri_remove_empty: Remove All Empty Strings from a Character Vector
- stri_replace_na: Replace Missing Values in a Character Vector
- stri_replace_rstr: Convert gsub-Style Replacement Strings
- stri_replace: Replace Pattern Occurrences
- stri_reverse: Reverse Each String
- stri_sort_key: Sort Keys
- stri_sort: String Sorting
- stri_split_boundaries: Split a String at Text Boundaries
- stri_split_lines: Split a String Into Text Lines
- stri_split: Split a String By Pattern Matches
- stri_sprintf: Format Strings
- stri_startsendswith: Determine if the Start or End of a String Matches a Pattern
- stri_stats_general: General Statistics for a Character Vector
- stri_stats_latex: Statistics for a Character Vector Containing LaTeX Commands
- stri_sub_all: Extract or Replace Multiple Substrings
- stri_sub: Extract a Substring From or Replace a Substring In a Character Vector
- stri_subset: Select Elements that Match a Given Pattern
- stri_timezone_info: Query a Given Time Zone
- stri_timezone_list: List Available Time Zone Identifiers
- stri_timezone_set:
- stri_trans_casemap: Transform Strings with Case Mapping or Folding
- stri_trans_char: Translate Characters
- stri_trans_general: General Text Transforms, Including Transliteration
- stri_trans_list: List Available Text Transforms and Transliterators
- stri_trans_nf: Perform or Check For Unicode Normalization
- stri_trim: Trim Characters from the Left and/or Right Side of a String
- stri_unescape_unicode: Un-escape All Escape Sequences
- stri_unique: Extract Unique Elements
- stri_width: Determine the Width of Code Points
- stri_wrap: Word Wrap Text to Format Paragraphs
- stri_write_lines: Write Text Lines to a Text File