stri_encode: Convert Strings Between Given Encodings#
These functions convert strings between encodings. They aim to serve as a more portable and faster replacement for R’s own
stri_encode(str, from = NULL, to = NULL, to_raw = FALSE) stri_conv(str, from = NULL, to = NULL, to_raw = FALSE)
a character vector, a raw vector, or a list of
a single logical value; indicates whether a list of raw vectors rather than a character vector should be returned
stri_conv is an alias for
from is either missing,
NULL, and if
str is a character vector then the marked encodings are used (see
stri_enc_mark) – in such a case
bytes-declared strings are disallowed. Otherwise, i.e., if
str is a
raw-type vector or a list of raw vectors, we assume that the input encoding is the current default encoding as given by
from is given explicitly, the internal encoding declarations are always ignored.
to_raw=FALSE, the output strings always have the encodings marked according to the target converter used (as specified by
to) and the current default Encoding (
bytes in all other cases).
Note that some issues might occur if
to indicates, e.g, UTF-16 or UTF-32, as the output strings may have embedded NULs. In such cases, please use
to_raw=TRUE and consider specifying a byte order marker (BOM) for portability reasons (e.g., set
UTF-32 which automatically adds the BOMs).
stri_encode(as.raw(data), 'encodingname') is a clever substitute for
In the current version of stringi, if an incorrect code point is found on input, it is replaced with the default (for that target encoding) ‘missing/erroneous’ character (with a warning), e.g., the SUBSTITUTE character (U+001A) or the REPLACEMENT one (U+FFFD). Occurrences thereof can be located in the output string to diagnose the problematic sequences, e.g., by calling:
Because of the way this function is currently implemented, maximal size of a single string to be converted cannot exceed ~0.67 GB.
FALSE, then a character vector with encoded strings (and appropriate encoding marks) is returned. Otherwise, a list of vectors of type raw is produced.
Conversion – ICU User Guide, https://unicode-org.github.io/icu/userguide/conversion/
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02