What Is New in stringi
1.7.7 (2022-07-02)
[DOCUMENTATION] Paper on stringi has been published in the Journal of Statistical Software; see https://doi.org/10.18637/jss.v103.i02.
[BUGFIX] #473, #397: Fixed buffer overflow in
stri_dup
.stri_dup
,stri_paste
, … fail more graciously on attempts to generate strings of length >= 2^31 each.[BUILD TIME] #480: Using
Rf_isNull
instead ofisNull
.[DOCUMENTATION] #462: That the
numeric=TRUE
collator does not handle negative numbers correctly is now mentioned in the manual.
1.7.6 (2021-11-29)
[BUILD TIME] #463: Added loongarch support in ICU’s double conversion (@liuxiang88).
[BUGFIX] #467: The UCRT build on Windows was not marking strings as
latin1
.
1.7.5 (2021-10-04)
[DOCUMENTATION] Paper on stringi has been accepted for publication in the Journal of Statistical Software, see https://stringi.gagolewski.com/_static/vignette/stringi.pdf for a draft version.
[DOCUMENTATION] The stringi website at https://stringi.gagolewski.com now features a comprehensive tutorial based on the aforementioned paper.
[DOCUMENTATION] The ICU Project site has been moved to https://icu.unicode.org/.
[BUILD TIME] #457: The
autoconf
macrosAC_LANG_CPLUSPLUS
andAC_TRY_COMPILE
were obsolete.[BUGFIX] #458: Passing ALTREP objects no longer yields ‘embeded nul in string’ errors.
1.7.4 (2021-08-12)
[BUGFIX] #449: Fixed segfaults generated by
stri_sprintf
.[BUILD TIME] No longer defining
USE_RINTERNALS
andR_NO_REMAP
.
1.7.3 (2021-07-15)
[BUGFIX] Fixed the previous patch of ICU55 causing a build failure on, amongst others, CRAN’s Solaris-based target.
1.7.2 (2021-07-14)
[BUGFIX] Workaround for a bug in
tools::checkFF
failing whenNA_character_
is passed to.Call
.
1.7.1 (2021-07-14)
[BACKWARD INCOMPATIBILITY]
%s$%
and%stri$%
now use the newstri_sprintf
(see below) function instead ofbase::sprintf
.[BACKWARD INCOMPATIBILITY, NEW FEATURE] In
stri_sub<-
andstri_sub_all<-
, providing a negativelength
from now on does not result in the corresponding input string being altered.[BACKWARD INCOMPATIBILITY, NEW FEATURE] In
stri_sub
andstri_sub_all
, negativelength
results in the corresponding output beingNA
or not extracted at all, depending on the setting of the new argumentignore_negative_length
.[BACKWARD INCOMPATIBILITY, BUGFIX, NEW FEATURE] In
stri_subset*
and their replacement versions,pattern
andvalue
cannot be longer thanstr
(but now they are recycled if necessary).[BACKWARD INCOMPATIBILITY, NEW FEATURE]
stri_sub*
now accept thefrom
argument being a matrix likecbind(from, length=length)
. Unnamed columns or any other names are still interpreted ascbind(from, to)
. Also, the new argumentuse_matrix
can be used to disable the special treatment of such matrices.[DOCUMENTATION] It has been clarified that the syntax of
*_charclass
(e.g., used instri_trim*
) differs slightly from regex character classes.[NEW FEATURE] #420:
stri_sprintf
(alias:stri_string_format
) is a Unicode-aware replacement for and enhancement of the basesprintf
: it adds a customised handling ofNA
s (on demand), computing field size based on code point width, outputting substrings of at most given width, variable width and precision (both at the same time), etc. Moreover,stri_printf
can be used to display formatted strings conveniently.[NEW FEATURE] #153:
stri_match_*_regex
now extract capture group names.[NEW FEATURE] #25:
stri_locate_*_regex
now have a new argument,capture_groups
, which allows for extracting positions of matches to parenthesised subexpressions.[NEW FEATURE]
stri_locate_*
now have a new argument,get_length
, whose setting may result in generating from-length matrices (instead of from-to ones).[NEW FEATURE] #438:
stri_trans_general
now supports rule-based as well as reverse-direction transliteration.[NEW FEATURE] #434:
stri_datetime_format
andstri_datetime_parse
are now vectorised also with respect to theformat
argument.[NEW FEATURE]
stri_datetime_fstr
has a new argument,ignore_special
, which defaults toTRUE
for backward compatibility.[NEW FEATURE]
stri_datetime_format
,stri_datetime_add
, andstri_datetime_fields
now callas.POSIXct
more eagerly.[NEW FEATURE]
stri_trim*
now have a new argument,negate
.[NEW FEATURE]
stri_replace_rstr
convertsgsub
-style replacement strings tostri_replace
-style.[INTERNAL]
stri_prepare_arg*
have been refactored, buffer overruns in the exception handling subsystem are now avoided.[BUGFIX] Few functions (
stri_length
,stri_enc_toutf32
, etc.) did not throw an exception on an invalid UTF-8 byte sequence (and merely issued a warning instead).[BUGFIX]
stri_datetime_fstr
did not honourNA_character_
and did not parse format strings such as"%Y%m%d"
correctly. It has now been completely rewritten (in C).[BUGFIX]
stri_wrap
did not recognise the width of certain Unicode sequences correctly.
1.6.2 (2021-05-14)
[BACKWARD INCOMPATIBILITY] In
stri_enc_list()
,simplify
now defaults toTRUE
.[NEW FEATURE] #425: The outputs of
stri_enc_list()
,stri_locale_list()
,stri_timezone_list()
, andstri_trans_list()
are now sorted.[NEW FEATURE] #428: In
stri_flatten
,na_empty=NA
now omits missing values.[BUILD TIME] #431: Pre-4.9.0 GCC has
::max_align_t
, but notstd::max_align_t
, added a (possible) workaround, see theINSTALL
file.[BUGFIX] #429:
stri_width()
misclassified the width of certain code points (including grave accent, Eszett, etc.); General category Sk (Symbol, modifier) is no longer of width 0,UCHAR_EAST_ASIAN_WIDTH
ofU_EA_AMBIGUOUS
is no longer of width 2.[BUGFIX] #354:
ALTREP
CHARSXP
s were not copied, and thus could have been garbage collected in the so-called meanwhile (with thanks to @jimhester).
1.6.1 (2021-05-05)
[GENERAL] #401: stringi is now bundled with ICU4C 69.1 (upgraded from 61.1), which is used on most Windows and OS X builds as well as on *nix systems not equipped with system ICU. However, if the C++11 support is disabled, stringi will be built against the battle-tested ICU4C 55.1. The update to ICU brings Unicode 13.0 and CLDR 39 support.
[DOCUMENTATION] A draft version of a paper on
stringi
is now available at https://stringi.gagolewski.com/_static/vignette/stringi.pdf.[GENERAL] stringi now requires R >= 3.1 (
CXX_STD
ofCXX11
orCXX1X
).[NEW FEATURE] #408:
stri_trans_casefold()
performs case folding; this is different from case mapping, which is locale-dependent. Folding makes two pieces of text that differ only in case identical. This can come in handy when comparing strings.[NEW FEATURE] #421:
stri_rank()
ranks strings in a character vector (e.g., for ordering data frames with regards to multiple criteria, the ranks can be passed toorder()
, see #219).[NEW FEATURE] #266:
stri_width()
now supports emojis.[NEW FEATURE]
%s$%
and%stri$%
are now vectorised with respect to both arguments.[BUGFIX]
stri_sort_key()
now outputsbytes
-encoded strings.[BUGFIX] #415:
locale=''
was not equivalent tolocale=NULL
instri_opts_collator()
.[INTERNAL] #414: Use
LEVELS(x)
macro instead of accessing(x)->sxpinfo.gp
directly (@lukaszdaniel).
1.5.3 (2020-09-04)
[DOCUMENTATION] stringi home page has moved to https://stringi.gagolewski.com and now includes a comprehensive reference manual.
[NEW FEATURE] #400:
%s$%
and%stri$%
are now binary operators that call base R’ssprintf()
.[NEW FEATURE] #399: The
%s*%
and%stri*%
operators can be used in addition tostri_dup()
, for the very same purpose.[NEW FEATURE] #355:
stri_opts_regex()
now accepts thetime_limit
andstack_limit
options so as to prevent malformed or malicious regexes from running for too long.[NEW FEATURE] #345:
stri_startswith()
andstri_endswith()
are now equipped with thenegate
parameter.[NEW FEATURE] #382: Incorrect regexes are now reported to ease debugging.
[DEPRECATION WARNING] #347: Any unknown option passed to
stri_opts_fixed()
,stri_opts_regex()
,stri_opts_coll()
, andstri_opts_brkiter()
now generates a warning. In the future, the...
parameter will be removed, so that will be an error.[DEPRECATION WARNING]
stri_duplicated()
’sfromLast
argument has been renamedfrom_last
.fromLast
is now its alias scheduled for removal in a future version of the package.[DEPRECATION WARNING]
stri_enc_detect2()
is scheduled for removal in a future version of the package. Usestri_enc_detect()
or the more targetedstri_enc_isutf8()
,stri_enc_isascii()
, etc., instead.[DEPRECATION WARNING]
stri_read_lines()
,stri_write_lines()
,stri_read_raw()
: usecon
argument instead offname
now. The argumentfallback_encoding
is scheduled for removal and is no longer used.stri_read_lines()
does not supportencoding="auto"
anymore.[DEPRECATION WARNING]
nparagraphs
instri_rand_lipsum()
has been renamedn_paragraphs
.[NEW FEATURE] #398: Alternative, British spelling of function parameters has been introduced, e.g.,
stri_opts_coll()
now supports bothnormalization
andnormalisation
.[NEW FEATURE] #393:
stri_read_bin()
,stri_read_lines()
, andstri_write_lines()
are no longer marked as draft API.[NEW FEATURE] #187:
stri_read_bin()
,stri_read_lines()
, andstri_write_lines()
now support connection objects as well.[NEW FEATURE] #386: New function
stri_sort_key()
for generating locale-dependent sort keys which can be ordered at the byte level and return an equivalent ordering to the original string (@DavisVaughan).[BUGFIX] #138:
stri_encode()
andstri_rand_strings()
now can generate strings of much larger lengths.[BUGFIX]
stri_wrap()
did not honourindent
correctly whenuse_width
wasTRUE
.
1.4.6 (2020-02-17)
[BACKWARD INCOMPATIBILITY] #369:
stri_c()
now returns an empty string when input is empty andcollapse
is set.[BUGFIX] #370: fixed an issue in
stri_prepare_arg_POSIXct()
reported by rchk.[DOCUMENTATION] #372: documented arguments not in
\usage
in documentation objectstri_datetime_format
:...
1.4.5 (2020-01-11)
[BUGFIX] #366: fix for #363 required ICU >= 55 .
1.4.4 (2020-01-06)
[BUGFIX] #348: Avoid copying 0 bytes to a nil-buffer in
stri_sub_all()
.[BUGFIX] #362: Removed
configure
variableCXXCPP
as it is now deprecated.[BUGFIX] #318: PROTECTing objects from gcing as reported by
rchk
.[BUGFIX] #344, #364: Removed compiler warnings in icu61/common/cstring.h.
[BUGFIX] #363: Status of
RegexMatcher
is now checked after its use.
1.4.3 (2019-03-12)
[NEW FEATURE] #30: New function
stri_sub_all()
- a version ofstri_sub()
accepting listfrom
/to
/length
arguments for extracting multiple substrings from each string in a character vector.[NEW FEATURE] #30: New function
stri_sub_all<-()
(and its%<%
-friendly version,stri_sub_replace_all()
) - for replacing multiple substrings with corresponding replacement strings.[NEW FEATURE] In
stri_sub_replace()
,value
parameter has a new alias,replacement
.[NEW FEATURE] New convenience functions based on
stri_remove_empty()
:stri_omit_empty_na()
,stri_remove_empty_na()
,stri_omit_empty()
, and alsostri_remove_na()
,stri_omit_na()
.[BUGFIX] #343:
stri_trans_char()
did not yield correct results for overlapping pattern and replacement strings.[WARNFIX] #205:
configure.ac
is now included in the source bundle.
1.3.1 (2019-02-10)
[BACKWARD INCOMPATIBILITY] #335: A fix to #314 prevented (by design) the use of the system ICU if the library had been compiled with
U_CHARSET_IS_UTF8=1
. However, this is the default setting inlibicu
>=61. From now on, in such cases the system ICU is used more eagerly, butstri_enc_set()
issues a warning stating that the default (UTF-8) encoding cannot be changed.[NEW FEATURE] #232: All
stri_detect_*
functions now have themax_count
argument that allows for, e.g., stopping at the first pattern occurrence.[NEW FEATURE] #338:
stri_sub_replace()
is now an alias forstri_sub<-()
which makes it much more easily pipable (@yutannihilation, @BastienFR).[NEW FEATURE] #334: Added missing
icudt61b.dat
to support big-endian platforms (thanks to Dimitri John Ledkov @xnox).[BUGFIX] #296: Out-of-the box build used to fail on CentOS 6, upgraded
configure
to--disable-cxx11
more eagerly at an early stage.[BUGFIX] #341: Fixed possible buffer overflows when calling
strncpy()
from within ICU 61.[BUGFIX] #325: Made
configure
more portable so that it works under/bin/dash
now.[BUGFIX] #319: Fixed overflow in
stri_rand_shuffle()
.[BUGFIX] #337: Empty search patterns in search functions (e.g.,
stri_split_regex()
andstri_count_fixed()
) used to raise too many warnings on empty search patterns.
1.2.4 (2018-07-20)
[BUGFIX] #314: Testing
U_CHARSET_IS_UTF8
inconfigure
when usingpkg-build
.[BUILD TIME] #317: Included
icudt61l.zip
in the source bundle to solve the frequenticudt download failed
error (also on CRAN’swindows-release
andwindows-oldrel
). (reverted in version 1.3.1, thewinbuilder
errors were caused by a build chain bug).
1.2.3 (2018-05-16)
[BUGFIX] #296: Fixed the behaviour of the
configure
script on CentOS 6.[BUGFIX] Fixed broken Windows build by updating the
icudt
mirror list.
1.2.2 (2018-05-01)
[GENERAL] #193: stringi is now bundled with ICU4C 61.1, which is used on most Windows and OS X builds as well as on *nix systems not equipped with ICU. However, if the C++11 support is disabled, stringi will be built against ICU4C 55.1. The update to ICU brings Unicode 10.0 support, including new emoji characters.
[BUGFIX] #288:
stri_match()
did not return the correct number of columns when input was empty.[NEW FEATURE] #188:
stri_enc_detect()
now returns a list of data frames.[NEW FEATURE] #289:
stri_flatten()
how hasna_empty
andomit_empty
arguments.[NEW FEATURE] New functions:
stri_remove_empty()
,stri_na2empty()
.[NEW FEATURE] #285: Coercion from a non-trivial list (one that consists of atomic vectors, each of length 1) to an atomic vector now issues a warning.
[WARN] Removed
-Wparentheses
warnings inicu55/common/cstring.h:38:63
andicu55/i18n/windtfmt.cpp
in the ICU4C 55.1 bundle.
1.1.7 (2018-03-06)
[BUGFIX] Fixed ICU4C 55.1 generating some significant warnings (
icu55/i18n/winnmfmt.cpp
) and suppressing important diagnostics (src/icu55/i18n/decNumber.c
).
1.1.6 (2017-11-10)
[WINDOWS SPECIFIC] #270: Strings marked with
latin1
encoding are now converted internally to UTF-8 using the WINDOWS-1252 codec. This fixes problems with - among others - displaying the Euro sign.[NEW FEATURE] #263: Added support for custom rule-based break iteration, see
?stri_opts_brkiter
.[NEW FEATURE] #267:
omit_na=TRUE
instri_sub<-()
now ignores missing values in any of the arguments provided.[BUGFIX] Fixed unPROTECTed variable names and stack imbalances as reported by
rchk
.
1.1.5 (2017-04-07)
[GENERAL] stringi now requires ICU4C >= 52.
[BUGFIX] Fixed errors pointed out by
clang-UBSAN
instri_brkiter.h
.[GENERAL] stringi now requires R >= 2.14.
[BUILD TIME] #238, #220: Now trying standard ICU4C build flags if a call to
pkg-config
fails.[BUILD TIME] #258: Use
CXX11
instead ofCXX1X
on R >= 3.4.[BUILD TIME, BUGFIX] #254:
dir.exists()
is R >= 3.2.
1.1.3 (2017-03-21)
[REMOVE DEPRECATED]
stri_install_check()
andstri_install_icudt()
marked as deprecated in stringi 0.5-5 are no longer being exported.[BUGFIX] #227: Incorrect behaviour of
stri_sub()
andstri_sub<-()
if the empty string was the result.[BUILD TIME] #231: The
configure
(Linux/Unix only) script now reads the following environment variables:STRINGI_CFLAGS
,STRINGI_CPPFLAGS
,STRINGI_CXXFLAGS
,STRINGI_LDFLAGS
,STRINGI_LIBS
,STRINGI_DISABLE_CXX11
,STRINGI_DISABLE_ICU_BUNDLE
,STRINGI_DISABLE_PKG_CONFIG
,PKG_CONFIG
, seeINSTALL
for more information.[BUILD TIME] #253: Call to
R_useDynamicSymbols()
added.[BUILD TIME] #230:
icudt
is now being downloaded byconfigure
(*NIX only) before building.[BUILD TIME] #242:
_COUNT/_LIMIT
enum constants have been deprecated as of ICU 58.2, stringi code has been upgraded accordingly.
1.1.2 (2016-09-30)
[BUGFIX]
round()
,snprintf()
is not C++98.
1.1.1 (2016-05-25)
[BUGFIX] #214: Allow a regex pattern like
.*
to match an empty string.[BUGFIX] #210:
stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)
now results inc("1", NA)
.[NEW FEATURE] #199:
stri_sub<-()
now allows for ignoringNA
locations (a newomit_na
argument added).[NEW FEATURE] #207:
stri_sub<-()
now allows for substring insertions (vialength=0
).[NEW FUNCTION] #124:
stri_subset<-()
functions added.[NEW FEATURE] #216:
stri_detect()
,stri_subset()
,stri_subset<-()
now all have thenegate
argument.[NEW FUNCTION] #175:
stri_join_list()
concatenates all strings in a list of character vectors. Useful in conjunction with, e.g.,stri_extract_all_regex()
,stri_extract_all_words()
, etc.
1.0-1 (2015-10-22)
[GENERAL] #88: C API is now available for use in, e.g., Rcpp packages, see https://github.com/gagolews/ExampleRcppStringi for an example.
[BUGFIX] #183: Floating point exception raised in
stri_sub()
andstri_sub<-()
whento
orlength
was a zero-length numeric vector.[BUGFIX] #180:
stri_c()
warned incorrectly (recycling rule) when using more than two elements.
0.5-5 (2015-06-28)
[BACKWARD INCOMPATIBILITY]
stri_install_check()
andstri_install_icudt()
are now deprecated. From now on they are supposed to be used only by the stringi installer.[BUGFIX] #176: A patch for
sys/feature_tests.h
no longer included (the original file was copyrighted by Sun Microsystems); fixed the Compiler or options invalid for pre-Unix 03 X/Open applications and pre-2001 POSIX applications error by forcing (conditionally)_XPG6
conformance.[BUGFIX] #174:
stri_paste()
did not generate any warning when the recycling rule is violated andsep==""
.[BUGFIX] #170:
icu::setDataDirectory
is no longer called if our ICU source bundle is not used (this used to cause build problems on openSUSE).[BUILD TIME] #169:
configure
now tries to switch to the standard C++ compiler if a C++11 one is not configured correctly.[BUILD TIME]
configure.win
(Biarch: TRUE
) now mimicsautoconf
’sAC_SUBST
andAC_CONFIG_FILES
so that the build process is now more similar across different platforms.[NEW FEATURE]
stri_info()
now also gives information about which version of ICU4C is in use (system or bundle).
0.5-2 (2015-06-21)
[BACKWARD INCOMPATIBILITY] The second argument to
stri_pad_*()
has been renamedwidth
.[GENERAL] #69: stringi is now bundled with ICU4C 55.1.
[NEW FUNCTIONS]
stri_extract_*_boundaries()
extract text between text boundaries.[NEW FUNCTION] #46:
stri_trans_char()
is a stringi-flavouredchartr()
equivalent.[NEW FUNCTION] #8:
stri_width()
approximates the width of a string in a more Unicode-ish fashion thannchar(..., "width")
[NEW FEATURE] #149:
stri_pad()
andstri_wrap()
is now (by default) based on code point widths instead of the number of code points. Moreover, the default behaviour ofstri_wrap()
is now such that it does not get rid of non-breaking, zero width, etc., spaces.[NEW FEATURE] #133:
stri_wrap()
silently allows forwidth <= 0
(for compatibility withstrwrap()
).[NEW FEATURE] #139:
stri_wrap()
gained a new argument:whitespace_only
.[NEW FUNCTIONS] #137: Date-time formatting/parsing:
stri_timezone_list()
- lists all known time zone identifiers;stri_timezone_set()
,stri_timezone_get()
- manage the current default time zone;stri_timezone_info()
- basic information on a given time zone;stri_datetime_symbols()
- gives localizable date-time formatting data;stri_datetime_fstr()
- converts astrptime
-like format string to an ICU date/time format string;stri_datetime_format()
- converts date/time to string;stri_datetime_parse()
- converts string to date/time object;stri_datetime_create()
- constructs date-time objects from numeric representations;stri_datetime_now()
- returns current date-time;stri_datetime_fields()
- returns date-time fields’ values;stri_datetime_add()
- adds specific number of date-time units to a date-time object.
[GENERAL] #144: Performance improvements in handling ASCII strings (these affect
stri_sub()
,stri_locate()
and other string index-based operations)[GENERAL] #143: Searching for short fixed patterns (
stri_*_fixed()
) now relies on the currentlibC
’s implementation ofstrchr()
andstrstr()
. This is very fast, e.g., onglibc
using theSSE2/3/4
instruction set.[BUILD TIME] #141: A local copy of
icudt*.zip
may be used on package install; see theINSTALL
file for more information.[BUILD TIME] #165: The
configure
option--disable-icu-bundle
forces the use of system ICU when building the package.[BUGFIX] Locale specifiers are now normalized in a more intelligent way: e.g.,
@calendar=gregorian
expands toDEFAULT_LOCALE@calendar=gregorian
.[BUGFIX] #134:
stri_extract_all_words()
did not acceptsimplify=NA
.[BUGFIX] #132: Incorrect behaviour in
stri_locate_regex()
for matches of zero lengths.[BUGFIX] stringr/#73:
stri_wrap()
returnedCHARSXP
instead ofSTRSXP
on empty string input withsimplify=FALSE
argument.[BUGFIX] #164: Using
libicu-dev
failed on Ubuntu (LIBS
shall be passed afterLDFLAGS
and the list of.o
files).[BUGFIX] #168: Build now fails if
icudt
is not available.[BUGFIX] #135: C++11 is now used by default (see the
INSTALL
file, however) to build stringi from sources. This is because ICU4C uses thelong long
type which is not part of the C++98 standard.[BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.
[BUGFIX] Force ICU
u_init()
call on the stringi dynlib load.[BUGFIX] #157: Many overfull
hbox
es in the package PDF manual have been corrected.
0.4-1 (2014-12-11)
[IMPORTANT CHANGE]
n_max
argument instri_split_*()
has been renamedn
.[IMPORTANT CHANGE]
simplify=FALSE
instri_extract_all_*()
andstri_split_*()
now callsstri_list2matrix()
withfill=""
.fill=NA_character_
may be obtained by usingsimplify=NA
.[IMPORTANT CHANGE, NEW FUNCTIONS] #120:
stri_extract_words()
has been renamedstri_extract_all_words()
andstri_locate_boundaries()
-stri_locate_all_boundaries()
as well asstri_locate_words()
-stri_locate_all_words()
. New functions are now available:stri_locate_first_boundaries()
,stri_locate_last_boundaries()
,stri_locate_first_words()
,stri_locate_last_words()
,stri_extract_first_words()
,stri_extract_last_words()
.[IMPORTANT CHANGE] #111:
opts_regex
,opts_collator
,opts_fixed
, andopts_brkiter
can now be supplied individually via...
. In other words, you may now simply call, e.g.,stri_detect_regex(str, pattern, case_insensitive=TRUE)
instead ofstri_detect_regex(str, pattern, opts_regex=stri_opts_regex(case_insensitive=TRUE))
.[NEW FEATURE] #110: Fixed pattern search engine’s settings can now be supplied via
opts_fixed
argument instri_*_fixed()
, seestri_opts_fixed()
. A simple (not suitable for natural language processing) yet very fastcase_insensitive
pattern matching can be performed now.stri_extract_*_fixed()
is again available.[NEW FEATURE] #23:
stri_extract_all_fixed()
,stri_count()
, andstri_locate_all_fixed()
may now also look for overlapping pattern matches, see?stri_opts_fixed
.[NEW FEATURE] #129:
stri_match_*_regex()
gained acg_missing
argument.[NEW FEATURE] #117:
stri_extract_all_*()
,stri_locate_all_*()
,stri_match_all_*()
gained a new argument:omit_no_match
. Setting it toTRUE
makes these functions compatible with theirstringr
equivalents.[NEW FEATURE] #118:
stri_wrap()
gainedindent
,exdent
,initial
, andprefix
arguments. Moreover, Knuth’s dynamic word wrapping algorithm now assumes that the cost of printing the last line is zero, see #128.[NEW FEATURE] #122:
stri_subset()
gained anomit_na
argument.[NEW FEATURE]
stri_list2matrix()
gained ann_min
argument.[NEW FEATURE] #126:
stri_split()
is now also able to act just likestringr::str_split_fixed()
.[NEW FEATURE] #119:
stri_split_boundaries()
now hasn
,tokens_only
, andsimplify
arguments. Additionally,stri_extract_all_words()
is now equipped withsimplify
arg.[NEW FEATURE] #116:
stri_paste()
gained a new argument:ignore_null
. Setting it toTRUE
makes this function more compatible withpaste()
.[OTHER] #123:
useDynLib
is used to speed up symbol look-up in the compiled dynamic library.[BUGFIX] #114:
stri_paste()
: could return result in an incorrect order.[BUGFIX] #94: Run-time errors on Solaris caused by setting
-DU_DISABLE_RENAMING=1
- memory allocation errors in, among others, the ICUUnicodeString
. This setting also caused someASAN
sanity check failures within ICU code.
0.3-1 (2014-11-06)
[IMPORTANT CHANGE] #87:
%>%
overlapped with the pipe operator from themagrittr
package; now each operator like%>%
has been renamed%s>%
.[IMPORTANT CHANGE] #108: Now the
BreakIterator
(for text boundary analysis) may be more easily controlled viastri_opts_brkiter()
(see optionstype
andlocale
which aim to replace now-removedboundary
andlocale
parameters tostri_locate_boundaries()
,stri_split_boundaries()
,stri_trans_totitle()
,stri_extract_words()
, andstri_locate_words()
).[NEW FUNCTIONS] #109:
stri_count_boundaries()
andstri_count_words()
count the number of text boundaries in a string.[NEW FUNCTIONS] #41:
stri_startswith_*()
andstri_endswith_*()
determine whether a string starts or ends with a given pattern.[NEW FEATURE] #102:
stri_replace_all_*()
now all have thevectorize_all
parameter, which defaults toTRUE
for backward compatibility.[NEW FUNCTION] #91: Added
stri_subset_*()
- a convenient and more efficient substitute forstr[stri_detect_*(str, ...)]
.[NEW FEATURE] #100:
stri_split_fixed()
,stri_split_charclass()
,stri_split_regex()
,stri_split_coll()
gained atokens_only
parameter, which defaults toFALSE
for backward compatibility.[NEW FUNCTION] #105:
stri_list2matrix()
converts lists of atomic vectors to character matrices, useful in conjunction withstri_split()
andstri_extract()
.[NEW FEATURE] #107:
stri_split_*()
now allow setting anomit_empty=NA
argument.[NEW FEATURE] #106:
stri_split()
andstri_extract_all()
gained asimplify
argument (ifTRUE
, thenstri_list2matrix(..., byrow=TRUE)
is called on the resulting list).[NEW FUNCTION] #77:
stri_rand_lipsum()
generates a (pseudo)random dummy lorem ipsum text.[NEW FEATURE] #98:
stri_trans_totitle()
gained aopts_brkiter
parameter; it indicates which ICUBreakIterator
should be used when case mapping.[NEW FEATURE]
stri_wrap()
gained a new parameter:normalize
.[BUGFIX] #86:
stri_*_fixed()
,stri_*_coll()
, andstri_*_regex()
could give incorrect results if one of search strings were of length 0.[BUGFIX] #99:
stri_replace_all()
did not use thereplacement
arg.[BUGFIX] #112: Some of the objects were not PROTECTed from garbage collection - this could have led to spontaneous SEGFAULTS.
[BUGFIX] Some collator’s options were not passed correctly to ICU services.
[BUGFIX] Memory leaks as detected by
valgrind --tool=memcheck --leak-check=full
have been removed.[DOCUMENTATION] Significant extensions/clean ups in the stringi manual.
0.2-5 (2014-05-16)
Some examples are no longer run if
icudt
is not available (this was reverted in a future version though).
0.2-4 (2014-05-15)
[BUGFIX] Fixed issues with loading of misaligned addresses in
stri_*_fixed()
.
0.2-3 (2014-05-14)
[IMPORTANT CHANGE]
stri_cmp*()
now do not allow for passingopts_collator=NA
. From now on,stri_cmp_eq()
,stri_cmp_neq()
, and the new operators%===%
,%!==%
,%stri===%
, and%stri!==%
are locale-independent operations, which base on code point comparisons. New functionsstri_cmp_equiv()
andstri_cmp_nequiv()
(and from now on also%==%
,%!=%
,%stri==%
, and%stri!=%
) test for canonical equivalence.[IMPORTANT CHANGE]
stri_*_fixed()
search functions now perform a locale-independent exact (byte-wise, of course after conversion to UTF-8) pattern search. All theCollator
-based, locale-dependent search routines are now available viastri_*_coll()
. The reason behind this is that ICU’sUSearch
has currently very poor performance. What is more, in many search tasks exact pattern matching is sufficient anyway.[GENERAL]
stri_*_fixed
now use a tweaked Knuth-Morris-Pratt search algorithm which improves the search performance drastically.[IMPORTANT CHANGE]
stri_enc_nf*()
andstri_enc_isnf*()
function families have been renamedstri_trans_nf*()
andstri_trans_isnf*()
, respectively – they deal with text transforming, and not with character encoding. Note that all of these may be performed by ICU’sTransliterator
too (see below).[NEW FUNCTION]
stri_trans_general()
andstri_trans_list()
give access to ICU’sTransliterator
: they may be used to perform some generic text transforms, like Unicode normalisation, case folding, etc.[NEW FUNCTION
stri_split_boundaries()
uses ICU’sBreakIterator
to split strings at specific text boundaries. Moreover,stri_locate_boundaries()
indicates positions of these boundaries.[NEW FUNCTION]
stri_extract_words()
uses ICU’sBreakIterator
to extract all words from a text. Additionally,stri_locate_words()
locates start and end positions of words in a text.[NEW FUNCTION]
stri_pad()
,stri_pad_left()
,stri_pad_right()
, andstri_pad_both()
pad a string with a specific code point.[NEW FUNCTION]
stri_wrap()
breaks paragraphs of text into lines. Two algorithms (greedy and minimal raggedness) are available.[IMPORTANT CHANGE]
stri_*_charclass()
search functions now rely solely on ICU’sUnicodeSet
patterns. All the previously accepted charclass identifiers became invalid. However, new patterns should now be more familiar to the users (they are regex-like). Moreover, we observe a very nice performance gain.[IMPORTANT CHANGE]
stri_sort()
now does not includeNA
s in output vectors by default, for compatibility withsort()
. Moreover, currently none of the input vector’s attributes are preserved.[NEW FUNCTION]
stri_unique()
extracts unique elements from a character vector.[NEW FUNCTIONS]
stri_duplicated()
andstri_duplicated_any()
determine duplicate elements in a character vector.[NEW FUNCTION]
stri_replace_na()
replacesNA
s in a character vector with a given string, useful for emulating, e.g., R’spaste()
behaviour.[NEW FUNCTION]
stri_rand_shuffle()
generates a random permutation of code points in a string.[NEW FUNCTION]
stri_rand_strings()
generates random strings.[NEW FUNCTIONS] New functions and binary operators for string comparison:
stri_cmp_eq()
,stri_cmp_neq()
,stri_cmp_lt()
,stri_cmp_le()
,stri_cmp_gt()
,stri_cmp_ge()
,%==%
,%!=%
,%<%
,%<=%
,%>%
,%>=%
.[NEW FUNCTION]
stri_enc_mark()
reads declared encodings of character strings as seen by stringi.[NEW FUNCTION]
stri_enc_tonative(str)
is an alias tostri_encode(str, NULL, NULL)
.[NEW FEATURE]
stri_order()
andstri_sort()
now have an additional argumentna_last
(defaults toTRUE
andNA
, respectively).[NEW FEATURE]
stri_replace_all_charclass()
,stri_extract_all_charclass()
, andstri_locate_all_charclass()
now have a new argument,merge
(defaults toFALSE
for backward-compatibility). It may be used to, e.g., replace sequences of white spaces with a single space.[NEW FEATURE]
stri_enc_toutf8()
now has a newvalidate
argument (which defaults toFALSE
for backward-compatibility). It may be used in a (rare) case where a user wants to fix an invalid UTF-8 byte sequence.stri_length()
(among others) now detects invalid UTF-8 byte sequences.[NEW FEATURE] All binary operators
%???%
now also have aliases%stri???%
.[GENERAL] Performance improvements in
StriContainerUTF8
andStriContainerUTF16
(they affect most other functions).[GENERAL] Significant performance improvements in
stri_join()
,stri_flatten()
,stri_cmp()
,stri_trans_to*()
, and others.[GENERAL] Added 3rd mirror site for our
icudt
binary distribution.U_MISSING_RESOURCE_ERROR
message inStriException
now suggests callingstri_install_check()
.[BUGFIX] UTF-8 BOMs are now silently removed from input strings.
[BUGFIX] No more attempts to re-encode UTF-8 encoded strings if native encoding is UTF-8 in
StriContainerUTF8
.[BUGFIX] Possible memory leaks when throwing errors via
Rf_error()
.[BUGFIX]
stri_order()
andstri_cmp()
could return incorrect results foropts_collator=NA
.[BUGFIX]
stri_sort()
did not guarantee to return strings in UTF-8.
0.1-25 (2014-03-12)
LICENSE tweaks.
First CRAN release.
0.1-24 (2014-03-11)
Fixed bugs detected with
ASAN
andUBSAN
, e.g., fixedCharClass::gcmask
type (enum
->uint32_t
) (reported byUBSAN
).Fixed array over-runs detected with
valgrind
instring8.h
.Fixed uninitialised class fields in
StriContainerUTF8
(reported byvalgrind
).
0.1-23 (2014-03-11)
License changed to BSD-3-clause, COPYRIGHTS updated.
icudt
is not shipped with stringi anymore; it is now downloaded ininstall.libs.R
from one of our servers.New functions:
stri_install_check()
,stri_install_icudt()
.
0.1-22 (2014-02-20)
System ICU is used on systems which do have one (version >= 50 needed). ICU is auto-detected with
pkg-config
inconfigure
. Pass'--disable-pkg-config'
toconfigure
to force building ICU from sources.icudt52b
(custom subset) is now shipped with stringi (for big-endian, ASCII systems).
0.1-21 (2014-02-19)
Fixed some issues on Solaris while preparing stringi for CRAN submission.
0.1-20 (2014-02-17)
ICU4C 52.1 sources included (common, i18n, stubdata +
icu52dt.dat
loaded dynamically). Compilation via Makevars.stringi does not depend on any external libraries anymore.
0.1-11 (2013-11-16)
ICU4C is now statically linked on Windows.
First OS X binary build.
The package is being intensively tested by our students at Warsaw University of Technology.
0.1-10 (2013-11-13)
Using
pkg-config
viaconfigure
to look for ICU4C libs.
0.1-6 (2013-07-05)
First Windows binary build.
Compilation passed on Oracle Sun Studio compiler collection.
By now we have implemented most of the functionality scheduled for milestone 0.1.
0.1-1 (2013-01-05)
The stringi project has been started.