stri_split_boundaries: Split a String at Text Boundaries¶
Description¶
This function locates text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.
Usage¶
stri_split_boundaries(
str,
n = -1L,
tokens_only = FALSE,
simplify = FALSE,
...,
opts_brkiter = NULL
)
Arguments¶
|
character vector or an object coercible to |
|
integer vector, maximal number of strings to return |
|
single logical value; may affect the result if |
|
single logical value; if |
|
additional settings for |
|
a named list with ICU BreakIterator’s settings, see stri_opts_brkiter; |
Details¶
Vectorized over str
and n
.
If n
is negative (the default), then all text pieces are extracted.
Otherwise, if tokens_only
is FALSE
(this is the default, for compatibility with the stringr package), then n-1
tokens are extracted (if possible) and the n
-th string gives the (non-split) remainder (see Examples). On the other hand, if tokens_only
is TRUE
, then only full tokens (up to n
pieces) are extracted.
For more information on text boundary analysis performed by ICU’s BreakIterator
, see stringi-search-boundaries.
Value¶
If simplify=FALSE
(the default), then the functions return a list of character vectors.
Otherwise, stri_list2matrix with byrow=TRUE
and n_min=n
arguments is called on the resulting object. In such a case, a character matrix with length(str)
rows is returned. Note that stri_list2matrix’s fill
argument is set to an empty string and NA
, for simplify
equal to TRUE
and NA
, respectively.
See Also¶
Other search_split: about_search, stri_split_lines(), stri_split()
Other locale_sensitive: %s<%(), about_locale, about_search_boundaries, about_search_coll, stri_compare(), stri_count_boundaries(), stri_duplicated(), stri_enc_detect2(), stri_extract_all_boundaries(), stri_locate_all_boundaries(), stri_opts_collator(), stri_order(), stri_sort_key(), stri_sort(), stri_trans_tolower(), stri_unique(), stri_wrap()
Other text_boundaries: about_search_boundaries, about_search, stri_count_boundaries(), stri_extract_all_boundaries(), stri_locate_all_boundaries(), stri_opts_brkiter(), stri_split_lines(), stri_trans_tolower(), stri_wrap()
Examples¶
test <- 'The\u00a0above-mentioned features are very useful. ' %s+%
'Spam, spam, eggs, bacon, and spam. 123 456 789'
stri_split_boundaries(test, type='line')
stri_split_boundaries(test, type='word')
stri_split_boundaries(test, type='word', skip_word_none=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_letter=TRUE)
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_number=TRUE)
stri_split_boundaries(test, type='sentence')
stri_split_boundaries(test, type='sentence', skip_sentence_sep=TRUE)
stri_split_boundaries(test, type='character')
# a filtered break iterator with the new ICU:
stri_split_boundaries('Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.', type='sentence', locale='en_US@ss=standard') # ICU >= 56 only