stri_split_boundaries: Split a String at Text Boundaries#

Description#

This function locates text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.

Usage#

stri_split_boundaries(
  str,
  n = -1L,
  tokens_only = FALSE,
  simplify = FALSE,
  ...,
  opts_brkiter = NULL
)

Arguments#

str

character vector or an object coercible to

n

integer vector, maximal number of strings to return

tokens_only

single logical value; may affect the result if n is positive, see Details

simplify

single logical value; if TRUE or NA, then a character matrix is returned; otherwise (the default), a list of character vectors is given, see Value

...

additional settings for opts_brkiter

opts_brkiter

a named list with ICU BreakIterator’s settings, see stri_opts_brkiter; NULL for the default break iterator, i.e., line_break

Details#

Vectorized over str and n.

If n is negative (the default), then all text pieces are extracted.

Otherwise, if tokens_only is FALSE (which is the default), then n-1 tokens are extracted (if possible) and the n-th string gives the (non-split) remainder (see Examples). On the other hand, if tokens_only is TRUE, then only full tokens (up to n pieces) are extracted.

For more information on text boundary analysis performed by ICU’s BreakIterator, see stringi-search-boundaries.

Value#

If simplify=FALSE (the default), then the functions return a list of character vectors.

Otherwise, stri_list2matrix with byrow=TRUE and n_min=n arguments is called on the resulting object. In such a case, a character matrix with length(str) rows is returned. Note that stri_list2matrix’s fill argument is set to an empty string and NA, for simplify equal to TRUE and NA, respectively.

Author(s)#

Marek Gagolewski and other contributors

See Also#

The official online manual of stringi at https://stringi.gagolewski.com/

Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02

Other search_split: about_search, stri_split_lines(), stri_split()

Other locale_sensitive: %s<%(), about_locale, about_search_boundaries, about_search_coll, stri_compare(), stri_count_boundaries(), stri_duplicated(), stri_enc_detect2(), stri_extract_all_boundaries(), stri_locate_all_boundaries(), stri_opts_collator(), stri_order(), stri_rank(), stri_sort_key(), stri_sort(), stri_trans_tolower(), stri_unique(), stri_wrap()

Other text_boundaries: about_search_boundaries, about_search, stri_count_boundaries(), stri_extract_all_boundaries(), stri_locate_all_boundaries(), stri_opts_brkiter(), stri_split_lines(), stri_trans_tolower(), stri_wrap()

Examples#

test <- 'The\u00a0above-mentioned    features are very useful. ' %s+%
   'Spam, spam, eggs, bacon, and spam. 123 456 789'
stri_split_boundaries(test, type='line')
## [[1]]
##  [1] "The above-"    "mentioned    " "features "     "are "         
##  [5] "very "         "useful. "      "Spam, "        "spam, "       
##  [9] "eggs, "        "bacon, "       "and "          "spam. "       
## [13] "123 "          "456 "          "789"
stri_split_boundaries(test, type='word')
## [[1]]
##  [1] "The"       " "         "above"     "-"         "mentioned" "    "     
##  [7] "features"  " "         "are"       " "         "very"      " "        
## [13] "useful"    "."         " "         "Spam"      ","         " "        
## [19] "spam"      ","         " "         "eggs"      ","         " "        
## [25] "bacon"     ","         " "         "and"       " "         "spam"     
## [31] "."         " "         "123"       " "         "456"       " "        
## [37] "789"
stri_split_boundaries(test, type='word', skip_word_none=TRUE)
## [[1]]
##  [1] "The"       "above"     "mentioned" "features"  "are"       "very"     
##  [7] "useful"    "Spam"      "spam"      "eggs"      "bacon"     "and"      
## [13] "spam"      "123"       "456"       "789"
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_letter=TRUE)
## [[1]]
## [1] "123" "456" "789"
stri_split_boundaries(test, type='word', skip_word_none=TRUE, skip_word_number=TRUE)
## [[1]]
##  [1] "The"       "above"     "mentioned" "features"  "are"       "very"     
##  [7] "useful"    "Spam"      "spam"      "eggs"      "bacon"     "and"      
## [13] "spam"
stri_split_boundaries(test, type='sentence')
## [[1]]
## [1] "The above-mentioned    features are very useful. "
## [2] "Spam, spam, eggs, bacon, and spam. "              
## [3] "123 456 789"
stri_split_boundaries(test, type='sentence', skip_sentence_sep=TRUE)
## [[1]]
## [1] "The above-mentioned    features are very useful. "
## [2] "Spam, spam, eggs, bacon, and spam. "
stri_split_boundaries(test, type='character')
## [[1]]
##  [1] "T" "h" "e" " " "a" "b" "o" "v" "e" "-" "m" "e" "n" "t" "i" "o" "n" "e" "d"
## [20] " " " " " " " " "f" "e" "a" "t" "u" "r" "e" "s" " " "a" "r" "e" " " "v" "e"
## [39] "r" "y" " " "u" "s" "e" "f" "u" "l" "." " " "S" "p" "a" "m" "," " " "s" "p"
## [58] "a" "m" "," " " "e" "g" "g" "s" "," " " "b" "a" "c" "o" "n" "," " " "a" "n"
## [77] "d" " " "s" "p" "a" "m" "." " " "1" "2" "3" " " "4" "5" "6" " " "7" "8" "9"
# a filtered break iterator with the new ICU:
stri_split_boundaries('Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.', type='sentence', locale='en_US@ss=standard') # ICU >= 56 only
## [[1]]
## [1] "Mr. Jones and Mrs. Brown are very happy.\n"
## [2] "So am I, Prof. Smith."