about_search_boundaries:¶
Description¶
Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text.
Details¶
Examples of the boundary analysis process include:
Locating positions to word-wrap text to fit within specific margins while displaying or printing, see
stri_wrap
andstri_split_boundaries
.Counting characters, words, sentences, or paragraphs, see
stri_count_boundaries
.Making a list of the unique words in a document, see
stri_extract_all_words
and thenstri_unique
.Capitalizing the first letter of each word or sentence, see also
stri_trans_totitle
.Locating a particular unit of the text (for example, finding the third word in the document), see
stri_locate_all_boundaries
.
Generally, text boundary analysis is a locale-dependent operation. For example, in Japanese and Chinese one does not separate words with spaces - a line break can occur even in the middle of a word. These languages have punctuation and diacritical marks that cannot start or end a line, so this must also be taken into account.
stringi uses ICU’s BreakIterator
to locate specific text boundaries. Note that the BreakIterator
’s behavior may be controlled in come cases, see stri_opts_brkiter
.
The
character
boundary iterator tries to match what a user would think of as a “character” – a basic unit of a writing system for a language – which may be more than just a single Unicode code point.The
word
boundary iterator locates the boundaries of words, for purposes such as “Find whole words” operations.The
line_break
iterator locates positions that would be appropriate to wrap lines when displaying the text.The break iterator of type
sentence
locates sentence boundaries.
For technical details on different classes of text boundaries refer to the ICU User Guide, see below.
References¶
Boundary Analysis – ICU User Guide, https://unicode-org.github.io/icu/userguide/boundaryanalysis/
See Also¶
The official online manual of stringi at https://stringi.gagolewski.com/
Gagolewski M., stringi: Fast and portable character string processing in R, Journal of Statistical Software 103(2), 2022, 1-59, doi:10.18637/jss.v103.i02
Other locale_sensitive: %s<%()
, about_locale
, about_search_coll
, stri_compare()
, stri_count_boundaries()
, stri_duplicated()
, stri_enc_detect2()
, stri_extract_all_boundaries()
, stri_locate_all_boundaries()
, stri_opts_collator()
, stri_order()
, stri_rank()
, stri_sort()
, stri_sort_key()
, stri_split_boundaries()
, stri_trans_tolower()
, stri_unique()
, stri_wrap()
Other text_boundaries: about_search
, stri_count_boundaries()
, stri_extract_all_boundaries()
, stri_locate_all_boundaries()
, stri_opts_brkiter()
, stri_split_boundaries()
, stri_split_lines()
, stri_trans_tolower()
, stri_wrap()
Other stringi_general_topics: about_arguments
, about_encoding
, about_locale
, about_search
, about_search_charclass
, about_search_coll
, about_search_fixed
, about_search_regex
, about_stringi