c - Cannot distinguish single character words with libunibreak -
when use set_word_breaks_utf32() libunibreak library navigate through words, see single letter words (i.e. 'a' in english, '北' in chinese, ...) disappear because evaluate wordbreak_break , consequently indistinguishable surrounding whitespace. following code demonstrates issue:
#include <stdio.h> #include "wordbreak.h" int main(int argc, const char* argv[]) { int i; uint32_t text[] = { 't', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', '.', '\n' }; char breaks[1024]; size_t length = sizeof(text) / sizeof(text[0]); set_word_breaks_utf32(text, length, "", breaks); for(i = 0; < length; i++) putchar(text[i]); for(i = 0; < length; i++) putchar(breaks[i] + '0'); putchar('\n'); return 0; } the output of code shows letter 'a' indistinguishable surrounding whitespace:
this test. 1110010000111000 what can ensure boundaries of single letter words distinguishable in set_word_breaks_utf32() output?
[apologies using line-breaks tag, word-break tag related css property.]
the unicode standard annex #29 isn't designed that. set_wordbreaks_utf32() find each word boundary.
this test. 1110010000111000 t h s ' ' s ' ' ' ' t e s t . '\n' | _ _ _ | | _ | | | | _ _ _ | | | each | above word boundary, can helpful find words, not complete solution. note there implicit word boundary @ beginning of string. complete word detection algorithm have determine if character between each adjacent word boundary unicode letter, , mark character word accordingly.
Comments
Post a Comment