c - Cannot distinguish single character words with libunibreak -


when use set_word_breaks_utf32() libunibreak library navigate through words, see single letter words (i.e. 'a' in english, '北' in chinese, ...) disappear because evaluate wordbreak_break , consequently indistinguishable surrounding whitespace. following code demonstrates issue:

#include <stdio.h> #include "wordbreak.h"  int main(int argc, const char* argv[]) {     int i;     uint32_t text[] = { 't', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', '.', '\n' };     char breaks[1024];     size_t length = sizeof(text) / sizeof(text[0]);     set_word_breaks_utf32(text, length, "", breaks);     for(i = 0; < length; i++) putchar(text[i]);     for(i = 0; < length; i++) putchar(breaks[i] + '0');     putchar('\n');     return 0; } 

the output of code shows letter 'a' indistinguishable surrounding whitespace:

this test. 1110010000111000 

what can ensure boundaries of single letter words distinguishable in set_word_breaks_utf32() output?

[apologies using line-breaks tag, word-break tag related css property.]

the unicode standard annex #29 isn't designed that. set_wordbreaks_utf32() find each word boundary.

this test. 1110010000111000    t   h     s  ' '    s  ' '   ' '  t   e   s   t   .  '\n' |   _   _   _   |   |   _   |   |   |   |   _   _   _   |   |    | 

each | above word boundary, can helpful find words, not complete solution. note there implicit word boundary @ beginning of string. complete word detection algorithm have determine if character between each adjacent word boundary unicode letter, , mark character word accordingly.


Comments

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -