1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
|
.Dd January 18 2024
.Dt U8TOR 3
.Os
.Sh NAME
.Nm u8tor ,
.Nm u8tor_uc
.Nd decode UTF-8 into a rune
.Sh LIBRARY
.Lb librune
.Sh SYNOPSIS
.In mbstring.h
.Ft int
.Fn u8tor "rune *ch" "const char8_t *s"
.Ft int
.Fn u8tor_uc "rune *ch" "const char8_t *s"
.Sh DESCRIPTION
The
.Fn u8tor
and
.Fn u8tor_uc
functions decode the first rune in the UTF-8 buffer
.Fa s ,
storing the result in the rune pointed to by
.Fa ch .
Both functions return the number of bytes which compose the decoded
UTF-8.
.Pp
The two functions are nearly identical,
however
.Fn u8tor_uc
performs fewer range checks than
.Fn u8tor
allowing it to process data more efficiently.
When provided with invalid UTF-8 however,
.Fn u8tor_uc
engages in undefined-behavior.
The
.Fn u8tor
function on the other hand handles invalid UTF-8 by storing
.Dv RUNE_ERROR
in
.Fa ch
and returning 1.
.Sh RETURN VALUES
The
.Fn u8tor
and
.Fn u8tor_uc
functions return the number of bytes from
.Fa s
decoded into
.Fa ch .
.Pp
The
.Fn u8tor
function returns 1 on invalid UTF-8.
.Sh EXAMPLES
The following call to
.Fn u8tor
attempts to decode the first UTF-8 codepoint in
.Va buf .
.Bd -literal -offset indent
/* Implementation of read_codepoint() omitted */
int w;
rune ch;
char8_t *buf = read_codepoint(stdin);
w = u8tor(&ch, buf);
if (ch == RUNE_ERROR) {
fputs("Got invalid UTF-8 codepoint", stderr);
exit(EXIT_FAILURE);
}
printf("Got rune ‘%.*s’\en", w, buf);
.Ed
.Sh SEE ALSO
.Xr rtou8 3 ,
.Xr u8chk 3 ,
.Xr u8next 3 ,
.Xr unicode 7 ,
.Xr utf\-8 7
.Sh STANDARDS
.Rs
.%A F. Yergeau
.%D November 2003
.%R RFC 3629
.%T UTF-8, a transformation format of ISO 10646
.Re
.Sh AUTHORS
.An Thomas Voss Aq Mt mail@thomasvoss.com
|