Writing UTF-8 Programs in Plan 9

This post was written using 9front/amd64 “COMMUNITY VS INFRASTRUCTURE” as a reference.

Programs are built on Plan 9 using the 2c(1) compiler suite.

Thus, on amd64:

tenshi% 6c rsize.c
tenshi% 6l rsize.6
tenshi% 6.out
わ 4 4
tenshi% 

Remember, in C, a char (ASCII character) is almost always 1 byte (8 bits).

This document presumes that a non-rune char is 1 byte.

Initial reading

UTF-8 in Plan 9

Plan 9 is full of UTF-8 aware programs, which makes sense, given that Plan 9 was the system UTF-8 was designed on/for.

Wc(1) has a rune flag (very useful):

tenshi% wc -c -r < /lib/hiragana
    447     737
tenshi% 

Tr(1) works on runes:

tenshi% echo 'αβψ' | tr -d 'β'
αψ
tenshi% 

The compiler suite permits UTF-8 characters in source code, as seen in ☺☻☹.c below:

#include <u.h>
#include <libc.h>
typedef int ☺☹☻;
typedef void ☹☺☻;
enum
{
	☹☻☺ = sizeof(☺☹☻),
	☻☺☹ = 0,
};

☹☺☻
main(☹☺☻)
{
	☺☹☻ ☺☻☹;

	for(☺☻☹=☻☺☹; ☺☻☹<☹☻☺; ☺☻☹++)
		print("☺☻☹ = %d\n", ☺☻☹);
		exits(☻☺☹);
}

tenshi% 6.out
☺☻☹ = 0
☺☻☹ = 1
☺☻☹ = 2
☺☻☹ = 3
tenshi% 

Plan 9 happens to be the source of a, if not the only, unicode-aware troff implementation.

The list goes on.

Runes

Dracula: What is a string? A miserable little pile of bytes.

What is a rune?

On Plan 9 it’s:

tenshi% grep Rune /$objtype/include/u.h
typedef	uint		Rune;
tenshi% 

A uint of course being:

tenshi% grep uint /$objtype/include/u.h
typedef unsigned int	uint;

Okay, but how big is that actually?

A quick program to be sure:

#include <u.h>
#include <libc.h>

void
main(int, char*[])
{
	Rune r = L'わ';
	print("%C %d %d\n", r, sizeof r, sizeof (Rune));
	exits(nil);
}

Yielding:

tenshi% 6.out
わ 4 4
tenshi% 

So a rune is 32 bits.

What’s stored in a rune?

The utf(6) manual says it best:

In Plan 9, a rune is a 32-bit quantity representing a Uni-
code character.  Internally, programs may store characters
as runes.  However, any external manifestation of textual
information, in files or at the interface between programs,
uses a machine-independent, byte-stream encoding called UTF.

UTF is designed so the 7-bit ASCII set (values hexadecimal
00 to 7F), appear only as themselves in the encoding.  Runes
with values above 7F appear as sequences of two or more
bytes with values only from 80 to FF.

So a rune is truly 32 bits.

Runes in C

As seen in ‘C Programming in Plan 9’, Plan 9 introduces a few pleasantries to make writing rune-friendly programs a bit easier.

From the same document:

#include <u.h>
#include <libc.h>

void
main()
{
    print("3 %C 4\n", L'≤');
    print("%S\n", L"Άρχιμήδης"); /* Archimedes */
}

Yielding:

tenshi% 6.out
3 ≤ 4
Άρχιμήδης
tenshi% 

There are a few patterns that can save you from overthinking the rune↔char conversion.

In most cases in C, if you get input from somewhere, the input is in bytes, or, a char[].

A char[] is not a an array of runes, however, but just bytes. Runes can span multiple bytes and ostensibly go all the way up to 4-bytes wide.

So operating on individual written characters (runes) is not necessarily as simple as iterating through the bytes.

Thankfully, getting from a char[] to a Rune[] is fairly straightforward.

smprint(2) and runesmprint(2) (or other print(2) functions as appropriate) permit the emission of a malloc(2)’d buffer of bytes and runes, respectively.

So, in practice, we can convert an entire string into a new buffer:

#include <u.h>
#include <libc.h>

void
main(int, char*[])
{
	Rune	*rstr = L"☺☹☻",	*rout;
	char	*bstr = "abc123",	*bout;

	rout = runesmprint("%s", bstr);
	bout = smprint("%S", rstr);
	print("%S (%d)\n%s (%d)\n", rstr, runestrlen(rstr), bout, strlen(bout));
	print("%s (%d)\n%S (%d)\n", bstr, strlen(bstr), rout, runestrlen(rout));

	free(rout);
	free(bout);
	exits(nil);
}

tenshi% 6.out
☺☹☻ (3)
☺☹☻ (9)
abc123 (6)
abc123 (6)
tenshi% 

Iterating through characters in a Rune[] lets us work with individual, legible, unicode, characters as we would bytes in traditional C.

Case Study: K&R exercise 1-19

‘The C Programming Language’ has exercises in it for completion by the reader. Exercise 1-19 asks the reader to implement a program that reverses individual lines of its input using a function reverse(s).

The prompt is almost certainly written with only ASCII characters as input, so reversal is, in theory, trivial. In the spirit of reimplementing K&R exercises in Plan 9 C, I decided to pursue UTF-8 compliance.

In-place array reversal in C is more or less well understood, but what we’ll be given as our input is probably going to be plain bytes.

My solution used the bio(2) library, specifically Brdstr(2) for reading input, so each read returns a malloc(2)’d char* containing the line of bytes until the next \n, with the \n replaced by a \0 null delimiter.

Since we’re not working with just bytes and thus the atomic unit in our reversal is potentially multiple bytes, we should implement it as working on Rune*.

Here’s a trimmed down version of the program:

#include <u.h>
#include <libc.h>
#include <bio.h>

Rune*
reverse(Rune *in, Rune *out, usize len)
{
	int i;
	int to = 0, from = len-1;
	while(from >= 0){
		out[to] = in[from];
		from--;
		to++;
	}

	return out;
}

void
main(int, char*[])
{
	Biobuf	*in, *out;
	in = Bfdopen(0, OREAD);
	out = Bfdopen(1, OWRITE);

	while(1){
		long rlen = 0;
		char *line;
		int len = 0, n, i;
		Rune *rev, *rstr;

		line = Brdstr(in, '\n', 1);
		len = Blinelen(in);
		if(line == 0)
			break;

		rstr = runesmprint("%s", line);
		rlen = runestrlen(rstr);
		rev = calloc(rlen+1, sizeof (Rune));
		reverse(rstr, rev, rlen);
		rev[rlen] = '\0';

		n = Bprint(out, "%S\n", rev);

		free(line);
		free(rev);
		free(rstr);
	}

	Bterm(in);
	Bterm(out);

	exits(nil);
}

Here’s an example of the program running on /lib/hiragana:

term% 8.out < /lib/hiragana
anagariH


-	k	s	t	n	h	m	y	r	w

a  あ	か	さ	た	な	は	ま	や	ら	わ
i  い	き	ihsし	ihcち	に	ひ	み		り	ゐ
u  う	く	す	ustつ	ぬ	ufふ	む	ゆ	る	
e  え	け	せ	て	ね	へ	め		れ	ゑ
o  お	こ	そ	と	の	ほ	も	よ	ろ	oを

m/nん
g	z	d		b	p			

a	が	ざ	だ		ば	ぱ			
i	ぎ	ijじ	ijぢ		び	ぴ			
u	ぐ	ず	uzづ		ぶ	ぷ			
e	げ	ぜ	で		べ	ぺ			
o	ご	ぞ	ど		ぼ	ぽ			


yk	hs	hc	yn	yh	ym		yr	

a	ゃき	ゃし	ゃち	ゃに	ゃひ	ゃみ		ゃり	
u	ゅき	ゅし	ゅち	ゅに	ゅひ	ゅみ		ゅり	
o	ょき	ょし	ょち	ょに	ょひ	ょみ		ょり	


yg	j	j		yb	yp			

a	ゃぎ	ゃじ	ゃぢ		ゃび	ゃぴ			
u	ゅぎ	ゅじ	ゅぢ		ゅび	ゅぴ			
o	ょぎ	ょじ	ょぢ		ょび	ょぴ			
term% 

The above program is much more complex (and less time/memory efficient) than its byte-oriented counterpart. Rather, the program actually works with languages other than English, which is most languages.

The complete K&R 1-19 solution source: https://shithub.us/henesy/kandr/HEAD/ex/ex1-19.c/f.html

An aside on whitespace

Does multi-byte (UTF-8-only) whitespace exist? Go doesn’t think so:

package main

import (
	"fmt"
)

func main() {
	fmt.Println("Max:", ^byte(0))
	// From Go's "unicode" package's `unicode.IsSpace()`
	// https://github.com/golang/go/blob/go1.17.7/src/unicode/graphic.go#L126
	for _, r := range []int{'\t', '\n', '\v', '\f', '\r', ' ', 0x85, 0xA0} {
		fmt.Println(r)
	}
}

Max: 255
9
10
11
12
13
32
133
160

All values fit under the 1-byte max of ≤255.

Functions to know

In the interest of making it easier to find what you want when writing your UTF-8-friendly programs, a few manuals:

Rune and UTF-8 conversion → rune(2).

String utilities, but implemented against Rune*runestrcat(2).

Unicode character (rune) functions, such as tolower()isalpharune(2).

While mostly dealing with bytes, bio(2) includes some Rune-specific functions:

long  Bgetrune(Biobufhdr *bp)
int   Bungetrune(Biobufhdr *bp)
int   Bputrune(Biobufhdr *bp, long c)

Bio’s Bprint() also respects %S and %C, of course.

References

☦ If you want to see me struggle with this particular exercise, it’s on video: https://youtu.be/kWeeQnO0DNc?t=1022