This is the first of (I hope) many "debug logs" that I'm going to start putting up. These are unpolished, stream-of-consciousness posts that are meant to approximate my thought process as I'm working on a problem. Writing it down helps me think. I will leave in any dead-ends and mistakes as well.


Debug log: Plan9port on Wayland

Besides my web browser, most of my time on my computer is spent in front of the Acme text editor from Plan 9. In the same way many people use emacs, I use Acme for just about everything–to write prose, to write code, as a mail client, an IRC client, as a terminal–if it's text based, I'm probably using it from an acme window. I use Linux as my daily driver, and oscillate between using Acme on a Plan 9 VM using drawterm(8) and using a port of Acme and other plan9 utilities, to Unix-like systems is available through Plan 9 from user space. The hardware I was running Plan 9 is currently on a boat in the middle of the Atlantic ocean, so I've been relying on plan9port on my laptop.

I want to start using Wayland, since Xorg is in maintenance mode, and Wayland does improve upon X11 in certain ways. Applications that use X11 can continue to run using Xwayland, a compatibility layer most Wayland compositors, unfortunately, acme, when running under Sway's Xwayland implementation, is not able to warp the mouse pointer, making acme much less pleasant to use. I found a few attempts at implementing devdraw support for wayland. This one seems to be the most complete and up-to-date.

I use Guix as my linux distribution of choice. While Guix has a plan9port package, I figured I would need to check if this wayland support worked first before I adapted the package to build from this git branch, so I built it the "normal" way.

$ git clone https://github.com/eaburns/plan9port
$ cd plan9port
$ git checkout -b wayland origin/wayland
$ echo WSYSTYPE=wayland >> LOCAL.config
$ ./INSTALL

There were a few changes I had to make to get it to build.

During the build, I was bombarded with warnings like this:

9c var.c
egrep: warning: egrep is obsolescent; using grep -E

How annoying! Seriously, don't the coreutils developers have better things to do? I created $HOME/bin/egrep, which has higher precedence in my $PATH:

#!/bin/sh
exec grep -E "$@"

After the INSTALL script completed, I tried running a graphical program:

$ 9term
xdg_surface#13: error 3: xdg_surface has never been configured
$

Uh oh! Let's try that again with WAYLAND_DEBUG set:


$ WAYLAND_DEBUG=1 9term
[ ... snip ...]
[2216142.152] {Default Queue} wl_callback#12.done(170396)
[2216142.282] {Default Queue}  -> wl_compositor#5.create_surface(new id wl_surface#12)
[2216142.297] {Default Queue}  -> xdg_wm_base#7.get_xdg_surface(new id xdg_surface#13, wl_surface#12)
[2216142.305] {Default Queue}  -> xdg_surface#13.get_toplevel(new id xdg_toplevel#14)
[2216142.312] {Default Queue}  -> xdg_toplevel#14.set_title("9term")
[2216142.317] {Default Queue}  -> wl_surface#12.frame(new id wl_callback#15)
[2216142.323] {Default Queue}  -> wl_seat#10.get_pointer(new id wl_pointer#16)
[2216142.329] {Default Queue}  -> wl_compositor#5.create_surface(new id wl_surface#17)
[2216142.387] {Default Queue}  -> wl_shm#4.create_pool(new id wl_shm_pool#18, fd 7, 1024)
[2216142.399] {Default Queue}  -> wl_shm_pool#18.create_buffer(new id wl_buffer#19, 0, 16, 16, 64, 0)
[2216142.408] {Default Queue}  -> wl_shm_pool#18.destroy()
[2216142.436] {Default Queue}  -> wl_surface#17.attach(wl_buffer#19, 0, 0)
[2216142.447] {Default Queue}  -> wl_surface#17.damage_buffer(0, 0, 16, 16)
[2216142.455] {Default Queue}  -> wl_surface#17.commit()
[2216142.462] {Default Queue}  -> wl_seat#10.get_keyboard(new id wl_keyboard#20)
[2216142.542] {Default Queue}  -> zxdg_decoration_manager_v1#8.get_toplevel_decoration(new id zxdg_toplevel_decoration_v1#21, xdg_toplevel#14)
[2216142.554] {Default Queue}  -> zxdg_toplevel_decoration_v1#21.set_mode(2)
[2216142.804] {Default Queue}  -> wl_surface#12.set_buffer_scale(1)
[2216142.813] {Default Queue}  -> wl_surface#12.commit()
[2216143.485] {Default Queue}  -> wl_shm#4.create_pool(new id wl_shm_pool#22, fd 8, 1228800)
[2216143.497] {Default Queue}  -> wl_shm_pool#22.create_buffer(new id wl_buffer#23, 0, 640, 480, 2560, 1)
[2216143.508] {Default Queue}  -> wl_shm_pool#22.destroy()
[2216144.858] {Default Queue}  -> wl_surface#12.attach(wl_buffer#23, 0, 0)
[2216144.871] {Default Queue}  -> wl_surface#12.damage_buffer(0, 0, 640, 480)
[2216144.879] {Default Queue}  -> wl_surface#12.commit()
[2216147.428] {Display Queue} wl_display#1.delete_id(18)
[2216147.455] {Default Queue}  -> wl_shm#4.create_pool(new id wl_shm_pool#24, fd 8, 1228800)
[2216147.469] {Default Queue}  -> wl_shm_pool#24.create_buffer(new id wl_buffer#25, 0, 640, 480, 2560, 1)
[2216147.480] {Default Queue}  -> wl_shm_pool#24.destroy()
[2216147.493] {Display Queue} wl_display#1.delete_id(22)
[2216147.501] {Display Queue} wl_display#1.error(xdg_surface#13, 3, "xdg_surface has never been configured")
xdg_surface#13: error 3: xdg_surface has never been configured

What does this mean? I've highlighted the lines that I think are important. Each object has an integer ID. The error message is for xdg_surface#13, which was created here:

[2216142.297] {Default Queue}  -> xdg_wm_base#7.get_xdg_surface(new id xdg_surface#13, wl_surface#12)

We have the following hierarchy of objects:

wl_surface#12
└xdg_surface#13
 └xdg_toplevel#14

There is another set of surfaces derived from wl_surface#17 that appears to be related to the mouse cursor. I will ignore those, because the error does not reference them. The error seems to come from the compositor, so I can search the sway and wlroots source code to find out more. The guix build -S command will acquire and print out the path to the sources for any package, which is handy.

$ grep -n -r 'xdg_surface has never been configured' $(guix build -S sway) $(guix build -S wlroots)
/gnu/store/kcvqa2al535jqrryq4gh0k8cxfkap056-wlroots-0.18.0-checkout/types/xdg_shell/wlr_xdg_surface.c:265:			XDG_SURFACE_ERROR_UNCONFIGURED_BUFFER, "xdg_surface has never been configured");

Here is the code that raises the error:

static void xdg_surface_role_client_commit(struct wlr_surface *wlr_surface) {
	struct wlr_xdg_surface *surface = wlr_xdg_surface_try_from_wlr_surface(wlr_surface);
	assert(surface != NULL);

	if (wlr_surface_state_has_buffer(&wlr_surface->pending) && !surface->configured) {
		wlr_surface_reject_pending(wlr_surface, surface->resource,
			XDG_SURFACE_ERROR_UNCONFIGURED_BUFFER, "xdg_surface has never been configured");
		return;
	}

So the error is raised when a surface is being commited after it has a buffer attached, but before it is configured. What does it mean to be "configured"?

static void xdg_surface_handle_ack_configure(struct wl_client *client,
		struct wl_resource *resource, uint32_t serial) {
	surface->configured = true;

So it's configured when the ack_configure request is completed. Searching the web for this RPC and the error message, I found this blurb from the wayland.app site, which is a very handy aggregation of the wayland protocol and all of its extensions:

After creating a role-specific object and setting it up, the client must perform an initial commit without any buffer attached. The compositor will reply with initial wl_surface state such as wl_surface.preferred_buffer_scale followed by an xdg_surface.configure event. The client must acknowledge it and is then allowed to attach a buffer to map the surface.

Going back to the WAYLAND_DEBUG output:

[2216142.152] {Default Queue} wl_callback#12.done(170396)
[2216142.297] {Default Queue}  -> xdg_wm_base#7.get_xdg_surface(new id xdg_surface#13, wl_surface#12)
[2216142.305] {Default Queue}  -> xdg_surface#13.get_toplevel(new id xdg_toplevel#14)
[2216142.312] {Default Queue}  -> xdg_toplevel#14.set_title("9term")
[2216142.813] {Default Queue}  -> wl_surface#12.commit()
[2216143.497] {Default Queue}  -> wl_shm_pool#22.create_buffer(new id wl_buffer#23, 0, 640, 480, 2560, 1)
[2216144.858] {Default Queue}  -> wl_surface#12.attach(wl_buffer#23, 0, 0)
[2216144.871] {Default Queue}  -> wl_surface#12.damage_buffer(0, 0, 640, 480)
[2216144.879] {Default Queue}  -> wl_surface#12.commit()
[2216147.501] {Display Queue} wl_display#1.error(xdg_surface#13, 3, "xdg_surface has never been configured")

After the first wl_surface.commit(), and before the second wl_surface.commit(), the client should wait for a configure request from the compositor. Taking a look at the devdraw code, if I search for the set_title RPC I find the rpc_attach function:


Memimage *rpc_attach(Client *c, char *label, char *winsize) {
	...

	WaylandClient *wl = calloc(1, sizeof(WaylandClient));

	...

	wl->wl_surface = wl_compositor_create_surface(wl_compositor);
	wl->xdg_surface = xdg_wm_base_get_xdg_surface(xdg_wm_base, wl->wl_surface);
	xdg_surface_add_listener(wl->xdg_surface, &xdg_surface_listener, c);

	wl->xdg_toplevel = xdg_surface_get_toplevel(wl->xdg_surface);
	xdg_toplevel_add_listener(wl->xdg_toplevel, &xdg_toplevel_listener, c);
	xdg_toplevel_set_title(wl->xdg_toplevel, label);

	...

	wl_surface_set_buffer_scale(wl->wl_surface, wl_output_scale_factor);
	wl_surface_commit(wl->wl_surface);

	qunlock(&wayland_lock);
	return wl->memimage;
}

This function appears to match the sequence of RPC in the WAYLAND_DEBUG output up to the first wl_surface.commit(). There's no buffer attached here. That happens elsewhere, in the rpc_flush function:

static void rpc_flush(Client *c, Rectangle r) {
	WaylandClient *wl = (WaylandClient*) c->view;
	qlock(&wayland_lock);

	int w = Dx(wl->memimage->r);
	int h = Dy(wl->memimage->r);
	WaylandBuffer *b = get_xrgb8888_buffer(w, h);
	memcpy(b->data, (char*) wl->memimage->data->bdata, b->size);
	wl_surface_attach(wl->wl_surface, b->wl_buffer, 0, 0);
	wl_surface_damage_buffer(wl->wl_surface, r.min.x, r.min.y, Dx(r), Dy(r));
	wl_surface_commit(wl->wl_surface);
	wl_display_flush(wl_display);

	qunlock(&wayland_lock);
}

This function needs to be guarded by a check that the initial configure request has been completed. I added a configured field to the WaylandClient struct, set it in the callback for the configure handler, and used it to guard the rpc_flush method:

 static void rpc_flush(Client *c, Rectangle r) {
 	WaylandClient *wl = (WaylandClient*) c->view;
 	qlock(&wayland_lock);

+	if (!wl->configured) goto done;
+
 	int w = Dx(wl->memimage->r);
 	int h = Dy(wl->memimage->r);
 	WaylandBuffer *b = get_xrgb8888_buffer(w, h);
 	memcpy(b->data, (char*) wl->memimage->data->bdata, b->size);
 	wl_surface_attach(wl->wl_surface, b->wl_buffer, 0, 0);
 	wl_surface_damage_buffer(wl->wl_surface, r.min.x, r.min.y, Dx(r), Dy(r));
 	wl_surface_commit(wl->wl_surface);
 	wl_display_flush(wl_display);

+done:
 	qunlock(&wayland_lock);
 }

Now, I don't get the error. I don't get anything, actually. No error, no additional logging, no window. WAYLAND_DEBUG output looks like this:

[2533241.366] {Default Queue} wl_callback#12.done(180570)
[2533241.570] {Default Queue}  -> wl_compositor#5.create_surface(new id wl_surface#12)
[2533241.588] {Default Queue}  -> xdg_wm_base#7.get_xdg_surface(new id xdg_surface#13, wl_surface#12)
[2533241.598] {Default Queue}  -> xdg_surface#13.get_toplevel(new id xdg_toplevel#14)
[2533241.605] {Default Queue}  -> xdg_toplevel#14.set_title("9term")
[2533242.148] {Default Queue}  -> wl_surface#12.set_buffer_scale(1)
[2533242.157] {Default Queue}  -> wl_surface#12.commit()
[2533245.711] {Default Queue}  -> xdg_toplevel#14.set_title("rc 17355")

The premature wl_surface_attach call is gone, but we never get a configure request from the compositor. Why? I started a sub-sway with WAYLAND_DEBUG:

echo 'exec 9term' > swayconfig
WAYLAND_DEBUG=1 sway --debug -c swayconfig

This doesn't show me anything new, really, just the same set of calls that the client makes, and no configure request, still. Recompiling devdraw with the wayland_debug constant set to 1 doesn't tell me much more, either, just confirming that the xdg_surface_configure callback is never called. To figure out why the configure request never comes from the compositor, I have to go back to the wlroots and sway source. With some grepping, I find the xdg_surface_send_configure function called by wlroots. Working backwards from there, I come to this handler in sway:

static void handle_commit(struct wl_listener *listener, void *data) {
	...

	if (xdg_surface->initial_commit) {
		if (view->xdg_decoration != NULL) {
			set_xdg_decoration_mode(view->xdg_decoration);
		}
		wlr_xdg_surface_schedule_configure(xdg_surface);

So a configure request is queued when a client commits a surface whose initial_commit field is true. I inspected the locations where this field is set and couldn't find any way for it to be flipped erroneously. Besides, other wayland clients like foot were working just fine, so I went back to looking at devdraw.

Zooming out a bit, devdraw runs with two threads; one to handle draw(3) RPC requests from clients like 9term and acme, and another to interact with the windowing system. It's this latter thread which is responsible for sending and receiving requests to and from the wayland compositor. It spends most of its time in gfx_main, here:

	while(wl_display_dispatch(wl_display))
		;

I wondered if the main loop was working, so I put in a print statement:

	while(wl_display_dispatch(wl_display))
		DEBUG("dispatch\n");

I never saw the print statement, so wl_display_dispatch was blocking, which its docs say it will do when there are no new incoming events. It's about time to load up the debugger. 9term and the other graphical processes run devdraw in a child process and hook it up to the actual program, so I have to use the follow-fork-mode child setting in gdb.

$ WAYLAND_DEBUG=1 gdb 9term
(gdb) set follow-fork-mode child
(gdb) break gfx_main
Function "gfx_main" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (gfx_main) pending.
(gdb) run
[Switching to Thread 0x7ffff7c57b80 (LWP 10592)]

Thread 3.1 "devdraw" hit Breakpoint 1, gfx_main () at wayland.c:814
814		DEBUG("gfx_main called\n");
(gdb)

while single-stepping from there, to my surprise, a window popped up! Here's the WAYLAND_DEBUG output, interleaved with my steps, edited for brevity:

854		gfx_started();
(gdb)
[New Thread 0x7ffff7c006c0 (LWP 11173)]

The gfx_started function creates another thread in srv.c which will request the new window via the rpc_attach function.

rpc_attach(9term)
[3433831.918] {Default Queue}  -> wl_compositor#5.create_surface(new id wl_surface#12)
[3433831.965] {Default Queue}  -> xdg_wm_base#7.get_xdg_surface(new id xdg_surface#13, wl_surface#12)
[3433831.989] {Default Queue}  -> xdg_surface#13.get_toplevel(new id xdg_toplevel#14)
[3433832.008] {Default Queue}  -> xdg_toplevel#14.set_title("9term")
...
856		while(wl_display_dispatch(wl_display))
(gdb)
[3434050.760] {Default Queue}  -> wl_surface#12.set_buffer_scale(1)
[3434050.873] {Default Queue}  -> wl_surface#12.commit()

rpc_attach configures and commits a new surface without a buffer, like it's supposed to. After the first call to wl_display_dispatch, the commit is sent.

[3434056.855] {Display Queue} wl_display#1.delete_id(18)
[3434056.939] {Default Queue} wl_buffer#19.release()
[3434056.952] {Default Queue}  -> wl_buffer#19.destroy()
[3434056.962] {Default Queue} wl_keyboard#20.keymap(1, fd 6, 63738)
wl_keyboard_keymap
[3434061.826] {Default Queue} xdg_wm_base#7.ping(244929)
[3434061.906] {Default Queue}  -> xdg_wm_base#7.pong(244929)
[3434061.925] {Default Queue} xdg_toplevel#14.configure(0, 0, array[0])
xdg_toplevel_configure(width=0, height=0)
[3434061.951] {Default Queue} xdg_surface#13.configure(244928)
xdg_surface_configure
[3434061.969] {Default Queue}  -> xdg_surface#13.ack_configure(244928)

The compositor replies with an xdg_wm_ping, then a configure request for the xdg_toplevel surface and its parent xdg_surface. This already looked different from what I was seeing before, so I input the continue command to let the program run freely. To my surprise, a window popped up!

9term runs after single-stepping through gdb

So why does it work when I slow everything down, but it not when it runs at full speed? I slowed down the gfx_main thread, but the other thread was running at normal speed. There must be a race condition here. Looking at gfx_main:

	entered_gfx_loop = 1;
	gfx_started(); <-- this creates a thread which calls rpc_attach

	while(wl_display_dispatch(wl_display)) <-- this sends & receives wayland events & requests
		;

The rpc_attach and wl_display_dispatch functions are running in different threads. It could be that the wl_commit in rpc_attach needs to happen before wl_display_dispatch flushes messages out to the compositor. A really low-tech way to test this theory would be to insert a sleep right before we enter the main loop:

	entered_gfx_loop = 1;
	gfx_started();
	sleep(1)
	while(wl_display_dispatch(wl_display))
		;

This works! I'm able to run 9term, acme, and sam. Now we're getting somewhere, but the sleep() call is a hack -- how do we really fix this? Without the sleep(), the debug output indicates that wl_display_dispatch is reached before rpc_attach begins:

$ 9term
gfx_main called
wl_output_scale(factor=1)
gfx_main: entering loop
rpc_attach(9term)

So it appears when wl_display_dispatch blocks, new messages will not be sent. Reading the code of wl_display_dispatch, it appears to do the following:

  1. Call wl_display_flush to write all pending requests to the compositor.
  2. Read 1 or more events from the compositor, blocking if necessary.

So the wl_commit is buffered, after wl_display_dispatch reaches step 2. We can fix this by throwing in an explicit wl_display_flush at the end of the rpc_attach function:

Memimage *rpc_attach(Client *c, char *label, char *winsize) {
	...
	wl_surface_commit(wl->wl_surface);
	wl_display_flush(wl_display);

	qunlock(&wayland_lock);
	return wl->memimage;
}

We've got a window!

Acme running natively on a wayland compositor

At first impression, mouse warping looks like it works, but there's a problem:

You can see, when I right-click on the text "zxdg_toplevel_decoration_v1", Acme warps the pointer cursor successfully to the next location in the buffer matching that string. However, when I press the mouse button again, without moving the cursor, the button press is registered at the old location, before the pointer warp took place. This is actually even more disruptive than the behavior on xwayland, which didn't warp the cursor at all, but it can't be that hard to fix, right?

The pointer is moved programatically in the rpc_setmouse function:

static void rpc_setmouse(Client *c, Point p) {
	WaylandClient *wl = (WaylandClient*) c->view;
	qlock(&wayland_lock);

	struct zwp_locked_pointer_v1 *lock = zwp_pointer_constraints_v1_lock_pointer(
		pointer_constraints, wl->wl_surface, wl->wl_pointer, NULL,
		ZWP_POINTER_CONSTRAINTS_V1_LIFETIME_PERSISTENT);
	int x = wl_fixed_from_int(p.x / wl_output_scale_factor);
	int y = wl_fixed_from_int(p.y / wl_output_scale_factor);
	zwp_locked_pointer_v1_set_cursor_position_hint(lock, x, y);
	wl_surface_commit(wl->wl_surface);
	zwp_locked_pointer_v1_destroy(lock);

	qunlock(&wayland_lock);
}

comparing it with the x11 implementation in x11-screen.c, it's similar, just with a little more ceremony. I suspect that in x11, the XWarpPointer function generates an event with the new pointer position, but the wayland compositor does not. We can fix this by updating Acme's idea of the pointer position directly in rpc_setmouse:

struct zwp_locked_pointer_v1 *lock = zwp_pointer_constraints_v1_lock_pointer(
	pointer_constraints, wl->wl_surface, wl->wl_pointer, NULL,
	ZWP_POINTER_CONSTRAINTS_V1_LIFETIME_PERSISTENT);
int x = wl_fixed_from_int(p.x / wl_output_scale_factor);
int y = wl_fixed_from_int(p.y / wl_output_scale_factor);
zwp_locked_pointer_v1_set_cursor_position_hint(lock, x, y);
wl_surface_commit(wl->wl_surface);
zwp_locked_pointer_v1_destroy(lock);

wl->mouse_x = wl_fixed_to_int(x) * wl_output_scale_factor;
wl->mouse_y = wl_fixed_to_int(y) * wl_output_scale_factor;
int b = wl->buttons;

qunlock(&wayland_lock);
gfx_mousetrack(c, x, y, b, nsec() / 1000000);

Success!

(You might notice that the pointer in the video looks strange, like it's missing transparency outside of the shaded area. This artifact only shows up in the video generated by wf-recorder. On my screen, the pointer looks fine)

I've opened a pull request with these changes. There are still a few things to improve:

That said, it is now usable enough for me to move to Wayland as my daily driver.

Since I am relying on the XDG and the Pointer Constraints wayland protocol extensions, both of which are not part of the core wayland protocol, an update to a compositor could break devdraw in the future. At least for the XDG protocol, it is marked stable, so a breaking change is less likely. It would be nice to have some automated tests in place to run before upgrading.

Future work

In addition to fixing the minor issues above, there are a few potential projects that I'd be interested in working on:

References