Editor's note; this was a project that I started several years ago and left unfinished, so there is a large time skip in the middle of the article, and the project has changed considerably. Some parts in the beginning may contradict parts later on. Sorry. If you just want to see the code, it's here.

Building a network device manager in OCaml

I'm building a suite of services for running virtual machines in my home network. I am turning my home PC, my laptop, and a few machines I plan to rent from cloud providers into a cohesive, low cost, uniform system for building, deploying and maintaining software. I will write more about it in the future. For this post, I just wanted to walk through the process of using OCaml to call C functions, with the purpose of writing local services and utilities to run on a Unix system.

This will be more of a "build log", where I've tried to include any dead ends, errors, misconceptions, or other problems I encountered. My intention is to show others how they can dig themselves out if any holes they fall into. I've included most of the error messages I've encountered to help anyone doing a web search for error strings.

I'm also using this as a learning opportunity to get more comfortable with OCaml, so some of the abstractions I make may seem unnecessary; they probably are. But I'm deliberately trying to bump up against the areas of OCaml and its ecosystem that I don't understand very well yet, in the hope that in the future, if I do need to go there, I'll be able to fall back on this experience.

I'll be building two tools;

tapalloc - a privileged service that provisions network devices and an unprivileged client that retrieves file descriptors for those devices over a unix socket.
tapautoconf - a service that uses netlink(7) to snoop on DHCP/ICMPv6 traffic coming from a tap device to configure IPv4/IPv6 addresses on it so they can receive traffic, allowing ipvlan-based devices to configure themselves.

The goal of these tools is to enable me to create VMs and containers with their own "real" IP addresses, on-demand, in a way that I find palatable.

Allocating network devices

The tapallocd programs allocate different kinds of TAP devices on behalf of a client process, and pass file descriptor(s) for the device to the client over a unix socket. These file descriptors can then be used by a hypervisor like qemu or firecracker as a VM's network interface. They are intended as an alternative to helper programs such as qemu-bridge-helper, and do not require a setuid-root binary. Ultimately I plan to re-implement the following pattern for acquiring other kinds of system resources:

Client A connects to unix socket for allocator service B
Allocator service B checks some policy to decide if A is allowed access to the resource. It can query the SO_PEERCRED option to learn the uid, gid, and pid of the calling process.
If the check is successful, B allocates the resource, opens the resource files, and passes the open file descriptor to A over the unix socket.
B closes its copy of the file descriptor, so that A retains the sole reference to the resource. Once A closes the file descriptor, explicitly or by exiting, the resource is collected.

This gives me the flexibility to perform permissions checks of arbitrary complexity, and to really restrict the actions an unprivileged process is allowed to take. As a comparison to the capabilities(7) system in Linux, giving a process the CAP_NET_ADMIN capability required to provision and configure network interfaces grants permission to do far more, like add or modify entries in the system's routing table.

On Linux, network devices are created and configured using the netlink(7) interface. The Linux kernel supports an additional type of address family for sockets, AF_NETLINK. A process that wants to create a network interface opens a socket of this type and sends and receives messages over the file descriptor. I will be interfacing with the libmnl companion library, which will perform most of the nitty-gritty work of marshalling and unmarshalling these messages.

Traditionally, when you wanted to call a C function from OCaml, you would manually write a stub, a small C library with intermediate functions that convert parameters from their representation in the OCaml runtime to the required representation for the system call (such as a char* or a struct stat and convert the return value in the reverse direction. It is not unlike the process of extending python with C. Here is an example of a stub I wrote to expose the vectored IO system calls.

There is an OCaml package called ctypes that partially automates the generation of these "stub" files for you. Rather than writing the stub files directly, you write a smaller description of the C functions you would like to call, as an OCaml module, and the stubs are then generated from your description.

Project setup

I use GNU guix to provision my development environment. In the new repository I write the file guix.scm with a package definition, following this example:

(package
  (name "tapalloc")
  (version "dev")
  (source (local-file "." #f #:recursive? #t))
  (synopsis "") (description "") (home-page "")
  (build-system dune-build-system)
  (inputs
    (list
      ocaml-ctypes
      ocaml-fmt
      linux-libre-headers
      libmnl
      gcc-toolchain
      pkg-config))
  (license isc))

This describes a package built from sources in the current directory using the dune build system. The inputs list describes the dependencies needed to build this package. With such a file in the project's root directory, I can run the command

$ guix shell -D

which drops me into a new shell with the listed input packages available. With OCaml, most new projects manage their dependencies and packaging with opam. I use guix instead of opam because guix it is agnostic to any one programming language, it is easy to extend, it is easy to modify, patch, or replace dependencies, and it is easier to use it to get reproducible builds; unless I really mess up, someone else should be able to follow the same process I did and produce the exact same binary. Later on, I want to use guix as part of an automated build and test system. That said, the use of guix is still optional, and the README file contains instructions for building the package without guix.

With my dependencies loaded, I create the directory structure I want:

The libmnl/ directory will contain source for the Libmnl OCaml module, which is a thin wrapper around libmnl.
The tapalloc/ directory will have the source for the executable programs tapalloc-macvtap and tapallocd-ipvtap.
The contrib/ directory will contain an example script to run these programs under s6-ipcserver.

My usual process when starting a new project is to start with the smallest meaningful goal at first; a "hello, world" of sorts. This lets me validate the build process without getting bogged down trying to write the whole program from the start.

To start, then, I write a simple program, tapalloc/hello.ml, whose only purpose is to open a netlink socket using the mnl_socket_open function, and exit.

(* hello.ml *)
let () =
  match Libmnl.(socket_open NETLINK_ROUTE) with
  | exception _ -> Fmt.epr "failure%!"
  | sock -> Fmt.epr "success%!"

I can then put build instructions in tapalloc/dune:

(test
 (name hello)
 (modules hello)
 (libraries libmnl fmt))

And start a test loop with the command

dune test -w

This test will re-run itself every time it detects a change in its dependencies. I keep this running in a visible window while I work on the project. In general, when I am working on a project in any language, I always have a visible window that is automatically updated with test or build results, so I can get constant feedback.

Unsurprisingly, the test fails with the error:

Entering directory '/src/dev.arroyo.cc/tapalloc'
File "tapalloc/dune", line 4, characters 12-18:
4 |  (libraries libmnl fmt))
                ^^^^^^
Error: Library "libmnl" not found.
-> required by _build/default/tapalloc/hello.exe
-> required by alias tapalloc/runtest in tapalloc/dune:2
Had errors, waiting for filesystem changes...

As it says, the libmnl library does not exist. We can write libmnl/dune:

(library
 (name libmnl))

Now the error changes to

Error: Unbound value socket_open

It's time to stop beating around the bush and define this library. With ctypes, there are 2 ways to call functions in a C library:

Loading the C library at runtime using dlopen() and jumping into it. The function must first be described with the foreign function before it can be used. Assuming your description is accurate, the OCaml compiler will prevent you from calling the function with bad input. However, if your description is incorrect, you may have a run-time error, or worse. It is also not possible to use this method in a statically-linked binary.
Using the Cstubs sub-package which will generate a .c file that you must compile and link into your binary, along with the external library. The benefit is that if your description is incorrect, the C compiler is likely to throw an error, because the generated .c file will have an error. This method is also amenable to static linking. The drawback is increased complexity in your build process.

I prefer to statically link my binaries when feasible, and I want to catch as many errors as I can at build time, so I am using the second option of generating .c files. Luckily, the dune build tool makes this quite easy. With the dune 3.0 and above, it even has a dedicated (ctypes ...) stanza that will build the stub files for you. It requires you to define two modules:

A module named Functions describing functions in the external C library
A module named Types describing constants

I put these modules in a single file, libmnl/c_stub.ml:

open Ctypes

module Functions (F : Ctypes.FOREIGN) = struct
  open F

  type mnl_socket
  let mnl_socket : mnl_socket structure typ = structure "mnl_socket"
  let mnl_socket_open = int @-> returning (ptr mnl_socket)
end

module Types (F : Ctypes.TYPE) = struct
  open F
  let netlink_route = constant "NETLINK_ROUTE" int
  let netlink_generic = constant "NETLINK_GENERIC" int
end

The libmnl library defines the mnl_socket struct in its header like so:

struct mnl_socket;

That is, external users of the library do not have access to the fields of the struct; it is a so-called "opaque struct". Such a struct maps well to OCaml as an abstract type (a type with a hidden implementation):

type mnl_socket

To describe a C function for the ctypes module, we make use of the @-> operator and some helper functions. "@->" is meant to look like the -> syntax that is used in ordinary OCaml function signatures. If you squint, the description

let mnl_socket_open = int @-> returning (ptr mnl_socket)

Looks almost like an OCaml function signature, such as

val mnl_socket_open : int -> mnl_socket ptr

It is slightly different, though; in the ctypes description, the words int, returning, ptr and mnl_socket are values rather than types, as is the case in the OCaml function signature. The line

let mnl_socket : mnl_socket structure typ = structure "mnl_socket"

Unites the value mnl_socket with the type mnl_socket.

The two modules have the names Functions and Types. The rules implementing dune's ctypes stanza are hard-coded to look for modules with these names.

The Types module is currently empty, but it won't be for long; I will populate it with constants as needed.

In libmnl/dune I add the (ctypes ...) stanza:

(library
 (name libmnl)
 (ctypes
  (external_library_name libmnl)
  (headers (include "libmnl/libmnl.h" "linux/netlink.h"))

  (type_description
   (instance Const)  ;; symbols accessed under C.Const
   (functor c_stub)) ;; look for c_stub.Types

  (function_description
   (instance Libmnl) ;; symbols accessed under C.Libmnl
   (functor c_stub)) ;; look for c_stub.Functions

  (generated_entry_point C))) ;; nest modules under module name "C"

At this point I got a nasty surprise in my test loop:

********** NEW BUILD (libmnl/c_stub.ml changed) **********

File "libmnl/_unknown_", line 1, characters 0-0:
In file included from /gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/ctypes/ctypes_cstubs_internals.h:13,
                 from /src/dev.arroyo.cc/tapalloc/_build/default/libmnl/libmnl__c_cout_generated_types.c:4:
/gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/ctypes/ctypes_primitives.h:16:10: fatal error: ocaml_integers.h: No such file or directory
   16 | #include "ocaml_integers.h"
      |          ^~~~~~~~~~~~~~~~~~
compilation terminated.
Had errors, waiting for filesystem changes...

What is "ocaml_integers.h" and what did I ever do to it? Since the header name is fairly unique, I can search my system for it:

$ find /gnu/store -name 'ocaml_integers.h'
/gnu/store/g48m9g91754zssif06mn66gmjswq619i-ocaml-integers-0.7.0/lib/ocaml/site-lib/integers/ocaml_integers.h

So it is part of the ocaml-integers package. It is present in my dev shell:

$ ocamlfind query integers
/gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/integers

So what's wrong? Looking at the dune build log, at $project_root/_build/log, here is the command that dune attempts to run:

gcc \
	-I/gnu/store/dj7kw3mqasw0rxdbm1gkajgsznhw8b4h-libmnl-1.0.4/include \
	-I/gnu/store/sjkj3bi7nx8bd02i8z0d43b85xm76rj5-ocaml-4.14.0/lib/ocaml \
	-I/gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/ctypes \
	-o /src/dev.arroyo.cc/tapalloc/_build/default/libmnl/libmnl__c_cout_generated_types.exe \
	libmnl/libmnl__c_cout_generated_types.c

The problem is that the .../lib/ocaml/site-lib/integers directory is not added to gcc's header search path. This does not happen when I am using opam instead of guix. After a bit of trial and error I identified the problem; the ctypes package, during its installation process, copies this header file from the integers package into its own installation directory, and the guix package was overriding that behavior. I sent a patch to Guix that removes this override, and for the time being, did a local override like so:

(use-module
  (ice-9 match)
  (guix transformations))

(define ocaml-ctypes-fixbuild
  (package
    (inherit ocaml-ctypes)
    (arguments
      (match (package-arguments ocaml-ctypes)
            ((#:make-flags flags . rest) rest)))))

((package-input-rewriting `((,ocaml-ctypes . ,ocaml-ctypes-fixbuild)))
 (package
   (name "tapalloc")
   (version "dev")
   (source (local-file "." #f #:recursive? #t))
   (synopsis "Allocates ephemeral TAP devices on demand.")
   (description "")
   (home-page "")
   (build-system dune-build-system)
   ;; gcc is needed for stdarg.h
   (inputs (list ocaml-ctypes ocaml-fmt linux-libre-headers libmnl gcc-toolchain))
   (native-inputs (list pkg-config))
   (license expat)))

With that digression out of the way, we get a new error message:

File "libmnl/libmnl__c_generated_types.ml", line 25, characters 24-31:
25 | let enum (type a) name ?typedef ?unexpected (alist : (a * int64) list) =
                             ^^^^^^^
Error (warning 27 [unused-var-strict]): unused variable typedef.

The dune manual warns us about these errors in its example:

; ctypes backward compatibility shims warn sometimes; suppress them
(flags (:standard -w -9-27))

I hope that in the future, dune can automatically make these warnings non-fatal, or that ctypes can stop triggering them. Because the current solution encourages developers to copy build directives they don't understand, and it is unlikely they will know when suppression is no longer necessary. With the warnings suppressed, we finally get a real error:

File "tapalloc/hello.ml", line 2, characters 16-27:
2 |   match Libmnl.(socket_open NETLINK_ROUTE) with
                    ^^^^^^^^^^^
Error: Unbound value socket_open

The build of the c stub files completes, and we're now back to calling functions that do not exist yet. Just to confirm things are working, I change the test to call the generated C function directly:

(* tapalloc/hello.ml *)
let () =
  match Libmnl.C.Libmnl.mnl_socket_open 0 with
  | _ -> Fmt.pr "success!"

The next error is

ld: libmnl/liblibmnl_stubs.a(libmnl__c_cout_generated_functions__c_stub__libmnl.o): in function `libmnl_stubs_1_mnl_socket_open':
_build/default/libmnl/libmnl__c_cout_generated_functions__c_stub__libmnl.c:6: undefined reference to `mnl_socket_open'
collect2: error: ld returned 1 exit status
File "caml_startup", line 1:
Error: Error during linking (exit code 1)

The error complains about an undefined reference to mnl_socket_open, a C function in libmnl. This shouldn't happen, as the library (.cmxa) file generated for the libmnl module should contain the necessary directives for the C linker:

$ ocamlobjinfo _build/default/libmnl/libmnl.cmxa  | head
File _build/default/libmnl/libmnl.cmxa
Extra C object files: -lmnl -llibmnl_stubs -L/gnu/store/dj7kw3mqasw0rxdbm1gkajgsznhw8b4h-libmnl-1.0.4/lib -lmnl

And the libmnl.so library itself does indeed have a public symbol mnl_socket_open:

$ readelf -sW $(guix build libmnl)/lib/libmnl.so| grep mnl_socket_open
    33: 0000000000002340    93 FUNC    GLOBAL DEFAULT   13 mnl_socket_open@@LIBMNL_1.0
    73: 00000000000023a0    85 FUNC    GLOBAL DEFAULT   13 mnl_socket_open2@@LIBMNL_1.2

So what gives?

I find the best thing to do in times like these is to find an example build that does work, and iteratively change that example, one piece at a time, until it looks like the build that does not work. At some point, the build will break, and that will bring you closer to the root cause. So, starting with the example in dune's documentation, I changed from building a libmnl library, to building a libmnl executable

;; libmnl/dune
(executable
 (name libmnl)
 (libraries fmt)
 (flags (:standard -w -9-27 -verbose))
 (ctypes
  (external_library_name libmnl)
  (headers (include "libmnl/libmnl.h"))

  (type_description
   (instance Const)  ;; symbols accessed under C.Const
   (functor c_stub)) ;; look for c_stub.Types

  (function_description
   (instance Libmnl) ;; symbols accessed under C.Libmnl
   (functor c_stub)) ;; look for c_stub.Functions

  (generated_entry_point C))) ;; nest modules under module name "C"

and embedding tapalloc/hello.ml into this executable:

(* libmnl/libmnl.ml *)
open C

type bus =
  | NETLINK_ROUTE
  | NETLINK_GENERIC

let bus_to_int = function
  | NETLINK_ROUTE -> Const.netlink_route
  | NETLINK_GENERIC -> Const.netlink_generic

let socket_open bus = Libmnl.mnl_socket_open (bus_to_int bus)

let () =
  match socket_open NETLINK_ROUTE with
  | _ -> Fmt.pr "opened a netlink socket!\n"

This works!

$ dune exec libmnl/libmnl.exe
opened a netlink socket!

So it can build the executable when the c stubs are defined in the same stanza in dune, but triggers a linker error when the executable and library are separate. What's going on? Adding (flags (:standard -verbose)) to tapalloc/dune tells the OCaml compiler to print the external commands it runs to link the program. Here's the command, with a bit of formatting added and unimportant flags omitted:

 1 gcc -o 'tapalloc/hello.exe' \
 2   '-L$GUIX_ENVIRONMENT/lib/ocaml' \
 3   '-L$GUIX_ENVIRONMENT/lib/ocaml/site-lib/integers' \
 4   '-L$GUIX_ENVIRONMENT/lib/ocaml/site-lib/ctypes' \
 5   '-Llibmnl' \
 6   '/tmp/build_1e9f8a_dune/camlstartup085117.o' \
 7   '$GUIX_ENVIRONMENT/lib/ocaml/std_exit.o' \
 8   'tapalloc/.hello.eobjs/native/dune__exe__Hello.o' \
 9   'libmnl/libmnl.a' \
10   '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/ctypes/cstubs.a' \
11   '$GUIX_ENVIRONMENT/lib/ocaml/str.a' \
12   '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/ctypes/ctypes.a' \
13   '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/integers/integers.a' \
14   '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/bigarray-compat/bigarray_compat.a' \
15   '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/fmt/fmt.a' \
16   '$GUIX_ENVIRONMENT/lib/ocaml/stdlib.a' \
17   '-llibmnl_stubs' \
18   '-L$GUIX_ENVIRONMENT/lib' \
19   '-lmnl' \
20   '-Wl,--no-as-needed' \
21   '-lcamlstr' \
22   '-lctypes_stubs' \
23   '-lintegers_stubs' \
24   '-Wl,--no-as-needed' \
25   '-lintegers_stubs' \
26   '$GUIX_ENVIRONMENT/lib/ocaml/libasmrun.a' \
27   -lm \
28   -ldl

Do you see the problem? If you don't immediately see the problem, don't feel bad; it was only after I had dug into this problem for the better part of a day, almost given up, and taken a break (and a drink) before I realized what was wrong. When I build the libmnl module as a library instead of an executable, dune builds the archive _build/default/libmnl/libmnl.a. Then, when it comes time to build hello.exe, the archive's parent directory is added to gcc's library search path with the -Llibmnl directive (line 5), before the directive that adds the parent directory of the "real" library, -L$GUIX_ENVIRONMENT/lib/libmnl.so (line 18). As a result, the -lmnl directive tells gcc to link _build/default/libmnl/libmnl.a, not $GUIX_ENVIRONMENT/lib/libmnl.so. The local library, which shadows but depends on the external one, obviously doesn't have a symbol for mnl_socket_open.

Changing the library name from libmnl to mnl does the trick. We've reached the first milestone, and our major build troubles should be behind us.

The programs in the tapalloc project are pretty simple; the client reads a file descriptor from a unix socket. The servers provision a single interface, and send an open file descriptor referencing it over said unix socket. So, in addition to the libmnl library, I will need functions to read and write file descriptors over a unix socket.

Error handling

In C, like in OCaml, functions can only return a single value. This raises the question of how to return an error from a C function. In OCaml, you can raise an exception, or use a sum type like result:

# type ('ok, 'err) result =
    Ok of 'ok
  | Error of 'err

# let safe_incr num =
  if x = Int.max_int then
    Error "overflow"
  else
    Ok (x + 1)
;;
val safe_incr : int -> (int, string) result

While you could use a C union to mimic a result type, it would require unsafe practices or the declaration of a union type for each unique return type, and may impose additional memory management on the caller of the function. Instead, C functions typically return a single value like an integer or a pointer, and use a return value outside of the function's range to signify an error. The caller can then check the (sorta) global variable errno to learn more about the nature of the error. For instance here is the signature of the read(2) function:

ssize_t read(int fd, void *buf, size_t count);

If an I/O error occurs, read should return -1 and set errno to the value EIO. -1 is outside of the range of read(2) because it is impossible to read a negative number of bytes. Functions that return a pointer typically use NULL to signify an error. Functions that return a floating point number may return HUGE_VAL or NaN (see the glibc manual).

By default, the code generated by ctypes does not check for errors, and just returns the return value of the C function call. You can specify an errno_policy of return_errno to signal to the Cstubs code generator that functions should sample the value of errno after calling a C function and return it. So, to add error handling for the mnl_socket_open function, I can change its wrapper to this:

(* mnl/mnl.ml *)
let socket_open bus =
  match Libmnl.mnl_socket_open bus with
  | (p, errno) when Ctypes.is_null p -> Error errno
  | (sock, _) -> Ok sock

To make the code a little nicer, I can change the signature of mnl_socket_open to:

- foreign "mnl_socket_open" (int @-> returning (ptr mnl_socket))
+ foreign "mnl_socket_open" (int @-> returning (ptr_opt mnl_socket))

This way, NULL is returned as None, and a non-NULL becomes Some p. Then my wrapper becomes:

let socket_open bus =
  match Libmnl.mnl_socket_open bus with
  | (None, errno)  -> Error errno
  | (Some sock, _) -> Ok sock

There is another roadblock for us; ctypes provides the errno_policy parameter for its code generator, but dune does not expose it in its (ctypes ...) stanza. I wrote a patch and added it to my local dune by modifying guix.scm like so. Now I can augment the dune file like so:

   (function_description
+   (errno_policy return_errno)
    (instance Libmnl)
    (functor c_stub))

Unfortunately, I'm met with a new error 🙃

ld: mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o: warning: relocation \
  against `Caml_state' in read-only section `.text'
ld: mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o: relocation R_X86_64_PC32 \
  against undefined symbol `Caml_state' can not be used when making a shared object; \
  recompile with -fPIC
ld: final link failed: bad value
collect2: error: ld returned 1 exit status

Here is a partial diff of the .c file that builds the .o file referenced in the error, with ignore_errno vs return_errno:

 value libmnl_stubs_1_mnl_socket_open(value x1)
 {
    int x2 = Long_val(x1);
+   errno = 0;
    struct mnl_socket* x5 = mnl_socket_open(x2);
-   return CTYPES_FROM_PTR(x5);
+   value x6 = CTYPES_FROM_PTR(x5);
+   return ctypes_pair_with_errno(x6);
 }

With the return_errno policy, the C function now has to call ctypes_pair_with_errno to build a (value, errno) tuple. This function, in turn, calls caml_alloc_tuple, which calls caml_alloc, which calls Alloc_small, which calls Alloc_small_with_profinfo which references the Caml_state symbol, which appears to be a block of memory holding all of the state for the current runtime domain. I can see that the object file contains the following relocation records:

$ objdump -r _build/default/mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE
0000000000000015 R_X86_64_PC32     Caml_state-0x0000000000000004
0000000000000027 R_X86_64_PC32     Caml_state-0x0000000000000004
0000000000000039 R_X86_64_PC32     Caml_state-0x0000000000000004
0000000000000078 R_X86_64_PC32     Caml_state-0x0000000000000004
000000000000008d R_X86_64_PC32     Caml_state-0x0000000000000004

and these are not present when using the ignore_errno policy. The error message suggests recompiling with the -fPIC option. I can add it to the dune files:

(library
  (name mnl)
  (ocamlopt_flags (:standard -fPIC)))

However, this doesn't seem to make any difference, I still get the same error. To investigate further, I can re-run the build with strace to snoop on the ld command that gets run:

$ strace --trace=execve \
  --string-limit=1000 \
  --follow-forks \
  --successful-only \
  dune build

First, I can see that the specific gcc command that fails is this one:

gcc -shared  -g -o mnl/dllmnl_stubs.so mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o

And if I look further in the strace output, I can confirm that, indeed,

mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o

was not compiled with -fPIC. If I rebuild this object file by hand with -fPIC, the failing command now works:

$ gcc -I$GUIX_ENVIRONMENT/{include,lib/ocaml,lib/ocaml/site-lib/bigarray-compat,lib/ocaml/site-lib/ctypes,lib/ocaml/site-lib/integers,ocaml/site-lib/stdlib-shims} -fPIC -g -o mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o -c mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.c
$ gcc -shared  -g -o mnl/dllmnl_stubs.so mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
$ file mnl/dllmnl_stubs.so
mnl/dllmnl_stubs.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, with debug_info, not stripped

And you can see the relocation type in the object file changed from R_X86_64_PC32 to R_X86_64_REX_GOTPCRELX:

$ objdump -r mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE
0000000000000015 R_X86_64_REX_GOTPCRELX  Caml_state-0x0000000000000004
000000000000002a R_X86_64_REX_GOTPCRELX  Caml_state-0x0000000000000004
000000000000003f R_X86_64_REX_GOTPCRELX  Caml_state-0x0000000000000004
...

So why was this only a problem with the return_errno policy? Honestly, I am not sure; it seems like dumb luck, because with the ignore_errno policy, the -fPIC flag is not used, but there are no relocations for which gcc decides to use the R_X86_64_PC32 relocation type:

$ objdump -r mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
RELOCATION RECORDS FOR [.text]:
OFFSET           TYPE              VALUE
000000000000001c R_X86_64_PLT32    mnl_socket_open-0x0000000000000004
000000000000002c R_X86_64_PLT32    caml_copy_nativeint-0x0000000000000004

I came up with a simple reproduction and submitted a bug report for dune. In the meantime, I can manually specify the flags for building the object files:

;; mnl/dune
(library
 (name mnl)
 (libraries integers)
 (flags (:standard -w -9-27))
 (ctypes
  (external_library_name libmnl)
  (headers (include "libmnl/libmnl.h"))

  ;; https://github.com/ocaml/dune/issues/5809
  (build_flags_resolver
   (vendored
    (c_flags (:standard -fPIC -I%{env:GUIX_ENVIRONMENT=/usr}/include))
    (c_library_flags (:standard -lmnl -L%{env:GUIX_ENVIRONMENT=/usr}/lib))))

I don't like having to put guix-related environment variables in the dune file, as I want to keep the build files agnostic of the package manager. But it will have to do for now. With this, dune build succeeds! I've tried to submit fixes to the speed bumps I encountered, so eventually no one else will have to hit them. As of 3 Jun 2022:

Guix patch 55715, fixing the missing ocaml_integers.h error has been merged upstream.
A fix for the -fPIC issue #5809 by @mbacarella has been merged, and is available in the 3.3.0 release.
My PR #5827 adding errno_policy support to dune's (ctypes) stanza, has been merged and will be available in the 3.4.0 release.

With build stuff out of the way, let's write the damn program ☺.

Exposing a subset of `libmnl`

Since I don't plan on making my libmnl bindings a standalone library, but rather an implementation detail of the tapalloc programs, I will only expose those functions which I need to implement my services. I will also expose a few identifiers from netlink.h from the Mnl module, even though they are not technically part of libmnl. For that, I need to add "linux/netlink.h" and "linux/if_link.h" to the includes set in mnl/dune.

First, it's nice to define some helper functions to reduce some of the repetition for error handling:

(* mnl/mnl.ml *)
open struct
  type errno = Signed.SInt.t

  let string_of_errno n
    let i = Signed.SInt.to_int n in
    Unix.(error_message (EUNKNOWNERR i))

  let check_neg1 : (int * errno) -> ('a, string) result =
    function
      | -1, e -> Error (string_of_errno e)
      | v, _ -> Ok v

  let check_none : ('a option * errno) -> ('a, string) result =
    function
      | None, e -> Error (string_of_errno e)
      | v, _ -> Ok v
end

The open struct ... end idiom is a scoping trick; the symbols defined within the anonymous block are available in the rest of the file, as if they were defined at the top level, but will not be present in the public interface of the module, and so cannot be used from other modules, and will not show up in automatically-generated documentation. It is a way for me to hide helper functions without maintaining an interface file. With these helpers in place, the socket_open and socket_close functions now become:

let socket_open   bus = Libmnl.mnl_socket_open (bus_to_int bus) |> check_none
let socket_close sock = Libmnl.mnl_socket_close sock            |> check_neg1

Now, regarding the creation of interfaces, the netlink API is sufficiently complex enough that, instead of reading it from top to bottom, I want to scope my search with some examples. Recently, the ever-useful strace has learned how to parse netlink messages, so I can simply run

doas strace ip link add ipv0 link wlan0 type ipvtap

And get back a readable sequence of messages that I can mimic in my program. The ip tool sends the following netlink message to create the link, given the command above:

sendmsg(3,
  {
    msg_name={
      sa_family=AF_NETLINK,
      nl_pid=0,
      nl_groups=00000000
    },
    msg_namelen=12,
    msg_iov=[{
      iov_base=[
        {
          nlmsg_len=68,
          nlmsg_type=RTM_NEWLINK,
          nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_EXCL|NLM_F_CREATE,
          nlmsg_seq=1654360314,
          nlmsg_pid=0
        },
        {
          ifi_family=AF_UNSPEC,
          ifi_type=ARPHRD_NETROM,
          ifi_index=0,
          ifi_flags=0,
          ifi_change=0
        },
        [
          [{nla_len=8,  nla_type=IFLA_LINK}, 4],
          [{nla_len=9,  nla_type=IFLA_IFNAME}, "ipv0"],
          [{nla_len=16, nla_type=IFLA_LINKINFO}, [
            {nla_len=10, nla_type=IFLA_INFO_KIND}, "ipvtap"...]]
        ]
      ],
      iov_len=68
    }],
    msg_iovlen=1,
    msg_controllen=0,
    msg_flags=0
  },
  0)

The message consists of a struct nlmsghdr, indicating an RTM_NEWLINK request, followed by a struct ifinfomsg which is blank (both AF_UNSPEC and ARPHRD_NETROM are defined as 0), followed by 3 struct nlattr items describing the type of interface. The netlink(7) protocol, like other network protocols, is layered, and message payloads can carry messages in another sub-protocol. In this case, the sub-protocol is rtnetlink(7). We need to surface some of the structs, enums, and constants found in the <linux/rtnetlink.h> header, after which we can compose a message with the right combination of flags and attributes seen in the trace.

Laying out messages in memory

Netlink is a datagram-based protocol. A message is composed of a standard header:

The standard header is then followed by a payload, which can have varying sizes and fields. Several of these messages can be laid out next to each other in a single datagram. You could model a protocol like this in OCaml like so:

type 'a nlmsg = {
  nlmsg_len : int32;
  nlmsg_flag : int;
  nlmsg_type : int;
  nlmsg_seq : int32;
  nlmsg_pid : int32;
  payload : 'a;
}

However, while superficially it looks like a netlink message, the memory layout is completely different; the int32 fields are actually pointers to blocks containing an integer value, just as payload will be for any non-int payload. As I was browsing the iproute2 source code, I came across this pattern which was used a lot:

struct {
	struct nlmsghdr	 n;
	struct ifinfomsg i;
	char             buf[1024];
} req = {
	.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
	.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
	.n.nlmsg_type = RTM_NEWLINK,
	.i.ifi_family = AF_UNSPEC,
};

While here it's mostly a syntactical construct, what struck me was the textual separation of the layout information and the actual message data. It made me think, "can I separate the layout information from the actual data?" As it turns out, you can, using Generalized algebraic data types (GADT).

While netlink is a datagram-oriented protocol, a single datagram can contain multiple messages of varying types. While netlink sockets, like other sockets, support the vectored sendmsg(2) and recvmsg(2) system calls that would allow me to spread parts of a message over multiple memory buffers, the libmnl library lays out messages in a single buffer.

The nlmsghdr structure in the netlink(7) API is, as the hdr suffix implies, a header. That is, it shows up at the beginning of some piece of memory that is potentially longer than just the header itself, and the header implies some layout on the memory that follows it. The mnl library provides a function, mnl_put_extra_header that, in C, takes a size_t and returns a void pointer which you cast into the appropriate structure type.

The rtnl-link-set.c shows how the mnl_*_put functions place sections of a message adjacent to each other by allocating from a shared buffer.

 4		char buf[MNL_SOCKET_BUFFER_SIZE];
...
26		nlh = mnl_nlmsg_put_header(buf);
...
30		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
...
35		mnl_attr_put_str(nlh, IFLA_IFNAME, argv[1]);

I adapted this function to take a GADT, which determined the return type, instead:

type _ extra_header =
  | Ifinfomsg : Ifinfomsg.t ptr extra_header
  | Ifaddrmsg : Ifaddrmsg.t ptr extra_header

val mnl_nlmsg_put_extra_header : t. Nlmsghdr.t ptr -> t extra_header -> t

I mirrored this in my adaptation of the example:

...
let nlh = mnl_nlmsg_put_header buf in
let ifi = mnl_nlmsg_put_header nlh Ifinfomsg in
mnl_attr_put_str nlh IFLA_IFNAME Sys.argv.(1);
...

The mnl library takes care of updating the nlmsg_len field when you add an extra header.

Unexpected (non-)constants

While building this example, I hit another error:

% dune test
File "mnl/_unknown_", line 1, characters 0-0:
libmnl__c_cout_generated_types.c: In function ‘main’:
libmnl__c_cout_generated_types.c:84:13: error: enumerator value for ‘check_MNL_SOCKET_BUFFER_SIZE_const’ is not an integer constant
   84 |      enum { check_MNL_SOCKET_BUFFER_SIZE_const = (int)MNL_SOCKET_BUFFER_SIZE };
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(exit status 1)

Let's see what that constant is:

$ grep -h -r MNL_SOCKET_BUFFER_SIZE $(guix build libmnl)
#define MNL_SOCKET_BUFFER_SIZE (sysconf(_SC_PAGESIZE) < 8192L ? sysconf(_SC_PAGESIZE) : 8192L)

Reading

_build/default/mnl/libmnl__c_cout_generated_types.c

the code generates an expression for each constant, like this:

enum { check_RTM_SETLINK_const = (int)RTM_SETLINK };

but the expression

sysconf(_SC_PAGESIZE) < 8192L ? sysconf(_SC_PAGESIZE) : 8192L

is not a constant; sysconf(3) is a function. I tried many ways to fix this, but ended up re-implementing the expression in OCaml, and caching its result at program startup:

let mnl_socket_buffer_size =
  match F.sysconf T.sc_pagesize with
  | n, _ when n > 8192 -> n
  | _ -> 8192

Refactoring

At this point I have learned enough and used the interface I've defined enough that I want to stop forward progress, go back and refactor things before they get out of hand. Here's a summary of the changes, in broad strokes:

I changed the Mnl module name to Netlink, and put the library name prefixes back in the function names. Thus
```
  Mnl.(socket_open NETLINK_ROUTE)
```
became
```
  Netlink.(mnl_socket_open NETLINK_ROUTE)
```
or rather, the intended:
```
  open Netlink
  mnl_socket_open NETLINK_ROUTE
```
I did this because the module ended up including identifiers from multiple header files, not just libmnl.h, and since, in C-fashion, those identifiers' names had prefixes hinting where they came from, the module could be opened without the reader getting confused about where some name came from.
I made the API more OCaml-friendly; using variants instead of integers for enums, using int for all integers where 63 bits, and so on. Sometimes this mean re-implementing some of the things the mnl library was doing.

I also added another module, Sockfd, exposing the ability to pass file descriptors over a unix socket. Unlike the libmnl bindings, I wrote this one as a conventional C binding, with a hand-written C stub.

Creating an interface

Lets try to create an interface. One of the first things I noticed on the libmnl docs was the text

You can access several example files under examples/ in the libmnl source code tree.

To the examples! I can download the libmnl source with guix like so:

$ guix build -S libmnl
/gnu/store/ihwm7imzz5cpwwpxlvnq4a59f2vs29xy-libmnl-1.0.5.tar.bz2
$ tar xf $(guix build -S libmnl)
$ cd libmnl-1.0.5

The file example/rtnl/rtnl-link-set.c contains an example of setting a link up or down. It shouldn't be too much of a stretch to replicate this example, and then iteratively change it until it does what I actually want.

I need to craft an RTM_NEWLINK message, described in rtnetlink(7). It should have the following attributes:

IFLA_LINK=(index of parent interface)
IFLA_IFNAME="tap0"
IFLA_LINKINFO=[
	IFLA_INFO_KIND="ipvtap"
	IFLA_INFO_DATA=[
		IFLA_IPVLAN_MODE=IPVLAN_MODE_L2
		IFLA_IPVLAN_FLAGS=0
	]
]

After a bunch of hacking, trial-and-error, I got to the point where I had a test setup using the dummy device driver like this:

An interesting property is that, although the dummy device driver is designed to drop all packets, the design of the ipvlan and macvlan drivers is such that communication between child interfaces is possible even when the parent interface is a dummy device! This has interesting potential for local-only networks, potentially with enormous (64k) MTUs, which I could explore in the future, you know, for !!fun!!

The test was pretty fun to write, as it involved lots of forking and exec-ing back into itself in order to set up the ephemeral namespaces using unshare(1). It requires user namespaces to be supported on the system. I think in the future I will setup a User-mode Linux integration test so it can run against a specific kernel without requiring the host machine to support user namespaces (some admins disable this feature out of caution).

I wrote a bunch of pretty-printing functions so I could see the messages I was creating. One funny error I ran into looked like this:

->{nlmsg_len: 44 nlmsg_type: RTM_GETLINK ...} IFLA_IFNAME=dummy0
sent RTM_GETLINK message for dummy0
<-{nlmsg_len: 1452 nlmsg_type: RTM_NEWLINK ...}
  {ifi_family: AF_UNSPEC ifi_type: ARPHRD_ETHER ifi_flags: 0x100c3 ifi_change: 0x0000 ifi_index: 2}
  IFLA_IFNAME=dummy0 ... IFLA_INFO_KIND=dummy
<- {nlmsg_len: 36 nlmsg_type: NLMSG_ERROR ...}{error: 0 msg: {nlmsg_len: 44 nlmsg_type: RTM_GETLINK ...}
parent device "dummy0" has index 44
->{nlmsg_len: 92 nlmsg_type: RTM_NEWLINK nlmsg_flags: NLM_F_REQUEST|NLM_F_CREATE|NLM_F_EXCL...}
  IFLA_LINK=44 IFLA_IFNAME=ipvtap19962 IFLA_INFO_KIND=ipvtap
<-{nlmsg_len: 36 nlmsg_type: NLMSG_ERROR ...}{error: -19 (No such device) ...}
req#1163551169 failed: No such device

This happened because I re-used the buffer that held the response for my RTM_GETLINK request to receive the acknowledgement (an NLMSG_ERROR message with error=0). The offset of the ifi_index field in the ifinfomsg struct just happens to line up with the msg.nlmsg_len field in the nlmsgerror struct. It was a silly error, and easily fixed by making a copy. Eventually I will change the program so that it doesn't read into sections of a buffer that are still in use.

After a bit more trial and error, my test was successfully creating an ipvtap device! 🎉

$ nsenter --target $(pgrep -f tapallocd_ipvtap) --net ip link
1: lo:  mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0:  mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 1a:4b:25:01:45:dd brd ff:ff:ff:ff:ff:ff
3: ipvtap21559@dummy0:  mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 500
    link/ether 1a:4b:25:01:45:dd brd ff:ff:ff:ff:ff:ff

But we're not done yet. We have to find the character device and yield file descriptors for it...

Getting the file descriptors

Typically, when you create a macvtap or ipvtap device, udev will create files under /sys/class/net/$name and a character device at /dev/tap$index. Depending on your configuration, it will probably be owned by root. I installed the following udev rule:

SUBSYSTEMS=="ipvtap", OWNER="tapallocd", MODE="0600"

I could make the rule a little more targeted, perhaps matching some prefix of the interface name, or the parent device. This cooperation from udev also makes the testing story a little more complicated, as an automated test will now depend on a device manager. There is also now a race between the device node being created by udev and tapallocd attempting to open it; tapallocd will have to wait for some notification that /dev/tap* has been created. Another challenge is that the device manager needs to be running in the same network namespace as tapallocd; you cannot rely on the default device manager.

I want this daemon to "just work" and be able to operate mostly on its own, in its own network namespace. So my approach is to listen to a NETLINK_KOBJECT_UEVENT socket for the device creation event to learn its major+minor device numbers and call mknod(2) within tapallocd to create it. This would also allow us to create the character device file somewhere else, so we can unlink the device without having write permissions to /dev.

This sounds like a lot of work, adding an entirely new protocol to the program. I had trouble finding documentation about the protocol format, so I decided to go ahead and just see what it looked like myself. I made a simple loop that ran in a background thread during the test and just dumped everything it read from a NETLINK_KOBJECT_UEVENT socket to stderr:

let hexdump : int -> buffer Fmt.t =
  fun len dst buf ->
  (Fmt.hex ()) dst (len, CArray.get buf)

let dump_nl =
  let nl = mnl_socket_open NETLINK_KOBJECT_UEVENT in
  let* _ = mnl_socket_bind nl NETLINK_KOBJECT_UEVENT 0 1 in
  let buf = create_buffer 32768 in
  let rec loop () =
    match mnl_socket_recvfrom nl buf with
    | Error e -> Error (Unix.error_message e)
    | Ok len ->
      Fmt.epr "got kobject msg@."
      Fmt.epr "%a@." (hexdump len) buf;
      loop ()
  in
  loop ()

Running this, the add events were printed in my test's output:

got kobject msg
00: 6164 6440 2f64 6576 6963 6573 2f76 6972  add@/devices/vir
10: 7475 616c 2f6e 6574 2f69 7076 7461 7035  tual/net/ipvtap5
20: 3735 332f 7175 6575 6573 2f74 782d 3000  753/queues/tx-0.
30: 4143 5449 4f4e 3d61 6464 0044 4556 5041  ACTION=add.DEVPA
40: 5448 3d2f 6465 7669 6365 732f 7669 7274  TH=/devices/virt
50: 7561 6c2f 6e65 742f 6970 7674 6170 3537  ual/net/ipvtap57
60: 3533 2f71 7565 7565 732f 7478 2d30 0053  53/queues/tx-0.S
70: 5542 5359 5354 454d 3d71 7565 7565 7300  UBSYSTEM=queues.
80: 5345 514e 554d 3d32 3638 3332 00         SEQNUM=26832.
got kobject msg
00: 6164 6440 2f64 6576 6963 6573 2f76 6972  add@/devices/vir
10: 7475 616c 2f6e 6574 2f69 7076 7461 7035  tual/net/ipvtap5
20: 3735 332f 6970 7674 6170 2f74 6170 3300  753/ipvtap/tap3.
30: 4143 5449 4f4e 3d61 6464 0044 4556 5041  ACTION=add.DEVPA
40: 5448 3d2f 6465 7669 6365 732f 7669 7274  TH=/devices/virt
50: 7561 6c2f 6e65 742f 6970 7674 6170 3537  ual/net/ipvtap57
60: 3533 2f69 7076 7461 702f 7461 7033 0053  53/ipvtap/tap3.S
70: 5542 5359 5354 454d 3d69 7076 7461 7000  UBSYSTEM=ipvtap.
80: 4d41 4a4f 523d 3233 3400 4d49 4e4f 523d  MAJOR=234.MINOR=
90: 3100 4445 564e 414d 453d 7461 7033 0053  1.DEVNAME=tap3.S
a0: 4551 4e55 4d3d 3236 3833 3300            EQNUM=26833.

It's a simple text-based protocol, with one event per datagram, and fields delimited by a NUL byte. You can see that the second event, which is adding the tap3 character device, includes the MAJOR (234) and MINOR(1) variables, and the device name. This should be all I need to call mknod(2). To create the device I need the CAP_MKNOD capability. I can give it to the process using setpriv(1)

$ exec doas setpriv \
	--ambient-caps +mknod \
	--inh-caps +mknod \
	--securebits +keep_caps_locked,+no_setuid_fixup_locked \
	--reuid $(id -u) --regid $(id -g) --init-groups \
	env -i $(env) guix shell -D
$ setpriv --dump | grep capabilities
Inheritable capabilities: mknod
Ambient capabilities: mknod
$ dune test -w --no-buffer

Unfortunately, this doesn't work:

mknod /tmp/ipvtap5023: Operation not permitted

I suspect that the test loses the CAP_MKNOD capability when it uses unshare(1) to transition from the default namespace into its own user namespace, since if I use mknod(1) to manually create the character device outside of the namespace, it is successful.

An unforseen leak

While I was pondering what to do, I noticed another problem. One of the core assumptions of this whole system is that I would be able to unlink the device node from the file system, and then use open file descriptors as references that would prevent the device from being deleted until they were closed. With the approach I took of manually creating the device nodes, I am able to create the nodes in directories where I have write access, and unlink them without requiring write access to the /dev directory.

Imagine my surprise when I found a /dev/tap3 file during my test! At first I suspected udev was doing this somehow, but that seemed impossible; udevd listens for device events on a NETLINK_KOBJECT_UEVENT socket, and those should not receive events from other network namespaces.

I wrote a quick tool to simply dump everything it could read from a netlink socket to stdout, similar to the testing in previous sections. As I expected, I did not see events for the ipvtap devices unless I ran the tool within the same network namespace. The device permissions were also wrong:

$ ls -l /dev/tap3
crw------- 1 root root 234, 1 Jan  5 16:15 /dev/tap3

My udev rules would have changed the node permissions to 0660 and owner to tapalloc. Using strace on udevd more or less proved its innoncence; it wasn't calling mknod to create the file. After a bit more research I realized it was devtmpfs; back in the 2000s there was a stretch of time where devfs was removed from the kernel, and udev was responsible for populating dev. I guess that state of affairs was too difficult, because the concept was re-introduced in 2010 in the form of devtmpfs; the kernel will post device nodes there for any driver that asks for it, and its mounted by default at /dev on my system.

That throws a wrench in my plan. I'm not ready to configure my system to run without devtmpfs. I can think of a few options:

Run a device manager in the same namespace that removes the nodes from devtmpfs.
Give tapallocd the privilege to remove these nodes itself.
Patch the kernel to introduce a device analagous to /dev/net/tun, but for ipvtap devices. It would have to be a per-parent device, and new ioctls would be needed to configure the ipvlan-specific flags.
Find some other way to track when the interface is no longer in use.

None of these options are great, but the least bad option, I think, is to rely on a device manager. So setup instructions will include steps to add udev/mdev/mdevd rules that fix the permissions of tap devices as they come online.

Because I want to keep my tests self contained, I will move the mknod(2) calls out of tapallocd and into the test, where it will run a child process that emulates a device manager, including the re-broadcasting behavior that both udev and mdevd support. I use this behavior in tapallocd as a synchronization tool, to wait until the device node is ready. It runs as a child process of the test, forked before a new user+network namespace is created, and looks like this:

let prefix = Bytes.of_string "add@/devices/virtual/net/" in

let rec loop () =
  let n = Unix.recv uevent buf 0 len [] in
  Fmt.(epr "got %a@." (hex ()) (n, Bytes.get buf));

  if not (Bytes.starts_with ~prefix buf) then loop ()
  else process (Bytes.sub_string buf 0 n)

and process s =
  let props =
    String.split_on_char '\000' s |> List.filter_map parse_kv
  in
  let name = List.assoc "DEVNAME" props in
  let major = List.assoc "MAJOR" props |> int_of_string in
  let minor = List.assoc "MINOR" props |> int_of_string in

  let dev = makedev major minor in
  let* () = mknodat devdir name 0o660 S_IFCHR dev in
  Fmt.epr "udev: created device node %s@." name;

  rebroadcast s

and rebroadcast s =
  let _ = Unix.send_substring donefd s 0 (String.length s) [] in
  loop ()
in
loop ()

This didn't quite work. It turns out the libmnl library doesn't really allow you to send a multicast message to a netlink socket, because the mnl_socket_sendto routine does not pass an nl_groups field in the sockaddr_nl structure. I had to use sendmsg directly and construct my own:

and rebroadcast n =
  (* mnl_socket_sendto blanks out the groups field, so we
     have to bypass it with our own msghdr *)
  let open Ctypes in
  let sa = make Sockaddr_nl.t in
  Sockaddr_nl.(setf sa nl_family (int_of_family AF_NETLINK));
  Sockaddr_nl.(setf sa nl_pid 0);
  Sockaddr_nl.(setf sa nl_groups nl_group_id);

  let iov = make Iovec.t in
  Iovec.(setf iov iov_base (CArray.start buf |> to_voidp));
  Iovec.(setf iov iov_len  n);

  let msg = make Msghdr.t in
  Msghdr.(setf msg msg_iov     (addr iov));
  Msghdr.(setf msg msg_iovlen  1);
  Msghdr.(setf msg msg_name    ((addr sa) |> to_voidp));
  Msghdr.(setf msg msg_namelen (sizeof Sockaddr_nl.t));
  Msghdr.(setf msg msg_controllen 0);
  Msghdr.(setf msg msg_flags 0);

  sendmsg fdw (addr msg)

As an aside, since I re-wrote the tests from scratch many times throughout the course of this project. I eventually forgot why I did this and spent awhile re-discovering the same solution. The lesson here is that I should make more frequent check-ins to source control with commit messages and comments describing the issues I encountered.

Another issue I ran into was with sending file descriptors over a Unix socket; I did this in a C stub because of how arcane the API was, relying on lots of pointer casting and macros. The C stubs implement two OCaml functions in the Sockfd module:

(** [send fds sock] sends the file descriptors in [fds] to the
    process on the other end of the socket [sock]. *)
val send : Unix.file_descr array -> Unix.file_descr -> unit

(** [recv n sock] receives up to [n] file descriptors from [sock],
    which must be a unix socket *)
val recv : int -> Unix.file_descr -> Unix.file_descr array

However, I was getting strange behavior. I would send two descriptors:

Sockfd.send [| fd1; fd2 |] sock;

But on the receiving side, I would only receive one:

let fds = Sockfd.recv 2 sock in
(* Array.length fds = 1 ??? *)

After some experiementation and research (strace was helpful here) the problem was that I was sending two control messages with one descriptor each, but on the receiving end, they were merged into one control message with two descriptors. I don't know if this is Linux specific behavior, so I rewrote the send function to put all of the fds into one message, and the recv function to support any arrangement of fds (up to SCM_MAX_FD) within any number of messages.

After lots more silly bugs, reading docs, reading the kernel code, I finally got to the end of the test:

tapallocd: char device tap3 for ipvtap21701 is ready
tapallocd: removing reference to device
tapallocd: opened 1 fds to tap3
tapallocd: sending fd 6
tapalloc: received fds [3]
tapalloc: moving fds [3] to [8]
exec /home/droyo/src/dev.arroyo.cc/tapalloc/_build/default/test/test_dummy.exe stage2
mac address is 12:b2:e1:ac:a9:db
got tap device with stats {dev: 66306, ino: 7262473, perm: 0640, uid: 65534,
                           gid: 65534, rdev: 59905}
ok

Unfortunately I still wasn't done; I found that the kernel was not destroying the network device after its device node and all references to it were gone. I thought about it some more and found that the most suitable option is the one taken by all the fashionable container managers out there; place the network interface in the client's network namespace. I did consider this initially and hesitated for a number of reasons:

I was concerned it would make it harder for me to implement my dhcp/slaac helper as it would need to use NETLINK_NETFILTER to snoop on packets from these interfaces, which are now in different namespaces. I think this can be addressed with the NETLINK_LISTEN_ALL_NSID option.
I didn't want to give the client, which probably had CAP_NET_ADMIN privileges in their namespace, control over the device if I didn't have to. Upon reflection, I was already giving them the ability to write raw ethernet frames into the device. Giving them the ability to bring the sub-interface up or down or assign addresses is not a big deal. Besides, I control the clients.

The work I did before to emulate a device manager is not wasted; obtaining file descriptors for the tap device will still be a privilege reserved for the server. It's also one less dependency for automated tests. Moving the device into the clients' namespaces is probably for the best, as it will establish some symmetry between VMs (which will use ipvtap devices) and containers (which will use ipvlan devices).

To start, I tried using the IFLA_NET_NS_PID attribute in my RTM_NEWLINK request:

mnl_ifla_put nlh IFLA_NET_NS_PID peer_pid;

With this attribute, my test hangs here:

kernel accepted RTM_NEWLINK
waiting for device manager to create char device

If I peek into the client's namespace, the link exists:

$ doas nsenter --target $(pgrep -f tapalloc.exe) --net -- ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ipvtap8447: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 500
    link/ether d2:09:23:6f:b7:bd brd ff:ff:ff:ff:ff:ff

So I suspect the "udev" emulator isn't getting notifications about this device because it's not in our namespace. I tried setting the NETLINK_LISTEN_ALL_NSID flag. From netlink(7):

"When set, this socket will receive netlink notifications from all network namespaces that have an nsid assigned into the network namespace where the socket has been opened. The nsid is sent to user space via an ancillary data."

Unfortunately, it did not work work; the device manager listening in tapallocd's namespace did not get any uevents. I could not tell if that was because the kernel did not consider the two namespaces to be linked (due to one ns having a sub-interface of a device in the other ns), or because the uevents are not properly tagged when being broadcast. Another downside of this approach is that in a real deployment, it would require the device manager to set the NETLINK_LISTEN_ALL_NSID socket option, which may require patching the software to support socket options. Perhaps I can fix that in the future. For now, I'll do the following:

Bring the device up in tapallocd's namespace and obtain file descriptors
Move the device to the client's namespace after it's ready

This two-step approach opens the possibility of leaking devices if something goes wrong between the device creation and its transfer to the client namespace. For example, the client cloud close the connection, or it could be in a namespace that the server doesn't have permission to create interfaces into. I am comfortable with this risk; I can write the server carefully to clean up the device if it can't be transferred, and have some periodic sweep for "dangling" interfaces. I can probably also work out some scheme to limit the total number of interfaces in the server's namespace at any given time.

Doing it this way, I finally get what I want; the network device is allocated, the client has open file descriptors to send and receive ethernet frames with, and the device's lifecycle is the same as the client's. I no longer need to perform any trickery with unlinking files. The proof of concept is complete!

In all honesty, this service could have been an execline script with the help of the programs in fdtools. This was really more of a warm-up; an easy, tractable problem to get me accustomed to the quirks and oddities of netlink and ctypes before getting to my real goal.

Along the way I finally understood how to use the Ctypes view function. It's basically registering handlers to convert to and from C and OCaml values when you pass a value to or receive a value from a C function. As an example, the nfnetlink structures contain several fields which are documented as holding values in network byte order (e.g. big-endian). For example:

struct nfqnl_msg_packet_hdr {
	__be32		packet_id;	/* unique ID of packet in queue */
	__be16		hw_protocol;	/* hw protocol (network order) */
	__u8	hook;		/* netfilter hook */
} __attribute__ ((packed));

I adapted this struct like so:

module Nfqnl_msg_packet_hdr = struct
  type t = [`Nfqnl_msg_packet_hdr] structure
  let t:t typ = structure "nfqnl_msg_packet_hdr"

  let packet_id   = field t "packet_id"   be32
  let hw_protocol = field t "hw_protocol" be16
  let hook        = field t "hook"        int
  let () = seal t
end

What are be32 and be16 above? They're views over int:

let be16 = view int ~read:ntohs ~write:htons
let be32 = view int ~read:ntohl ~write:htonl

Meaning, whenever I read a field of type be32 (through getf or the !@ operator for pointer dereferencing), I get back (ntohl value), and whenever I write such a field, (htonl value) is written instead. I had to write the ntoh* and hton* functions myself, although they may be in one of the modules I'm already using. They were easy enough:

let mirror width x =
  let rec loop dst src = function
    | 0 -> dst
    | n ->
      let lsb = src land 0xff in
      let dst' = (dst lsl 8) lor lsb in
      let src' = (src lsr 8) in
      loop dst' src' (n-1)
  in
  loop 0 x width

let htons = if Sys.big_endian then Fun.id else mirror 2
let htonl = if Sys.big_endian then Fun.id else mirror 4
let ntohs = htons
let ntohl = htonl

After understanding view better I started using it in other ways, for example for some enum-type fields that convert variants to and from integers, but weren't usable with the TYPE.enum function for one reason or another. For example, defining a family view for the address family constants looks like this:

let family = view int ~write:family_to_enum
    ~read:(fun x -> family_of_enum x |> Option.value ~default:AF_UNSPEC)

And using it as a field in a structure allows me to do things like

let v = mnl_nlmsg_put_extra_header nlh Nfgenmsg in
Nfgenmsg.(v |-> nfgen_family) <-@ AF_UNSPEC;

Helping VMs help themselves - tapautoconf

Ipvlan (and ipvtap) devices are sub-interfaces of some other device. Sub-interfaces behave as if they're attached to a switch, and are able to communicate efficiently without the participation of the underlying device. If a packet is destined to an address that does not belong to a sub-interface, it will be transmitted by the underlying device. Similarly, when the underlying device receives a packet, if it's destined for a sub-interface, that interface receives it, otherwise, the packet is dropped.

The core challenge with using ipvlan devices is that all sub-interfaces share the ethernet (MAC) address of the underlying device. Traffic is multiplexed among the sub-interfaces based on the layer 3 (IPv4/IPv6) addresses. That means, in order to receive unicast traffic, an ipvlan device must have an IP address assigned to it. Without an address, they can only receive broadcast and multicast traffic.

I don't want to manually assign IP addresses to my VMs, and I don't want to invent my own IP address management automation. I want my VMs to acquire their own addresses from the existing automation in whatever network I happen to be in. That means supporting DHCP for IPv4 and SLAAC for IPv6. Since I also plan to use these VMs on my laptop, I also have the stipulation that I may not have control of the network, so I cannot rely on special DHCP configuration.

My strategy will be to implement a service that will "learn" the IP address(es) that an interface wants to use by snooping for DHCP and ICMPv6 traffic, and assign those addresses to the interface on its behalf. Because I need to know what interface the packet came from, simply listening for broadcast & multicast packets is not enough. I can use a feature of netfilter that allows a filtered packet to be queued to a userspace program. The program can even modify and re-inject the packet, which I will use later for an extra feature. This allows me to only intercept the packets I actually need, and netfilter will include metadata about where the packet originated, allowing me to associate a packet with the interface it came from.

Test setup

I need a test. The test is my "north star", some marker that will show me if I'm going in the right direction. This setup will be slightly more complicated than the tapalloc test from earlier.

Since I can't find a dhcp client that would work with tap devices and isn't a full-fledged VM, I'll write a variant of tapallocd that allocates ipvlan instead of ipvtap devices. I was planning to do so anyway.

Dnsmasq acts as a stand-in for the dhcp/radvd services that run on the router in my home network, and is pre-configured with an IPv4 address pool and an IPv6 prefix which it will advertise for stateless autoconfig (no DHCPv6).

The dhclient process runs with the -n flag, which tells dhclient not to actually configure the interface after obtaining an address. The namespace containing the master dummy0 interface will have netfilter rules installed that intercept ICMPv6 and DHCP traffic and enqueue it to a netlink queue, which tapautoconf will listen on. The test will be successful when tapautoconf is able to configure the IPv4 and IPv6 addresses that dhclient negotiates with dnsmasq.

To start, I will construct this topology from an automated test, setup some netfilter rules to intercept traffic, and write a stand-in for tapautoconf that just logs every packet it receives and what metadata is available for it.

While setting up the test I hit an issue; I was attempting to create ipvlan devices directly in the client's namespace, since the dance required to obtain file descriptors for ipvtap devices was not necessary. I did this by setting the IFLA_NET_NS_PID attribute to the pid of the client in the RTM_NEWLINK message:

let (peer_pid, gid, uid) = Sockfd.peercred Unix.stdin in
...
mnl_ifla_put nlh IFLA_NET_NS_PID peer_pid

The peer pid can be retrieved with the SO_PEERCRED socket option. Despite setting this attribute, the network device was created in tapallocd's network namespace, rather than the client's. I wracked my brain over this problem for a long time. I was especially befuddled because in my prior test, before changing ipvtap to a 2-stage allocation method, I was able to create devices directly in the peer namespace. I knew this was supposed to work. I pored through the kernel source. I tried all kinds of experiments. I added more debug statements. I mechanically looked for differences between the working and non-working program, going so far as to re-arrange attributes in some desparate terror that the order would matter.

Ultimately, the issue was quite simple. The tapallocd program queries the SO_PEERCRED socket option, which is a struct that looks like this:

    struct ucred {
        pid_t pid;    /* process ID of the sending process */
        uid_t uid;    /* user ID of the sending process */
        gid_t gid;    /* group ID of the sending process */
    };

It uses the peer's process ID to identify the network namespace to insert the network interface into. The problem was that peer_pid was not the pid of the process on the other end of the socket. How is that possible? Well.

The tapallocd programs are intended to run under a super-server such as inetd, socat(1), or s6-ipcserver(8). The server listens on the unix socket, and for each accepted connection, forks an instance of tapallocd. A single tapallocd process serves a single request.

I wanted my test to require minimal external dependencies, and when possible, I did not want it to write files to a file system, or to bind to network addresses on the system. The solution I came up with looked something like this:

let well_known_port = 42

let start () =
  let fd = bg Sys.argv.(0) [| Sys.argv.(0); "listen" |]
  Unix.dup2 fd well_known_port

(* runs in a separate process *)
let listen () =
  let buf = Bytes.create 1 in
  while Unix.(read stdin buf 0 1) > 0 do
    let srv = bg "tapallocd" [| "tapallocd"; "--verbosity=debug"; "dummy0" |] in
    Sockfd.send [| srv |] stdout
  done

let connect () =
  ignore (Unix.write_substring well_known_port " " 0 1);
  let fd = Sockfd.recv 1 well_known_port).(0) in
  Unix.dup2 fd (fd_of_int 6)
  Unix.dup2 fd (fd_of_int 7)
...
let test () =
  let c = connect () in
  Unix.execvp "tapalloc" [| "tapalloc"; ... |]

let main () =
  match Sys.argv.(1) with
  | "listen" -> listen ()
  | "test" -> test ()
  ...

I've ellided some details, but essentially, the test program repeatedly exec's itself, using its first argument to choose what function to call. A socketpair connects the file descriptor 42 to the standard input and output of a child process that starts instances of tapallocd and shares one end of another socketpair that is connected to tapallocd's standard input and output. The test can then execute into the tapalloc client, which expects the tapallocd service to be connected to file descriptors 6 (for read) and 7 (for write). The numbers 6 and 7 come from the UCSPI convention.

What I learned was, the SO_PEERCRED option contains the pid (and uid, and gid) of the process which opened the socket, not the process that happens to be writing to it. In my ipvtap tests, which did not require a forking server since there was only one request, these processes were one and the same. But with the "server" implementation above, the process that creates the socket (in the implementation of the bg function) is the process running the listen function, not the process that calls connect. I changed the implementation to look like this:

(* runs in background process *)
let rec listen () =
  match Sockfd.recv 1 Unix.stdin with
  | [| c |] -> (
    (* tell client we've received the fd *)
    ignore (Unix.write_substring c "." 0 1);

    match Unix.fork () with
    | 0 -> Unix.close c; listen ()
    | _ -> (* new child process *)
      Unix.(dup2 c stdin);
      Unix.(dup2 c stdout);
      Unix.execvp "tapallocd" [ "tapallocd"; "--verbosity=debug"; "dummy0" ])
  | _ -> exit 0

(* called by test *)
let connect () =
  let client, server = Unix.(socketpair PF_UNIX SOCK_STREAM 0) in
  Sockfd.sendfd [| server |] well_known_port;

  (* read one byte as a sign that server has received the fd *)
  ignore (Unix.read client (Bytes.create 1) 0 1);

  Unix.dup2 client (fd_of_int 6);
  Unix.dup2 client (fd_of_int 7)

So rather than listen opening a socketpair on behalf of the process calling connect, connect opens its own socketpair and shares one end with listen. In hindsight, it makes perfect sense that the credentials in SO_PEERCRED would be those of the process that opened the socket. After all, the same socket could be dup'd to any number of processes in any number of namespaces, and treating SO_PEERCRED as "the currently speaking user" is inherently racy.

With that cleared up, the topology, at least, is in place. dclient -4 cannot obtain a lease, even though dnsmasq is offering it one:

DHCPDISCOVER on nic0 to 255.255.255.255 port 67 interval 4
Jan 10 20:11:11 dnsmasq-dhcp[27106]: DHCPDISCOVER(nic0) 0e:96:ef:9b:98:7b
Jan 10 20:11:11 dnsmasq-dhcp[27106]: DHCPOFFER(nic0) 192.168.0.119 0e:96:ef:9b:98:7b
No DHCPOFFERS received.
Unable to obtain a lease on first try.  Exiting.

A packet capture makes things obvious:

62:9c:ad:b6:86:f2 > ff:ff:ff:ff:ff:ff: 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 62:9c:ad:b6:86:f2
62:9c:ad:b6:86:f2 > 62:9c:ad:b6:86:f2: 192.168.0.1.67 > 192.168.0.121.68: BOOTP/DHCP, Reply, length 300

The DHCPDISCOVER request is broadcast, but the DHCPOFFER reply is unicast, addressed at the L2 layer to the mac address of the parent interface, and at the L3 layer to the IPv4 address being offered. Since the client does not have an L3 address configured, as an ipvlan device it cannot receive the reply. It's a chicken and egg problem. The strategy for making DHCP work, then, is to intercept these DHCPOFFER packets, configure the address they offer on the interface, then re-inject them. While netfilter would allow us to identify the interface that the DHCPDISCOVER offer originated from, it can't tell us what interface the DHCPOFFER was intended for. The tapautoconf program will have to match some information in the DHCPDISCOVER packet, such as the DUID, to information in the DHCPOFFER packet.

Let's insert tapautoconf into the mix. It is inspired by examples in the libmnl and libnetfilter_queue C libraries. At this point, it simply binds to a netfilter queue, and prints out every message it receives. This is the main loop, so to speak:

  let logmsg nlh =
    Logs.info (fun m -> m "%a" pp_nlmsg nlh);
  in
  let rec loop () =
    match Event.(sync (receive firehose)) with
    | Ok q -> Queue.iter logmsg q; loop ()
    | Error _ as err -> err
  in

I have 3 threads set up; one for reading messages from netlink, one for writing messages to netlink, and one for processing. They communicate via Event channels from OCaml's threading library. I really enjoyed setting up the event loop. I think in any language or scenario, event loops are incredibly fun to write, as you have to think about how to write it in a way that keeps the system responsive to new events, while also avoiding unbounded resource usage. The Event library is quite nice; it gives you channels with send and receive operations, and a select function that lets you choose between them:

val send : 'a channel -> 'a -> unit event
val receive : 'a channel -> 'a event
val select : 'a event list -> 'a

let event =
  select [
    receive channel1;
    receive channel2;
    receive channel3;
  ]
in
...

That may seem fairly limiting initially, as it only allows you to select between channels that produce values of the same type. Luckily, there is also the wrap function that can convert an event into another type without waiting for the event to occur.

val wrap : 'a event -> ('a -> 'b) -> 'b event

So, I have a type that describes the different kinds of events that can occur:

type request = (Nlmsghdr.t * nlmsg_reply Event.channel)
type ioevent =
  | Request of request (** request from another thread *)
  | Response of nlmsg_reply (** response from netlink *)
  | Delivered of mailbox (** a thread received its reply *)
  | Flush (** messages were flushed to netlink *)

And helper functions convert send and receive actions into this iovent type:

let ioevent_of_req r = Request r
let ioevent_of_rsp r = Response r
let ioevent_of_nop _ = Flush
let ioevent_of_ack m () = Delivered m

let recv_rsp ch = Event.(wrap (receive ch) ioevent_of_rsp)
let recv_req ch = Event.(wrap (receive ch) ioevent_of_req)
let flush_req wq ch = Event.(wrap (send ch (Queue.copy wq)) ioevent_of_nop)
let deliver m = Event.(wrap (send m.ch (Ok (Queue.copy m.q))) (ioevent_of_ack m))

So that my event loop looks like this:

let events =
  [ recv_rsp from_netlink ]
  |> (if not full then List.cons (recv_req request) else Fun.id)
  |> (if pending then List.cons (flush_req wq to_netlink) else Fun.id)
  |> Hashtbl.fold try_deliver inflight
in
match Event.select events with
| Flush -> Queue.clear wq; ioloop ()
| Request (nlh, ch) -> handle_req nlh ch; ioloop ()
| Response (Ok q) -> Queue.iter handle_rsp q; ioloop ()
| Delivered mb -> handle_ack mb; ioloop ()
| Response (Error s) -> handle_err s; ()

Requests come in as a message plus an event channel to receive replies on. The event loop generates a sequence number for them and puts them into a hash table. When a reply comes in, the sequence number is used to lookup the sender of the request. The events list is dynamically rebuilt on each run based on the state of the system, but it always checks for new replies from netlink, as neglecting that channel could lead to dropped messages. The events that handle messages deliver a queue of messages rather than a single message. This is to set up for the use of sendmmsg and recvmmsg in the future if performance ever becomes an issue.

While the event loop was fun, actually getting messages to flow was a trial verging on an existential crisis. "Why am I here? Why am I doing this?" It wasn't terribly difficult to get the right bits in the right places, but the utter lack of feedback from netlink made it difficult to understand why packets wouldn't flow, because the NFNETLINK socket was just ACK'ing my messages without telling me what was wrong with them.

I had to install the following nftables rules to select the traffic I wanted delivered to the queue:

table inet filter {
  chain PREROUTING {
    type filter hook prerouting priority 0
    policy accept

    # DHCPv4 traffic
    counter meta protocol ip udp sport 67-68 udp dport 67-68 queue num 1

    # DHCPv6 traffic
    counter meta protocol ip6 udp sport 546-547 udp dport 546-547 queue num 1

    # Router solicitations to learn guests' lladdr
    counter icmpv6 type { nd-router-solicit, nd-router-advert } queue num 1
  }
}

Eventually I would like the tapautoconf program to install this rule itself by default, so there is one less thing to set up. Doing so would not require increasing its permissions, since it already has CAP_NET_ADMIN required to read from the netfilter queue.

After that, and a lot of twiddling and logging and experimentation, I was finally rewarded with the message:

dnsmasq-dhcp[2787]: DHCPDISCOVER(nic0) f2:5a:9b:84:98:1e
dnsmasq-dhcp[2787]: DHCPOFFER(nic0) 192.168.0.146 f2:5a:9b:84:98:1e
TAPAUTOCONF: <-{nlmsg_len=408, nlmsg_type=NFQNL_MSG_PACKET, nlmsg_flags=0x0000, nlmsg_seq=0, nlmsg_pid=0}{nfgen_family=AF_UNSPEC, version=NFNETLINK_V0, res_id=0x0001} NFQA_PACKET_HDR={packet_id=5, hw_protocol=0x0800} NFQA_IFINDEX_INDEV=33554432 NFQA_HWADDR={hw_addrlen=6, hw_addr=f2:5a:9b:84:98:1e}  NFQA_PAYLOAD=[45c0 0148 3b8d 0000 4011 bb74 c0a8 0001 c0a8 0092 0043 0044 0134 4292 0201 0600 6b02 0168 002d 0000 0000 0000 c0a8 0092 c0a8 0001 0000 0000 f25a 9b84 981e 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 6382 5363 3501 0236 04c0 a800 0133 0400 000e 103a 0400 0007 083b 0400 000c 4e01 04ff ffff 001c 04c0 a800 ff03 04c0 a800 0106 04c0 a800 01ff 0000 0000 0000 0000]

I was beginning to think that my entire idea was flawed and that I could not intercept packets the way I wanted to. So it was a big relief to see this message, even if I could not immediately understand it. I have to go about parsing the packet. I would like to use the Cstruct package which comes with a nice preprocessor that allows binary data to be parsed easily. It's also a popular package, used extensively in the Mirage project, and, interestingly, in the Charrua project, which is a DHCP server, client, and protocol library. Cstruct's core data types are a buffer, which is just a Bigarray:

type buffer = (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t

And a view into that buffer:

type t = private {
    buffer : buffer;
    off : int;
    len : int;
}

So I would like to treat the payload of NFQA_PAYLOAD attributes as a Cstruct.t, or at least a Bigarray. It's currently constructed like this, as part of the function that iterates over message attributes:

let data: unit ptr = mnl_attr_get_payload nla in
let len: int = mnl_attr_get_payload_len nla in
let p = Ctypes.(from_voidp char data) in
Ctypes.CArray.from_ptr p len

Ctypes does have some support for bigarrays, but the documentation is intimidating:

type 'a bigarray_class = 'a Ctypes_static.bigarray_class
(** The type of Bigarray classes. There are four instances, one for each of the Bigarray submodules. *)

val array1 : < element : 'a ; layout : 'l ; ba_repr : 'b ; bigarray : ('a, 'b, 'l) Bigarray_compat.Array1.t ; carray : 'a carray ; dims : int > bigarray_class
(** The class of Bigarray.Array1.t values *)

val bigarray_of_ptr : < element : 'a ; layout : Bigarray_compat.c_layout ; ba_repr : 'f ; bigarray : 'b ; carray : _ ; dims : 'i > bigarray_class -> 'i -> ('a, 'f) Bigarray_compat.kind -> 'a ptr -> 'b
(** bigarray_of_ptr c dims k p converts the C pointer p to a C-layout bigarray value. No copy is made; the bigarray references the memory pointed to by p. *)

What the heck do those brackets mean? What is Bigarray_compat? Why is the type definition so big? Not letting confusion get the better of me, I tried squinting at the signature of bigarray_of_ptr:

val bigarray_of_ptr : 'dims scary_thing -> 'dims -> 'a other_scary_thing -> 'a ptr -> 'b

Then I saw that scary_thing is a bigarray_class, and other_scary_thing is Bigarray.kind. So our call will look something like:

bigarray_of_ptr array1 ??? Bigarray.Char myptr

The final piece, I based on the name, dims, I guess should be the size of the array. So Going back to the NFQA_PAYLOAD attribute, I can try to make a bigarray like so:

bigarray_of_ptr array1 len Bigarray.Char (from_voidp char data)

This gives me the error:

File "netlink/netlink.ml", line 908, characters 40-105:
908 |     | 10 -> next @@ f init NFQA_PAYLOAD (bigarray_of_ptr array1 len Bigarray.Char (from_voidp char data))
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type
         (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout)
         Bigarray.Array1.t
       but an expression was expected of type buffer = char carray

Which is actually exactly what I wanted! I just need to change the type definition of this attribute from this:

type _ nlattr_type =
  ...
  | NFQA_PAYLOAD : char carray ifla_type

to this:

type _ nlattr_type =
  ...
  | NFQA_PAYLOAD : (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t ifla_type

but since I plan to use the bigarray through the Cstructs.t type, I may as well make it

type _ nlattr_type =
  ...
  | NFQA_PAYLOAD : Cstruct.t ifla_type

I can then address the errors generated by the compiler, one by one, to get things working. Forget test-driven development, how about Compiler-driven development? Reading the payload becomes:

let cstruct =
  from_voidp char data
  |> bigarray_of_ptr array1 len Bigarray.Char
  |> Cstruct.of_bigarray

and writing becomes:

let buf = Cstruct.to_bigarray value in
let len = Cstruct.length value in
mnl_attr_put nlh T.nfqa_payload len (bigarray_start array1 buf |> to_voidp)

The Cstruct library contains its own function to dump a buffer, so pretty-printing becomes:

Fmt.(pf ppf "NFQA_PAYLOAD=[%a]" (octets ()) (Cstruct.length v, Cstruct.get v)

And we're back at feature-parity with the previous CArray-based payloads, only now we have access to the packet-processing features of the Cstruct library, and we can interact with other libraries!

Cstruct comes with a nice pre-processor that allows you to describe a packet as an OCaml record type, with appropriate field types, and generate functions to get/set fields in that packet. However, we don't need to parse every field in a packet. We only care about the following information:

The src/dst ethernet address, to filter packets that match the parent device.
The source and destination IPv4/IPv6 address, to learn the address the user of the device wants to use (or that dhcp wants to give it).
The DHCP message type (DHCPOFFER, DHCPDISCOVER, etc)
The ICMPv6 message type (neighbor/router solicit, router advertisement)

So rather than defining each field, we'll use cstruct's parsing helpers to jump through a packet and just lift the fields we need. There's a bunch of functions for slicing and dicing Cstruct.t values, and reading integers from them in either byte order. I defined a record type that would contain just the information I'm interested in:

type packet = {
  ts: Ptime.t;
  l2_src : bytes;
  l2_dst : bytes;
  l3_src : bytes;
  l3_dst : bytes;
  payload : pkt_payload;
}
and pkt_payload =
  | DHCPDISCOVER of { xid: string; client_id: string }
  | DHCPOFFER    of { xid: string; yiaddr: bytes; address_time: int }
  | DHCPREQUEST  of { xid: string; address_request: bytes }
  | DHCPDECLINE  of { xid: string }
  | DHCPACK      of { xid: string }
  | DHCPNAK      of { xid: string }
  | DHCPRELEASE  of { xid: string; ciaddr: bytes }
  | IGNORE

Then a sequence of functions parse the packet in layers, advancing the start of the Cstruct.t to the end of their respective headers before calling the next parser:

let ethernet pkt cs =
  ...
  let pkt = { pkt with l2_src; l2_dst } in
  match ethertype with
  | 0x0800 -> ipv4 pkt cs
  | 0x0866 -> ipv6 pkt cs
  | _ -> { pkt with payload = IGNORE }

and ipv6 pkt cs = { pkt with payload = IGNORE } (* TODO *)
and ipv4 pkt cs =
  ...
  let pkt = { pkt with l3_src; l3_dst } in
  match protocol with
  | 17 -> udp pkt cs
  | _  -> { pkt with payload = IGNORE }

and udp pkt cs =
  ...
  match src_port, dst_port with
  | 67, 68 | 68, 67 -> dhcp pkt cs
  | _ -> { pkt with payload = IGNORE }

and dhcp pkt cs =
  ...
  { pkt with payload =
    match message_type with
    | 1 -> DHCPDISCOVER { xid; client_id }
    | 2 -> DHCPOFFER    { xid; yiaddr; address_time }
    | 3 -> DHCPREQUEST  { xid; address_request }
    | 4 -> DHCPDECLINE  { xid }
    | 5 -> DHCPACK      { xid }
    | 6 -> DHCPNAK      { xid }
    | 7 -> DHCPRELEASE  { xid; ciaddr }
  }
in

match mnl_get_attr nlh NFQA_PAYLOAD with
| exception Not_found -> None
| cs ->
  let ts = Ptime_clock.now () in
  let pkt = make_packet ~ts () in
  try
    Some (ethernet pkt cs)
  with Invalid_argument _ ->
    Logs.debug (fun m -> m "out-of-bounds parsing %a" Cstruct.hexdump_pp cs);
    None

The first issue I encountered was that NFQA_PAYLOAD only contained the payload from the l3 header onwards, so my parsing failed. I suspect the contents of the payload depend on the netfilter rules used to enqueue the packets. I have to pull the l2 addresses from the NFQA_HWADDR attributes, and the ethertype from the NFQA_PACKET_HDR attribute. After making those adjustments, and fixing various parsing errors, I was able to parse the DHCPOFFER message coming from the dnsmasq process.

parsed { ts = 2025-01-19 19:16:01 +00:00;
         hwaddr = 2a:39:42:42:3c:f8; l3_src = 192.168.0.1;
         l3_dst = 192.168.0.78;
         payload =
         DHCPOFFER {xid = 2ebc 897a; yiaddr = 192.168.0.78;
           address_time = 3600}
         }

From here, it's simply a matter of assigning the address in yiaddr to the relevant interface. But I have a problem; tapautoconf never saw the initial DHCPDISCOVER or DHCPREQUEST packet, so it doesn't know where the request came from! If I change the test so that tapautoconf runs from the dhcp server's network namespace, I can see the request, but not the reply:

parsed { ts = 2025-01-19 19:36:48 +00:00;
         hwaddr = f2:bc:8c:0c:42:ef; l3_src = 0.0.0.0;
         l3_dst = 255.255.255.255;
         payload =
         DHCPDISCOVER {xid = 159f 076c;
           client_id = ff8c 0c42 ef00 0100 012f 2010 d0f2 bc8c
           0c42 ef}
         }

Moving tapautconf to the client's namespace, I get neither. Reading Netfilter hooks I think the problem is that I am only enqueing packets in the PREROUTING hook, which will only see packets coming into an interface. I need to collect packets at the OUTPUT hook to capture packets that were generated by applications in the network namespace. I suspect that, if I were actually testing with tap devices, I wouldn't have to do this, but it doesn't hurt.

table inet filter {
  chain PREROUTING { ... }
  chain OUTPUT {
    type filter hook output priority 0
    policy accept
    counter meta protocol ip udp sport 67-68 udp dport 67-68 queue num 1
  }
}

Running from the namespace with the dhcp server, I get both:

{ ts = 2025-01-19 19:56:08 +00:00; indev = 2; outdev = -1;
  hwaddr = a6:d7:34:2e:7f:19; l3_src = 0.0.0.0; l3_dst = 255.255.255.255;
  payload =
  DHCPDISCOVER {xid = ef56 5b4d;
    client_id = ff34 2e7f 1900 0100 012f 2015 58a6 d734 2e7f 19}
  }
{ ts = 2025-01-19 19:56:08 +00:00; indev = -1; outdev = 2; hwaddr = ;
  l3_src = 192.168.0.1; l3_dst = 192.168.0.95;
  payload =
  DHCPOFFER {xid = ef56 5b4d; yiaddr = 192.168.0.95; address_time = 3600} }

I added two fields, indev and outdev, corresponding to the NFQA_IFINDEX_INDEV and NFQA_IFINDEX_OUTDEV attributes, respectively. From the perspective of the dhcp server namespace, the DHCPDISCOVER packet comes in on ifindex 2 (nic0), and the DHCPOFFER packet goes out on the same interface.

I cannot run tapautoconf from the dhcp server namespace because, in real usage, the dhcp server is on a separate machine outside of my control. If I run tapautoconf from the namespace with the parent device, I only see the DHCPOFFER packet. If I run tapautconf from the client namespace, I see neither. I believe this is because the ipvlan interfaces are in l2 mode. From IPVLAN Driver HOWTO:

"In this mode TX processing happens on the stack instance attached to the slave device and packets are switched and queued to the master device to send out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) as well."

I do want to support broadcast/multicast traffic, because they are necessary for IP autoconfiguration on both IPv4 and IPv6. And I do want to allow cross-talk between sub-interfaces. But the description makes me think that I should be using the layer 2 bridge family in nftables in order to see these packets. Before going that route (heh), there is a bigger problem; the interface index on the intercepted packets seems to be the index of the interface in the network namespace tapautoconf runs from. This makes total sense, as that's where the netfilter stack is intercepting, and where the netlink socket is opened.

However, it means that in order for tapautoconf to associate a packet with the interface that it needs to add an address to, it needs to have the nftables rules installed in every namespace that it has allocated a device for, and needs netlink sockets open for each namespace. I don't want to do that, as it raises all sorts of other problems around opening lots of netlink sockets, dealing with nftables rules being modified by processes in the peer namespaces, and so on.

I am once again confronted with the decision over where to put the tap devices. Initially they were all in one namespace, and clients only got file descriptors for them. Then I waffled and moved the tap devices into the namespace of the client process, so that they would automatically be destroyed when the client namespace was destroyed. Now I am again considering putting them all in the same namespace so that I can intercept their DHCP/ICMPv6 traffic in a scalable way. It would mean that I need to come up with a reliable way to destroy these interfaces when they're no longer in use.

It would also mean that I would need to modify my test setup, because the client namespace will no longer have an interface. I could create a veth pair to carry traffic from the client namespace to tapalloc/ tapautoconf's namespace, but would need to install routes to get traffic from one end of the veth into the ipvlan interface. I'm opting instead to change the test to use ipvtap devices, since that is my intended use case anyway. It means I need a DHCP client that receives and sends packets through a file descriptor, which dhclient(8) cannot do. I could run it in a VM using qemu, but instead I decided to use Charrua's DHCP_client module, and use it to implement a dhcp client for the test. Doing this will save me work later when I extend the tapalloc idea to support ipvlan devices, given to containers, instead of just ipvtap devices given to VMs. I'll talk about that later ☺.

There is no guix package for charrua, but guix can automatically import a package definition from the OCaml package repository (OPAM), including definitions for its dependencies:

$ guix import opam --recursive charrua-client

This didn't build right away. After some tedious work getting things to build, I took a more mechanical approach; for each package definition ocaml-NAME, I added the field

(arguments (list #:package "NAME"))

The issue was that the opam importer would generate dependencies for a single opam package, but the dune-build-system would try to build all packages by default, so the generated package definitions were insufficient, and the resulting packages had more than I needed. After doing that, there were only two problems to correct; a missing build-time (native-inputs) dependency of charrua on ocaml-menhir, and a missing definition for the ocaml-ipaddr package in the generated output.

After listing ocaml-charrua-client as a dependency of my tapalloc package, guix shell -D makes the package available in my environment. I can now list charrua-client as a dependency for a module in my dune files to use the Dhcp_client module.

A break for some housekeeping

It seems like there is a lot left to do, but I can feel that I am nearing the end of this project. At least, the path forward seems clear. Since I am building a new test, I want to take a brief interlude to do some tidying up in the project, to make it closer to something I'd actually want to run. It's not a major refactor, just some reorganization.

tapallocd_ipvlan and tapallocd_ipvtap are merged into a single program, that can create either according to the request.
tapalloc_ipvlan and tapalloc are merged into one program that takes a flag to decide which type of device to request.
The tapalloc* programs are renamed to ontap. I want to evoke the imagery of a VM swaggering up to a bar in a saloon and the barkeep pouring him a fresh, cold ~~brew~~ device queue. Another name candidate was patchpanel but to me that gave too much of an impression of 1:1 physical to virtual links rather than the reality of 1:many or small:many. It also gives the impression that the devices could be moved between physical devices, which isn't an explicit goal.
The client is split into two components: ontapc, which is the client, and ontap which is a convenience wrapper that connects to the unix socket without the need to use s6-ipcclient(8)
The client-server protocol is reworked to be a little more flexible and extendable; Requests are now a single line of the form
```
  type=ipvtap name=nic0 queues=1\n
```
and the response is either
```
  ok
  KEY1=VAL1
  KEY2=VAL2
  <end-of-file>
```
in the successful case or
```
  error
  a detailed error message
  <end-of-file>
```
in the failure case.
tapautoconf is renamed to ontap-l3sync
The miscellaneous libc calls stuffed at the end of the Netlink module are split out into a separate Syscall module.
The Sockfd module is renamed to Fdpass.
Chunks of the tests, like the stand-in for udev and s6-ipcserver, are factored out into separate modules under test/test_helpers.
The man pages are updated to reflect the changes.

Strange behavior with mount namespaces

I encountered a strange issue while building the test. Because this test is using tap devices this time, it needs to create character device nodes like my first test did. That test created device nodes in the current working directory. In general I don't like creating real files in a test because you have to think about cleaning them up, about having a writable file system, and so on. I wanted to improve upon the old test by creating the files in a tmpfs, mounted in a new mount namespace. This way, when the test processes exit, the mount namespace would be destroyed along with the file system.

But I had a problem; as the diagram shows above, the "device mgr" runs in the default namespace, outside of the user and mount namespaces where the tmpfs gets mounted. I can use the mknodat system call, which takes as its first argument a file descriptor to a directory, to create the file from outside of the namespace:

I thought I was very clever putting this together, but I was surprised when I saw the error:

ontapd: [DEBUG] device manager notification for tap3 received
ontapd: [ERROR] could not allocate device: open(/tmp/tap3): Permission denied

I ticked off the obvious causes:

The file was owned by the same uid:gid in both namespaces
The file had read-write permissions (0666)
The tmpfs did not have the "nodev" option set

And conducted a bunch of additional scenarios to explore the problem

Creating a regular file worked (the file could be read within the namespace)
Creating the node outside of the tmpfs, in a directory shared between both namespaces, worked.
Creating and opening a node in the same namespace worked.

I started reading the tmpfs implementation in the Linux kernel, but felt like I was getting too sidetracked, so I wrote a minimal implementation in C and drafted an email to the linux-fsdevel list. However, as I did so, I found enough indicators that this was expected behavior that I sent the email to my personal tracker instead. I will look into it later.

In the meantime, I can create the device in the current working directory, as dune test seems to perform some cleanup on my behalf anyway.

A simple dhcp client

So I've added the charrua dhcp library to my project. The Dhcp_client module does not make any assumptions about how to send or receive packets; it lets you create a Dhcp_client.t, which is a state machine. You feed it a packet via the input function, and it will return its new state, and a suggestion for what to do next:

[ `Response of t * Dhcp_wire.pkt (** send the provided packet *)
| `New_lease of t * Dhcp_wire.pkt (** you've got a lease, celebrate! *)
| `Noop  (** don't do anything )
]

It can be driven by a simple loop:

open Cstruct
let rec wait (fd:Unix.file_descr) (cs:Cstruct.t) (client:Dhcp_client.t) =
  let msg =
    Unix.read_bigarray cs.buffer cs.off cs.len |> Cstruct.sub cs 0
  in
  match Dhcp_client.input client msg with
  | `Response (client', pkt) -> send_response fd cs client' pkt
  | `New_lease (_, pkt) -> Ok pkt.yiaddr
  | `Noop -> wait client

and send_request fd cs client pkt =
  let len = Dhcp_wire.pkt_into_buf pkt cs in
  Unix.send_bigarray fd cs.buffer cs.off len |> ignore;
  wait fd buf client

It needs a few modifications:

We need to set a unique xid and client id for our requests. I do this by seeding a random number generator with the current process id:
```
  let xidgen =
    let prng = Random.State.make [| Unix.getpid () |] in
    fun () -> Random.State.int32 prng Int32.max_int
```
By default, it sets the BROADCAST flag in its requests, which instructs the dhcp server to send its replies to the broadcast address, rather than the unicast address. While this makes things easier for us, since ipvlan devices in l2 mode can receive broadcast packets without an address, I don't want to rely on that. Most dhcp clients will be configured not to ask for broadcast replies, and I don't want to have to change the defaults.

The modifications can go into the send_request function:

 and send_request fd cs client pkt =
+  let pkt = { pkt with flags = Dhcp_wire.Unicast; xid = xidgen () }

A real client would also be accounting for the passage of time, requesting a new lease before the current one expires, and it would have to react to changes in the network interface state, system suspend/resume, and so on. But getting a lease, one time, is sufficient for this test.

Throwing in some print statements, the current state of the test looks like this:

dhclient: ->{...XID 1976429091 options Client ID ID 0 343130383831313335, Message type DHCP DISCOVER...}
dnsmasq-dhcp[23960]: DHCPDISCOVER(ipv0) 8e:ec:6e:5f:d4:e1
dnsmasq-dhcp[23960]: DHCPOFFER(ipv0) 192.168.0.36 8e:ec:6e:5f:d4:e1
ontap_l3sync: [DEBUG] { ts = ...; indev = 2; outdev = -1; hwaddr = 8e:ec:6e:5f:d4:e1; l3_src = 192.168.0.1; l3_dst = 192.168.0.36; payload = DHCPOFFER {xid = 1976429091l; yiaddr = 192.168.0.36; address_time = 3600} }

So ontap_l3sync can see the unicast DHCPOFFER coming from the DHCP server, but it doesn't see the initial DHCPDISCOVER request comint out of the tap device. We have come back to the problem we faced before we switched the test from ipvlan to ipvtap interfaces.

I know that ipvlan (and macvlan) interfaces are "special", in that traffic bypasses most of the networking stack of the parent interface's network namespace. But in this case, the parent and sub-interfaces are in the same namespace! With some experimentation I find that if I run the helper in its own namespace with its own ipvlan device, the netfilter rules can intercept both packets, but then I have the same problem of not being able to tell what interface the packets originated from.

I'm again starting to worry that I cannot make this work in a sustainable way. I do have the option of going back to the previous model of putting ipvtap devices in clients' namespaces, then obtaining a netlink queue from the namespace of every sub-interface. That will certainly work, I just worry about holding too many netlink sockets open. There is also, of course, the option of binding a raw socket to each interface and filtering DHCP/ICMPv6 traffic that way. I was hoping not to do that, as I don't want to open one socket for every interface.

I found myself getting frustrated, so I sent an e-mail to the netfilter mailing list asking for help, and took a side-quest to implement a ring buffer to avoid the rampant copying of packet buffers that was happening within the tools. While these programs are in the path for protocols that normally will see only a few packets per hour, and performance is not really a concern, I needed to work on something without external dependencies and the result will be useful in the future. It also had the benefit of simplifying the programs, which were doing inter-thread communication using a hodge-podge of Event channels and queues of pointers.

After coming back to this after a short break, I read drivers/net/tap.c and drivers/net/ipvlan/ipvlan_core.c more closely. When an application like qemu wants to transmit a frame, it calls sendmsg on the file descriptor it has for the tap device. In the kernel, tap_sendmsg runs, which calls tap_get_user which copies the packet data from user space into an sk_buff structure and calls dev_queue_xmit, which calls skb->netdev_ops->ndo_start_xmit, which, for an ipvlan/ipvtap device, is set to ipvlan_start_xmit which, finally, gets to ipvlan_queue_xmit. From here there are 3 major paths to transmit the packet:

If the destination is another ipvlan device on the same parent device, look it up (by its l3 address) and call its receive function (ipvlan_rcv_frame).
If the packet is a multicast or broadcast packet, and the ipvlan is in l2 mode, pass it to ipvlan_multicast_enqueue.
If the packet has an external destination, call dev_queue_xmit on the parent device.

How does netfilter factor into this? The dev_queue_xmit function passes the sk_buff structure to the egress hook, so it will show up on its way out of the ipvtap device, and on its way from the ipvtap device to its parent device in the egress hook. That's the only hook; by accepting ethernet frames, tap devices essentially skip almost all of the Linux kernel's network stack. In more conventional setups like a tap+bridge combo, or using routes & IP forwarding, you have the option of intercepting packets as they pass through some other element like a bridge. But macvtap and ipvtap devices provide almost direct access to physical device queues, which has good performance implications, but makes the transmit path especially stealthy.

That explains why I was able to log the packet using an egress hook, but no other hooks. As far as I can see, that is the only hook the packet will pass through. If it is distributed to another ipvlan device, it will show up on the INPUT and PREROUTING hooks for that device. For a multicast packet, it is scrubbed before being enqueued for transmission, to avoid leaking information to other network namespaces. This foils any plans to learn the originating interface by adding an extra ipvlan device for snooping.

The netdev family of netfilter hooks do not support queuing packets to user space. So what does that mean for this project? Not all hope is lost! We've hit another obstacle, but we have some options. Two strategies come to mind:

Tap (à-la tcpdump) every ipvtap device with a filter for dhcp/icmpv6 packets and feed them to my program.
Since netfilter can still emit logs or traces of packets, we can correlate those logs with broadcast/multicast packets which we can receive normally (by creating a socket with the SO_BROADCAST option).

I had hesitated to pursue #1 because I did not want to maintain a socket for every ipvtap device. However, reading packet(7), it looks like the first concern is misplaced; the sockaddr_ll structure includes the interface index of the link the packet originated from. So I would only need to maintain one socket per network namespace. In my setup I'm planning to keep the ipvtap devices all in the same namespace, so it's very doable.

While I need to add code to setup an AF_PACKET socket and install a filter for DHCP & ICMP packets, I can re-use the existing architecture of the helper, which was comprised of 3 parts:

One thread that synchronized a local cache of existing links and the addresses assigned to them, fed by an rtnetlink socket.
One thread that learned new addresses to assign to links, fed by an nfnetlink_queue socket, and queuing commands to an rtnetlink socket.
One thread that handled I/O to and from the netlink sockets and application queue using select(2)

Most of the program can stay as-is; I just need to replace the nfnetlink_queue socket with an AF_PACKET socket and slightly change the parsing functions, since the input packets will now include the ethernet frame (nfnetlink only gave me the packet from the L3 header up).

In hindsight, I wish I was less stubborn about using netfilter queues, but I also feel like I'm better for having struggled to use them here, as I have a better mental model of how things work.

I wrote a quick proof-of-concept program, snoop.c, that setup an AF_PACKET socket, attached a bpf filter, generated by

tcpdump -dd udp portrange 68-67 or udp portrange 546-547 or icmp6

locks the filter so it cannot be changed or detached, and then dumps everything received by the socket to standard output, while logging the sockaddr_sll structure populated by recvmsg(2). I'm able to see the DHCP and SLAAC-related packets. When they are emitted by the ipvtap device they have pkttype PACKET_OUTGOING, and when they are forwarded to any other ipvlan devices on the same parent, or forwarded to the parent itself, they have packet type PACKET_BROADCAST.

Because opening an AF_PACKET socket requires a new privilege, CAP_NET_RAW, I will write a separate program that does the following:

Opens an AF_PACKET socket
Attaches a BPF filter for ip configuration packets and locks the filter
Executes into another command line

I'm separating things like this so I can drop privileges between setting up the socket and executing the helper program.

After some more hacking and bug squashing, I can finally achieve my goal:

A Plan 9 VM, using the host wireless interface through an ipvtap device, acquiring an IPv4 address from the local network. Catclock looks away in shame

A Plan 9 VM, using the host wireless interface through an ipvtap device, acquiring an IPv4 address from the local network. Catclock can't bear to watch.

The helper program is able to observe the DHCP requests and replies. It configures a short-lived address (60 seconds) on observing a DHCPOFFER, then updates the lease time when the DHCPREQUEST/DHCPACK transaction is completed. For IPv6, it intercepts Router and Neighbor solicitations to learn and configure the link-local v6 addresses, then whenever it sees a router advertisement, it configures addresses for all prefixes with the proper bits set, masking the link-local address with the prefix.

While this is a success, there are a couple problems I've observed:

The client does not receive the first DHCPOFFER, because the address it's offering has not been configured on the interface yet. The client has to send a retry. This could be avoided by replaying the packet after the address is configured, or using netfilter queues, which allow you to hold the packet back until you know it can be delivered.
My home router, which I got from my ISP, ignores the Client-identifier option when allocating an IP address. This means my VM receives the same IPv4 address that my host would. It's unfortunate, but I did expect that some routers would do this. IPv6 will still work fine, since clients are able to choose their own address.

I don't mind going ipv6-only for now. In the future I can arrange for clients to share an IPv4 address for outbound requests using NAT. I planned to do so anyway to avoid consuming too many IPv4 addresses on busy public networks, which would be rude.

Now that I've proven the concept works, I can afford to do more extensive cleanup, deleting dead code and such. For this project I tried as hard as I could to defer this stage, so there was a lot of cruft built-up, that felt really good to clean out.

I started using the Cstruct.t type evyerwhere. It's a simple (buffer, offset, length) triplet that makes it harder to make mistakes when working with regions of bigarrays.
I removed a ton of dead code. All of the netfilter-related code was removed, since that was a road I chose not to take.

I also added more features to ontap-l3sync:

A blacklist of addresses that we should never assign to an ipvtap device, including any addresses with the "default" eui-64 address that is derived from the (shared) MAC address.
A periodic poll of ipvtap links and any addresses assigned to them, to detect changes that occur while ontap-l3sync is not running or a notification was lost (Netlink can drop messages).

Revisiting the cleanup problem

I've waffled back and forth on what netns to put ipvtap devices. My current implementation requires all of the ipvtap devices managed by ontap-l3sync to live in the same network namespace, so I cannot use namespaces as a cleanup mechanism to destroy the ipvtap devices when the processes using them are gone. I have to think of something else.

What I came up with is to have ontapd use flock(2) to associate a shared lock with the open file descriptor(s) for the tap device before passing it to the client. Then, a new program, ontap-sweep, attempts to take exclusive locks on any ipvtap devices connected to the specified parent device. For devices which are still in use, this will block, and for devices which are no longer in use, this will succeed, and ontap-sweep will proceed to delete them.

In writing this tool, I could not find a way to wait on a large number of flock(2) calls that didn't involve either polling with non-blocking flock, or creating one thread per flock. I went for the latter, since I really don't like polling, and I don't expect to have anywhere near enough interfaces that the number of threads would be a problem.

Despite the difficulties, my approach worked; the ontap-sweep program was able to detect and quickly clean up unused ipvtap interfaces. It worked too well; the detection was so quick that it would delete the interfaces when ontapd created them, but before it called flock on them! Rather than moving ontap-sweep's functionality into ontapd, I decided to allow ontapd to pause ontap-sweep's activity by opening a unix socket with it, which is closed automatically when the ontapd process for a client's request exits.

Future work

I've completed enough that I'm able to start using this project to work on other things. My plan now is to use it for a long time, fixing any problems that come up, but not really to try and change anything too drastically. This project was meant to be a tool to enable quick, secure, and optionally ephemeral sandboxes for other projects I want to work on and services I want to run.

As the result of an exploratory, iterative process, the code base of this project is a mess. It can be drastically simplified and refactored. Specifically:

The Netlink module could be split out into a standalone library with a more consistent and complete API. I'd like to wait a few kernel releases to get a feel for how often it changes, though. The Linux kernel has a pretty good track record of not breaking userspace programs, so I'm optimistic that the maintenance burden won't be too high.
The tests should not require any elevated privileges. Right now they need CAP_MKNOD.
Experiment with ways to work around clients not setting client IDs, or DHCP servers not using them to assign clients a unique IP. One idea is to modify the chaddr field of the DHCP requests, while acting as a DHCP relay in order to receive the DHCP server's reply.

Other fun stuff

Crashing on errors

I make extensive use of the result type defined in the OCaml standard library. A result can either a value or an error:

match some_operation_that_could_fail args with
| Error e -> handle_error e
| Ok v -> do_thing v

In many cases, what I want to do with an error is exit the function and return the error:

| Error _ as err -> err
| Ok v -> (* keep going *)

Because this is a common idiom, I define a binding operator like so:

let ( let* ) = Result.bind

Then, returning an error or continuing with the rest of the function looks like this:

let* v = some_operation_that_could_fail args in
(* the rest of the function *)

I prefer the result type over exceptions because I can forget to handle an exception, but I can't really forget to handle an error. However, when I'm debugging, I really like to have stack traces showing me where an error was generated. So I changed my binding operator like this:

let ( let* ) ret f =
  match ret with
  | Error str -> failwith str
  | Ok v -> f v

This only work with string errors, which is fine by me.

C constants and enums as GADTs

I chose to represent C constants like RTM_NEWLINK as variants rather than integers. The initial reason for this was very superficial; variant constructors can be capitalized, but ordinary values like integers must start with a lowercase letter. In addition, variant constructors can be used in pattern matching, but ordinary values cannot. So this:

enum {
	RTM_NEWLINK = 16,
	RTM_DELLINK
	RTM_GETLINK
}

became

type nlmsg_type =
  | RTM_NEWLINK
  | RTM_DELLINK
  | RTM_GETLINK

this required extra plumbing between the C and OCaml functions to convert between the two representations, but I think it was worth it. The netlink APIs have a few areas where you get or set a property, passing a type (key) and payload (value). Here are a few examples:

setsockopt/getsockopt to set and get options on the netlink socket.
mnl_attr_put and mnl_attr_get functions set and get attributes at the end of a message.

These are notable because the value has a different shape depending on the key used. For example, getsockopt(NETLINK_CAP_ACK, ...) retrieves an int, but getsockopt(NETLINK_LIST_MEMBERSHIPS, ...) retrieves an array of uint32_t values. In C, you just read the docs and cast your void* buffer to whatever type the docs tell you to, and hope you don't make a mistake. In OCaml we can use the type checker to prevent ourselves from making an incorrect cast or passing a wrong-size buffer.

Initially you could attempt to use variants for this:

type netlink_sockopt =
  | NETLINK_CAP_ACK of bool
  | NETLINK_EXT_ACK of bool
  | NETLINK_LIST_MEMBERSHIPS of int list

Then, our setsockopt function could look like this:

let mnl_socket_setsockopt sock = function
  | NETLINK_CAP_ACK tf -> setsockopt_bool sock 0x0a tf
  | NETLINK_EXT_ACK tf -> setsockopt_bool sock 0x0b tf
  | NETLINK_LIST_MEMBERSHIPS _ -> invalid_arg "NETLINK_LIST_MEMBERSHIPS is read-only"

and call it like this:

mnl_socket_setsockopt sock (NETLINK_EXT_ACK true)

but how would we call the getsockopt function? An expression like this

mnl_socket_getsockopt sock NETLINK_EXT_ACK

is invalid; the NETLINK_EXT_ACK constructor requires one argument. You can use a Generalized algebraic datatype(GADT) here. They are described as a way to evaluate trees of expressions in your type system, but they actually have a lot more uses, as a general way to separate a type from its representation.

type _ netlink_sockopt =
  | NETLINK_CAP_ACK : bool netlink_sockopt
  | NETLINK_EXT_ACK : bool netlink_sockopt
  | NETLINK_LIST_MEMBERSHIPS : int list netlink_sockopt

let mnl_socket_setsockopt (type t) nl (k : t netlink_sockopt) (v : t) =
  match k with
  | NETLINK_CAP_ACK -> setsockopt_bool nl 0x0a v
  | NETLINK_EXT_ACK -> setsockopt_bool nl 0x0b v
  | NETLINK_LIST_MEMBERSHIPS -> invalid_arg "NETLINK_LIST_MEMBERSHIPS is read-only"

To call it, you drop the parenthesis:

mnl_socket_setsockopt sock NETLINK_EXT_ACK true

Getting works as you'd expect:

mnl_socket_getsockopt sock NETLINK_EXT_ACK

And the implementation looks like this:

let mnl_socket_getsockopt : type t. mnl_socket -> t netlink_sockopt -> t =
  fun nl opt ->
  match opt with
  | NETLINK_CAP_ACK -> get_bool_sockopt nl 0x0a
  | NETLINK_EXT_ACK -> get_bool_sockopt nl 0x0b
  | NETLINK_LIST_MEMBERSHIPS -> get_list_sockopt nl 0x09

As you can see, there are a few more type annotations required in order to appease the type checker, but in most cases, they are tucked away in the library and calling code doesn't have to think about them. The Unix module from the OCaml standard library takes the approach of splitting up the options into separate groups based on their underlying types, and having one function for getting and setting each type.

The sockopt case is trivial, but I think there will be other opportunities to use GADTs for performance; creating a variant introduces a level of indirection and creates more work (see Memory Representation of Values for the garbage collector.

Retrospective

If my goal was to get a working service as quickly as possible, with minimum frustration and head-scratching, I utterly failed that goal. At just about every possible step I chose the hard road; faced with a system protocol and library that fully exercises the freedom of C, that has accumulated decades of quirks, that is difficult to debug because most of it runs in the kernel, I decided to try using it from a language I did not fully understand, with its own special flavor of freedom.

The state of the codebase when I was able to use the service "for real" is atrocious. Thousands of lines of code. Inconsistencies everywhere. Shifting patterns in how types are declared and used; you can see how my thoughts and patterns changed over time.

A lot of time was spent simply translating C declarations into Ocaml ones. I think ppx-cstubs could have majorly cut down on that aspect. Similarly, ppx-deriving could save me the effort of defining printers for many types. Many of the time sinks were of my own making; I deliberately chose to use features of OCaml that I didn't fuly understand as a way to force myself to learn about them. As an example, my decision to use GADTs for message attributes really hit me in the face when I went to implement a function that would iterate over all attributes. Behold the glory of mnl_ifla_walk!

let mnl_ifla_walk ~offset fn (init: 'a) nlh =
  let children nla = mnl_attr_get_payload nla |> from_voidp T.Nlattr.s in
  let continue f k nla len fn init =
    let len' = len - mnl_attr_get_len nla in
    let nla' = mnl_attr_next nla in
    if not (mnl_attr_ok nla' len') then k init
    else f k nla' len' fn init
  in
  let kind = ref "" in
  let rec toplevel k (nla:T.Nlattr.t) (len:int) (({f}: 'a ifla_fn) as fn) (init: 'a) =
    let next = continue toplevel k nla len fn in
    match mnl_attr_get_type nla with
    | 0x01 -> next @@ f init IFLA_ADDRESS (mnl_attr_get_bytes nla)
    | 0x03 -> next @@ f init IFLA_IFNAME  (mnl_attr_get_str nla)
    | 0x04 -> next @@ f init IFLA_MTU     (mnl_attr_get_u32 nla)
    | 0x05 -> next @@ f init IFLA_LINK    (mnl_attr_get_u32 nla)
    | 0x0d -> next @@ f init IFLA_TXQLEN  (mnl_attr_get_u32 nla)

    | 0x13 -> next @@ f init IFLA_NET_NS_PID (mnl_attr_get_u32 nla)
    | 0x1c -> next @@ f init IFLA_NET_NS_FD  (mnl_attr_get_u32 nla)

    | 0x1f -> next @@ f init IFLA_NUM_TX_QUEUES (mnl_attr_get_u32 nla)
    | 0x20 -> next @@ f init IFLA_NUM_RX_QUEUES (mnl_attr_get_u32 nla)

    | 0x12 ->
      kind := "";
      linkinfo next (children nla) (mnl_attr_get_payload_len nla) fn init
    | x -> next @@ f init IFLA_UNKNOWN x

  and linkinfo return nla len (({f}: 'a ifla_fn) as fn) (init: 'a) =
    let next = continue linkinfo return nla len fn in
    match mnl_attr_get_type nla with
    | 1 ->
      kind := mnl_attr_get_str nla;
      next @@ f init IFLA_INFO_KIND !kind
    | 2 when !kind = "ipvlan" || !kind = "ipvtap" ->
      ipvlan next (children nla) (mnl_attr_get_payload_len nla) fn init
    | 2 when !kind = "macvlan" || !kind = "macvtap" ->
      macvlan next (children nla) (mnl_attr_get_payload_len nla) fn init
    | x -> next @@ f init IFLA_UNKNOWN x

  and macvlan return nla len ({f}: 'a ifla_fn) (init: 'a) =
    let next = continue macvlan return nla len fn in
    match mnl_attr_get_type nla with
    | x -> next @@ f init IFLA_UNKNOWN x

  and ipvlan return nla len ({f}: 'a ifla_fn) (init: 'a) =
    let next = continue ipvlan return nla len fn in
    match mnl_attr_get_type nla with
    | 0x01 ->
      next @@ f init IFLA_IPVLAN_MODE (mnl_attr_get_u16 nla |> ipvlan_mode_of_int)
    | 0x02 ->
      next @@ f init IFLA_IPVLAN_FLAGS (mnl_attr_get_u16 nla |> ipvlan_flag_of_int)
    | x -> next @@ f init IFLA_UNKNOWN x
  in

  let len = mnl_nlmsg_get_payload_len nlh - offset in
  let nla = mnl_nlmsg_get_payload_offset nlh offset |> from_voidp T.Nlattr.s in
  if not (mnl_attr_ok nla len) then init
  else toplevel Fun.id nla len fn init

It's beautiful ☺. I give it two weeks before I no longer understand it. I wrote it in continuation-passing style because hey, why not? I especially like the naked integer constants. I was literally learning the structure of the data as I went along.

I am confident I could implement this function in a much better way, having done it once. However, one thing I tried to stick to on this project is, as much I could help it, not to go back and "fix" things to make them pleasant to look at. I tried only to address the errors that were coming out of my test runs. I was worried that if I kept going back to fiddle and adjust and rename every little thing, I would never finish.

General, hand-wavy observations

These are just general thoughts I had that I didn't care to develop further, but don't want to lose, so I'm writing them down.

OCaml's type system is powerful enough to replace a lot of things I would use unit tests for in other languages. Rather than having tests that check if a function behaves as expected when you pass it bad input, you use the type system to make bad input impossible. Then all your little tests are obviated by the question "does my program compile?"
A caveat to that is that using C and, by extension Ctypes, subverts most of the benefits that you get from OCaml's type system. It is just as easy to segfault with Ctypes as it would be with C. You have to be fairly disciplined not to make the same kinds of mistakes.
Dune still feels a lot like magic, and not in a good way. I don't know what it's doing most of the time, so when it doesn't work I'm befuddled. I would like to spend some time trying to build an OCaml project with Make. There are some example Makefiles floating around the internet, but they are all too complicated because they try to be as flexible as possible and automatically resolve everything for you. I have no qualms with listing and maintaining inter-module dependencies manually.
I think that there is a feedback loop between the build tools used for a language and the complexity of building software in that language. The more language-specific help a build tool gives a language, the more license the users of that language have to pull in dependencies that they may not have otherwise, introduce convoluted conditional compilation rules, multi-stage code generation, and in general build software in a way that may seem convenient for the developer in the moment, but could make things harder for packagers, distributors, and new contributors to the software in the long run. A simpler way of putting it; if your build tool makes building complex software easy, developers will build more complex software. If the tool doesn't make building complex software easy, they will build simpler software. The key is finding the right balance between irritating developers and enabling them to commit atrocities.
The thing that excites me the most about GADTs is that they separate the type of a value from its representation. If you're familiar with how OCaml represents values, they are boxed and look like this:

Then there's a little trick OCaml does to pack 63 (or 31, depending on your platform)-bit integers into the header for performance, because integers are so commonly used. The representation is reasonably compact, and the uniformity allows OCaml to implement generic functions in a really elegant way. However, a tradeoff of this approach is that when you construct a value, for example
```
  let xid = ... in
  let yiaddr = ... in
  let address_time = ... in
  DHCPOFFER { xid; yiaddr; address_time }
```
the constituent elements of the value, which you may have already allocated in other parts of the program, must be copied into a box. This is fine if the values are small strings and integers like the example above. But when the values are packet payloads which can be up to 64KiB in size, copying these values creates work for the garbage collector, which can introduce unexpected latency.

What a GADT lets you do is effectively cleave the header off of the value and manipulate it separately:

not only is this really appealing for performance reasons, It's also appealing because you can use different headers to interpret the same data. The same buffer can be a different type of packet based on the GADT you pair it with.

I've had an idea floating around in my head that you could come up with a GADT that describes the layout of a piece of binary data, and have a set of functions that produce optimized versions of that GADT for encoding or decoding, or even decoding specific fields. Similar to what printf and scanf does for strings, or even what the ppx_cstruct preprocessor does for binary data, but without the preprocessing part. It would be really cool if you could build those optimized versions at compile time, but I don't know if that's possible.
Binding to libmnl was mostly waste of time. If I could go back, I would have only used ctypes to access the various constants used by the netlink protocols, and to fill in the parts of the sockets(7) api that are missing from the Unix module, and implemented the rest of the functionality in pure OCaml. It is fun and easy to decode and encode binary data in OCaml, especially with modules like Cstruct. And writing C stubs "by hand" is not really that tedious.