Editor's note; this was a project that I started several years ago and left unfinished, so there is a large time skip in the middle of the article, and the project has changed considerably. Some parts in the beginning may contradict parts later on. Sorry. If you just want to see the code, it's here.
Building a network device manager in OCaml
I'm building a suite of services for running virtual machines in my home network. I am turning my home PC, my laptop, and a few machines I plan to rent from cloud providers into a cohesive, low cost, uniform system for building, deploying and maintaining software. I will write more about it in the future. For this post, I just wanted to walk through the process of using OCaml to call C functions, with the purpose of writing local services and utilities to run on a Unix system.
This will be more of a "build log", where I've tried to include any dead ends, errors, misconceptions, or other problems I encountered. My intention is to show others how they can dig themselves out if any holes they fall into. I've included most of the error messages I've encountered to help anyone doing a web search for error strings.
I'm also using this as a learning opportunity to get more comfortable with OCaml, so some of the abstractions I make may seem unnecessary; they probably are. But I'm deliberately trying to bump up against the areas of OCaml and its ecosystem that I don't understand very well yet, in the hope that in the future, if I do need to go there, I'll be able to fall back on this experience.
I'll be building two tools;
tapalloc
- a privileged service that provisions network devices and an unprivileged client that retrieves file descriptors for those devices over a unix socket.tapautoconf
- a service that uses netlink(7) to snoop on DHCP/ICMPv6 traffic coming from a tap device to configure IPv4/IPv6 addresses on it so they can receive traffic, allowing ipvlan-based devices to configure themselves.
The goal of these tools is to enable me to create VMs and containers with their own "real" IP addresses, on-demand, in a way that I find palatable.
Allocating network devices
The tapallocd programs allocate different kinds of TAP devices on behalf of a client process, and pass file descriptor(s) for the device to the client over a unix socket. These file descriptors can then be used by a hypervisor like qemu or firecracker as a VM's network interface. They are intended as an alternative to helper programs such as qemu-bridge-helper, and do not require a setuid-root binary. Ultimately I plan to re-implement the following pattern for acquiring other kinds of system resources:
- Client A connects to unix socket for allocator service B
- Allocator service B checks some policy to decide if A is
allowed access to the resource. It can query the
SO_PEERCRED
option to learn the uid, gid, and pid of the calling process. - If the check is successful, B allocates the resource, opens the resource files, and passes the open file descriptor to A over the unix socket.
- B closes its copy of the file descriptor, so that A retains the sole reference to the resource. Once A closes the file descriptor, explicitly or by exiting, the resource is collected.
This gives me the flexibility to perform permissions checks
of arbitrary complexity, and to really restrict the actions an
unprivileged process is allowed to take. As a comparison to the
capabilities(7)
system in Linux, giving a process the CAP_NET_ADMIN
capability required
to provision and configure network interfaces grants permission to do
far more, like add or modify entries in the system's routing table.
On Linux, network devices are created and configured using the netlink(7)
interface. The Linux kernel supports an additional type of address family
for sockets, AF_NETLINK
. A process that wants to create a network
interface opens a socket of this type and sends and receives messages over
the file descriptor. I will be interfacing with the libmnl
companion
library, which will perform most of the nitty-gritty work of marshalling
and unmarshalling these messages.
Traditionally, when you wanted to call a C function from OCaml, you would
manually write a stub, a small
C library with intermediate functions that convert parameters from their
representation in the OCaml runtime to the required representation for the
system call (such as a char*
or a struct stat
and convert the return
value in the reverse direction. It is not unlike the process of extending
python with C.
Here
is an example of a stub I wrote to expose the vectored
IO system calls.
There is an OCaml package called ctypes that partially automates the generation of these "stub" files for you. Rather than writing the stub files directly, you write a smaller description of the C functions you would like to call, as an OCaml module, and the stubs are then generated from your description.
Project setup
I use GNU guix to provision my development environment. In the new repository I write the file guix.scm with a package definition, following this example:
(package
(name "tapalloc")
(version "dev")
(source (local-file "." #f #:recursive? #t))
(synopsis "") (description "") (home-page "")
(build-system dune-build-system)
(inputs
(list
ocaml-ctypes
ocaml-fmt
linux-libre-headers
libmnl
gcc-toolchain
pkg-config))
(license isc))
This describes a package built from sources in the current directory
using the dune build
system. The inputs
list describes the dependencies needed to build
this package. With such a file in the project's root directory, I can
run the command
$ guix shell -D
which drops me into a new shell with the listed input packages available. With OCaml, most new projects manage their dependencies and packaging with opam. I use guix instead of opam because guix it is agnostic to any one programming language, it is easy to extend, it is easy to modify, patch, or replace dependencies, and it is easier to use it to get reproducible builds; unless I really mess up, someone else should be able to follow the same process I did and produce the exact same binary. Later on, I want to use guix as part of an automated build and test system. That said, the use of guix is still optional, and the README file contains instructions for building the package without guix.
With my dependencies loaded, I create the directory structure I want:
- The
libmnl/
directory will contain source for theLibmnl
OCaml module, which is a thin wrapper around libmnl. - The
tapalloc/
directory will have the source for the executable programstapalloc-macvtap
andtapallocd-ipvtap
. - The
contrib/
directory will contain an example script to run these programs unders6-ipcserver
.
My usual process when starting a new project is to start with the smallest meaningful goal at first; a "hello, world" of sorts. This lets me validate the build process without getting bogged down trying to write the whole program from the start.
To start, then, I write a simple program, tapalloc/hello.ml
,
whose only purpose is to open a netlink socket using the
mnl_socket_open
function, and exit.
(* hello.ml *)
let () =
match Libmnl.(socket_open NETLINK_ROUTE) with
| exception _ -> Fmt.epr "failure%!"
| sock -> Fmt.epr "success%!"
I can then put build instructions in tapalloc/dune
:
(test
(name hello)
(modules hello)
(libraries libmnl fmt))
And start a test loop with the command
dune test -w
This test will re-run itself every time it detects a change in its dependencies. I keep this running in a visible window while I work on the project. In general, when I am working on a project in any language, I always have a visible window that is automatically updated with test or build results, so I can get constant feedback.
Unsurprisingly, the test fails with the error:
Entering directory '/src/dev.arroyo.cc/tapalloc'
File "tapalloc/dune", line 4, characters 12-18:
4 | (libraries libmnl fmt))
^^^^^^
Error: Library "libmnl" not found.
-> required by _build/default/tapalloc/hello.exe
-> required by alias tapalloc/runtest in tapalloc/dune:2
Had errors, waiting for filesystem changes...
As it says, the libmnl
library does not exist. We can write libmnl/dune
:
(library
(name libmnl))
Now the error changes to
Error: Unbound value socket_open
It's time to stop beating around the bush and define this library. With ctypes, there are 2 ways to call functions in a C library:
-
Loading the C library at runtime using
dlopen()
and jumping into it. The function must first be described with theforeign
function before it can be used. Assuming your description is accurate, the OCaml compiler will prevent you from calling the function with bad input. However, if your description is incorrect, you may have a run-time error, or worse. It is also not possible to use this method in a statically-linked binary. -
Using the Cstubs sub-package which will generate a
.c
file that you must compile and link into your binary, along with the external library. The benefit is that if your description is incorrect, the C compiler is likely to throw an error, because the generated.c
file will have an error. This method is also amenable to static linking. The drawback is increased complexity in your build process.
I prefer to statically link my binaries when feasible, and I want to catch
as many errors as I can at build time, so I am using the second option of
generating .c
files. Luckily, the dune
build tool makes this quite easy.
With the dune
3.0 and above, it even has a dedicated (ctypes ...)
stanza
that will build the stub files for you. It requires you to define two modules:
- A module named
Functions
describing functions in the external C library - A module named
Types
describing constants
I put these modules in a single file, libmnl/c_stub.ml
:
open Ctypes
module Functions (F : Ctypes.FOREIGN) = struct
open F
type mnl_socket
let mnl_socket : mnl_socket structure typ = structure "mnl_socket"
let mnl_socket_open = int @-> returning (ptr mnl_socket)
end
module Types (F : Ctypes.TYPE) = struct
open F
let netlink_route = constant "NETLINK_ROUTE" int
let netlink_generic = constant "NETLINK_GENERIC" int
end
The libmnl
library defines the mnl_socket
struct in its header like so:
struct mnl_socket;
That is, external users of the library do not have access to the fields of the struct; it is a so-called "opaque struct". Such a struct maps well to OCaml as an abstract type (a type with a hidden implementation):
type mnl_socket
To describe a C function for the ctypes module, we make use of the
@->
operator and some helper functions. "@->
" is meant to look
like the ->
syntax that is used in ordinary OCaml function signatures.
If you squint, the description
let mnl_socket_open = int @-> returning (ptr mnl_socket)
Looks almost like an OCaml function signature, such as
val mnl_socket_open : int -> mnl_socket ptr
It is slightly different, though; in the ctypes description, the words
int
, returning
, ptr
and mnl_socket
are values rather than
types, as is the case in the OCaml function signature. The line
let mnl_socket : mnl_socket structure typ = structure "mnl_socket"
Unites the value mnl_socket with the type mnl_socket.
The two modules have the names Functions
and Types
. The rules
implementing dune's ctypes
stanza are hard-coded to look for modules
with these names.
The Types
module is currently empty, but it won't be for long; I will
populate it with constants as needed.
In libmnl/dune
I add the (ctypes ...)
stanza:
(library
(name libmnl)
(ctypes
(external_library_name libmnl)
(headers (include "libmnl/libmnl.h" "linux/netlink.h"))
(type_description
(instance Const) ;; symbols accessed under C.Const
(functor c_stub)) ;; look for c_stub.Types
(function_description
(instance Libmnl) ;; symbols accessed under C.Libmnl
(functor c_stub)) ;; look for c_stub.Functions
(generated_entry_point C))) ;; nest modules under module name "C"
At this point I got a nasty surprise in my test loop:
********** NEW BUILD (libmnl/c_stub.ml changed) **********
File "libmnl/_unknown_", line 1, characters 0-0:
In file included from /gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/ctypes/ctypes_cstubs_internals.h:13,
from /src/dev.arroyo.cc/tapalloc/_build/default/libmnl/libmnl__c_cout_generated_types.c:4:
/gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/ctypes/ctypes_primitives.h:16:10: fatal error: ocaml_integers.h: No such file or directory
16 | #include "ocaml_integers.h"
| ^~~~~~~~~~~~~~~~~~
compilation terminated.
Had errors, waiting for filesystem changes...
What is "ocaml_integers.h" and what did I ever do to it? Since the header name is fairly unique, I can search my system for it:
$ find /gnu/store -name 'ocaml_integers.h'
/gnu/store/g48m9g91754zssif06mn66gmjswq619i-ocaml-integers-0.7.0/lib/ocaml/site-lib/integers/ocaml_integers.h
So it is part of the ocaml-integers
package. It is present in my dev shell:
$ ocamlfind query integers
/gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/integers
So what's wrong? Looking at the dune build log, at $project_root/_build/log
, here is the
command that dune attempts to run:
gcc \
-I/gnu/store/dj7kw3mqasw0rxdbm1gkajgsznhw8b4h-libmnl-1.0.4/include \
-I/gnu/store/sjkj3bi7nx8bd02i8z0d43b85xm76rj5-ocaml-4.14.0/lib/ocaml \
-I/gnu/store/63qdzf5h8s815z4gqydqfm5wsc87qs8n-profile/lib/ocaml/site-lib/ctypes \
-o /src/dev.arroyo.cc/tapalloc/_build/default/libmnl/libmnl__c_cout_generated_types.exe \
libmnl/libmnl__c_cout_generated_types.c
The problem is that the .../lib/ocaml/site-lib/integers
directory is not
added to gcc's header search path. This does not happen when I am using
opam instead of guix. After a bit of trial and error I identified the
problem; the ctypes
package, during its installation process, copies
this header file from the integers
package into its
own installation directory, and the guix package was overriding
that behavior. I sent a patch
to Guix that removes this override, and for the time being, did a local
override like so:
(use-module
(ice-9 match)
(guix transformations))
(define ocaml-ctypes-fixbuild
(package
(inherit ocaml-ctypes)
(arguments
(match (package-arguments ocaml-ctypes)
((#:make-flags flags . rest) rest)))))
((package-input-rewriting `((,ocaml-ctypes . ,ocaml-ctypes-fixbuild)))
(package
(name "tapalloc")
(version "dev")
(source (local-file "." #f #:recursive? #t))
(synopsis "Allocates ephemeral TAP devices on demand.")
(description "")
(home-page "")
(build-system dune-build-system)
;; gcc is needed for stdarg.h
(inputs (list ocaml-ctypes ocaml-fmt linux-libre-headers libmnl gcc-toolchain))
(native-inputs (list pkg-config))
(license expat)))
With that digression out of the way, we get a new error message:
File "libmnl/libmnl__c_generated_types.ml", line 25, characters 24-31:
25 | let enum (type a) name ?typedef ?unexpected (alist : (a * int64) list) =
^^^^^^^
Error (warning 27 [unused-var-strict]): unused variable typedef.
The dune manual warns us about these errors in its example:
; ctypes backward compatibility shims warn sometimes; suppress them
(flags (:standard -w -9-27))
I hope that in the future, dune can automatically make these warnings non-fatal, or that ctypes can stop triggering them. Because the current solution encourages developers to copy build directives they don't understand, and it is unlikely they will know when suppression is no longer necessary. With the warnings suppressed, we finally get a real error:
File "tapalloc/hello.ml", line 2, characters 16-27:
2 | match Libmnl.(socket_open NETLINK_ROUTE) with
^^^^^^^^^^^
Error: Unbound value socket_open
The build of the c stub files completes, and we're now back to calling functions that do not exist yet. Just to confirm things are working, I change the test to call the generated C function directly:
(* tapalloc/hello.ml *)
let () =
match Libmnl.C.Libmnl.mnl_socket_open 0 with
| _ -> Fmt.pr "success!"
The next error is
ld: libmnl/liblibmnl_stubs.a(libmnl__c_cout_generated_functions__c_stub__libmnl.o): in function `libmnl_stubs_1_mnl_socket_open':
_build/default/libmnl/libmnl__c_cout_generated_functions__c_stub__libmnl.c:6: undefined reference to `mnl_socket_open'
collect2: error: ld returned 1 exit status
File "caml_startup", line 1:
Error: Error during linking (exit code 1)
The error complains about an undefined reference to mnl_socket_open
, a
C function in libmnl
. This shouldn't happen, as the library (.cmxa
)
file generated for the libmnl
module should contain the necessary
directives for the C linker:
$ ocamlobjinfo _build/default/libmnl/libmnl.cmxa | head
File _build/default/libmnl/libmnl.cmxa
Extra C object files: -lmnl -llibmnl_stubs -L/gnu/store/dj7kw3mqasw0rxdbm1gkajgsznhw8b4h-libmnl-1.0.4/lib -lmnl
And the libmnl.so
library itself does indeed have a public symbol
mnl_socket_open
:
$ readelf -sW $(guix build libmnl)/lib/libmnl.so| grep mnl_socket_open
33: 0000000000002340 93 FUNC GLOBAL DEFAULT 13 mnl_socket_open@@LIBMNL_1.0
73: 00000000000023a0 85 FUNC GLOBAL DEFAULT 13 mnl_socket_open2@@LIBMNL_1.2
So what gives?
I find the best thing to do in times like these is to find an example
build that does work, and iteratively change that example, one piece
at a time, until it looks like the build that does not work. At some
point, the build will break, and that will bring you closer to the root
cause. So, starting with the example in dune's documentation, I changed
from building a libmnl
library, to building a libmnl
executable
;; libmnl/dune
(executable
(name libmnl)
(libraries fmt)
(flags (:standard -w -9-27 -verbose))
(ctypes
(external_library_name libmnl)
(headers (include "libmnl/libmnl.h"))
(type_description
(instance Const) ;; symbols accessed under C.Const
(functor c_stub)) ;; look for c_stub.Types
(function_description
(instance Libmnl) ;; symbols accessed under C.Libmnl
(functor c_stub)) ;; look for c_stub.Functions
(generated_entry_point C))) ;; nest modules under module name "C"
and embedding tapalloc/hello.ml
into this executable:
(* libmnl/libmnl.ml *)
open C
type bus =
| NETLINK_ROUTE
| NETLINK_GENERIC
let bus_to_int = function
| NETLINK_ROUTE -> Const.netlink_route
| NETLINK_GENERIC -> Const.netlink_generic
let socket_open bus = Libmnl.mnl_socket_open (bus_to_int bus)
let () =
match socket_open NETLINK_ROUTE with
| _ -> Fmt.pr "opened a netlink socket!\n"
This works!
$ dune exec libmnl/libmnl.exe
opened a netlink socket!
So it can build the executable when the c stubs are defined in the
same stanza in dune, but triggers a linker error when the executable
and library are separate. What's going on? Adding (flags (:standard -verbose))
to tapalloc/dune
tells the OCaml compiler to print the
external commands it runs to link the program. Here's the command,
with a bit of formatting added and unimportant flags omitted:
1 gcc -o 'tapalloc/hello.exe' \
2 '-L$GUIX_ENVIRONMENT/lib/ocaml' \
3 '-L$GUIX_ENVIRONMENT/lib/ocaml/site-lib/integers' \
4 '-L$GUIX_ENVIRONMENT/lib/ocaml/site-lib/ctypes' \
5 '-Llibmnl' \
6 '/tmp/build_1e9f8a_dune/camlstartup085117.o' \
7 '$GUIX_ENVIRONMENT/lib/ocaml/std_exit.o' \
8 'tapalloc/.hello.eobjs/native/dune__exe__Hello.o' \
9 'libmnl/libmnl.a' \
10 '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/ctypes/cstubs.a' \
11 '$GUIX_ENVIRONMENT/lib/ocaml/str.a' \
12 '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/ctypes/ctypes.a' \
13 '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/integers/integers.a' \
14 '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/bigarray-compat/bigarray_compat.a' \
15 '$GUIX_ENVIRONMENT/lib/ocaml/site-lib/fmt/fmt.a' \
16 '$GUIX_ENVIRONMENT/lib/ocaml/stdlib.a' \
17 '-llibmnl_stubs' \
18 '-L$GUIX_ENVIRONMENT/lib' \
19 '-lmnl' \
20 '-Wl,--no-as-needed' \
21 '-lcamlstr' \
22 '-lctypes_stubs' \
23 '-lintegers_stubs' \
24 '-Wl,--no-as-needed' \
25 '-lintegers_stubs' \
26 '$GUIX_ENVIRONMENT/lib/ocaml/libasmrun.a' \
27 -lm \
28 -ldl
Do you see the problem? If you don't immediately see the problem,
don't feel bad; it was only after I had dug into this problem
for the better part of a day, almost given up, and taken a break
(and a drink) before I realized what was wrong. When I build the
libmnl
module as a library instead of an executable, dune builds the
archive _build/default/libmnl/libmnl.a
. Then, when it comes time to
build hello.exe
, the archive's parent directory is added to gcc's
library search path with the -Llibmnl
directive (line 5), before
the directive that adds the parent directory of the "real" library,
-L$GUIX_ENVIRONMENT/lib/libmnl.so
(line 18). As a result, the -lmnl
directive tells gcc to link _build/default/libmnl/libmnl.a
, not
$GUIX_ENVIRONMENT/lib/libmnl.so
. The local library, which shadows
but depends on the external one, obviously doesn't have a symbol for
mnl_socket_open
.
Changing the library name from libmnl
to
mnl
does the trick. We've reached the first milestone, and our major build
troubles should be behind us.
The programs in the tapalloc
project are pretty simple; the client reads a
file descriptor from a unix socket. The servers provision a single interface,
and send an open file descriptor referencing it over said unix socket. So, in
addition to the libmnl
library, I will need functions to read and write
file descriptors over a unix socket.
Error handling
In C, like in OCaml, functions can only return a single value. This
raises the question of how to return an error from a C function. In OCaml,
you can raise an exception, or use a sum type like result
:
# type ('ok, 'err) result =
Ok of 'ok
| Error of 'err
# let safe_incr num =
if x = Int.max_int then
Error "overflow"
else
Ok (x + 1)
;;
val safe_incr : int -> (int, string) result
While you could use a C union
to mimic a result
type, it would
require unsafe practices or the declaration of a union type for each
unique return type, and may impose additional memory management on the
caller of the function. Instead, C functions typically return a single
value like an integer or a pointer, and use a return value outside of
the function's range to signify an error. The caller can then check the
(sorta) global variable errno to learn more about the nature of the
error. For instance here is the signature of the read(2) function:
ssize_t read(int fd, void *buf, size_t count);
If an I/O error occurs, read
should return -1 and set errno to
the value EIO
. -1 is outside of the range of read(2) because it is
impossible to read a negative number of bytes. Functions that return a
pointer typically use NULL
to signify an error. Functions that return
a floating point number may return HUGE_VAL
or NaN
(see the glibc
manual).
By default, the code generated by ctypes does not
check for errors, and just returns the return value of
the C function call. You can specify an errno_policy
of return_errno
to signal to the Cstubs code generator that functions
should sample the value of errno
after calling a C function and return
it. So, to add error handling for the mnl_socket_open
function, I can
change its wrapper to this:
(* mnl/mnl.ml *)
let socket_open bus =
match Libmnl.mnl_socket_open bus with
| (p, errno) when Ctypes.is_null p -> Error errno
| (sock, _) -> Ok sock
To make the code a little nicer, I can change the signature of mnl_socket_open
to:
- foreign "mnl_socket_open" (int @-> returning (ptr mnl_socket))
+ foreign "mnl_socket_open" (int @-> returning (ptr_opt mnl_socket))
This way, NULL
is returned as None
, and a non-NULL becomes Some p
. Then
my wrapper becomes:
let socket_open bus =
match Libmnl.mnl_socket_open bus with
| (None, errno) -> Error errno
| (Some sock, _) -> Ok sock
There is another roadblock for us; ctypes provides the errno_policy
parameter for its code generator, but dune
does not expose it in its
(ctypes ...)
stanza. I wrote a
patch
and added it to my local dune
by modifying guix.scm
like
so.
Now I can augment the dune
file like so:
(function_description
+ (errno_policy return_errno)
(instance Libmnl)
(functor c_stub))
Unfortunately, I'm met with a new error 🙃
ld: mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o: warning: relocation \
against `Caml_state' in read-only section `.text'
ld: mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o: relocation R_X86_64_PC32 \
against undefined symbol `Caml_state' can not be used when making a shared object; \
recompile with -fPIC
ld: final link failed: bad value
collect2: error: ld returned 1 exit status
Here is a partial diff of the .c file that builds the .o file referenced in
the error, with ignore_errno
vs return_errno
:
value libmnl_stubs_1_mnl_socket_open(value x1)
{
int x2 = Long_val(x1);
+ errno = 0;
struct mnl_socket* x5 = mnl_socket_open(x2);
- return CTYPES_FROM_PTR(x5);
+ value x6 = CTYPES_FROM_PTR(x5);
+ return ctypes_pair_with_errno(x6);
}
With the return_errno
policy, the C function
now has to call ctypes_pair_with_errno
to build
a (value, errno) tuple. This function, in turn, calls
caml_alloc_tuple
,
which calls
caml_alloc
,
which calls Alloc_small
, which calls
Alloc_small_with_profinfo
which references the Caml_state
symbol, which appears to be a
block of memory
holding all of the state for the current runtime domain. I can see
that the object file contains the following relocation records:
$ objdump -r _build/default/mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000015 R_X86_64_PC32 Caml_state-0x0000000000000004
0000000000000027 R_X86_64_PC32 Caml_state-0x0000000000000004
0000000000000039 R_X86_64_PC32 Caml_state-0x0000000000000004
0000000000000078 R_X86_64_PC32 Caml_state-0x0000000000000004
000000000000008d R_X86_64_PC32 Caml_state-0x0000000000000004
and these are not present when using the ignore_errno
policy.
The error message suggests recompiling with the -fPIC
option.
I can add it to the dune files:
(library
(name mnl)
(ocamlopt_flags (:standard -fPIC)))
However, this doesn't seem to make any difference, I still get the same
error. To investigate further, I can re-run the build with strace to snoop
on the ld
command that gets run:
$ strace --trace=execve \
--string-limit=1000 \
--follow-forks \
--successful-only \
dune build
First, I can see that the specific gcc command that fails is this one:
gcc -shared -g -o mnl/dllmnl_stubs.so mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
And if I look further in the strace output, I can confirm that, indeed,
mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
was not compiled
with -fPIC
. If I rebuild this object file by hand with -fPIC
, the failing
command now works:
$ gcc -I$GUIX_ENVIRONMENT/{include,lib/ocaml,lib/ocaml/site-lib/bigarray-compat,lib/ocaml/site-lib/ctypes,lib/ocaml/site-lib/integers,ocaml/site-lib/stdlib-shims} -fPIC -g -o mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o -c mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.c
$ gcc -shared -g -o mnl/dllmnl_stubs.so mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
$ file mnl/dllmnl_stubs.so
mnl/dllmnl_stubs.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, with debug_info, not stripped
And you can see the relocation type in the object file changed from
R_X86_64_PC32
to R_X86_64_REX_GOTPCRELX
:
$ objdump -r mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000015 R_X86_64_REX_GOTPCRELX Caml_state-0x0000000000000004
000000000000002a R_X86_64_REX_GOTPCRELX Caml_state-0x0000000000000004
000000000000003f R_X86_64_REX_GOTPCRELX Caml_state-0x0000000000000004
...
So why was this only a problem with the return_errno
policy? Honestly,
I am not sure; it seems like dumb luck, because with the ignore_errno
policy, the -fPIC
flag is not used, but there are no relocations for which
gcc decides to use the R_X86_64_PC32
relocation type:
$ objdump -r mnl/libmnl__c_cout_generated_functions__c_stub__libmnl.o
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
000000000000001c R_X86_64_PLT32 mnl_socket_open-0x0000000000000004
000000000000002c R_X86_64_PLT32 caml_copy_nativeint-0x0000000000000004
I came up with a simple reproduction and submitted a bug report for dune. In the meantime, I can manually specify the flags for building the object files:
;; mnl/dune
(library
(name mnl)
(libraries integers)
(flags (:standard -w -9-27))
(ctypes
(external_library_name libmnl)
(headers (include "libmnl/libmnl.h"))
;; https://github.com/ocaml/dune/issues/5809
(build_flags_resolver
(vendored
(c_flags (:standard -fPIC -I%{env:GUIX_ENVIRONMENT=/usr}/include))
(c_library_flags (:standard -lmnl -L%{env:GUIX_ENVIRONMENT=/usr}/lib))))
I don't like having to put guix-related environment variables in the dune
file, as I want to keep the build files agnostic of the package manager.
But it will have to do for now. With this, dune build
succeeds! I've tried
to submit fixes to the speed bumps I encountered, so eventually no one else
will have to hit them. As of 3 Jun 2022:
- Guix patch 55715, fixing the missing
ocaml_integers.h
error has been merged upstream. - A fix for the
-fPIC
issue #5809 by @mbacarella has been merged, and is available in the 3.3.0 release. - My PR #5827 adding
errno_policy
support to dune's(ctypes)
stanza, has been merged and will be available in the 3.4.0 release.
With build stuff out of the way, let's write the damn program ☺.
libmnl
Exposing a subset of Since I don't plan on making my libmnl
bindings a standalone library, but
rather an implementation detail of the tapalloc
programs, I will only expose
those functions which I need to implement my services.
I will also expose a few identifiers from
netlink.h
from the Mnl
module, even though they are not technically part of libmnl
.
For that, I need to add "linux/netlink.h" and "linux/if_link.h" to the
includes
set in mnl/dune
.
First, it's nice to define some helper functions to reduce some of the repetition for error handling:
(* mnl/mnl.ml *)
open struct
type errno = Signed.SInt.t
let string_of_errno n
let i = Signed.SInt.to_int n in
Unix.(error_message (EUNKNOWNERR i))
let check_neg1 : (int * errno) -> ('a, string) result =
function
| -1, e -> Error (string_of_errno e)
| v, _ -> Ok v
let check_none : ('a option * errno) -> ('a, string) result =
function
| None, e -> Error (string_of_errno e)
| v, _ -> Ok v
end
The open struct ... end
idiom is a scoping trick; the symbols defined
within the anonymous block are available in the rest of the file, as if
they were defined at the top level, but will not be present in the public
interface of the module, and so cannot be used from other modules, and
will not show up in automatically-generated documentation. It is
a way for me to hide helper functions without maintaining an interface
file. With these
helpers in place, the socket_open
and socket_close
functions now become:
let socket_open bus = Libmnl.mnl_socket_open (bus_to_int bus) |> check_none
let socket_close sock = Libmnl.mnl_socket_close sock |> check_neg1
Now, regarding the creation of interfaces, the netlink API is sufficiently complex enough that, instead of reading it from top to bottom, I want to scope my search with some examples. Recently, the ever-useful strace has learned how to parse netlink messages, so I can simply run
doas strace ip link add ipv0 link wlan0 type ipvtap
And get back a readable sequence of messages that I can mimic in my program.
The ip
tool sends the following netlink message to create the link, given the
command above:
sendmsg(3,
{
msg_name={
sa_family=AF_NETLINK,
nl_pid=0,
nl_groups=00000000
},
msg_namelen=12,
msg_iov=[{
iov_base=[
{
nlmsg_len=68,
nlmsg_type=RTM_NEWLINK,
nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK|NLM_F_EXCL|NLM_F_CREATE,
nlmsg_seq=1654360314,
nlmsg_pid=0
},
{
ifi_family=AF_UNSPEC,
ifi_type=ARPHRD_NETROM,
ifi_index=0,
ifi_flags=0,
ifi_change=0
},
[
[{nla_len=8, nla_type=IFLA_LINK}, 4],
[{nla_len=9, nla_type=IFLA_IFNAME}, "ipv0"],
[{nla_len=16, nla_type=IFLA_LINKINFO}, [
{nla_len=10, nla_type=IFLA_INFO_KIND}, "ipvtap"...]]
]
],
iov_len=68
}],
msg_iovlen=1,
msg_controllen=0,
msg_flags=0
},
0)
The message consists of a struct nlmsghdr
, indicating an RTM_NEWLINK
request, followed by a struct ifinfomsg
which is blank (both AF_UNSPEC
and ARPHRD_NETROM are defined as 0), followed by 3 struct nlattr
items
describing the type of interface. The netlink(7) protocol, like other
network protocols, is layered, and message payloads can carry messages in
another sub-protocol. In this case, the sub-protocol is rtnetlink(7). We need
to surface some of the structs, enums, and constants found in the
<linux/rtnetlink.h>
header, after which we can compose a message with
the right combination of flags and attributes seen in the trace.
Laying out messages in memory
Netlink is a datagram-based protocol. A message is composed of a standard header:
The standard header is then followed by a payload, which can have varying sizes and fields. Several of these messages can be laid out next to each other in a single datagram. You could model a protocol like this in OCaml like so:
type 'a nlmsg = {
nlmsg_len : int32;
nlmsg_flag : int;
nlmsg_type : int;
nlmsg_seq : int32;
nlmsg_pid : int32;
payload : 'a;
}
However, while superficially it looks like a netlink message, the
memory layout is completely different; the int32
fields are actually
pointers to blocks containing an integer value, just as payload
will be for any non-int
payload. As I was browsing the iproute2 source code,
I came across this pattern which was used a lot:
struct {
struct nlmsghdr n;
struct ifinfomsg i;
char buf[1024];
} req = {
.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
.n.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
.n.nlmsg_type = RTM_NEWLINK,
.i.ifi_family = AF_UNSPEC,
};
While here it's mostly a syntactical construct, what struck me was the textual separation of the layout information and the actual message data. It made me think, "can I separate the layout information from the actual data?" As it turns out, you can, using Generalized algebraic data types (GADT).
While netlink is a datagram-oriented protocol, a single datagram can contain multiple messages of varying types. While netlink sockets, like other sockets, support the vectored sendmsg(2) and recvmsg(2) system calls that would allow me to spread parts of a message over multiple memory buffers, the libmnl library lays out messages in a single buffer.
The nlmsghdr
structure in the netlink(7) API is, as the hdr
suffix implies,
a header. That is, it shows up at the beginning of some piece of memory
that is potentially longer than just the header itself, and the header implies
some layout on the memory that follows it. The mnl library provides a
function, mnl_put_extra_header
that, in C, takes a size_t
and returns a
void pointer which you cast into the appropriate structure type.
The rtnl-link-set.c
shows how the mnl_*_put functions place sections of
a message adjacent to each other by allocating from a shared buffer.
4 char buf[MNL_SOCKET_BUFFER_SIZE];
...
26 nlh = mnl_nlmsg_put_header(buf);
...
30 ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
...
35 mnl_attr_put_str(nlh, IFLA_IFNAME, argv[1]);
I adapted this function to take a GADT, which determined the return type, instead:
type _ extra_header =
| Ifinfomsg : Ifinfomsg.t ptr extra_header
| Ifaddrmsg : Ifaddrmsg.t ptr extra_header
val mnl_nlmsg_put_extra_header : t. Nlmsghdr.t ptr -> t extra_header -> t
I mirrored this in my adaptation of the example:
...
let nlh = mnl_nlmsg_put_header buf in
let ifi = mnl_nlmsg_put_header nlh Ifinfomsg in
mnl_attr_put_str nlh IFLA_IFNAME Sys.argv.(1);
...
The mnl library takes care of updating the nlmsg_len
field when you add
an extra header.
Unexpected (non-)constants
While building this example, I hit another error:
% dune test
File "mnl/_unknown_", line 1, characters 0-0:
libmnl__c_cout_generated_types.c: In function ‘main’:
libmnl__c_cout_generated_types.c:84:13: error: enumerator value for ‘check_MNL_SOCKET_BUFFER_SIZE_const’ is not an integer constant
84 | enum { check_MNL_SOCKET_BUFFER_SIZE_const = (int)MNL_SOCKET_BUFFER_SIZE };
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
(exit status 1)
Let's see what that constant is:
$ grep -h -r MNL_SOCKET_BUFFER_SIZE $(guix build libmnl)
#define MNL_SOCKET_BUFFER_SIZE (sysconf(_SC_PAGESIZE) < 8192L ? sysconf(_SC_PAGESIZE) : 8192L)
Reading
_build/default/mnl/libmnl__c_cout_generated_types.c
the code generates an expression for each constant, like this:
enum { check_RTM_SETLINK_const = (int)RTM_SETLINK };
but the expression
sysconf(_SC_PAGESIZE) < 8192L ? sysconf(_SC_PAGESIZE) : 8192L
is not a constant; sysconf(3) is a function. I tried many ways to fix this, but ended up re-implementing the expression in OCaml, and caching its result at program startup:
let mnl_socket_buffer_size =
match F.sysconf T.sc_pagesize with
| n, _ when n > 8192 -> n
| _ -> 8192
Refactoring
At this point I have learned enough and used the interface I've defined enough that I want to stop forward progress, go back and refactor things before they get out of hand. Here's a summary of the changes, in broad strokes:
-
I changed the
Mnl
module name toNetlink
, and put the library name prefixes back in the function names. ThusMnl.(socket_open NETLINK_ROUTE)
became
Netlink.(mnl_socket_open NETLINK_ROUTE)
or rather, the intended:
open Netlink mnl_socket_open NETLINK_ROUTE
I did this because the module ended up including identifiers from multiple header files, not just libmnl.h, and since, in C-fashion, those identifiers' names had prefixes hinting where they came from, the module could be opened without the reader getting confused about where some name came from.
-
I made the API more OCaml-friendly; using variants instead of integers for enums, using
int
for all integers where 63 bits, and so on. Sometimes this mean re-implementing some of the things the mnl library was doing.
I also added another module, Sockfd
, exposing the ability to
pass file descriptors over a unix socket. Unlike the libmnl bindings,
I wrote this one as a conventional C binding, with a hand-written
C stub.
Creating an interface
Lets try to create an interface. One of the first things I noticed on the libmnl docs was the text
You can access several example files under examples/ in the libmnl source code tree.
To the examples! I can download the libmnl source with guix like so:
$ guix build -S libmnl
/gnu/store/ihwm7imzz5cpwwpxlvnq4a59f2vs29xy-libmnl-1.0.5.tar.bz2
$ tar xf $(guix build -S libmnl)
$ cd libmnl-1.0.5
The file example/rtnl/rtnl-link-set.c contains an example of setting a link up or down. It shouldn't be too much of a stretch to replicate this example, and then iteratively change it until it does what I actually want.
I need to craft an RTM_NEWLINK message, described in rtnetlink(7). It should have the following attributes:
IFLA_LINK=(index of parent interface)
IFLA_IFNAME="tap0"
IFLA_LINKINFO=[
IFLA_INFO_KIND="ipvtap"
IFLA_INFO_DATA=[
IFLA_IPVLAN_MODE=IPVLAN_MODE_L2
IFLA_IPVLAN_FLAGS=0
]
]
After a bunch of hacking, trial-and-error, I got to the point where I had
a test setup using the dummy
device driver like this:
An interesting property is that, although the dummy
device driver
is designed to drop all packets, the design of the ipvlan
and
macvlan
drivers is such that communication between child interfaces
is possible even when the parent interface is a dummy device! This has
interesting potential for local-only networks, potentially with enormous
(64k) MTUs, which I could explore in the future, you know, for !!fun!!
The test was pretty fun to write, as it involved lots of forking and exec-ing back into itself in order to set up the ephemeral namespaces using unshare(1). It requires user namespaces to be supported on the system. I think in the future I will setup a User-mode Linux integration test so it can run against a specific kernel without requiring the host machine to support user namespaces (some admins disable this feature out of caution).
I wrote a bunch of pretty-printing functions so I could see the messages I was creating. One funny error I ran into looked like this:
->{nlmsg_len: 44 nlmsg_type: RTM_GETLINK ...} IFLA_IFNAME=dummy0
sent RTM_GETLINK message for dummy0
<-{nlmsg_len: 1452 nlmsg_type: RTM_NEWLINK ...}
{ifi_family: AF_UNSPEC ifi_type: ARPHRD_ETHER ifi_flags: 0x100c3 ifi_change: 0x0000 ifi_index: 2}
IFLA_IFNAME=dummy0 ... IFLA_INFO_KIND=dummy
<- {nlmsg_len: 36 nlmsg_type: NLMSG_ERROR ...}{error: 0 msg: {nlmsg_len: 44 nlmsg_type: RTM_GETLINK ...}
parent device "dummy0" has index 44
->{nlmsg_len: 92 nlmsg_type: RTM_NEWLINK nlmsg_flags: NLM_F_REQUEST|NLM_F_CREATE|NLM_F_EXCL...}
IFLA_LINK=44 IFLA_IFNAME=ipvtap19962 IFLA_INFO_KIND=ipvtap
<-{nlmsg_len: 36 nlmsg_type: NLMSG_ERROR ...}{error: -19 (No such device) ...}
req#1163551169 failed: No such device
This happened because I re-used the buffer that held the response for my
RTM_GETLINK request to receive the acknowledgement (an NLMSG_ERROR message
with error=0). The offset of the ifi_index
field in the ifinfomsg
struct just happens to line up with the msg.nlmsg_len
field in the
nlmsgerror
struct. It was a silly error, and easily fixed by making a
copy. Eventually I will change the program so that it doesn't read into
sections of a buffer that are still in use.
After a bit more trial and error, my test was successfully creating an ipvtap device! 🎉
$ nsenter --target $(pgrep -f tapallocd_ipvtap) --net ip link
1: lo: mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0: mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 1a:4b:25:01:45:dd brd ff:ff:ff:ff:ff:ff
3: ipvtap21559@dummy0: mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 500
link/ether 1a:4b:25:01:45:dd brd ff:ff:ff:ff:ff:ff
But we're not done yet. We have to find the character device and yield file descriptors for it...
Getting the file descriptors
Typically, when you create a macvtap or ipvtap device, udev will
create files under /sys/class/net/$name
and a character device at
/dev/tap$index
. Depending on your configuration, it will probably be
owned by root. I installed the following udev rule:
SUBSYSTEMS=="ipvtap", OWNER="tapallocd", MODE="0600"
I could make the rule a little more targeted, perhaps matching some
prefix of the interface name, or the parent device. This cooperation
from udev also makes the testing story a little more complicated, as
an automated test will now depend on a device manager. There is also
now a race between the device node being created by udev and tapallocd
attempting to open it; tapallocd will have to wait for some notification
that /dev/tap* has been created. Another challenge is that the device
manager needs to be running in the same network namespace as tapallocd
;
you cannot rely on the default device manager.
I want this daemon to "just work" and be able to operate mostly on its own, in its own network namespace. So my approach is to listen to a NETLINK_KOBJECT_UEVENT socket for the device creation event to learn its major+minor device numbers and call mknod(2) within tapallocd to create it. This would also allow us to create the character device file somewhere else, so we can unlink the device without having write permissions to /dev.
This sounds like a lot of work, adding an entirely new protocol to the program. I had trouble finding documentation about the protocol format, so I decided to go ahead and just see what it looked like myself. I made a simple loop that ran in a background thread during the test and just dumped everything it read from a NETLINK_KOBJECT_UEVENT socket to stderr:
let hexdump : int -> buffer Fmt.t =
fun len dst buf ->
(Fmt.hex ()) dst (len, CArray.get buf)
let dump_nl =
let nl = mnl_socket_open NETLINK_KOBJECT_UEVENT in
let* _ = mnl_socket_bind nl NETLINK_KOBJECT_UEVENT 0 1 in
let buf = create_buffer 32768 in
let rec loop () =
match mnl_socket_recvfrom nl buf with
| Error e -> Error (Unix.error_message e)
| Ok len ->
Fmt.epr "got kobject msg@."
Fmt.epr "%a@." (hexdump len) buf;
loop ()
in
loop ()
Running this, the add events were printed in my test's output:
got kobject msg
00: 6164 6440 2f64 6576 6963 6573 2f76 6972 add@/devices/vir
10: 7475 616c 2f6e 6574 2f69 7076 7461 7035 tual/net/ipvtap5
20: 3735 332f 7175 6575 6573 2f74 782d 3000 753/queues/tx-0.
30: 4143 5449 4f4e 3d61 6464 0044 4556 5041 ACTION=add.DEVPA
40: 5448 3d2f 6465 7669 6365 732f 7669 7274 TH=/devices/virt
50: 7561 6c2f 6e65 742f 6970 7674 6170 3537 ual/net/ipvtap57
60: 3533 2f71 7565 7565 732f 7478 2d30 0053 53/queues/tx-0.S
70: 5542 5359 5354 454d 3d71 7565 7565 7300 UBSYSTEM=queues.
80: 5345 514e 554d 3d32 3638 3332 00 SEQNUM=26832.
got kobject msg
00: 6164 6440 2f64 6576 6963 6573 2f76 6972 add@/devices/vir
10: 7475 616c 2f6e 6574 2f69 7076 7461 7035 tual/net/ipvtap5
20: 3735 332f 6970 7674 6170 2f74 6170 3300 753/ipvtap/tap3.
30: 4143 5449 4f4e 3d61 6464 0044 4556 5041 ACTION=add.DEVPA
40: 5448 3d2f 6465 7669 6365 732f 7669 7274 TH=/devices/virt
50: 7561 6c2f 6e65 742f 6970 7674 6170 3537 ual/net/ipvtap57
60: 3533 2f69 7076 7461 702f 7461 7033 0053 53/ipvtap/tap3.S
70: 5542 5359 5354 454d 3d69 7076 7461 7000 UBSYSTEM=ipvtap.
80: 4d41 4a4f 523d 3233 3400 4d49 4e4f 523d MAJOR=234.MINOR=
90: 3100 4445 564e 414d 453d 7461 7033 0053 1.DEVNAME=tap3.S
a0: 4551 4e55 4d3d 3236 3833 3300 EQNUM=26833.
It's a simple text-based protocol, with one event per datagram, and
fields delimited by a NUL byte. You can see that the second event,
which is adding the tap3
character device, includes the MAJOR (234)
and MINOR(1) variables, and the device name. This should be all I need to
call mknod(2).
To create the device I need the CAP_MKNOD capability. I can give it to the
process using setpriv(1)
$ exec doas setpriv \
--ambient-caps +mknod \
--inh-caps +mknod \
--securebits +keep_caps_locked,+no_setuid_fixup_locked \
--reuid $(id -u) --regid $(id -g) --init-groups \
env -i $(env) guix shell -D
$ setpriv --dump | grep capabilities
Inheritable capabilities: mknod
Ambient capabilities: mknod
$ dune test -w --no-buffer
Unfortunately, this doesn't work:
mknod /tmp/ipvtap5023: Operation not permitted
I suspect that the test loses the CAP_MKNOD capability when it uses unshare(1) to transition from the default namespace into its own user namespace, since if I use mknod(1) to manually create the character device outside of the namespace, it is successful.
An unforseen leak
While I was pondering what to do, I noticed another problem. One of the
core assumptions of this whole system is that I would be able to unlink
the device node from the file system, and then use open file descriptors
as references that would prevent the device from being deleted until
they were closed. With the approach I took of manually creating the
device nodes, I am able to create the nodes in directories where I have
write access, and unlink them without requiring write access to the
/dev
directory.
Imagine my surprise when I found a /dev/tap3
file during my test! At
first I suspected udev was doing this somehow, but that seemed impossible;
udevd
listens for device events on a NETLINK_KOBJECT_UEVENT socket,
and those should not receive events from other network namespaces.
I wrote a quick tool to simply dump everything it could read from a netlink socket to stdout, similar to the testing in previous sections. As I expected, I did not see events for the ipvtap devices unless I ran the tool within the same network namespace. The device permissions were also wrong:
$ ls -l /dev/tap3
crw------- 1 root root 234, 1 Jan 5 16:15 /dev/tap3
My udev rules would have changed the node permissions to 0660 and owner to
tapalloc
. Using strace
on udevd
more or less proved its innoncence;
it wasn't calling mknod to create the file. After a bit more research
I realized it was devtmpfs
; back in the 2000s there was a stretch of
time where devfs
was removed from the kernel, and udev was responsible
for populating dev. I guess that state of affairs was too difficult,
because the concept was re-introduced in 2010 in the form of devtmpfs
;
the kernel will post device nodes there for any driver that asks for it,
and its mounted by default at /dev
on my system.
That throws a wrench in my plan. I'm not ready to configure my system
to run without devtmpfs
. I can think of a few options:
- Run a device manager in the same namespace that removes the nodes from devtmpfs.
- Give tapallocd the privilege to remove these nodes itself.
- Patch the kernel to introduce a device analagous to /dev/net/tun, but for ipvtap devices. It would have to be a per-parent device, and new ioctls would be needed to configure the ipvlan-specific flags.
- Find some other way to track when the interface is no longer in use.
None of these options are great, but the least bad option, I think, is to rely on a device manager. So setup instructions will include steps to add udev/mdev/mdevd rules that fix the permissions of tap devices as they come online.
Because I want to keep my tests self contained, I will move the mknod(2)
calls out of tapallocd and into the test, where it will run a child
process that emulates a device manager, including the re-broadcasting
behavior that both udev and mdevd support. I use this behavior in
tapallocd
as a synchronization tool, to wait until the device node
is ready. It runs as a child process of the test, forked before a new
user+network namespace is created, and looks like this:
let prefix = Bytes.of_string "add@/devices/virtual/net/" in
let rec loop () =
let n = Unix.recv uevent buf 0 len [] in
Fmt.(epr "got %a@." (hex ()) (n, Bytes.get buf));
if not (Bytes.starts_with ~prefix buf) then loop ()
else process (Bytes.sub_string buf 0 n)
and process s =
let props =
String.split_on_char '\000' s |> List.filter_map parse_kv
in
let name = List.assoc "DEVNAME" props in
let major = List.assoc "MAJOR" props |> int_of_string in
let minor = List.assoc "MINOR" props |> int_of_string in
let dev = makedev major minor in
let* () = mknodat devdir name 0o660 S_IFCHR dev in
Fmt.epr "udev: created device node %s@." name;
rebroadcast s
and rebroadcast s =
let _ = Unix.send_substring donefd s 0 (String.length s) [] in
loop ()
in
loop ()
This didn't quite work. It turns out the libmnl
library doesn't really allow you to send a multicast
message to a netlink socket, because the mnl_socket_sendto
routine does not pass an nl_groups
field in the sockaddr_nl
structure.
I had to use sendmsg
directly and construct my own:
and rebroadcast n =
(* mnl_socket_sendto blanks out the groups field, so we
have to bypass it with our own msghdr *)
let open Ctypes in
let sa = make Sockaddr_nl.t in
Sockaddr_nl.(setf sa nl_family (int_of_family AF_NETLINK));
Sockaddr_nl.(setf sa nl_pid 0);
Sockaddr_nl.(setf sa nl_groups nl_group_id);
let iov = make Iovec.t in
Iovec.(setf iov iov_base (CArray.start buf |> to_voidp));
Iovec.(setf iov iov_len n);
let msg = make Msghdr.t in
Msghdr.(setf msg msg_iov (addr iov));
Msghdr.(setf msg msg_iovlen 1);
Msghdr.(setf msg msg_name ((addr sa) |> to_voidp));
Msghdr.(setf msg msg_namelen (sizeof Sockaddr_nl.t));
Msghdr.(setf msg msg_controllen 0);
Msghdr.(setf msg msg_flags 0);
sendmsg fdw (addr msg)
As an aside, since I re-wrote the tests from scratch many times throughout the course of this project. I eventually forgot why I did this and spent awhile re-discovering the same solution. The lesson here is that I should make more frequent check-ins to source control with commit messages and comments describing the issues I encountered.
Another issue I ran into was with sending file descriptors over a
Unix socket; I did this in a C stub because of how arcane the API was,
relying on lots of pointer casting and macros. The C stubs implement
two OCaml functions in the Sockfd
module:
(** [send fds sock] sends the file descriptors in [fds] to the
process on the other end of the socket [sock]. *)
val send : Unix.file_descr array -> Unix.file_descr -> unit
(** [recv n sock] receives up to [n] file descriptors from [sock],
which must be a unix socket *)
val recv : int -> Unix.file_descr -> Unix.file_descr array
However, I was getting strange behavior. I would send two descriptors:
Sockfd.send [| fd1; fd2 |] sock;
But on the receiving side, I would only receive one:
let fds = Sockfd.recv 2 sock in
(* Array.length fds = 1 ??? *)
After some experiementation and research (strace was helpful here) the problem was that I was sending two control messages with one descriptor each, but on the receiving end, they were merged into one control message with two descriptors. I don't know if this is Linux specific behavior, so I rewrote the send function to put all of the fds into one message, and the recv function to support any arrangement of fds (up to SCM_MAX_FD) within any number of messages.
After lots more silly bugs, reading docs, reading the kernel code, I finally got to the end of the test:
tapallocd: char device tap3 for ipvtap21701 is ready
tapallocd: removing reference to device
tapallocd: opened 1 fds to tap3
tapallocd: sending fd 6
tapalloc: received fds [3]
tapalloc: moving fds [3] to [8]
exec /home/droyo/src/dev.arroyo.cc/tapalloc/_build/default/test/test_dummy.exe stage2
mac address is 12:b2:e1:ac:a9:db
got tap device with stats {dev: 66306, ino: 7262473, perm: 0640, uid: 65534,
gid: 65534, rdev: 59905}
ok
Unfortunately I still wasn't done; I found that the kernel was not destroying the network device after its device node and all references to it were gone. I thought about it some more and found that the most suitable option is the one taken by all the fashionable container managers out there; place the network interface in the client's network namespace. I did consider this initially and hesitated for a number of reasons:
-
I was concerned it would make it harder for me to implement my dhcp/slaac helper as it would need to use NETLINK_NETFILTER to snoop on packets from these interfaces, which are now in different namespaces. I think this can be addressed with the NETLINK_LISTEN_ALL_NSID option.
-
I didn't want to give the client, which probably had CAP_NET_ADMIN privileges in their namespace, control over the device if I didn't have to. Upon reflection, I was already giving them the ability to write raw ethernet frames into the device. Giving them the ability to bring the sub-interface up or down or assign addresses is not a big deal. Besides, I control the clients.
The work I did before to emulate a device manager is not wasted; obtaining file descriptors for the tap device will still be a privilege reserved for the server. It's also one less dependency for automated tests. Moving the device into the clients' namespaces is probably for the best, as it will establish some symmetry between VMs (which will use ipvtap devices) and containers (which will use ipvlan devices).
To start, I tried using the IFLA_NET_NS_PID
attribute in my
RTM_NEWLINK
request:
mnl_ifla_put nlh IFLA_NET_NS_PID peer_pid;
With this attribute, my test hangs here:
kernel accepted RTM_NEWLINK
waiting for device manager to create char device
If I peek into the client's namespace, the link exists:
$ doas nsenter --target $(pgrep -f tapalloc.exe) --net -- ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ipvtap8447: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 500
link/ether d2:09:23:6f:b7:bd brd ff:ff:ff:ff:ff:ff
So I suspect the "udev" emulator isn't getting notifications about this device because it's not in our namespace. I tried setting the NETLINK_LISTEN_ALL_NSID flag. From netlink(7):
"When set, this socket will receive netlink notifications from all network namespaces that have an nsid assigned into the network namespace where the socket has been opened. The nsid is sent to user space via an ancillary data."
Unfortunately, it did not work work; the device manager listening in tapallocd's namespace did not get any uevents. I could not tell if that was because the kernel did not consider the two namespaces to be linked (due to one ns having a sub-interface of a device in the other ns), or because the uevents are not properly tagged when being broadcast. Another downside of this approach is that in a real deployment, it would require the device manager to set the NETLINK_LISTEN_ALL_NSID socket option, which may require patching the software to support socket options. Perhaps I can fix that in the future. For now, I'll do the following:
- Bring the device up in tapallocd's namespace and obtain file descriptors
- Move the device to the client's namespace after it's ready
This two-step approach opens the possibility of leaking devices if something goes wrong between the device creation and its transfer to the client namespace. For example, the client cloud close the connection, or it could be in a namespace that the server doesn't have permission to create interfaces into. I am comfortable with this risk; I can write the server carefully to clean up the device if it can't be transferred, and have some periodic sweep for "dangling" interfaces. I can probably also work out some scheme to limit the total number of interfaces in the server's namespace at any given time.
Doing it this way, I finally get what I want; the network device is allocated, the client has open file descriptors to send and receive ethernet frames with, and the device's lifecycle is the same as the client's. I no longer need to perform any trickery with unlinking files. The proof of concept is complete!
In all honesty, this service could have been an execline script with the help of the programs in fdtools. This was really more of a warm-up; an easy, tractable problem to get me accustomed to the quirks and oddities of netlink and ctypes before getting to my real goal.
Along the way I finally understood how to use the Ctypes view
function.
It's basically registering handlers to convert to and from C and OCaml
values when you pass a value to or receive a value from a C function.
As an example, the nfnetlink structures contain several fields which are
documented as holding values in network byte order (e.g. big-endian). For
example:
struct nfqnl_msg_packet_hdr {
__be32 packet_id; /* unique ID of packet in queue */
__be16 hw_protocol; /* hw protocol (network order) */
__u8 hook; /* netfilter hook */
} __attribute__ ((packed));
I adapted this struct like so:
module Nfqnl_msg_packet_hdr = struct
type t = [`Nfqnl_msg_packet_hdr] structure
let t:t typ = structure "nfqnl_msg_packet_hdr"
let packet_id = field t "packet_id" be32
let hw_protocol = field t "hw_protocol" be16
let hook = field t "hook" int
let () = seal t
end
What are be32
and be16
above? They're views over int
:
let be16 = view int ~read:ntohs ~write:htons
let be32 = view int ~read:ntohl ~write:htonl
Meaning, whenever I read a field of type be32
(through getf
or the
!@
operator for pointer dereferencing), I get back (ntohl value)
,
and whenever I write such a field, (htonl value)
is written instead.
I had to write the ntoh* and hton* functions myself, although they
may be in one of the modules I'm already using. They were easy
enough:
let mirror width x =
let rec loop dst src = function
| 0 -> dst
| n ->
let lsb = src land 0xff in
let dst' = (dst lsl 8) lor lsb in
let src' = (src lsr 8) in
loop dst' src' (n-1)
in
loop 0 x width
let htons = if Sys.big_endian then Fun.id else mirror 2
let htonl = if Sys.big_endian then Fun.id else mirror 4
let ntohs = htons
let ntohl = htonl
After understanding view
better I started using it in other ways, for
example for some enum-type fields that convert variants to and from
integers, but weren't usable with the TYPE.enum
function for one
reason or another. For example, defining a family
view for the address
family constants looks like this:
let family = view int ~write:family_to_enum
~read:(fun x -> family_of_enum x |> Option.value ~default:AF_UNSPEC)
And using it as a field in a structure allows me to do things like
let v = mnl_nlmsg_put_extra_header nlh Nfgenmsg in
Nfgenmsg.(v |-> nfgen_family) <-@ AF_UNSPEC;
Helping VMs help themselves - tapautoconf
Ipvlan (and ipvtap) devices are sub-interfaces of some other device. Sub-interfaces behave as if they're attached to a switch, and are able to communicate efficiently without the participation of the underlying device. If a packet is destined to an address that does not belong to a sub-interface, it will be transmitted by the underlying device. Similarly, when the underlying device receives a packet, if it's destined for a sub-interface, that interface receives it, otherwise, the packet is dropped.
The core challenge with using ipvlan devices is that all sub-interfaces share the ethernet (MAC) address of the underlying device. Traffic is multiplexed among the sub-interfaces based on the layer 3 (IPv4/IPv6) addresses. That means, in order to receive unicast traffic, an ipvlan device must have an IP address assigned to it. Without an address, they can only receive broadcast and multicast traffic.
I don't want to manually assign IP addresses to my VMs, and I don't want to invent my own IP address management automation. I want my VMs to acquire their own addresses from the existing automation in whatever network I happen to be in. That means supporting DHCP for IPv4 and SLAAC for IPv6. Since I also plan to use these VMs on my laptop, I also have the stipulation that I may not have control of the network, so I cannot rely on special DHCP configuration.
My strategy will be to implement a service that will "learn" the IP address(es) that an interface wants to use by snooping for DHCP and ICMPv6 traffic, and assign those addresses to the interface on its behalf. Because I need to know what interface the packet came from, simply listening for broadcast & multicast packets is not enough. I can use a feature of netfilter that allows a filtered packet to be queued to a userspace program. The program can even modify and re-inject the packet, which I will use later for an extra feature. This allows me to only intercept the packets I actually need, and netfilter will include metadata about where the packet originated, allowing me to associate a packet with the interface it came from.
Test setup
I need a test. The test is my "north star", some marker that will show
me if I'm going in the right direction. This setup will be slightly more
complicated than the tapalloc
test from earlier.
Since I can't find a dhcp client that would work with tap devices
and isn't a full-fledged VM, I'll write a variant of tapallocd
that
allocates ipvlan
instead of ipvtap
devices. I was planning to do
so anyway.
Dnsmasq acts as a stand-in for the dhcp/radvd services that run on the router in my home network, and is pre-configured with an IPv4 address pool and an IPv6 prefix which it will advertise for stateless autoconfig (no DHCPv6).
The dhclient
process runs with the -n
flag, which tells dhclient
not to actually configure the interface after obtaining an address. The
namespace containing the master dummy0
interface will have netfilter
rules installed that intercept ICMPv6 and DHCP traffic and enqueue it
to a netlink queue, which tapautoconf
will listen on. The test will
be successful when tapautoconf
is able to configure the IPv4 and IPv6
addresses that dhclient negotiates with dnsmasq.
To start, I will construct this topology from an automated test, setup
some netfilter rules to intercept traffic, and write a stand-in for
tapautoconf
that just logs every packet it receives and what metadata
is available for it.
While setting up the test I hit an issue; I was attempting to create
ipvlan
devices directly in the client's namespace, since the dance
required to obtain file descriptors for ipvtap
devices was not necessary.
I did this by setting the IFLA_NET_NS_PID
attribute to the pid of
the client in the RTM_NEWLINK message:
let (peer_pid, gid, uid) = Sockfd.peercred Unix.stdin in
...
mnl_ifla_put nlh IFLA_NET_NS_PID peer_pid
The peer pid can be retrieved with the SO_PEERCRED socket option. Despite setting this attribute, the network device was created in tapallocd's network namespace, rather than the client's. I wracked my brain over this problem for a long time. I was especially befuddled because in my prior test, before changing ipvtap to a 2-stage allocation method, I was able to create devices directly in the peer namespace. I knew this was supposed to work. I pored through the kernel source. I tried all kinds of experiments. I added more debug statements. I mechanically looked for differences between the working and non-working program, going so far as to re-arrange attributes in some desparate terror that the order would matter.
Ultimately, the issue was quite simple. The tapallocd
program queries
the SO_PEERCRED socket option, which is a struct that looks like this:
struct ucred {
pid_t pid; /* process ID of the sending process */
uid_t uid; /* user ID of the sending process */
gid_t gid; /* group ID of the sending process */
};
It uses the peer's process ID to identify the network namespace to insert the network interface into. The problem was that peer_pid was not the pid of the process on the other end of the socket. How is that possible? Well.
The tapallocd
programs are intended to run under a super-server
such as inetd, socat(1), or s6-ipcserver(8). The server
listens on the unix socket, and for each accepted connection, forks an
instance of tapallocd
. A single tapallocd
process serves a single
request.
I wanted my test to require minimal external dependencies, and when possible, I did not want it to write files to a file system, or to bind to network addresses on the system. The solution I came up with looked something like this:
let well_known_port = 42
let start () =
let fd = bg Sys.argv.(0) [| Sys.argv.(0); "listen" |]
Unix.dup2 fd well_known_port
(* runs in a separate process *)
let listen () =
let buf = Bytes.create 1 in
while Unix.(read stdin buf 0 1) > 0 do
let srv = bg "tapallocd" [| "tapallocd"; "--verbosity=debug"; "dummy0" |] in
Sockfd.send [| srv |] stdout
done
let connect () =
ignore (Unix.write_substring well_known_port " " 0 1);
let fd = Sockfd.recv 1 well_known_port).(0) in
Unix.dup2 fd (fd_of_int 6)
Unix.dup2 fd (fd_of_int 7)
...
let test () =
let c = connect () in
Unix.execvp "tapalloc" [| "tapalloc"; ... |]
let main () =
match Sys.argv.(1) with
| "listen" -> listen ()
| "test" -> test ()
...
I've ellided some details, but essentially, the test program repeatedly
exec's itself, using its first argument to choose what function to
call. A socketpair connects the file descriptor 42 to the standard input
and output of a child process that starts instances of tapallocd
and
shares one end of another socketpair that is connected to tapallocd
's
standard input and output. The test can then execute into the tapalloc
client, which expects the tapallocd
service to be connected to file
descriptors 6 (for read) and 7 (for write). The numbers 6 and 7 come
from the UCSPI convention.
What I learned was, the SO_PEERCRED
option contains the pid
(and uid, and gid) of the process which opened the socket, not
the process that happens to be writing to it. In my ipvtap
tests, which
did not require a forking server since there was only one request, these
processes were one and the same. But with the "server" implementation
above, the process that creates the socket (in the implementation of the
bg
function) is the process running the listen function, not the
process that calls connect. I changed the implementation to look like
this:
(* runs in background process *)
let rec listen () =
match Sockfd.recv 1 Unix.stdin with
| [| c |] -> (
(* tell client we've received the fd *)
ignore (Unix.write_substring c "." 0 1);
match Unix.fork () with
| 0 -> Unix.close c; listen ()
| _ -> (* new child process *)
Unix.(dup2 c stdin);
Unix.(dup2 c stdout);
Unix.execvp "tapallocd" [ "tapallocd"; "--verbosity=debug"; "dummy0" ])
| _ -> exit 0
(* called by test *)
let connect () =
let client, server = Unix.(socketpair PF_UNIX SOCK_STREAM 0) in
Sockfd.sendfd [| server |] well_known_port;
(* read one byte as a sign that server has received the fd *)
ignore (Unix.read client (Bytes.create 1) 0 1);
Unix.dup2 client (fd_of_int 6);
Unix.dup2 client (fd_of_int 7)
So rather than listen
opening a socketpair on behalf of the process calling
connect
, connect
opens its own socketpair and shares one end with listen
.
In hindsight, it makes perfect sense that the credentials in SO_PEERCRED would
be those of the process that opened the socket. After all, the same socket could
be dup'd to any number of processes in any number of namespaces, and treating
SO_PEERCRED as "the currently speaking user" is inherently racy.
With that cleared up, the topology, at least, is in place. dclient -4
cannot obtain a lease, even though dnsmasq is offering it one:
DHCPDISCOVER on nic0 to 255.255.255.255 port 67 interval 4
Jan 10 20:11:11 dnsmasq-dhcp[27106]: DHCPDISCOVER(nic0) 0e:96:ef:9b:98:7b
Jan 10 20:11:11 dnsmasq-dhcp[27106]: DHCPOFFER(nic0) 192.168.0.119 0e:96:ef:9b:98:7b
No DHCPOFFERS received.
Unable to obtain a lease on first try. Exiting.
A packet capture makes things obvious:
62:9c:ad:b6:86:f2 > ff:ff:ff:ff:ff:ff: 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 62:9c:ad:b6:86:f2
62:9c:ad:b6:86:f2 > 62:9c:ad:b6:86:f2: 192.168.0.1.67 > 192.168.0.121.68: BOOTP/DHCP, Reply, length 300
The DHCPDISCOVER request is broadcast, but the DHCPOFFER reply is unicast, addressed at the L2 layer to the mac address of the parent interface, and at the L3 layer to the IPv4 address being offered. Since the client does not have an L3 address configured, as an ipvlan device it cannot receive the reply. It's a chicken and egg problem. The strategy for making DHCP work, then, is to intercept these DHCPOFFER packets, configure the address they offer on the interface, then re-inject them. While netfilter would allow us to identify the interface that the DHCPDISCOVER offer originated from, it can't tell us what interface the DHCPOFFER was intended for. The tapautoconf program will have to match some information in the DHCPDISCOVER packet, such as the DUID, to information in the DHCPOFFER packet.
Let's insert tapautoconf
into the mix. It is inspired by examples
in the libmnl and libnetfilter_queue C libraries. At this point, it
simply binds to a netfilter queue, and prints out every message it
receives. This is the main loop, so to speak:
let logmsg nlh =
Logs.info (fun m -> m "%a" pp_nlmsg nlh);
in
let rec loop () =
match Event.(sync (receive firehose)) with
| Ok q -> Queue.iter logmsg q; loop ()
| Error _ as err -> err
in
I have 3 threads set up; one for reading messages from netlink,
one for writing messages to netlink, and one for processing. They
communicate via Event channels
from OCaml's threading library. I really enjoyed setting up the event
loop. I think in any language or scenario, event loops are incredibly
fun to write, as you have to think about how to write it in a way
that keeps the system responsive to new events, while also avoiding
unbounded resource usage. The Event
library is quite nice; it gives
you channels with send
and receive
operations, and a select
function that lets you choose between them:
val send : 'a channel -> 'a -> unit event
val receive : 'a channel -> 'a event
val select : 'a event list -> 'a
let event =
select [
receive channel1;
receive channel2;
receive channel3;
]
in
...
That may seem fairly limiting initially, as it only allows you to
select between channels that produce values of the same type. Luckily,
there is also the wrap
function that can convert an event into another
type without waiting for the event to occur.
val wrap : 'a event -> ('a -> 'b) -> 'b event
So, I have a type that describes the different kinds of events that can occur:
type request = (Nlmsghdr.t * nlmsg_reply Event.channel)
type ioevent =
| Request of request (** request from another thread *)
| Response of nlmsg_reply (** response from netlink *)
| Delivered of mailbox (** a thread received its reply *)
| Flush (** messages were flushed to netlink *)
And helper functions convert send
and receive
actions into
this iovent
type:
let ioevent_of_req r = Request r
let ioevent_of_rsp r = Response r
let ioevent_of_nop _ = Flush
let ioevent_of_ack m () = Delivered m
let recv_rsp ch = Event.(wrap (receive ch) ioevent_of_rsp)
let recv_req ch = Event.(wrap (receive ch) ioevent_of_req)
let flush_req wq ch = Event.(wrap (send ch (Queue.copy wq)) ioevent_of_nop)
let deliver m = Event.(wrap (send m.ch (Ok (Queue.copy m.q))) (ioevent_of_ack m))
So that my event loop looks like this:
let events =
[ recv_rsp from_netlink ]
|> (if not full then List.cons (recv_req request) else Fun.id)
|> (if pending then List.cons (flush_req wq to_netlink) else Fun.id)
|> Hashtbl.fold try_deliver inflight
in
match Event.select events with
| Flush -> Queue.clear wq; ioloop ()
| Request (nlh, ch) -> handle_req nlh ch; ioloop ()
| Response (Ok q) -> Queue.iter handle_rsp q; ioloop ()
| Delivered mb -> handle_ack mb; ioloop ()
| Response (Error s) -> handle_err s; ()
Requests come in as a message plus an event channel to receive replies
on. The event loop generates a sequence number for them and puts them
into a hash table. When a reply comes in, the sequence number is used
to lookup the sender of the request. The events
list is dynamically
rebuilt on each run based on the state of the system, but it always
checks for new replies from netlink, as neglecting that channel could
lead to dropped messages. The events that handle messages deliver a
queue of messages rather than a single message. This is to set up for
the use of sendmmsg
and recvmmsg
in the future if performance ever
becomes an issue.
While the event loop was fun, actually getting messages to flow was a trial verging on an existential crisis. "Why am I here? Why am I doing this?" It wasn't terribly difficult to get the right bits in the right places, but the utter lack of feedback from netlink made it difficult to understand why packets wouldn't flow, because the NFNETLINK socket was just ACK'ing my messages without telling me what was wrong with them.
I had to install the following nftables rules to select the traffic I wanted delivered to the queue:
table inet filter {
chain PREROUTING {
type filter hook prerouting priority 0
policy accept
# DHCPv4 traffic
counter meta protocol ip udp sport 67-68 udp dport 67-68 queue num 1
# DHCPv6 traffic
counter meta protocol ip6 udp sport 546-547 udp dport 546-547 queue num 1
# Router solicitations to learn guests' lladdr
counter icmpv6 type { nd-router-solicit, nd-router-advert } queue num 1
}
}
Eventually I would like the tapautoconf program to install this rule itself by default, so there is one less thing to set up. Doing so would not require increasing its permissions, since it already has CAP_NET_ADMIN required to read from the netfilter queue.
After that, and a lot of twiddling and logging and experimentation, I was finally rewarded with the message:
dnsmasq-dhcp[2787]: DHCPDISCOVER(nic0) f2:5a:9b:84:98:1e
dnsmasq-dhcp[2787]: DHCPOFFER(nic0) 192.168.0.146 f2:5a:9b:84:98:1e
TAPAUTOCONF: <-{nlmsg_len=408, nlmsg_type=NFQNL_MSG_PACKET, nlmsg_flags=0x0000, nlmsg_seq=0, nlmsg_pid=0}{nfgen_family=AF_UNSPEC, version=NFNETLINK_V0, res_id=0x0001} NFQA_PACKET_HDR={packet_id=5, hw_protocol=0x0800} NFQA_IFINDEX_INDEV=33554432 NFQA_HWADDR={hw_addrlen=6, hw_addr=f2:5a:9b:84:98:1e} NFQA_PAYLOAD=[45c0 0148 3b8d 0000 4011 bb74 c0a8 0001 c0a8 0092 0043 0044 0134 4292 0201 0600 6b02 0168 002d 0000 0000 0000 c0a8 0092 c0a8 0001 0000 0000 f25a 9b84 981e 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 6382 5363 3501 0236 04c0 a800 0133 0400 000e 103a 0400 0007 083b 0400 000c 4e01 04ff ffff 001c 04c0 a800 ff03 04c0 a800 0106 04c0 a800 01ff 0000 0000 0000 0000]
I was beginning to think that my entire idea was flawed and that I
could not intercept packets the way I wanted to. So it was a big relief
to see this message, even if I could not immediately understand it. I
have to go about parsing the packet. I would like to use the Cstruct package which
comes with a nice preprocessor that allows binary data to be parsed
easily. It's also a popular package, used extensively in the Mirage project, and, interestingly, in the
Charrua project, which is a DHCP
server, client, and protocol library. Cstruct's core data types are a
buffer
, which is just a Bigarray:
type buffer = (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t
And a view into that buffer:
type t = private {
buffer : buffer;
off : int;
len : int;
}
So I would like to treat the payload of NFQA_PAYLOAD attributes
as a Cstruct.t
, or at least a Bigarray
. It's currently constructed like
this, as part of the function that iterates over message attributes:
let data: unit ptr = mnl_attr_get_payload nla in
let len: int = mnl_attr_get_payload_len nla in
let p = Ctypes.(from_voidp char data) in
Ctypes.CArray.from_ptr p len
Ctypes does have some support for bigarrays, but the documentation is intimidating:
type 'a bigarray_class = 'a Ctypes_static.bigarray_class
(** The type of Bigarray classes. There are four instances, one for each of the Bigarray submodules. *)
val array1 : < element : 'a ; layout : 'l ; ba_repr : 'b ; bigarray : ('a, 'b, 'l) Bigarray_compat.Array1.t ; carray : 'a carray ; dims : int > bigarray_class
(** The class of Bigarray.Array1.t values *)
val bigarray_of_ptr : < element : 'a ; layout : Bigarray_compat.c_layout ; ba_repr : 'f ; bigarray : 'b ; carray : _ ; dims : 'i > bigarray_class -> 'i -> ('a, 'f) Bigarray_compat.kind -> 'a ptr -> 'b
(** bigarray_of_ptr c dims k p converts the C pointer p to a C-layout bigarray value. No copy is made; the bigarray references the memory pointed to by p. *)
What the heck do those brackets mean? What is Bigarray_compat
? Why is
the type definition so big? Not letting confusion get the better of me, I
tried squinting at the signature of bigarray_of_ptr
:
val bigarray_of_ptr : 'dims scary_thing -> 'dims -> 'a other_scary_thing -> 'a ptr -> 'b
Then I saw that scary_thing
is a bigarray_class
, and other_scary_thing
is Bigarray.kind. So our call will look something like:
bigarray_of_ptr array1 ??? Bigarray.Char myptr
The final piece, I based on the name, dims
, I guess should be the size of the array. So
Going back to the NFQA_PAYLOAD attribute, I can try to make a bigarray like so:
bigarray_of_ptr array1 len Bigarray.Char (from_voidp char data)
This gives me the error:
File "netlink/netlink.ml", line 908, characters 40-105:
908 | | 10 -> next @@ f init NFQA_PAYLOAD (bigarray_of_ptr array1 len Bigarray.Char (from_voidp char data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: This expression has type
(char, Bigarray.int8_unsigned_elt, Bigarray.c_layout)
Bigarray.Array1.t
but an expression was expected of type buffer = char carray
Which is actually exactly what I wanted! I just need to change the type definition of this attribute from this:
type _ nlattr_type =
...
| NFQA_PAYLOAD : char carray ifla_type
to this:
type _ nlattr_type =
...
| NFQA_PAYLOAD : (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t ifla_type
but since I plan to use the bigarray through the Cstructs.t type, I may as well make it
type _ nlattr_type =
...
| NFQA_PAYLOAD : Cstruct.t ifla_type
I can then address the errors generated by the compiler, one by one, to get things working. Forget test-driven development, how about Compiler-driven development? Reading the payload becomes:
let cstruct =
from_voidp char data
|> bigarray_of_ptr array1 len Bigarray.Char
|> Cstruct.of_bigarray
and writing becomes:
let buf = Cstruct.to_bigarray value in
let len = Cstruct.length value in
mnl_attr_put nlh T.nfqa_payload len (bigarray_start array1 buf |> to_voidp)
The Cstruct library contains its own function to dump a buffer, so pretty-printing becomes:
Fmt.(pf ppf "NFQA_PAYLOAD=[%a]" (octets ()) (Cstruct.length v, Cstruct.get v)
And we're back at feature-parity with the previous CArray-based payloads, only now we have access to the packet-processing features of the Cstruct library, and we can interact with other libraries!
Cstruct comes with a nice pre-processor that allows you to describe a packet as an OCaml record type, with appropriate field types, and generate functions to get/set fields in that packet. However, we don't need to parse every field in a packet. We only care about the following information:
- The src/dst ethernet address, to filter packets that match the parent device.
- The source and destination IPv4/IPv6 address, to learn the address the user of the device wants to use (or that dhcp wants to give it).
- The DHCP message type (DHCPOFFER, DHCPDISCOVER, etc)
- The ICMPv6 message type (neighbor/router solicit, router advertisement)
So rather than defining each field, we'll use cstruct's parsing helpers
to jump through a packet and just lift the fields we need. There's a bunch
of functions for slicing and dicing Cstruct.t
values, and reading integers
from them in either byte order. I defined a record type that would contain
just the information I'm interested in:
type packet = {
ts: Ptime.t;
l2_src : bytes;
l2_dst : bytes;
l3_src : bytes;
l3_dst : bytes;
payload : pkt_payload;
}
and pkt_payload =
| DHCPDISCOVER of { xid: string; client_id: string }
| DHCPOFFER of { xid: string; yiaddr: bytes; address_time: int }
| DHCPREQUEST of { xid: string; address_request: bytes }
| DHCPDECLINE of { xid: string }
| DHCPACK of { xid: string }
| DHCPNAK of { xid: string }
| DHCPRELEASE of { xid: string; ciaddr: bytes }
| IGNORE
Then a sequence of functions parse the packet in layers, advancing the
start of the Cstruct.t
to the end of their respective headers before calling
the next parser:
let ethernet pkt cs =
...
let pkt = { pkt with l2_src; l2_dst } in
match ethertype with
| 0x0800 -> ipv4 pkt cs
| 0x0866 -> ipv6 pkt cs
| _ -> { pkt with payload = IGNORE }
and ipv6 pkt cs = { pkt with payload = IGNORE } (* TODO *)
and ipv4 pkt cs =
...
let pkt = { pkt with l3_src; l3_dst } in
match protocol with
| 17 -> udp pkt cs
| _ -> { pkt with payload = IGNORE }
and udp pkt cs =
...
match src_port, dst_port with
| 67, 68 | 68, 67 -> dhcp pkt cs
| _ -> { pkt with payload = IGNORE }
and dhcp pkt cs =
...
{ pkt with payload =
match message_type with
| 1 -> DHCPDISCOVER { xid; client_id }
| 2 -> DHCPOFFER { xid; yiaddr; address_time }
| 3 -> DHCPREQUEST { xid; address_request }
| 4 -> DHCPDECLINE { xid }
| 5 -> DHCPACK { xid }
| 6 -> DHCPNAK { xid }
| 7 -> DHCPRELEASE { xid; ciaddr }
}
in
match mnl_get_attr nlh NFQA_PAYLOAD with
| exception Not_found -> None
| cs ->
let ts = Ptime_clock.now () in
let pkt = make_packet ~ts () in
try
Some (ethernet pkt cs)
with Invalid_argument _ ->
Logs.debug (fun m -> m "out-of-bounds parsing %a" Cstruct.hexdump_pp cs);
None
The first issue I encountered was that NFQA_PAYLOAD only contained the payload from the l3 header onwards, so my parsing failed. I suspect the contents of the payload depend on the netfilter rules used to enqueue the packets. I have to pull the l2 addresses from the NFQA_HWADDR attributes, and the ethertype from the NFQA_PACKET_HDR attribute. After making those adjustments, and fixing various parsing errors, I was able to parse the DHCPOFFER message coming from the dnsmasq process.
parsed { ts = 2025-01-19 19:16:01 +00:00;
hwaddr = 2a:39:42:42:3c:f8; l3_src = 192.168.0.1;
l3_dst = 192.168.0.78;
payload =
DHCPOFFER {xid = 2ebc 897a; yiaddr = 192.168.0.78;
address_time = 3600}
}
From here, it's simply a matter of assigning the address in yiaddr
to the relevant interface. But I have a problem; tapautoconf never saw
the initial DHCPDISCOVER or DHCPREQUEST packet, so it doesn't know where
the request came from! If I change the test so that tapautoconf runs from
the dhcp server's network namespace, I can see the request, but not the reply:
parsed { ts = 2025-01-19 19:36:48 +00:00;
hwaddr = f2:bc:8c:0c:42:ef; l3_src = 0.0.0.0;
l3_dst = 255.255.255.255;
payload =
DHCPDISCOVER {xid = 159f 076c;
client_id = ff8c 0c42 ef00 0100 012f 2010 d0f2 bc8c
0c42 ef}
}
Moving tapautconf to the client's namespace, I get neither. Reading Netfilter hooks I think the problem is that I am only enqueing packets in the PREROUTING hook, which will only see packets coming into an interface. I need to collect packets at the OUTPUT hook to capture packets that were generated by applications in the network namespace. I suspect that, if I were actually testing with tap devices, I wouldn't have to do this, but it doesn't hurt.
table inet filter {
chain PREROUTING { ... }
chain OUTPUT {
type filter hook output priority 0
policy accept
counter meta protocol ip udp sport 67-68 udp dport 67-68 queue num 1
}
}
Running from the namespace with the dhcp server, I get both:
{ ts = 2025-01-19 19:56:08 +00:00; indev = 2; outdev = -1;
hwaddr = a6:d7:34:2e:7f:19; l3_src = 0.0.0.0; l3_dst = 255.255.255.255;
payload =
DHCPDISCOVER {xid = ef56 5b4d;
client_id = ff34 2e7f 1900 0100 012f 2015 58a6 d734 2e7f 19}
}
{ ts = 2025-01-19 19:56:08 +00:00; indev = -1; outdev = 2; hwaddr = ;
l3_src = 192.168.0.1; l3_dst = 192.168.0.95;
payload =
DHCPOFFER {xid = ef56 5b4d; yiaddr = 192.168.0.95; address_time = 3600} }
I added two fields, indev
and outdev
, corresponding to
the NFQA_IFINDEX_INDEV
and NFQA_IFINDEX_OUTDEV
attributes,
respectively. From the perspective of the dhcp server namespace, the
DHCPDISCOVER packet comes in on ifindex 2 (nic0
), and the DHCPOFFER
packet goes out on the same interface.
I cannot run tapautoconf from the dhcp server namespace because, in real usage, the dhcp server is on a separate machine outside of my control. If I run tapautoconf from the namespace with the parent device, I only see the DHCPOFFER packet. If I run tapautconf from the client namespace, I see neither. I believe this is because the ipvlan interfaces are in l2 mode. From IPVLAN Driver HOWTO:
"In this mode TX processing happens on the stack instance attached to the slave device and packets are switched and queued to the master device to send out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) as well."
I do want to support broadcast/multicast traffic, because they are
necessary for IP autoconfiguration on both IPv4 and IPv6. And I do want
to allow cross-talk between sub-interfaces. But the description makes
me think that I should be using the layer 2 bridge
family in nftables
in order to see these packets. Before going that route (heh), there is
a bigger problem; the interface index on the intercepted packets seems
to be the index of the interface in the network namespace tapautoconf
runs from. This makes total sense, as that's where the netfilter stack
is intercepting, and where the netlink socket is opened.
However, it means that in order for tapautoconf to associate a packet with the interface that it needs to add an address to, it needs to have the nftables rules installed in every namespace that it has allocated a device for, and needs netlink sockets open for each namespace. I don't want to do that, as it raises all sorts of other problems around opening lots of netlink sockets, dealing with nftables rules being modified by processes in the peer namespaces, and so on.
I am once again confronted with the decision over where to put the tap devices. Initially they were all in one namespace, and clients only got file descriptors for them. Then I waffled and moved the tap devices into the namespace of the client process, so that they would automatically be destroyed when the client namespace was destroyed. Now I am again considering putting them all in the same namespace so that I can intercept their DHCP/ICMPv6 traffic in a scalable way. It would mean that I need to come up with a reliable way to destroy these interfaces when they're no longer in use.
It would also mean that I would need to modify my test setup, because the
client namespace will no longer have an interface. I could create a veth
pair to carry traffic from the client namespace to tapalloc/ tapautoconf's
namespace, but would need to install routes to get traffic from one end of
the veth into the ipvlan interface. I'm opting instead to change the test
to use ipvtap
devices, since that is my intended use case anyway. It
means I need a DHCP client that receives and sends packets through a file
descriptor, which dhclient(8) cannot do. I could run it in a VM using
qemu
, but instead I decided to use Charrua's DHCP_client module,
and use it to implement a dhcp client for the test. Doing this will
save me work later when I extend the tapalloc
idea to support ipvlan
devices, given to containers, instead of just ipvtap
devices given to
VMs. I'll talk about that later ☺.
There is no guix package for charrua
, but guix can automatically
import a package definition from the OCaml package repository (OPAM),
including definitions for its dependencies:
$ guix import opam --recursive charrua-client
This didn't build right away. After some tedious work getting things to
build, I took a more mechanical approach; for each package definition
ocaml-NAME
, I added the field
(arguments (list #:package "NAME"))
The issue was that the opam
importer would generate dependencies
for a single opam package, but the dune-build-system
would try
to build all packages by default, so the generated package definitions
were insufficient, and the resulting packages had more than I needed.
After doing that, there were only two problems to correct; a missing
build-time (native-inputs
) dependency of charrua
on ocaml-menhir
,
and a missing definition for the ocaml-ipaddr
package in the generated
output.
After listing ocaml-charrua-client
as a dependency of my tapalloc
package, guix shell -D
makes the package available in my environment. I
can now list charrua-client
as a dependency for a module in my dune
files to use the Dhcp_client
module.
A break for some housekeeping
It seems like there is a lot left to do, but I can feel that I am nearing the end of this project. At least, the path forward seems clear. Since I am building a new test, I want to take a brief interlude to do some tidying up in the project, to make it closer to something I'd actually want to run. It's not a major refactor, just some reorganization.
-
tapallocd_ipvlan
andtapallocd_ipvtap
are merged into a single program, that can create either according to the request. -
tapalloc_ipvlan
andtapalloc
are merged into one program that takes a flag to decide which type of device to request. -
The
tapalloc*
programs are renamed toontap
. I want to evoke the imagery of a VM swaggering up to a bar in a saloon and the barkeep pouring him a fresh, coldbrewdevice queue. Another name candidate waspatchpanel
but to me that gave too much of an impression of 1:1 physical to virtual links rather than the reality of 1:many or small:many. It also gives the impression that the devices could be moved between physical devices, which isn't an explicit goal. -
The client is split into two components:
ontapc
, which is the client, andontap
which is a convenience wrapper that connects to the unix socket without the need to use s6-ipcclient(8) -
The client-server protocol is reworked to be a little more flexible and extendable; Requests are now a single line of the form
type=ipvtap name=nic0 queues=1\n
and the response is either
ok KEY1=VAL1 KEY2=VAL2 <end-of-file>
in the successful case or
error a detailed error message <end-of-file>
in the failure case.
-
tapautoconf
is renamed toontap-l3sync
-
The miscellaneous libc calls stuffed at the end of the
Netlink
module are split out into a separateSyscall
module. -
The
Sockfd
module is renamed toFdpass
. -
Chunks of the tests, like the stand-in for udev and s6-ipcserver, are factored out into separate modules under test/test_helpers.
-
The man pages are updated to reflect the changes.
Strange behavior with mount namespaces
I encountered a strange issue while building the test. Because this test is using tap devices this time, it needs to create character device nodes like my first test did. That test created device nodes in the current working directory. In general I don't like creating real files in a test because you have to think about cleaning them up, about having a writable file system, and so on. I wanted to improve upon the old test by creating the files in a tmpfs, mounted in a new mount namespace. This way, when the test processes exit, the mount namespace would be destroyed along with the file system.
But I had a problem; as the diagram shows above, the "device mgr"
runs in the default namespace, outside of the user and mount namespaces
where the tmpfs gets mounted. I can use the mknodat
system call, which
takes as its first argument a file descriptor to a directory, to create the
file from outside of the namespace:
I thought I was very clever putting this together, but I was surprised when I saw the error:
ontapd: [DEBUG] device manager notification for tap3 received
ontapd: [ERROR] could not allocate device: open(/tmp/tap3): Permission denied
I ticked off the obvious causes:
- The file was owned by the same uid:gid in both namespaces
- The file had read-write permissions (0666)
- The tmpfs did not have the "nodev" option set
And conducted a bunch of additional scenarios to explore the problem
- Creating a regular file worked (the file could be read within the namespace)
- Creating the node outside of the tmpfs, in a directory shared between both namespaces, worked.
- Creating and opening a node in the same namespace worked.
I started reading the tmpfs implementation in the Linux kernel, but felt like I was getting too sidetracked, so I wrote a minimal implementation in C and drafted an email to the linux-fsdevel list. However, as I did so, I found enough indicators that this was expected behavior that I sent the email to my personal tracker instead. I will look into it later.
In the meantime, I can create the device in the current working directory,
as dune test
seems to perform some cleanup on my behalf anyway.
A simple dhcp client
So I've added the charrua dhcp library to my project. The Dhcp_client
module does not make any assumptions about how to send or receive
packets; it lets you create a Dhcp_client.t, which is a state machine. You feed it a packet
via the input
function, and it will return its new state, and a suggestion
for what to do next:
[ `Response of t * Dhcp_wire.pkt (** send the provided packet *)
| `New_lease of t * Dhcp_wire.pkt (** you've got a lease, celebrate! *)
| `Noop (** don't do anything )
]
It can be driven by a simple loop:
open Cstruct
let rec wait (fd:Unix.file_descr) (cs:Cstruct.t) (client:Dhcp_client.t) =
let msg =
Unix.read_bigarray cs.buffer cs.off cs.len |> Cstruct.sub cs 0
in
match Dhcp_client.input client msg with
| `Response (client', pkt) -> send_response fd cs client' pkt
| `New_lease (_, pkt) -> Ok pkt.yiaddr
| `Noop -> wait client
and send_request fd cs client pkt =
let len = Dhcp_wire.pkt_into_buf pkt cs in
Unix.send_bigarray fd cs.buffer cs.off len |> ignore;
wait fd buf client
It needs a few modifications:
-
We need to set a unique xid and client id for our requests. I do this by seeding a random number generator with the current process id:
let xidgen = let prng = Random.State.make [| Unix.getpid () |] in fun () -> Random.State.int32 prng Int32.max_int
-
By default, it sets the BROADCAST flag in its requests, which instructs the dhcp server to send its replies to the broadcast address, rather than the unicast address. While this makes things easier for us, since ipvlan devices in l2 mode can receive broadcast packets without an address, I don't want to rely on that. Most dhcp clients will be configured not to ask for broadcast replies, and I don't want to have to change the defaults.
The modifications can go into the send_request
function:
and send_request fd cs client pkt =
+ let pkt = { pkt with flags = Dhcp_wire.Unicast; xid = xidgen () }
A real client would also be accounting for the passage of time, requesting a new lease before the current one expires, and it would have to react to changes in the network interface state, system suspend/resume, and so on. But getting a lease, one time, is sufficient for this test.
Throwing in some print statements, the current state of the test looks like this:
dhclient: ->{...XID 1976429091 options Client ID ID 0 343130383831313335, Message type DHCP DISCOVER...}
dnsmasq-dhcp[23960]: DHCPDISCOVER(ipv0) 8e:ec:6e:5f:d4:e1
dnsmasq-dhcp[23960]: DHCPOFFER(ipv0) 192.168.0.36 8e:ec:6e:5f:d4:e1
ontap_l3sync: [DEBUG] { ts = ...; indev = 2; outdev = -1; hwaddr = 8e:ec:6e:5f:d4:e1; l3_src = 192.168.0.1; l3_dst = 192.168.0.36; payload = DHCPOFFER {xid = 1976429091l; yiaddr = 192.168.0.36; address_time = 3600} }
So ontap_l3sync can see the unicast DHCPOFFER coming from the DHCP server, but it doesn't see the initial DHCPDISCOVER request comint out of the tap device. We have come back to the problem we faced before we switched the test from ipvlan to ipvtap interfaces.
I know that ipvlan (and macvlan) interfaces are "special", in that traffic bypasses most of the networking stack of the parent interface's network namespace. But in this case, the parent and sub-interfaces are in the same namespace! With some experimentation I find that if I run the helper in its own namespace with its own ipvlan device, the netfilter rules can intercept both packets, but then I have the same problem of not being able to tell what interface the packets originated from.
I'm again starting to worry that I cannot make this work in a sustainable way. I do have the option of going back to the previous model of putting ipvtap devices in clients' namespaces, then obtaining a netlink queue from the namespace of every sub-interface. That will certainly work, I just worry about holding too many netlink sockets open. There is also, of course, the option of binding a raw socket to each interface and filtering DHCP/ICMPv6 traffic that way. I was hoping not to do that, as I don't want to open one socket for every interface.
I found myself getting frustrated, so I sent an e-mail to the netfilter
mailing list asking for help, and took a side-quest to implement a ring
buffer to avoid the rampant copying of
packet buffers that was happening within the tools. While these programs
are in the path for protocols that normally will see only a few packets
per hour, and performance is not really a concern, I needed to work
on something without external dependencies and the result will be
useful in the future. It also had the benefit of simplifying the programs,
which were doing inter-thread communication using a hodge-podge
of Event
channels and queues of pointers.
After coming back to this after a short break, I read drivers/net/tap.c
and drivers/net/ipvlan/ipvlan_core.c
more closely. When an application like qemu wants to transmit a frame,
it calls sendmsg
on the file descriptor it has for the tap device. In the
kernel, tap_sendmsg
runs, which calls tap_get_user which copies the packet data from user space into an sk_buff structure and calls dev_queue_xmit
, which
calls skb->netdev_ops->ndo_start_xmit
, which, for an ipvlan/ipvtap
device, is set to ipvlan_start_xmit which, finally, gets to ipvlan_queue_xmit. From here there
are 3 major paths to transmit the packet:
- If the destination is another ipvlan device on the same parent device,
look it up (by its l3 address) and call its receive function (
ipvlan_rcv_frame
). - If the packet is a multicast or broadcast packet, and the ipvlan is in l2 mode,
pass it to
ipvlan_multicast_enqueue
. - If the packet has an external destination, call
dev_queue_xmit
on the parent device.
How does netfilter factor into this? The dev_queue_xmit
function passes
the sk_buff structure to the egress hook, so it will show up on its
way out of the ipvtap
device, and on its way from the ipvtap
device to its parent device in the egress hook. That's the only hook;
by accepting ethernet frames, tap devices essentially skip almost all
of the Linux kernel's network stack. In more conventional setups like a
tap+bridge combo, or using routes & IP forwarding, you have the option
of intercepting packets as they pass through some other element like
a bridge. But macvtap and ipvtap devices provide almost direct access
to physical device queues, which has good performance implications,
but makes the transmit path especially stealthy.
That explains why I was able to log the packet using an egress hook, but no other hooks. As far as I can see, that is the only hook the packet will pass through. If it is distributed to another ipvlan device, it will show up on the INPUT and PREROUTING hooks for that device. For a multicast packet, it is scrubbed before being enqueued for transmission, to avoid leaking information to other network namespaces. This foils any plans to learn the originating interface by adding an extra ipvlan device for snooping.
The netdev family of netfilter hooks do not support queuing packets to user space. So what does that mean for this project? Not all hope is lost! We've hit another obstacle, but we have some options. Two strategies come to mind:
- Tap (à-la tcpdump) every ipvtap device with a filter for dhcp/icmpv6 packets and feed them to my program.
- Since netfilter can still emit logs or traces of packets, we can correlate those logs with broadcast/multicast packets which we can receive normally (by creating a socket with the SO_BROADCAST option).
I had hesitated to pursue #1 because I did not want to maintain
a socket for every ipvtap device. However, reading packet(7), it looks like the
first concern is misplaced; the sockaddr_ll
structure includes the
interface index of the link the packet originated from. So I would only
need to maintain one socket per network namespace. In my setup I'm
planning to keep the ipvtap devices all in the same namespace, so it's
very doable.
While I need to add code to setup an AF_PACKET socket and install a filter for DHCP & ICMP packets, I can re-use the existing architecture of the helper, which was comprised of 3 parts:
- One thread that synchronized a local cache of existing links and the addresses assigned to them, fed by an rtnetlink socket.
- One thread that learned new addresses to assign to links, fed by an nfnetlink_queue socket, and queuing commands to an rtnetlink socket.
- One thread that handled I/O to and from the netlink sockets and application queue using select(2)
Most of the program can stay as-is; I just need to replace the nfnetlink_queue socket with an AF_PACKET socket and slightly change the parsing functions, since the input packets will now include the ethernet frame (nfnetlink only gave me the packet from the L3 header up).
In hindsight, I wish I was less stubborn about using netfilter queues, but I also feel like I'm better for having struggled to use them here, as I have a better mental model of how things work.
I wrote a quick proof-of-concept program, snoop.c,
that setup an AF_PACKET
socket, attached a bpf filter, generated by
tcpdump -dd udp portrange 68-67 or udp portrange 546-547 or icmp6
locks the filter so it cannot be changed or detached, and then dumps
everything received by the socket to standard output, while logging
the sockaddr_sll
structure populated by recvmsg(2). I'm able to
see the DHCP and SLAAC-related packets. When they are emitted by the
ipvtap device they have pkttype
PACKET_OUTGOING, and when they are
forwarded to any other ipvlan devices on the same parent, or forwarded
to the parent itself, they have packet type PACKET_BROADCAST.
Because opening an AF_PACKET socket requires a new privilege, CAP_NET_RAW, I will write a separate program that does the following:
- Opens an AF_PACKET socket
- Attaches a BPF filter for ip configuration packets and locks the filter
- Executes into another command line
I'm separating things like this so I can drop privileges between setting up the socket and executing the helper program.
After some more hacking and bug squashing, I can finally achieve my goal:
A Plan 9 VM, using the host wireless interface through an ipvtap device, acquiring an IPv4 address from the local network. Catclock can't bear to watch.
The helper program is able to observe the DHCP requests and replies. It configures a short-lived address (60 seconds) on observing a DHCPOFFER, then updates the lease time when the DHCPREQUEST/DHCPACK transaction is completed. For IPv6, it intercepts Router and Neighbor solicitations to learn and configure the link-local v6 addresses, then whenever it sees a router advertisement, it configures addresses for all prefixes with the proper bits set, masking the link-local address with the prefix.
While this is a success, there are a couple problems I've observed:
-
The client does not receive the first DHCPOFFER, because the address it's offering has not been configured on the interface yet. The client has to send a retry. This could be avoided by replaying the packet after the address is configured, or using netfilter queues, which allow you to hold the packet back until you know it can be delivered.
-
My home router, which I got from my ISP, ignores the Client-identifier option when allocating an IP address. This means my VM receives the same IPv4 address that my host would. It's unfortunate, but I did expect that some routers would do this. IPv6 will still work fine, since clients are able to choose their own address.
I don't mind going ipv6-only for now. In the future I can arrange for clients to share an IPv4 address for outbound requests using NAT. I planned to do so anyway to avoid consuming too many IPv4 addresses on busy public networks, which would be rude.
Now that I've proven the concept works, I can afford to do more extensive cleanup, deleting dead code and such. For this project I tried as hard as I could to defer this stage, so there was a lot of cruft built-up, that felt really good to clean out.
-
I started using the
Cstruct.t
type evyerwhere. It's a simple (buffer, offset, length) triplet that makes it harder to make mistakes when working with regions of bigarrays. -
I removed a ton of dead code. All of the netfilter-related code was removed, since that was a road I chose not to take.
I also added more features to ontap-l3sync
:
- A blacklist of addresses that we should never assign to an ipvtap device, including any addresses with the "default" eui-64 address that is derived from the (shared) MAC address.
- A periodic poll of ipvtap links and any addresses assigned to them,
to detect changes that occur while
ontap-l3sync
is not running or a notification was lost (Netlink can drop messages).
Revisiting the cleanup problem
I've waffled back and forth on what netns to put ipvtap devices. My current implementation requires all of the ipvtap devices managed by ontap-l3sync to live in the same network namespace, so I cannot use namespaces as a cleanup mechanism to destroy the ipvtap devices when the processes using them are gone. I have to think of something else.
What I came up with is to have ontapd use flock(2) to associate a shared lock with the open file descriptor(s) for the tap device before passing it to the client. Then, a new program, ontap-sweep, attempts to take exclusive locks on any ipvtap devices connected to the specified parent device. For devices which are still in use, this will block, and for devices which are no longer in use, this will succeed, and ontap-sweep will proceed to delete them.
In writing this tool, I could not find a way to wait on a large number
of flock(2) calls that didn't involve either polling with non-blocking
flock
, or creating one thread per flock
. I went for the latter,
since I really don't like polling, and I don't expect to have anywhere
near enough interfaces that the number of threads would be a problem.
Despite the difficulties, my approach worked; the ontap-sweep program was able to detect and quickly clean up unused ipvtap interfaces. It worked too well; the detection was so quick that it would delete the interfaces when ontapd created them, but before it called flock on them! Rather than moving ontap-sweep's functionality into ontapd, I decided to allow ontapd to pause ontap-sweep's activity by opening a unix socket with it, which is closed automatically when the ontapd process for a client's request exits.
Future work
I've completed enough that I'm able to start using this project to work on other things. My plan now is to use it for a long time, fixing any problems that come up, but not really to try and change anything too drastically. This project was meant to be a tool to enable quick, secure, and optionally ephemeral sandboxes for other projects I want to work on and services I want to run.
As the result of an exploratory, iterative process, the code base of this project is a mess. It can be drastically simplified and refactored. Specifically:
-
The
Netlink
module could be split out into a standalone library with a more consistent and complete API. I'd like to wait a few kernel releases to get a feel for how often it changes, though. The Linux kernel has a pretty good track record of not breaking userspace programs, so I'm optimistic that the maintenance burden won't be too high. -
The tests should not require any elevated privileges. Right now they need
CAP_MKNOD
. -
Experiment with ways to work around clients not setting client IDs, or DHCP servers not using them to assign clients a unique IP. One idea is to modify the
chaddr
field of the DHCP requests, while acting as a DHCP relay in order to receive the DHCP server's reply.
Other fun stuff
Crashing on errors
I make extensive use of the result
type defined in the OCaml standard
library. A result
can either a value or an error:
match some_operation_that_could_fail args with
| Error e -> handle_error e
| Ok v -> do_thing v
In many cases, what I want to do with an error is exit the function and return the error:
| Error _ as err -> err
| Ok v -> (* keep going *)
Because this is a common idiom, I define a binding operator like so:
let ( let* ) = Result.bind
Then, returning an error or continuing with the rest of the function looks like this:
let* v = some_operation_that_could_fail args in
(* the rest of the function *)
I prefer the result type over exceptions because I can forget to handle an exception, but I can't really forget to handle an error. However, when I'm debugging, I really like to have stack traces showing me where an error was generated. So I changed my binding operator like this:
let ( let* ) ret f =
match ret with
| Error str -> failwith str
| Ok v -> f v
This only work with string errors, which is fine by me.
C constants and enums as GADTs
I chose to represent C constants like RTM_NEWLINK
as variants
rather than integers. The initial reason for this was very superficial;
variant constructors can be capitalized, but ordinary values like integers
must start with a lowercase letter. In addition, variant constructors can be used
in pattern matching, but ordinary values cannot. So this:
enum {
RTM_NEWLINK = 16,
RTM_DELLINK
RTM_GETLINK
}
became
type nlmsg_type =
| RTM_NEWLINK
| RTM_DELLINK
| RTM_GETLINK
this required a bunch of extra plumbing between the C and OCaml functions to convert between the two representations, but I think it was worth it. The netlink APIs have a few areas where you get or set a property, passing a type (key) and payload (value). Here are a few examples:
- setsockopt/getsockopt to set and get options on the netlink socket.
- mnl_attr_put and mnl_attr_get functions set and get attributes at the end of a message.
These are notable because the value has a different shape depending on
the key used. For example, getsockopt(NETLINK_CAP_ACK, ...)
retrieves
an int
, but getsockopt(NETLINK_LIST_MEMBERSHIPS, ...)
retrieves an
array of uint32_t
values. In C, you just read the docs and pass cast
your void*
buffer to whatever type the docs tell you to, and hope you
don't make a mistake. In OCaml we can use the type checker to prevent
ourselves from making an incorrect cast or passing a wrong-size buffer.
Initially you could attempt to use variants for this:
type netlink_sockopt =
| NETLINK_CAP_ACK of bool
| NETLINK_EXT_ACK of bool
| NETLINK_LIST_MEMBERSHIPS of int list
Then, our setsockopt function could look like this:
let mnl_socket_setsockopt sock = function
| NETLINK_CAP_ACK tf -> setsockopt_bool sock 0x0a tf
| NETLINK_EXT_ACK tf -> setsockopt_bool sock 0x0b tf
| NETLINK_LIST_MEMBERSHIPS xs -> setsockopt_list sock 0x09 xs
and call it like this:
mnl_socket_setsockopt sock (NETLINK_EXT_ACK true)
but how would we call the getsockopt function? An expression like this
mnl_socket_getsockopt sock NETLINK_EXT_ACK
is invalid; the NETLINK_EXT_ACK constructore requires one argument. You can use a Generalized algebraic datatype(GADT) here. They are described as a way to write a parser in your type system, but they actually have a lot more uses, as a general way to separate the a value from its type.
type _ netlink_sockopt =
| NETLINK_CAP_ACK : bool netlink_sockopt
| NETLINK_EXT_ACK : bool netlink_sockopt
| NETLINK_LIST_MEMBERSHIPS : int list netlink_sockopt
let mnl_socket_setsockopt (type t) nl (k : t netlink_sockopt) (v : t) =
match k with
| NETLINK_CAP_ACK -> setsockopt_bool nl 0x0a v
| NETLINK_EXT_ACK -> setsockopt_bool nl 0x0b v
| NETLINK_LIST_MEMBERSHIPS -> setsockopt_list sock 0x09 v
To call it, you drop the parenthesis:
mnl_socket_setsockopt sock NETLINK_EXT_ACK true
Getting works as you'd expect:
mnl_socket_getsockopt sock NETLINK_EXT_ACK
And the implementation looks like this:
let mnl_socket_getsockopt : type t. mnl_socket -> t netlink_sockopt -> t =
fun nl opt ->
match opt with
| NETLINK_CAP_ACK -> get_bool_sockopt nl 0x0a
| NETLINK_EXT_ACK -> get_bool_sockopt nl 0x0b
| NETLINK_LIST_MEMBERSHIPS -> get_list_sockopt nl 0x09
As you can see, there are a few more type annotations required in order to appease the type checker, but in most cases, they are tucked away in the library and calling code doesn't have to think about them. The Unix module from the OCaml standard library takes the approach of splitting up the options into separate types based on their underlying types, and having one function for getting and setting each type.
The sockopt case is trivial, but I think there will be other opportunities to use GADTs for performance; creating a variant introduces a level of indirection and creates more work (see Memory Representation of Values for the garbage collector.
Retrospective
If my goal was to get a working service as quickly as possible, with minimum frustration and head-scratching, I utterly failed that goal. At just about every possible step I chose the hard road; faced with a system protocol and library that fully exercises the freedom of C, that has accumulated decades of quirks, that is difficult to debug because most of it runs in the kernel, I decided to try using it from a language I did not fully understand, with its own special flavor of freedom.
The state of the codebase when I was able to use the service "for real" is atrocious. Thousands of lines of code. Inconsistencies everywhere. Shifting patterns in how types are declared and used; you can see how my thoughts and patterns changed over time.
A lot of time was spent simply translating C declarations into Ocaml
ones. I think ppx-cstubs
could have majorly cut down on that aspect. Similarly, ppx-deriving could save me the effort of
defining printers for many types. Many of the time sinks were of my own
making; I deliberately chose to use features of OCaml that I didn't fuly
understand as a way to force myself to learn about them. As an example,
my decision to use GADTs for message attributes really hit me in the
face when I went to implement a function that would iterate over all
attributes. Behold the glory of mnl_ifla_walk
!
let mnl_ifla_walk ~offset fn (init: 'a) nlh =
let children nla = mnl_attr_get_payload nla |> from_voidp T.Nlattr.s in
let continue f k nla len fn init =
let len' = len - mnl_attr_get_len nla in
let nla' = mnl_attr_next nla in
if not (mnl_attr_ok nla' len') then k init
else f k nla' len' fn init
in
let kind = ref "" in
let rec toplevel k (nla:T.Nlattr.t) (len:int) (({f}: 'a ifla_fn) as fn) (init: 'a) =
let next = continue toplevel k nla len fn in
match mnl_attr_get_type nla with
| 0x01 -> next @@ f init IFLA_ADDRESS (mnl_attr_get_bytes nla)
| 0x03 -> next @@ f init IFLA_IFNAME (mnl_attr_get_str nla)
| 0x04 -> next @@ f init IFLA_MTU (mnl_attr_get_u32 nla)
| 0x05 -> next @@ f init IFLA_LINK (mnl_attr_get_u32 nla)
| 0x0d -> next @@ f init IFLA_TXQLEN (mnl_attr_get_u32 nla)
| 0x13 -> next @@ f init IFLA_NET_NS_PID (mnl_attr_get_u32 nla)
| 0x1c -> next @@ f init IFLA_NET_NS_FD (mnl_attr_get_u32 nla)
| 0x1f -> next @@ f init IFLA_NUM_TX_QUEUES (mnl_attr_get_u32 nla)
| 0x20 -> next @@ f init IFLA_NUM_RX_QUEUES (mnl_attr_get_u32 nla)
| 0x12 ->
kind := "";
linkinfo next (children nla) (mnl_attr_get_payload_len nla) fn init
| x -> next @@ f init IFLA_UNKNOWN x
and linkinfo return nla len (({f}: 'a ifla_fn) as fn) (init: 'a) =
let next = continue linkinfo return nla len fn in
match mnl_attr_get_type nla with
| 1 ->
kind := mnl_attr_get_str nla;
next @@ f init IFLA_INFO_KIND !kind
| 2 when !kind = "ipvlan" || !kind = "ipvtap" ->
ipvlan next (children nla) (mnl_attr_get_payload_len nla) fn init
| 2 when !kind = "macvlan" || !kind = "macvtap" ->
macvlan next (children nla) (mnl_attr_get_payload_len nla) fn init
| x -> next @@ f init IFLA_UNKNOWN x
and macvlan return nla len ({f}: 'a ifla_fn) (init: 'a) =
let next = continue macvlan return nla len fn in
match mnl_attr_get_type nla with
| x -> next @@ f init IFLA_UNKNOWN x
and ipvlan return nla len ({f}: 'a ifla_fn) (init: 'a) =
let next = continue ipvlan return nla len fn in
match mnl_attr_get_type nla with
| 0x01 ->
next @@ f init IFLA_IPVLAN_MODE (mnl_attr_get_u16 nla |> ipvlan_mode_of_int)
| 0x02 ->
next @@ f init IFLA_IPVLAN_FLAGS (mnl_attr_get_u16 nla |> ipvlan_flag_of_int)
| x -> next @@ f init IFLA_UNKNOWN x
in
let len = mnl_nlmsg_get_payload_len nlh - offset in
let nla = mnl_nlmsg_get_payload_offset nlh offset |> from_voidp T.Nlattr.s in
if not (mnl_attr_ok nla len) then init
else toplevel Fun.id nla len fn init
It's beautiful ☺. I give it two weeks before I no longer understand it. I wrote it in continuation-passing style because hey, why not? I especially like the naked integer constants. I was literally learning the structure of the data as I went along.
I am confident I could implement this function in a much better way, having done it once. However, one thing I tried to stick to on this project is, as much I could help it, not to go back and "fix" things to make them pleasant to look at. I tried only to address the errors that were coming out of my test runs. I was worried that if I kept going back to fiddle and adjust and rename every little thing, I would never finish.
General, hand-wavy observations
These are just general thoughts I had that I didn't care to develop further, but don't want to lose, so I'm writing them down.
-
OCaml's type system is powerful enough to replace a lot of things I would use unit tests for in other languages. Rather than having tests that check if a function behaves as expected when you pass it bad input, you use the type system to make bad input impossible. Then all your little tests are obviated by the question "does my program compile?"
-
A caveat to that is that using C and, by extension
Ctypes
, subverts most of the benefits that you get from OCaml's type system. It is just as easy to segfault with Ctypes as it would be with C. You have to be fairly disciplined not to make the same kinds of mistakes. -
Dune still feels a lot like magic, and not in a good way. I don't know what it's doing most of the time, so when it doesn't work I'm befuddled. I would like to spend some time trying to build an OCaml project with Make. There are some example Makefiles floating around the internet, but they are all too complicated because they try to be as flexible as possible and automatically resolve everything for you. I have no qualms with listing and maintaining inter-module dependencies manually.
-
I think that there is a feedback loop between the build tools used for a language and the complexity of building software in that language. The more language-specific help a build tool gives a language, the more license the users of that language have to pull in dependencies that they may not have otherwise, introduce convoluted conditional compilation rules, multi-stage code generation, and in general build software in a way that may seem convenient for the developer in the moment, but could make things harder for packagers, distributors, and new contributors to the software in the long run. A simpler way of putting it; if your build tool makes building complex software easy, developers will build more complex software. If the tool doesn't make building complex software easy, they will build simpler software. The key is finding the right balance between irritating developers and enabling them to commit atrocities.
-
The thing that excites me the most about GADTs is that they separate the type of a value from its representation. If you're familiar with how OCaml represents values, they are boxed and look like this:
Then there's a little trick OCaml does to pack 63 (or 31, depending on your platform)-bit integers into the header for performance, because integers are so commonly used. The representation is reasonably compact, and the uniformity allows OCaml to implement generic functions in a really elegant way. However, a tradeoff of this approach is that when you construct a value, for example
let xid = ... in let yiaddr = ... in let address_time = ... in DHCPOFFER { xid; yiaddr; address_time }
the constituent elements of the value, which you may have already allocated in other parts of the program, must be copied into a box. This is fine if the values are small strings and integers like the example above. But when the values are packet payloads which can be up to 64KiB in size, copying these values creates work for the garbage collector, which can introduce unexpected latency.
What a GADT lets you do is effectively cleave the header off of the value and manipulate it separately:
not only is this really appealing for performance reasons, It's also appealing because you can use different headers to interpret the same data. The same buffer can be a different type of packet based on the GADT you pair it with.
I've had an idea floating around in my head that you could come up with a GADT that describes the layout of a piece of binary data, and have a set of functions that produce optimized versions of that GADT for encoding or decoding, or even decoding specific fields. Similar to what
printf
andscanf
does for strings, or even what theppx_cstruct
preprocessor does for binary data, but without the preprocessing part. It would be really cool if you could build those optimized versions at compile time, but I don't know if that's possible. -
Binding to libmnl was mostly waste of time. If I could go back, I would have only used ctypes to access the various constants used by the netlink protocols, and to fill in the parts of the sockets(7) api that are missing from the
Unix
module, and implemented the rest of the functionality in pure OCaml. It is fun and easy to decode and encode binary data in OCaml, especially with modules likeCstruct
. And writing C stubs "by hand" is not really that tedious.
Further reading
I consulted all these documents and more during the course of this work. Upon reflection, my initial approach of jumping from example to blog post to example to tracing iproute2 to spelunking through the iproute2 and kernel source code was like the equivalent of "button mashing" in a fighting game; an approach that works well enough to make you think you're good, until you face a real challenge.
- See also
-
A message-oriented pipe
Feb 2025
Fun with ring buffers