Battle of the mesh VPNs

Lately, I have been looking for a good “mesh VPN” solution á la ZeroTier, LogMeIn Hamachi, or TailScale. It was a hard decision, so I am writing a blog post in case I need to research this again in the future.

Rationale

Why was I looking for a new mesh VPN solution?

I currently maintain a small Active Directory deployment with 2 domain controllers to manage authentication and SSO for some of my services. For a while, I have been looking to expand this such that all of my services that support LDAP or SSO can authenticate using AD credentials to simplify credential management.

However, since it is not safe whatsoever to run Active Directory exposed to the public internet, I needed a VPN solution to allow all of my servers and devices to connect to my AD domain controllers on a private and secure internal network.

Up until now, I have been using ZeroTier’s SaaS version; however, a recent change that limits the maximum number of devices on the free plan to 50 devices means I would need to pay $50/month to connect all of my devices. Furthermore, I noticed the zerotier-one client has a bug where not all managed routes are consistently applied (my AD domain controllers are in the subnet 172.17.0.0/24 while all of my other servers are in the subnet 172.18.0.0/24. In many cases, zerotier-one only created a route for 172.18.0.0/24 over the ZeroTier interface, meaning it was necessary to automatically create a route for 172.17.0.0/24 using some hacky method like rc.local…)

Criteria

Note: When I say “node”, I am referring to a single member of the mesh network.

I had a set of (rather narrow) criteria when looking for a VPN solution:

Free (and ideally open-source) for unlimited devices (for scalability reasons – I’m already well over the 50 devices limit of ZeroTier’s SaaS offering and I could see myself exceeding TailScale’s 100 device limit.)
Easy management and low-maintenance (i.e., no need to manually reconfigure every node when adding a new node like a traditional VPN deployment might entail)
Automatically manages routes – all nodes should be able to access all other nodes
Reasonably fast
Traffic encrypted in-transit
Has clients for Linux, Windows, and Android (to connect my other servers, devices, and domain controllers)

You should know that I am by no means a security or networking expert, so this post focuses more on each solution’s features rather than their security or lower-level networking stack implementations.

Software I considered

After some research on Google and Reddit, I discovered five solutions for my internal mesh VPN that fulfill the above criteria:

slackhq/nebula
gravitl/netmaker
tonarino/innernet
ZeroTier on-premises through key-networks/ztncui
costela/wesher

slackhq/nebula

Nebula seems like a very popular alternative to ZeroTier and TailScale. It’s based on the Noise Protocol (like WireGuard) and known for being extremely fast. The configuration syntax (YAML based) is simple, and little configuration is required besides making each node aware of a central “lighthouse” that allows nodes to be able to route to all other nodes. Internally, all traffic is encrypted and each node has its own firewall built into the Nebula client. Clients are available for almost all the OSes under the sun. But most importantly, Slack currently uses it in production (and Slack Engineering claims most production traffic runs over Nebula tunnels), so it is battle tested and stable.

Problems: The main problem I noticed with Nebula was actually its method of configuration. Having to configure lighthouses and firewall rules independently on every single node could lead to scalability problems if, for example, I have hundreds of nodes and want to add a new lighthouse. (In contrast, solutions like ZeroTier allow firewall rules to be configured centrally.) If you use a configuration management tool like SaltStack or Ansible, this could be a non-issue; however, I don’t use either of these.

Also, Nebula’s clients seem, at the time of writing, a little bit incomplete. Unlike ZeroTier’s clients (which are quite mature and well-supported), the Nebula clients lack basic features like a built in system service to allow the client to run at system startup. It is possible to install a systemd service for Nebula easily, but this is more complicated on Windows systems.

graviti/netmaker

I hadn’t heard of netmaker before, but it seems to be a new project that is quite popular on the /r/selfhosted subreddit. As the repo’s README puts it, netmaker is “like Tailscale, ZeroTier, or Nebula, but faster, easier, and more dynamic”. The architecture seems to be very similar to ZeroTier, where one central server has a management interface and all other nodes can be joined to the network with a single command. The main difference is that netmaker uses WireGuard as the underlying VPN protocol (rather than ZeroTier’s proprietary protocol). Netmaker also has built in DNS services so it’s possible to create internal domains that can only be resolved inside the private network.

Problems: The main problem for my use case at the time of writing is that it’s not possible to support Windows or Android clients (not even by exporting wireguard configurations), but the maintainers claim a Windows client will exist “in future releases”.

Interestingly, it has a feature very similar to TailScale’s “relay nodes” functionality:

Example: You create a network in netmaker called Homenet. It has several machines on your home server. You create another network called Cloudnet. It has several machines in AWS. You have one server (server X) which is added to both networks. On Cloudnet, you make Server X a gateway to Homenet. Now, the cloudnet machines have access to your homenet machines. via Server X.

This could hypothetically be used to bypass the lack of a Windows or Android client by running a separate, Windows and Android compatible VPN server (like WireGuard, OpenVPN, etc.) on a gateway connected to the netmaker network.

tonarino/innernet

Innernet seems to be a more mature version of netmaker, sans management GUI and DNS features. Like netmaker, it also uses WireGuard and features a similar architecture. It does have a unique feature where the admin can create logically separate and isolated subnets for different purposes, similar to the concept of “networks” in ZeroTier.

Problems: Two issues regarding scalability I noticed were a) that each new node needs to receive a file to join the network rather than using, say, an invitation code and b) nodes cannot be deleted. I wouldn’t be too bothered by the first issue (hosting the invitation file in a private S3 bucket is easy enough), but the fact that nodes can’t be deleted is an problem for perfectionist me 😂. I would prefer not to run out of IP addresses in a few years, since I am still buying new/cancelling old machines quite frequently.

Also, like netmaker, innernet lacks Windows and Android clients.

ZeroTier on premises

ZeroTier has always had some sort of open source on premises version, but the folks at Key Networks have made a friendly monolithic solution called ztncui. This functions just like ZeroTier’s SaaS version (down to the 16 character network join codes and the easy zerotier-one client), but doesn’t have the device limitations. Compared to some of the others, this certainly seems like the most convenient option.

Problems: If I decided to go this route, I would still have to deal with the bugginess in the zerotier-one client (or renumber such that I only have 1 managed route). It also seems like the repository may no longer be maintained.

costela/wesher

This is another WireGuard based solution. Out of the other WireGuard solutions, this seems to be the simplest by far: no firewalls, no network segmentation rules, no certificates to share. It simply relies on a pre-shared key (a “cluster key”) to authenticate new nodes attempting to join the network.

Problems: The biggest problem with wesher is that, at the time of writing, it is not possible to rotate the cluster key. So, if anyone gets access to my cluster key, I would probably have to end up rebuilding my entire mesh. Not fun.

Additional notes

All of the solutions use some sort of centralized server that acts as a way for nodes to discover other nodes. It should be noted that this does represent a single point of failure; however, my understanding is that a short-term downtime of the central server should be no problem (since all of the solutions I evaluated only use it as a means to generate configurations/connections rather than actually acting as a router). So for my use case, I think it’s enough to run the central server on a reasonably reliable hosting provider and take frequent backups.

Final setup

Based on what I’ve read about these five solutions, I feel the simplest way to fulfill my needs is to use on premises ZeroTier and renumber to use only one subnet.

If Nebula had some sort of central configuration method and each node could easily be joined to the network with a one-liner (that could be easily executed through cloudinit userdata), I would probably be using it. But right now, it seems like it would be too much work to scale it across all of my infrastructure.

Published on June 1, 2021June 1, 2021