Keybard-Vagabond-Demo

This is a portion of the keyboad vagabond source that I'm open to sharing, based off of the main private repository.

This is something that I made using online guides such as https://datavirke.dk/posts/bare-metal-kubernetes-part-1-talos-on-hetzner/ along with Cursor for help. There are some some things aren't ideal, but work, which I will try to outline. Frankly, things here may be more complicated than necessary, so I'm not confident in saying that anyone should use this as a reference, but rather to show work that I've done. I ran in to quite a few issuse that were unexpected, which I'll document to the best of my memory, so I hope that it may help someone.

Background

This is a 3 node ARM VPS cluster running on Bare-metal kubernetes and hosting various fediverse software applications. My provider is not Hetzner, so not everything in the guide pertains to here. If you do use the guide, do NOT change your local domain from cluster.local to local.your-domain. It caused so many headaches that I eventually went back and restarted the process without that. It would up causing me a lot of issuse around Open Observe and there are a lot of things in there that are aliased incorrectly, but I now have dashboards working and don't want to change it. Don't use my OpenObserve as a reference for your project - it's a bit of mess.

I chose to go with the 10vCPU and 16GB of RAM nodes for around 11 Euros. I probably should have gone up to 15 Euros for the 24GB of RAM nodes. But for now, the 16GB nodes are doing fine.

Authentik
The cluster runs Authentik, but I was unfortunately not able to run it for as many applications as I wanted. It does have a custom workflow so that users can use it to sign up for write freely. This is done to prevent spam.
Write Freely
A minimalist blog. This one is using the local sqlite3 db, so only runs one instance. It was one of the first real apps that I installed, before Cloud Native Postgres was set up. I debate on whether that was a good enough choice or not. At one point I almost lost the blogs in a disaster recovery incident (self-inflicted, of course) because I forgot to add the longhorn attributes to the volume claim declaration, so I thought it was backed up to S3 when it wasn't.
Bookwyrm, Pixelfed, Piefed
These all have their own custom builds that pull source code and create different images for workers and web projects. I don't mind the workers being more resource constrained, as they will catch up eventually and have horizontal scaling set at pretty high thresholds if they really need it, but that's rare. I definitely imagine that the docker builds can be cleaner and would always appreciate review. One of my concerns with the images was on the final size, which is around 300MB-400MBish for each application.
Infrastructure - FluxCD
FluxCD is used for continuous delivery and maintaining state. I use this instead of ArgoCD because that's what the guide used. The same goes for Open Observe, though it has a smaller resource footprint than Grafana, which was important to me since I wanted to keep certain resource usages lower. SOPS is used as encryption since that's what the guide that I was using used, but I've checked in enough unencrypted secrets to source that I want to eventually self-host a secret manager. That's in the back of my mind as a nice-to-have.
Infrastructure - Harbor Registry
I'm running my own registry based on the guide that I used and it's been a mixed bag. On one hand it's nice to have a private registry for my own custom builds, but on the other Harbor gave me many issues for a long time. Another thing that I need to bear in mind is that I'm using Cloudflare Tunnels for secure access, but the free and base tiers have a 100MB upload limit. For a long time I debated on whether it was worth it to host, but now that I haven't had any issues in a while, I don't mind it. It does unfortunately still use the Bitnami charts, which are deprecated for non-paying customers, so that portion of my code shouldn't be used for reference and another solution should be found. I don't know where or what that is, though.
Infrastructure - Longhorn
The storage portion of the services was interesting. The guide that I used originally used Rook Ceph, which I went with, but I each of my nodes has 512GB of SSD storage that I didn't want to give up. After a lot of troubleshooting, I realized that Rook only works with whole drives and that longhorn allows partitioning, so I partitioned each of my ssds to a portion for Talos and the rest for longhorn. I had to get a custom build of Talos with the proper storage drivers, but once I got that up, everything worked fairly well.

There was a problem though. At the time of writing there's still a bug and github issue (documented in the readme) where Longhorn will make millions of s3_list_objects requests. This request is a paid endpoint, so I was paying less than $5 for storage and over $25 for these calls. The ultimate solution now is one from the Github issue where I have cron jobs that create and remove network policies that block longhorn from making the s3 requests outside of the backup period. The team does have it on the radar, so hopefully that will be resolved.

Infrastructure - CDN
My S3 provider has a deal with Cloudflare for unlimited egress when using their CDN, so assets are using cloudflare for routing and CDN. I also use the CDN for various static assets and federation endpoints to take a load off of the server.

Standard performance

In this configuration with currently me as the only user (feel free to sign up on any of the fediverse sites! home page) the cpu typically is in the low 20's% and the memory in k8s shows around 75%. However, the dashboards show a bit lower with the main control plane around 12GB of 16GB and the other nodes around 9GB of 16GB. Requests and federation do quite well and backups in federation have been well handled by the redis queues. At one point there was a fediverse bad actor creating spam that took down another server, which slowed the federation requests. The queues backed up to over 175k messages, but they were processed eventually over the next few hours.

One thing to note is that piefed has performance opitimizations to use for CDN caching of various fediverse endpoints, which helps a lot.

Database

The database is a specific image of postgresql with the gis plugin. What's odd here is that the default image of postgres does not include the gis extension and the main postgresql repository doesn't officially support ARM architecture. I managed to find one on version 16 and am using that for now. I am doing my own build based off of it and have it in the back of my mind to possibly do my own build and upgrade the version to a higher one. Bare this in mind if you go ARM.

Cloud Native PG is what I use for the database. There is one main(write) database and two read replicas with node anti-affinity so that theres only one per node. They currently are allowed up to 4GB of RAM but are using 1.5-1.7GB typically. Metrics reports that the buffer cache is hit nearly 100% of the time. Once more users show up I'll re-evaluate the resource allocations or see if I need to add a larger node. Some of the apps, like Mastodon, are pretty good with using read replica connection strings - that can help with spreading the load and using horizontal rather than vertical scaling.

Strange Things - Python app configmaps

The apps that run on python tend to use .env files for settings management. I was trying to come up with some way to handle the stateless nature of kubernetes with the stateful nature of .env files and settled on trying to have the configmap, secrets and all, encrypted and copied to the file system if there is no .env there already via script. The benefit is that I do have a baseline copy of the config that can be managed automatically, but the downside is that it's a reference that needs to be maintained and can make things a bit weird. I'm not sure if this is the best approach or not. But that's why you'll find some configmaps that have secrets and are encrypted in their entirety.

Strange Things - Open Observe

Open Observe became very bloated in its configurations and I believe that at the time I was setting it up as one of the first things that I was installing, that some things may have been out of date and, in conjunction with the cluster.local issue, the trying to get things to work became a mess. I have metrics, logs, and dashboards working so I'm not going to change anything, but I'd use something else as a reference.

Documentation

There are a lot of documentation files in the source. Many of these are just as much for humans as they are for the AI agents. The .cursor directory is mainly for the AI to preserve some context about the project and provide examples of how things are done. Typically, each application will have its own ReadMe or other documentation based off of some issue that I ran in to. Most of it is more for reference for me rather than reference for a person trying to do an implementation, so take it for what it is.

AI Usage

AI was used extensively in the process and has been quite good at doing templatey things once I got a general pattern set up. Indexing documentation sites (why can't we donwload the docs??) and downloading source code was very helpful for the agents. However, I am also aware that some things are probably too complicated or not quite optimized in the builds and that a more experienced person could probably do better. It is still a question in my mind on whether the AI tools helped save time or not. On one hand, they have been very fast at debugging issues and executing kubectl commands. That alone would have saved me a ton. However, I may have also wound up with something simpler. I think that it's a mixture of both because there were certainly some things that would have taken me far longer to find that the agent did quickly.

I'm still using the various agents provided by Cursor (I can't use the highest ones all the time because I'm on the $20/mth plan). I learned a lot about using cursor rules to help the agent, indexing documentation, etc to help it out rather than relying on its implicit knowledge.

Overall, it's been an interesting use case and I'm sure someone who's better in certain areas than I am will point out some problems. And please do! I did this project to learn and this sort of infrastructure is a big beast.

10 KiB Raw Blame History