AI Archives - Ian's Wild Ramblings

Towards Personal ML Ops with Headless Training

Ian — Fri, 13 Feb 2026 18:24:03 +0000

For years now, I’ve trained neural networks for my research using what I thought was the easy, standard way to do things. You get access to a server, log into it, pull some code off of GitHub, make some config changes, then run it. To monitor the training progress, you fire up TensorBoard and point it at the log directory. It didn’t matter if you were using PyTorch instead of TensorFlow; TensorBoard works for that, too. It’s only recently that I wondered how *real* AI people do this kind of thing. Turns out they use Headless training nodes and object stores

If you want to skip the “recipe blog-esque exposé,” just skip here.

How we got here

I recently had a power supply bearing go bad in my lab workstation. This meant that if I trained for an extended period of time on the 5090 inside, the machine’s power supply would start making a deafening screaming sound from the bearing holding on for dear life. This was disruptive not only to me, but undoubtedly to the others in my lab.

I can guess just how distracting from finding my computer turned off in the morning for some reason after trying to train all night. Likely a colleague trying to stop the noise. The initial problem was that the power supply fan would only kick on under sufficient load and temperature. This condition was only met if it was left training for a while. Which often only happened after I had left for the day or weekend. This meant the only people with knowledge of the problem were the nocturnal grad students. Whom were likely intensely annoyed by it.

When I finally heard the sound and diagnosed it at the PSU fan, I brought it up to my advisor, and we got the new part ordered. There will be another post here soon when that comes in documenting the replacement process, mostly for life insurance purposes if the fist-sized capacitor within decides my heart is an acceptable path to ground.

No Space, Mo Problems

Since my lab machine was out of commission, I was directed to ask another student for access to the machine they trained on. One shared with another department whose work involved massive datasets. Due to the size of these datasets and, likely, a lack of systems knowledge about where to properly store it, the server frequently ran out of disk space.

This posed a major challenge. Training runs would start just fine, but if the disk became full, it would fail to write any checkpoints. It would then either crash or simply not record them. Crashing, I deemed, was better because what’s the point of burning electricity just to not save any results. This left me with the problem where I had a machine with large amounts of compute and RAM, usually unused, but almost no disk space. Just enough to store my code and the dataset. I was lucky to get my 6GB dataset moved in right as someone freed up some space for it.

This left me wondering, surely all the *real* AI people don’t just treat all compute nodes also as storage nodes. What tools do we use to actually stream the logs and checkpoints?

Headless Training Tech Stack

In order for my findings to actually be useful to anyone, I feel it’s important to say what my software stack looks like. This is also so that SEO can hopefully bring some people here if trying to solve the same problems. It’s no secret how important the role of dev blogs is in helping other people solve problems.

The tech stack is a point cloud network, so I’m using PyTorch, pytorch3D, and PyTorch Lightning with CUDA machines for training. For SDN connectivity, I use tailscale, so everyone can talk to each other and to simplify network setup. Some of the components run on Docker as well.

GPU workstation with the bad fan, it runs Docker and serves as an artifact store
GPU server on the same LAN as the workstation, with almost no free space
A Laptop for monitoring. Assumed to be reached over WAN

My ultimate goal would be to treat the GPU server as purely a compute node. It would need only space for the Python libraries and the training code. Another machine would stream everything else.

At this point, though, the dataset lives locally, but I stream all the training checkpoints and logging files to the GPU workstation. Here’s How:

The Actual Solution

PyTorch Lightning will happily stream input checkpoints to and from S3 to keep them off your disk. You just need to make sure you have one extra package installed and set up some environment variables to keep the credentials out of your repo. The package you want is called s3fs:

uv add s3fs
or...
pip install s3fs
or... whatever env and package manager you use

With S3FS running, you can substitute file paths with S3 paths for any argument in Lightning that expects a path. So:

"~/project/checkpoints/experiment-02/last.ckpt"
"s3://training-artifacts/experiment-02/last.ckpt"

This will work automagically as long as you have the proper environment variables set in something like a .env file loaded with the dotenv package:

#.env
AWS_ACCESS_KEY_ID=AccountName
AWS_SECRET_ACCESS_KEY=SuperSecretPassword
AWS_ENDPOINT_URL=http://Some.S3.Compatible.Service
AWS_REGION=us-east-1

#train.py
other imports ...
from dotenv import load_dotenv

load_dotenv()
if not os.getenv("AWS_ENDPOINT_URL"):
    raise ValueError("Missing AWS_ENDPOINT_URL. Check your .env file.")

...

checkpoint = ModelCheckpoint(
        monitor='val_loss',
        dirpath=f"{S3_BUCKET}/checkpoints/",
        Other_args=...,
    )

And just like that, PyTorch Lightning will know to snag your initial checkpoint from the bucket and store anything that the checkpoint callback makes back into the bucket.

Self-Hosted S3

I chose to use minio to provide S3 capability. I ran it using Docker Compose on the workstation so that I could utilize the fast LAN connection between it and the training server. The compose file would look something like this:

#docker-compose.yaml
services:
  minio:
    image: minio/minio:latest
    container_name: minio
    restart: always
    volumes:
      - ./minio_data:/data # Make sure to map in a directory to store everything
    ports:
      - "9000:9000"
      - "9001:9000"
    environment:
      MINIO_ROOT_USER: "admin"
      MINIO_ROOT_PASSWORD: "StrongPassword"
    command: server /data --console-address ":9001"

And then, with a quick docker compose up -d, you can have a minio S3 server running for PyTorch Lightning to use as it’s datastore

But wait. What if my checkpoints are really big? You don’t want to have your GPU sitting idle while you save a file across the network, obviously. Luckily, we live in a modern world with async callbacks, and you can easily configure Lightning to do the saving in a separate thread by passing a plugin to the trainer:

from pytorch_lightning.plugins import AsyncCheckpointIO
trainer = pl.Trainer(
        max_steps=MAX_STEPS,
        accelerator='gpu',
        plugins=[AsyncCheckpointIO()],
        other_args=...,
    )

And just like that, you won’t be writing a single byte of data to disk on your training machine during the run.

Running Aim to replace TensorBoard

The next problem I wanted to solve was to use something more performant and powerful than TensorBoard to monitor my training. The project I ultimately settled on is called Aim and runs just fine in a Docker container:

services:
  aim-ui:
    image: aimstack/aim:latest
    container_name: aim_ui
    restart: unless-stopped
    ports:
      - "43800:43800"
    volumes:
      - ./aim_data:/opt/aim
    command: up --host 0.0.0.0 --port 43800 --repo /opt/aim

  aim-server:
    image: aimstack/aim:latest
    container_name: aim_server
    restart: unless-stopped
    ports:
      - "53800:53800"
    volumes:
      - ./aim_data:/opt/aim
    command: server --host 0.0.0.0 --port 53800 --repo /opt/aim

Apparently, the aim code runs better when you spin up separate server and ui instances. Though both can be served from the same container.

Once you have Aim up and running, you just need to tell PyTorch Lightning to use Aim for logging instead of TensorBoard. This is accomplished like so:

from aim.pytorch_lightning import AimLogger
aim_logger = AimLogger(
        experiment="MyAIThingy",
        repo=""aim://workstation:53800"",
    )
trainer = pl.Trainer(
        logger=aim_logger,
        plugins=[AsyncCheckpointIO()],
        other_args=...,
    )

Then, monitoring your training progress is as simple as pointing a browser to http://workstation:43800/ and exploring your training runs.

You Weights and Biases fans can also give that as a URL and use their logger. Really, anything other than TensorBoard is up for the job. Weights and Biases also has a self-hostable version that you should run for fun!

Conclusion

There you have it. With Aim, Minio, and PyTorch Lightning’s S3 integration, you can decouple the logging of any training artifacts from the GPU compute server. It’s also completely possible to do a similar process with streaming your dataset, but that requires a bit more legwork with custom dataloaders mounting the S3 bucket, and I haven’t implemented it yet.

Also, you really should just be using tailscale on all your machines. If someone else already installed it on a shared machine already, ask them to share that particular node to your tailnet from the online management interface.

The post Towards Personal ML Ops with Headless Training appeared first on Ian's Wild Ramblings.

Welcome Back

Ian — Thu, 29 Jan 2026 04:30:19 +0000

It’s been a long time since I’ve had a blog or contributed really any content to the internet. I used to have a version of this site in high school that was a whole bunch of nothing, and I found myself rarely making the time to write anything on it. Now, after nearly decade long hiatus, I think I’ve discovered that the problem could have been that it was about nothing at all. That’s why for this blog, I want it to actually be about something. About things I care about. About things I think other people should care about. So welcome back and here’s my current idea of what I want to put on this blog:

Personal Dev Blog

I’m a developer and computer graphics researcher. I gather an immense amount of experience and knowledge from other people’s personal development blogs. I’ve never really had the guts to post on Stack Overflow for all the reasons we joke about, but also I’ve just never really had problems that didn’t have an existing answer out there. That started to change when my educational career changed from writing code that already exists in undergrad to writing code that doesn’t in graduate school. This of course is ignoring that Stack Overflow is basically on life support and is basically a glorified dataset in 2026.

It’s much more common for me now to have to hack together solutions for library versioning and linking to get disparate parts of a cobbled-together codebase to play nice. There have been countless times that I was only able to figure out a solution to something from the dev blog of someone with a similarly odd tech stack. So I’d like to give back a bit and hope that people can find some of my experience useful.

Lamentations on AI

We took the internet of my youth for granted, and now it’s gone. What used to be “surfing the web” is now simply going to the same 3 websites over and over again all day with an anxious compulsion. I fear that inorganic content created to manipulate, deceive, or sell dominates the state of online social interactions. But it doesn’t need to be that way. There is no reason why the old way doesn’t work. Nothing stops anyone from taking a junk old computer from recycling and throwing a personal website on it. Now, doing so securely is another story (One I hope to poorly document on this site), but that world still exists.

I’m not going to pretend that I’m the only person feeling this or the only blog on the internet. But it’s pretty easy to see that ads, SEO garbage, listicle nonsense, or, more recently, AI slop dominate the majority of small websites. I hope for this to be something different. No ads, absolutely no slop. Just mistakes made by humans and money being incinerated by humans to host it. No em dashes or overuse or “if X then Y” statements. Just garbage written by a guy who probably has nothing useful to say, yet does it anyway

Digital Well-being

The modern internet and technology landscape has been described as “constant PvP”. That is, seemingly every interaction is between you and someone that wants something from you. Often they want your attention or money, and you want a distraction or content. But increasingly it feels as though this relationship contains a power imbalance that is simply too much to bear. I’ve begun to question the absurdity of digital practices that we just accept as commonplace. Take notifications, for example. Unless you carefully curate who and what is allowed to notify you, you probably get email notifications and app notifications on your desktop, your phone, and your wrist. Please for a moment just consider the absolute absurdity from this common scenario:

You’re eating dinner with a loved one, and you get a tap on your write from your smartwatch. The tap is to alert you to an email notification from Amazon about some upcoming discount. Just think about that for a second. You’re carefully attending someone that actually matters, when your attention is suddenly wrenched away, just for a split second, because a marketing executive knows it improves retention. How completely absurd. How insulting. Get out of my fucking head.

And yes, I know that those notifications can be turned off. But they’re on by default, and unless you religiously uncheck boxes, you’re bound to experience something like this. Maybe it’s a notification from a food delivery app enticing you with an offer. Of course you have notifications for Uber Eats on, how else would you know when your food is on the way? Why do the services we pay for also demand attention from us as additional payment? Why is this something we allow, are okay with, and accept is just part of participating in modern life?

Attention is a precious resource being plundered from us an unprecedented scale. I want to share my journey to take some of mine back.

Just to Write

I’m a graphics researcher. And a researcher that can’t effectively communicate their ideas is of no use to science because their ideas will never leave their head. This is my rationalization that writing and blogging as a hobby isn’t a waste of time or a distraction from my research, but instead something that helps me sharpen a vital skill.

Once I was done taking English classes, my writing education essentially stopped. Students in my field see writing assignments as largely a waste of time. I had the occasional essay needed for an elective class, but not much else for writing. So I want this blog to also function as a way for me to hone my scientific voice and persuasive style in order to be a more effective communicator.

Hobbies

I do things for fun. Sometimes cool things. Many things pertaining to owning an old house. Hopefully even interesting things too. I’d like someplace to share them that doesn’t enrich the richest men on the planet. So maybe you’ll find some stuff I’m proud of here eventually.

The post Welcome Back appeared first on Ian's Wild Ramblings.