Towards Personal ML Ops with Headless Training

For years now, I’ve trained neural networks for my research using what I thought was the easy, standard way to do things. You get access to a server, log into it, pull some code off of GitHub, make some config changes, then run it. To monitor the training progress, you fire up TensorBoard and point it at the log directory. It didn’t matter if you were using PyTorch instead of TensorFlow; TensorBoard works for that, too. It’s only recently that I wondered how *real* AI people do this kind of thing. Turns out they use Headless training nodes and object stores

If you want to skip the “recipe blog-esque exposé,” just skip here.

How we got here

I recently had a power supply bearing go bad in my lab workstation. This meant that if I trained for an extended period of time on the 5090 inside, the machine’s power supply would start making a deafening screaming sound from the bearing holding on for dear life. This was disruptive not only to me, but undoubtedly to the others in my lab.

I can guess just how distracting from finding my computer turned off in the morning for some reason after trying to train all night. Likely a colleague trying to stop the noise. The initial problem was that the power supply fan would only kick on under sufficient load and temperature. This condition was only met if it was left training for a while. Which often only happened after I had left for the day or weekend. This meant the only people with knowledge of the problem were the nocturnal grad students. Whom were likely intensely annoyed by it.

When I finally heard the sound and diagnosed it at the PSU fan, I brought it up to my advisor, and we got the new part ordered. There will be another post here soon when that comes in documenting the replacement process, mostly for life insurance purposes if the fist-sized capacitor within decides my heart is an acceptable path to ground.

No Space, Mo Problems

Since my lab machine was out of commission, I was directed to ask another student for access to the machine they trained on. One shared with another department whose work involved massive datasets. Due to the size of these datasets and, likely, a lack of systems knowledge about where to properly store it, the server frequently ran out of disk space.

This posed a major challenge. Training runs would start just fine, but if the disk became full, it would fail to write any checkpoints. It would then either crash or simply not record them. Crashing, I deemed, was better because what’s the point of burning electricity just to not save any results. This left me with the problem where I had a machine with large amounts of compute and RAM, usually unused, but almost no disk space. Just enough to store my code and the dataset. I was lucky to get my 6GB dataset moved in right as someone freed up some space for it.

This left me wondering, surely all the *real* AI people don’t just treat all compute nodes also as storage nodes. What tools do we use to actually stream the logs and checkpoints?

Headless Training Tech Stack

In order for my findings to actually be useful to anyone, I feel it’s important to say what my software stack looks like. This is also so that SEO can hopefully bring some people here if trying to solve the same problems. It’s no secret how important the role of dev blogs is in helping other people solve problems.

The tech stack is a point cloud network, so I’m using PyTorch, pytorch3D, and PyTorch Lightning with CUDA machines for training. For SDN connectivity, I use tailscale, so everyone can talk to each other and to simplify network setup. Some of the components run on Docker as well.

GPU workstation with the bad fan, it runs Docker and serves as an artifact store
GPU server on the same LAN as the workstation, with almost no free space
A Laptop for monitoring. Assumed to be reached over WAN

My ultimate goal would be to treat the GPU server as purely a compute node. It would need only space for the Python libraries and the training code. Another machine would stream everything else.

At this point, though, the dataset lives locally, but I stream all the training checkpoints and logging files to the GPU workstation. Here’s How:

The Actual Solution

PyTorch Lightning will happily stream input checkpoints to and from S3 to keep them off your disk. You just need to make sure you have one extra package installed and set up some environment variables to keep the credentials out of your repo. The package you want is called s3fs:

uv add s3fs
or...
pip install s3fs
or... whatever env and package manager you use

With S3FS running, you can substitute file paths with S3 paths for any argument in Lightning that expects a path. So:

"~/project/checkpoints/experiment-02/last.ckpt"
"s3://training-artifacts/experiment-02/last.ckpt"

This will work automagically as long as you have the proper environment variables set in something like a .env file loaded with the dotenv package:

#.env
AWS_ACCESS_KEY_ID=AccountName
AWS_SECRET_ACCESS_KEY=SuperSecretPassword
AWS_ENDPOINT_URL=http://Some.S3.Compatible.Service
AWS_REGION=us-east-1

#train.py
other imports ...
from dotenv import load_dotenv

load_dotenv()
if not os.getenv("AWS_ENDPOINT_URL"):
    raise ValueError("Missing AWS_ENDPOINT_URL. Check your .env file.")

...

checkpoint = ModelCheckpoint(
        monitor='val_loss',
        dirpath=f"{S3_BUCKET}/checkpoints/",
        Other_args=...,
    )

And just like that, PyTorch Lightning will know to snag your initial checkpoint from the bucket and store anything that the checkpoint callback makes back into the bucket.

Self-Hosted S3

I chose to use minio to provide S3 capability. I ran it using Docker Compose on the workstation so that I could utilize the fast LAN connection between it and the training server. The compose file would look something like this:

#docker-compose.yaml
services:
  minio:
    image: minio/minio:latest
    container_name: minio
    restart: always
    volumes:
      - ./minio_data:/data # Make sure to map in a directory to store everything
    ports:
      - "9000:9000"
      - "9001:9000"
    environment:
      MINIO_ROOT_USER: "admin"
      MINIO_ROOT_PASSWORD: "StrongPassword"
    command: server /data --console-address ":9001"

And then, with a quick docker compose up -d, you can have a minio S3 server running for PyTorch Lightning to use as it’s datastore

But wait. What if my checkpoints are really big? You don’t want to have your GPU sitting idle while you save a file across the network, obviously. Luckily, we live in a modern world with async callbacks, and you can easily configure Lightning to do the saving in a separate thread by passing a plugin to the trainer:

from pytorch_lightning.plugins import AsyncCheckpointIO
trainer = pl.Trainer(
        max_steps=MAX_STEPS,
        accelerator='gpu',
        plugins=[AsyncCheckpointIO()],
        other_args=...,
    )

And just like that, you won’t be writing a single byte of data to disk on your training machine during the run.

Running Aim to replace TensorBoard

The next problem I wanted to solve was to use something more performant and powerful than TensorBoard to monitor my training. The project I ultimately settled on is called Aim and runs just fine in a Docker container:

services:
  aim-ui:
    image: aimstack/aim:latest
    container_name: aim_ui
    restart: unless-stopped
    ports:
      - "43800:43800"
    volumes:
      - ./aim_data:/opt/aim
    command: up --host 0.0.0.0 --port 43800 --repo /opt/aim

  aim-server:
    image: aimstack/aim:latest
    container_name: aim_server
    restart: unless-stopped
    ports:
      - "53800:53800"
    volumes:
      - ./aim_data:/opt/aim
    command: server --host 0.0.0.0 --port 53800 --repo /opt/aim

Apparently, the aim code runs better when you spin up separate server and ui instances. Though both can be served from the same container.

Once you have Aim up and running, you just need to tell PyTorch Lightning to use Aim for logging instead of TensorBoard. This is accomplished like so:

from aim.pytorch_lightning import AimLogger
aim_logger = AimLogger(
        experiment="MyAIThingy",
        repo=""aim://workstation:53800"",
    )
trainer = pl.Trainer(
        logger=aim_logger,
        plugins=[AsyncCheckpointIO()],
        other_args=...,
    )

Then, monitoring your training progress is as simple as pointing a browser to http://workstation:43800/ and exploring your training runs.

You Weights and Biases fans can also give that as a URL and use their logger. Really, anything other than TensorBoard is up for the job. Weights and Biases also has a self-hostable version that you should run for fun!

Conclusion

There you have it. With Aim, Minio, and PyTorch Lightning’s S3 integration, you can decouple the logging of any training artifacts from the GPU compute server. It’s also completely possible to do a similar process with streaming your dataset, but that requires a bit more legwork with custom dataloaders mounting the S3 bucket, and I haven’t implemented it yet.

Also, you really should just be using tailscale on all your machines. If someone else already installed it on a shared machine already, ask them to share that particular node to your tailnet from the online management interface.