Hosting Private AI with Railway – Tech Breakdown – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Loudbook

This article is the tech breakdown for some AI models published on the Railway platform. If you haven’t already, read this post that talks about the motivation. You can find these template links at the bottom of this post.

Initial Research

Going into this, we were aware of the potential performance issues of running these models without the presence of a GPU- something Railway does not currently offer. Our resources on the “pro” plan would be limited to 32v CPU cores, and 32GB of RAM.

Preparation

Railway centers around services, which are essentially just Docker containers. Finding a pre-built Docker image with these models turned out to be extremely easy- thanks to Ollama. Ollama provides a portable Docker container for running their server, and within that server you’re able to pull whatever model you’d like. This seemed to be the ultimate, expandable solution for us.

Building the Template

Creating a new service in Railway is easy, and with Ollama being accessibly hosted on Dockerhub, we were able to directly source it from that image.

Now, to start this container, Ollama requires some extra variables. Here’s our start command, let’s break it down.

/bin/sh -c "ollama serve & pid=$!; sleep 5; ollama pull gpt-oss; wait $pid"

`/bin/sh -c`

This segment just shows that this should be run as a shell command. Railway requires this to use environmental variables, like $pid.

`ollama serve`

Running this command creates a new instance of the inner server running inside our container. This is required to pull various models.

`pid=$!`

The environmental variable pid is assigned to the output of the command ollama serve. This is so that we are able to monitor the server, and in case of a crash, Railway understands that something has failed and stops the container.

`sleep 5`

Magic numbers are always frowned upon, and this is no exception. Despite this, it’s reliable, and simple. In order to ensure the server is ready to pull, we just give it a few seconds.

`ollama pull gpt-oss`

This is the important step. Running this command initiates the download of the model from Ollama, and once it’s completed, the model is served for use.

But won’t this pull every time?
Nope! Once the model is downloaded once, it’s stored on the attached volume. This allows for model to be cached, with Ollama only checking for updates, and redownloading as necessary.

`wait $pid`

Remember how we stored our pid in this variable? Here, we tell Railway to wait for the process with that pid to finish before closing the container.

How was it secured?

Allowing anyone to access your instance can be harmful, and wastes resources that you could be allocating elsewhere.

In order to secure the backend, we used a service called “Caddy” with a custom configuration to require the presence of an API key as a header.

You can view this configuration here: https://github.com/Err0r430/railway-dockerfiles/blob/main/caddy/Caddyfile

Now, instead of connecting directly to the model, you are required to pass through an authentication proxy first.

How does it perform?

Below are the raw token/second speed measurements. Please note that your results will vary.

DeepSeek R1 (8b): 10.8 tokens/sec
OSS-GPT (20b): 13 tokens/sec
Llama (1b): 20.5 tokens/sec
Qwen3 (0.6b): 86 tokens/sec

The reason that these are progressively faster is due to their decreasing parameter size. The smaller the model, the faster it performs, with one obvious outlier here. GPT by OpenAI appears to be a HEAVILY optimized model in comparison to DeepSeek, despite its very large size. The specifics behind this are out of my knowledge, and outside of the scope of this post.

To put it simply, the larger the model, the more intelligence it is perceived to have. For complex tasks, GPT or DeepSeek will be more helpful, while Qwen3 should be used for quick and less sophisticated questions.

Conclusion

Creating these templates was fun and educational. Our expectations were blown away considering no GPU is present, and we hope that this helps you host your own models, for whatever purpose it may be.

You can host these templates yourself with the links below:

This content originally appeared on DEV Community and was authored by Loudbook