This content originally appeared on DEV Community and was authored by Noah
In today’s world, AI is no longer just a buzzword. It’s part of our everyday lives. Whether it’s ChatGPT, Claude, or Gemini, most of us interact with these tools daily. But while the big names run on massive cloud infrastructure, what if you could host your own private AI?
The cost.
OpenAI pours hundreds of millions of dollars into training and serving its models. Google invests at a scale only tech giants can sustain. Anthropic follows the same path with enormous compute bills and specialized infrastructure. The average developer, company, or small team has no reason to spend like that. What they want is a simple way to run private AI without breaking the bank.
Custom solutions can be expensive and require complex setup. But what if they didn’t have to? What if your solution to having a completely private AI model was as simple as two clicks?
Enter Railway.
What’s Railway?
Railway is a cloud platform built for developers. It takes you from idea to deployed application, or even a private AI service, in just a few clicks.
Complexity does not always equal quality. Too often, we overengineer solutions that could be simple. Why spend weeks engineering overly complex infrastructure when the real goal is to ship something that works?
How does Railway come into play?
At first glance it sounds impossible. AI models are massive. Don’t they need racks of GPUs and specialized hardware? The truth is, yes and no. Smaller open-weight models can run entirely on CPUs, and thanks to optimizations like quantization, they can run surprisingly well. Sure, you can’t run the full GPT-5 or Claude Opus 4.1, but models like GPT-OSS match the performance of o3-mini.
Railway provides the environment to deploy and manage these models without you having to worry about infrastructure plumbing.
So the question is no longer if you can run private AI on Railway, but how.
Thankfully, I’m here for that.
How did I deploy?
Well, me and @loudbook had this exact same question. How can we run and host an LLM on Railway with real performance?
Our first attempt involved a custom build of vLLM. It was powerful on paper, but in practice it was fragile, hard to build, and prone to compatibility issues.
Next, I suggested using llama.cpp. It looked promising, but getting it hosted and set up in a way that worked with Railway best turned out to be a very slow and painful process.
Then, the breakthrough. @loudbook suggested simply using Ollama. Deploying was now as simple as running ollama pull <model>
and ollama serve
. After a few small tweaks, it just worked. We had successfully deployed an open-sourced LLM on Railway. And because Ollama provides an OpenAI API–compatible endpoint, our private model can easily be hooked up to the existing OpenAI SDKs.
The numbers.
Cost.
Regardless of not using GPUs, running LLMs gets expensive.
Depending on the frequency of use, you will be looking at roughly $300–$900 monthly.
That number might sound high at first, but let’s keep it in perspective. Compared to the per-token or per-request pricing of hosted APIs, the cost becomes competitive very quickly if you are doing moderate to heavy usage. For a small team, this could mean hundreds of thousands of tokens per day for a flat, predictable price.
More importantly, the cost is fixed and under your control. You are not metered per request, and you are not paying for someone else’s margin on inference. You are simply renting the resources you need, running the model you want, and deciding how available you want it to be.
If you want a harsh price ceiling, Railway has a tool for that: Usage Limits.
If you want per-service resource caps, Railway has a tool for that: Resource Limits.
Speed.
Cost is only half the story. The other question is speed, because no one wants a private model that crawls along.
On Railway, inference speed depends on two things: the model you choose and the resources you allocate. Bigger models tend to generate slower, smaller ones much faster. In our testing, we saw:
- DeepSeek — 10.8 tokens/sec
- GPT-OSS — 13 tokens/sec
- Llama — 20.5 tokens/sec
- Qwen3 — 86 tokens/sec
Now, 10 might sound slow compared to 86, but bear in mind that a typical English sentence is roughly 10–15 tokens. That’s nearly one sentence per second.
For a complete tech breakdown, read this article by @loudbook.
Deploy a model on Railway.
Now that we have the numbers out of the way, here’s how to host a model on Railway.
Currently there are four models of multiple flavors:
Whichever you choose requires the Railway Pro plan.
After picking a flavor, simply click Deploy now. It’s that easy.
Conclusion.
Private AI is possible at an affordable price. In a few clicks, you can host your very own AI model.
The important part is not just that it works. It’s that it’s accessible. You don’t need to be a cloud architect or a Fortune 500 company to have secure, private AI infrastructure. Railway strips away the overhead, leaving you with the freedom to focus on what actually matters: building.
This content originally appeared on DEV Community and was authored by Noah