Building your own local AI app – ██FR█████ █INTELL███████████

This content originally appeared on Level Up Coding – Medium and was authored by Ben Witt

A deep dive into Microsoft Foundry and WPF

The era of artificial intelligence is in full swing. But the prevailing dependence on cloud-based services raises important questions: What about data protection? How do the running costs scale? And how high is the latency for real-time interactions? The answer to these challenges is increasingly local AI. In this technical article, we highlight how developers can use Microsoft Foundry to run powerful language models directly on their own computers and integrate them into classic desktop applications — in this case, using a C# WPF app as an example.

Why local AI?

The ability to run large language models (LLMs) locally marks a paradigm shift in software development. It gives developers full control over their data, enables offline capabilities, protects against external provider outages, and drastically reduces response times. Microsoft has recognized this need and created Foundry Local, a platform that radically simplifies getting started with on-device AI development.

Unlike the Azure cloud version, Foundry Local is specifically designed to run on Windows and macOS devices and does not require an Azure subscription or cloud connection. This not only lowers the barrier to entry, but also frees companies from the regulatory uncertainties of international data processing.

What is Microsoft Foundry — and what does it do?

Microsoft Foundry Local is a toolchain for managing the entire lifecycle of local AI models. From model discovery and download to configuration and execution, Foundry provides a unified, CLI-based interface.

The most important features at a glance:

Easy installation: A single command via winget (Windows) or brew (macOS) is all it takes.
Hardware detection: Foundry automatically analyzes the local hardware (GPU, CPU, NPU) and downloads the optimized model variant.
OpenAI-compatible API: A local REST endpoint emulates the OpenAI API, allowing existing tools such as LangChain or AutoGen to remain usable without modification.
Model catalog: Developers can choose from an ever-growing selection of powerful open source models.

A typical request body corresponds exactly to the OpenAI standard:

{
  "model": "gpt-oss-20b-cuda-gpu",
  "messages": [
    {"role": "user", "content": "Hey, what's up?"}
  ],
  "temperature": 0.7,
  "stream": true
}

1. Installing Foundry Local

Getting started is easy. For Windows, just use this command:

winget install Microsoft.FoundryLocal

After installing, you can check out the current model list with foundry model list.

2. Model selection and execution: gpt-oss-20b-cuda-gpu

For our example project, we selected the gpt-oss-20b-cuda-gpu model — a GPU-optimized variant of the open-weight model gpt-oss-20b. With 20 billion parameters, it is suitable for complex agent applications and logical reasoning.

Note: The CUDA version requires an NVIDIA GPU with at least 16 GB VRAM.

Model start:

foundry model run gpt-oss-20b

Foundry then automatically starts an inference server at:

http://localhost:5273/v1/chat/completions

3. Integration into a WPF application

Application architecture overview

I made a conscious decision to keep the code as simple as possible, because the focus here is on functionality, not architecture.

The C# WPF app is based on a simple structure with one main component:

public partial class MainWindow : Window

The conversation history is stored as follows:

private readonly List<ChatMessage> conversationHistory;

The ChatMessage class corresponds to the OpenAI data model:

public class ChatMessage
{
  [JsonProperty("role")]
  public string Role { get; set; }
  [JsonProperty("content")]
  public string Content { get; set; }
  public string Rationale { get; set; }
}

API request and streaming: SubmitButton_Click

When the Send button is clicked, the following JSON request is generated:

var requestBody = new
{
 model = "gpt-oss-20b-cuda-gpu",
 messages = conversationHistory,
 temperature = 0.7,
 max_tokens = 2048,
 stream = true
};

Communication with the model takes place via HttpClient:

var request = new HttpRequestMessage(HttpMethod.Post, FoundryApiUrl)
{
 Content = new StringContent(jsonRequest, Encoding.UTF8, "application/json")
};

HttpResponseMessage response = await client.SendAsync(
request, HttpCompletionOption.ResponseHeadersRead);

The ResponseHeadersRead flag allows access to the stream during the response transmission — a decisive advantage for reactive UI logic.

Processing the streaming response: ProcessStreamedResponse

The streamed text is processed as follows:

while (!reader.EndOfStream)
{
 string line = await reader.ReadLineAsync();
 if (line.StartsWith("data: ")) line = line.Substring(6);
 var responseObject = JObject.Parse(line);
 var delta = responseObject["choices"]?[0]?['delta']?["content"]?.ToString();
}

An artificial “live effect” is created by a delay:

await Task.Delay(CharacterStreamDelay);
The output is dynamically transferred to the UI:
await Dispatcher.InvokeAsync(() =>
{
 assistantTextBlock.Text += character;
 ChatScrollViewer.ScrollToEnd();
});

Criticism: Model-specific markers

A key disadvantage lies in the use of model-specific markers:

const string startMarker = "<|channel|>final<|message|>";
const string endMarker = "<|return|>";

These markers are not standardized. Their use makes the app dependent on the model used and complicates generic use with alternative LLMs.

Visual representation in the UI

The representation is in text blocks with color differentiation:

case "user":
  textBlock.HorizontalAlignment = HorizontalAlignment.Right;
  textBlock.Background = new SolidColorBrush(Color. FromRgb(225, 245, 254)); // light blue
  break;
case "assistant":
  textBlock.HorizontalAlignment = HorizontalAlignment.Left;
  textBlock.Background = new SolidColorBrush(Color.FromRgb(241, 241, 241)); // light gray
  break;

Weaknesses and Areas for Improvement

Model Dependency
The application relies on specific control markers like <|channel|> and <|return|> to determine the start and end of a model’s response. However, these tokens are not part of any official API standard and are tied to a specific model implementation. This tight coupling makes the application less flexible and harder to adapt to other language models.
Monolithic Architecture
All logic is placed directly in the code-behind file of the main window (MainWindow.xaml.cs). There is no separation between the UI, business logic, and data structures. This lack of architectural layering makes the application harder to maintain, extend, or test—especially as the project grows.
No Runtime Configuration
The model name (gpt-oss-20b-cuda-gpu) is hard-coded into the application. Users cannot switch between different models during runtime, nor can they modify settings through a configuration file. Introducing such flexibility would make the application more versatile and user-friendly, especially in evaluation or production scenarios.
No Logging System
Errors are only shown via message boxes, with no structured logging to a file or system log. This makes it difficult to trace issues after they occur, particularly in production use or longer sessions. A basic logging mechanism would significantly improve diagnostics and supportability.
Lack of Contextual Features
The app does not offer any context-related tools such as model switching, prompt templates, or token usage tracking. These features could help users better manage conversations and optimize the behavior of the language model, especially in more advanced use cases.

Conclusion: AI without the cloud? Reality -not vision

The application presented impressively demonstrates that AI functionality on the local system is not only possible, but also ready for production. Microsoft Foundry significantly lowers the entry barrier, integration into WPF is achieved with manageable effort, and the user experience benefits from smooth streaming.

However, the technical reality demands more than mere functionality: configurability, reusability, and robust architecture must also be guaranteed. With a few structural interventions (e.g., MVVM, logging, model selection via UI), the application can be further developed into a scalable solution — while retaining all the advantages of local AI.

Demo GitHub: https://github.com/WittBen/LocalAI

Source: https://learn.microsoft.com/de-de/azure/ai-foundry/foundry-local/get-started

Building your own local AI app was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding – Medium and was authored by Ben Witt