Why I Built My Own Markup Language for AI-Powered Video Editing



This content originally appeared on DEV Community and was authored by Idrees

Building a Custom Video Editing Backend with LLMs and Declarative Composition

I’ve been exploring how large language models (LLMs) can be used to automate video editing.

At first, I thought the process would be relatively straightforward: describe the edit in natural language, have the LLM translate that into actions, and render the result.

Initially, I tried using FFmpeg with LLMs to handle basic video operations like trimming and transitions. But it didn’t work well. The generated code was brittle, and FFmpeg’s syntax is not easy for LLMs to handle reliably, even with refined prompts.

To solve this, I created a custom markup system called Swimlane Markup Language (SWML) — a JSON-style declarative format that describes the video structure. It allows LLMs to express compositions in a format they’re good at generating, while keeping the logic consistent and error-free.

Architecture Overview

The system is built as a modular backend with:

  • FastAPI for the API server
  • SWML for declarative video composition
  • Blender’s Video Sequence Editor (VSE) for rendering video under the hood
  • A planning module that uses Gemini or ChatGPT to generate and revise the SWML structure
  • A plugin system for generating additional media assets (e.g. animations)

SWML makes it easier for the system to remain reliable across different kinds of inputs and edit prompts. It handles sequencing, layering, and timing in a structured way, so that a human-readable video plan can be parsed and rendered without error-prone scripting.

Animation Support with Manim

For generating simple animations and text overlays, I integrated Manim as a plugin. It works, but the pipeline is still somewhat brittle and under active development.

Eventually, I want to move more of the animation logic directly into SWML. However, designing a full animation system inside a markup language is a large project on its own, so for now Manim handles specific animation use cases while SWML manages the main composition logic.

Extensibility

The project uses a plugin architecture, so other tools and generation methods can be added without changing the core system. I’m planning to expand this with:

  • Image and audio generation plugins
  • Subtitle and transcript tools
  • Template-based video creation

Current Status

Right now, the system can:

  • Accept natural language edit prompts
  • Use an LLM to plan the required edit
  • Generate SWML to represent the edit sequence
  • Render the final output through Blender’s VSE
  • Optionally generate animations through Manim
  • Provide structured execution logs and editable history

Development is focused on improving reliability, expanding plugin support, and gradually enhancing the SWML specification so it can eventually support full-featured video editing directly.

Why This Approach

LLMs are good at structured generation. Rather than having them generate executable code directly (which often fails), I let them produce intermediate representations in a format they can handle reliably. This also makes debugging and iteration easier.

By separating planning, composition, and rendering into distinct steps, each component can be validated independently. That makes the whole system more robust and easier to maintain.

Repository

The code is available here:

🔗 https://github.com/idreesaziz/GPT_Editor_MVP

Conclusion

This is still a work in progress, but the goal is to build a practical backend system for AI-assisted video editing — one where LLMs don’t replace editors, but help them work faster by automating tedious tasks and generating initial compositions that can be refined further.

If you’re working on something similar, or want to explore SWML-based composition, feel free to reach out or open an issue on GitHub.


This content originally appeared on DEV Community and was authored by Idrees