Debugging Data Pipelines Shouldn’t Be a Guessing Game – ██FR█████ █INTELL███████████

This content originally appeared on Level Up Coding – Medium and was authored by Edward van Eechoud

My confession after 5 years of developing in Python: I still love drag-and-drop tools. And no, it’s not because I’m not “technical enough.”

Full disclosure: I’m the creator of Flowfile, the open-source tool I discuss in this article. My goal is to share the journey and the thinking behind its development.

Experienced developers tend to forget about drag-and-drop tools. There’s this unspoken rule in tech that once you can code, you should leave visual tools behind. Like they’re training wheels you’re supposed to outgrow. But what if this is just dogma that’s costing us development time? I’ve always believed visual tools weren’t just training wheels — they are power tools, even for the most experienced developers.

Let me be clear upfront: I’m not here to argue that code is inherently inferior to visual tools, or vice versa. Tools are just tools. The real goal is solving problems efficiently.

One typo shouldn’t cost 10 minutes of debugging

The Right Tool for the Job

My appreciation for visual tools started with Alteryx at my first data job. The ability to see data flow through transformations and click any step to inspect the data right there — it just made sense and helped me solve problems quickly.

As my coding skills grew, I did what everyone does — I went all-in on code. Python, Polars, SQL. And while I loved the power, extensibility, and flexibility, I kept noticing the same limitations.

The Limitation of Coding Blind

Take this real example using the Superstore Sales dataset:

import polars as pl
from polars import col

# Load the data
df = pl.scan_csv("data/superstore_sales.csv", infer_schema_length=1000, infer_schema=True)

# Clean spaces set typing
df_clean = (
    df.rename(
        {
            "Order Date": "order_date",
            "Ship Date": "ship_date",
            "Customer ID": "customer_id",
            "Order ID": "order_id",
            "Product ID": "product_id"
        }
    )
    .with_columns(
        [col("order_date").str.strptime(pl.Date, "%d/%m/%Y").alias("order_date"),
         col("order_date").str.strptime(pl.Date, "%d/%m/%Y").alias("ship_date")
         ]
    )
)

# Step 2: Add a few calculated fields
df_enriched = df_clean.with_columns([
    col("order_date").dt.year().alias("year"),
    col("order_date").dt.month().alias("month"),
])

# Step 3: Aggregate - where did my columns go?
df_summary = df_enriched.group_by(["Region", "Category", "month"]).agg([
    col("Sales").sum().alias("total_sales"),
    col("order_id").n_unique().alias("order_count")
])

# Step 4: Pivot for analysis
df_pivot = df_summary.collect().pivot(
    values="total_sales",
    index=["Region", "month"],
    on="Category"
)

We’ve all been there — after a few transformations, you’re left asking:

— Did the pivot create ‘Technology’ or ‘technology’? Is it ‘Office Supplies’ or ‘Office_Supplies’? Do my dates have actual date types?

What do you do? You end up adding df.schema, df.columns, df.collect() at random places in your code. This constant interruption — checking columns, verifying types, confirming transformations adds mental load. Moreover, you’ll end up with random debug code in your production environment.

Why Visual Tools Actually Matter

This is where the engineering value of a visual layer becomes clear. It’s not about the drag-and-drop; it’s about stateful, real-time schema validation. It’s the ability to inspect the state of your DataFrame at every step of the transformation pipeline without littering your code with print statements or breakpoints.

Stop thinking of visual tools as training wheels. Think of them as schema debuggers that happen to have a GUI.

Visual tools had the clarity I wanted, but they always felt almost there. Cause there are definitely things that visual tools are missing: proper error handling, extensibility and modularity. Also, the vendor lock in can be a real bottleneck. I wanted a visual tool that could do what code could do, but couldn’t find one.

So I thought: “Let me see if I can build one.”

That’s how I ended up building Flowfile. Not to sell you on another tool, but because I was curious if I could create something that captured what made visual tools so powerful while keeping the flexibility of code. (*And yes, those try-except nodes are still on my TODO list. )

What I learned from building a visual ETL tool is that the future isn’t choosing between visual or code — it’s having both. Here’s what my workflow looks like now:

Prototype Visually: When exploring new data or validating logic, I use the visual interface. I can see schemas instantly, test transformations, and iterate quickly.
Export to Code: Once I know what I want, I export to clean Python/Polars code. No vendor lock-in, no proprietary formats — just code.
Analyze Visually: When I’m in meetings and need an answer noww, I will fallback to a visual tool, to provide me the maximum speed.

The magic is being able to switch between modes based on what makes sense. Exploring data? Visual. Writing complex logic? Code. Debugging a pipeline? Visual. Deploying to production? Code.

The Schema Guessing Game

Remember that pivot operation that left us guessing? Let’s say we want to calculate what percentage of sales comes from Technology products.. In code, you’d have to check the schema first, then write your calculation hoping you got the column names right.

Here’s the same pipeline in Flowfile:

import flowfile as ff
# ... same transformations as above, but with Flowfile's API
# Open the visual editor to see what we're working with
ff.open_graph_in_editor(df_pivot.flow_graph)

This opens the pipeline in your browser. Now watch what happens when I add that percentage calculation:

Adding a formula node after the pivot, typing the percentage calculation with auto-complete showing the available columns (‘Furniture’, ‘Office Supplies’, ‘Technology’), and the preview instantly giving feedback

Look at that — the formula builder shows me exactly what columns exist. I can see ‘Office Supplies’ has that space. Auto-complete helps me get the names right. And I can validate that the calculation works before running anything.

This isn’t about avoiding code. It’s about not wasting time on problems that shouldn’t exist. The “what columns do I have now?” question isn’t a challenging technical problem — it’s a tooling gap.

The Bottom Line

After 5 years of coding and building my own visual tool, here’s what I know:

Treat the visual layer as a power tool — The ability to instantly inspect your data’s state isn’t a shortcut for beginners; it’s a productivity multiplier.
The future is “both-code” — seamlessly switching between visual and text based on what makes sense in the moment. And critically, your logic should never be held hostage. If you’re using Alteryx, Flowfile or any visual tool, that’s great — just make sure you will be able to convert it to code too, you never know when there is a feature that you need that is not supported by the frame work. Whether you stay with a tool or leave should be your choice, not forced by vendor lock-in.
Demand more from your data tooling — Think about it — we have type hints, schema inference, lazy evaluation, LSPs that can autocomplete everything… yet after a simple operation, our IDEs have no idea what columns exist in our DataFrame. We’ve built incredible developer tools for everything except understanding our actual data structures. We can do better!
Seeing is understanding — We’ve gotten so good at abstracting complexity that we forget this simple truth.

The schema guessing game isn’t a rite of passage — it’s a time sink. Whether you’re debugging a complex pipeline, exploring new data, or explaining logic to stakeholders, sometimes the best code is the code you didn’t have to write.

If your IDE can autocomplete every class in your codebase but can’t tell you what columns exist after a join, maybe it’s time for better data tooling.

What’s your take? Is the ‘schema guessing game’ a rite of passage, or a tooling gap we should have solved by now? I’d love to hear your thoughts in the comments.

And if you’re interested in the ideas behind Flowfile, check out the project on GitHub. We’re just getting started, and we’d love for you to be a part of the conversation.

Debugging Data Pipelines Shouldn’t Be a Guessing Game was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding – Medium and was authored by Edward van Eechoud