How E-commerce Giants Use LLMs Without Breaking the Bank – ██FR█████ █INTELL███████████

This content originally appeared on Level Up Coding – Medium and was authored by Nikhilesh Pandey

Comparative Study of eBay’s Distillation Pipeline vs. Google’s Training-Free STAR Method for Cost-Efficient LLM-Based Recommendation Systems

Image generated by the Author via raphael.ai

Introduction

Running large language models in production always looks good on paper, right up until the bill shows up. If you’ve been through it, it’s the same story every time. You drop in an LLM, it crushes the task, the team’s impressed, and for a moment, you feel like you’ve solved the problem. Then the invoices start rolling in, and the conversation shifts. Suddenly, it’s less about accuracy and more about cost, latency, and whether this setup has any chance of surviving outside a demo environment.

At eBay, the challenge came up in advertising. Sellers bid on keyphrases, and recommending the right ones makes a huge difference in whether ads actually convert. The problem was that the standard training signal was noisy and biased. Large models could cut through that mess and make solid judgments, but running them live wasn’t an option. The workaround was to let the big model act as a teacher. It labeled data, passed that judgment down to a cross-encoder, and eventually into a small bi-encoder that could run cheaply at scale. That pipeline, which they called LLMDistill4Ads[1], turned the idea from an expensive experiment into a production system that moved real business metrics.

Google’s team looked at the same landscape and went in almost the opposite direction. Instead of teaching smaller models, they asked what happens if you don’t train anything at all. Their STAR[2] method showed that you can get impressively strong recommendations by combining two pieces: semantic embeddings from an LLM and collaborative patterns from user history. The system then handed candidates back to the LLM for a pairwise ranking step. No retraining, no fine-tuning, just leaning on the knowledge the model already carries. It turned out that, for many datasets, this training-free setup could rival the performance of models that had gone through months of supervised training.

When it comes to building recommendation systems with LLMs, you really hit a fork in the road. One way is putting in the work upfront to distill the knowledge into smaller models that can run fast and cheaply at scale. The other is skipping training altogether and leaning on the raw reasoning power of the LLM model as it stands. Neither path is universally right. The real question is what fits your situation, whether you’re trying to spin up quick prototypes and show value fast, or you’re designing something that has to handle billions of requests without blowing through the budget.

What I’ll cover here are both of these approaches, side by side, with some hard-won lessons about where each one tends to shine. The goal is to make it easier to determine when going training-free makes sense and when it’s worth investing in distillation to maintain sustainability.

eBay’s Distilled Approach to Cleaner Recommendations

Anyone who has worked on recommender systems knows that click data isn’t clean. It’s easy to assume a missed click means the item was irrelevant, but in practice, clicks depend on where the item was shown, how the auction played out, and a dozen other factors unrelated to the item itself. At eBay, this bias, especially the “middleman bias” introduced by eBay Search filtering keyphrases before they even reach buyers, was killing the quality of advertiser keyphrase recommendations. Sellers would get suggested phrases that looked mathematically sound but didn’t align with what real buyers wanted. That mismatch meant wasted ad spend and unhappy sellers.

The team’s answer wasn’t to throw away click data but to correct it. They brought in a large language model as a kind of “proxy judge.” Unlike raw click data, which suffers from sample selection bias because only keyphrases passing the Search relevance filter appear in logs, the LLM evaluated item–keyphrase pairs directly. Instead of treating every skipped click as a negative signal, the LLM produced softer, finer judgments, effectively binary yes/no relevance labels, which were then used by a cross-encoder to generate calibrated, continuous relevance scores less noisy than raw logs. While these judgments weren’t perfect, they more faithfully represented seller and buyer intent by mitigating biases inherent to click data.

Figure1. Image by the Author.

Of course, running a large LLM at query time wasn’t feasible at scale. So the solution became a three-step teacher-assistant-student pipeline. The big model served as the teacher, generating relevance labels for millions of item-keyphrase pairs. Then, a cross-encoder assistant model was fine-tuned on these LLM labels to better capture complex item-keyphrase interactions and produce soft output scores. Finally, a lightweight bi-encoder student distilled this knowledge, compressing it into fast, compact embeddings suitable for real-time retrieval across billions of items. By the time requests hit production, the heavy lifting was already done offline.

To keep the student model grounded and robust, it was trained with a multi-task hybrid approach that combined multiple signals: traditional CTR data (where positive clicks are reliable but negatives are noisy), internal Search Relevance scores (which provide a less biased supervision aligned with auction dynamics), the LLM’s judgments, and the cross-encoder’s distilled outputs. This combination helped the bi-encoder understand what truly mattered without overfitting to noise in any single signal. They also found that typical pointwise losses like mean squared error (MSE) didn’t suffice for distillation. Instead, a correlation-based ranking loss (Pearson correlation loss) better captured the relative ordering of relevance scores, significantly improving ranking quality.

Additionally, to reduce latency and computational cost in the nearest neighbor search, they applied Matryoshka embeddings[3] — a hierarchical embedding strategy that allows using smaller sub-vectors without sacrificing accuracy. This technique proved effective in balancing retrieval speed and precision.

The production system architecture reflects these design choices, comprising offline batch inference over ~2.3 billion items and near real-time inference for new or updated listings. The bi-encoder embeddings are precomputed and indexed for fast approximate nearest neighbor retrieval, enabling efficient suggestions at scale.

In live A/B experiments, the new model replaced the CTR-only baseline. While it showed only a directional, non-significant uplift in raw clicks, it delivered a statistically significant 51% increase in Gross Merchandise Volume Bought (GMB), meaning the same number of clicks converted into far more sales. The return on ad spend (ROAS) also improved by nearly 39%, demonstrating that better keyphrase relevance directly benefits sellers’ bottom lines.

In summary, this research[1] highlights the pitfalls of relying solely on click data due to inherent biases like middleman/sample selection bias and exposure bias. By leveraging LLMs as proxy judges to provide cleaner relevance signals, combined with a teacher-assistant-student distillation pipeline and multi-task training on diverse labels, the team substantially enhanced advertiser keyphrase recommendations at eBay. The study further underscores the importance of careful pre-training of base models and the selection of appropriate ranking-based loss functions for effective knowledge distillation. Ultimately, this approach delivers practical business impact by improving both seller satisfaction and advertising efficiency.

Skipping Training Altogether: Google’s STAR Approach

Training recommenders is usually a grind. You wrestle with biased data, tweak loss functions, and watch weeks of GPU time disappear just to squeeze out a tiny gain over the baseline. The Google team took a step back and asked a blunt question: What if you just skip all of that? Their Simple Training-free Approach for Recommendation (STAR)[2] was their attempt at answering it. The surprising part was that it actually worked. They showed you could get solid recommendations from an LLM without ever touching a training loop.

Figure 2. Different strategies for leveraging LLMs in recommendations.

Prompting: Just ask the model. Quick, but it can’t tap into collaborative signals.

Fine-tuning: Train the model on user–item data. Powerful, but expensive and data-hungry.

Feature Encoder: Use the LLM for embeddings, then train a smaller model. Lighter, but still needs training data.

STAR: Skip training altogether by blending collaborative knowledge with semantic signals.

Figure 3. STAR

The idea was simple but effective. Instead of training, they leaned on what the model already knew. Each item was turned into an embedding using a pre-trained LLM, with metadata like the title, category, and description folded into the prompt. That gave the system a semantic picture of every product. On the other side, they still borrowed from the old playbook, looking at items that tended to show up together in user histories. STAR blended those two signals and put extra weight on recent activity, since yesterday’s clicks usually say a lot more about what someone wants than the ones from two years ago.

That blend produced a candidate list, but the real gain came from the ranking stage. Instead of trusting raw scores, they passed the candidates back to the LLM and asked it to compare items pair by pair. This pairwise reasoning turned out to be the sweet spot. It pushed accuracy well past what simple embedding similarity could offer, without requiring any retraining.

The whole thing sounds almost too simple, but the results held up. On standard benchmarks, STAR beat out many supervised models that had been trained from scratch. And it did so with almost no setup, just embeddings, co-occurrence counts, and a clever way of asking the LLM to make judgments.

Ranking Prompt Example

The example below shows how a list-wise ranking pipeline works in practice. We’re using a window size of 4 and a stride of 2 (w = 4, d = 2). The user has 3 items in their purchase history (l = 3).

The prompt is structured in three parts:

History items: what the user has bought before. This helps the model understand preferences and context.
Candidate items: new items that we want to rank. Each one comes with metadata such as category, price, and sales rank, plus co-occurrence counts that capture how often it was bought alongside the history items.
Ranking instruction: a final request asking the model to order the candidates by relevance, returning only the ranked list in JSON format.

The process also uses a sliding window. With w = 4 and d = 2, the window takes in 4 candidate items at a time and then moves forward 2 items to create the next pass. This overlapping window lets the system refine rankings in stages rather than trying to handle a very large candidate set all at once.

System: You are an intelligent assistant that can rank items based on the user’s preference.  

User: User 8921 has purchased the following items in this order:  
{
"Item ID": 501,
"title": "The Pragmatic Programmer: Your Journey to Mastery, 20th Anniversary Edition",
"salesRank_Books": 342,
"categories": [
["Books", "Computers & Technology", "Programming"]
],
"price": 39.99,
"author": "Andrew Hunt"
},
{
"Item ID": 742,
"title": "Clean Code: A Handbook of Agile Software Craftsmanship",
"salesRank_Books": 278,
"categories": [
["Books", "Computers & Technology", "Programming Languages", "Software Engineering"]
],
"price": 32.95,
"author": "Robert C. Martin"
},
{
"Item ID": 823,
"title": "Design Patterns: Elements of Reusable Object-Oriented Software",
"salesRank_Books": 654,
"categories": [
["Books", "Computers & Technology", "Programming", "Software Design"]
],
"price": 44.99,
"author": "Erich Gamma"
}  

I will provide you with 4 items, each indicated by a number identifier []. 
Analyze the user’s purchase history to identify preferences and purchase 
patterns. Then, rank the candidate items based on their alignment with 
the user’s preferences and other contextual factors.  

Assistant: Okay, please provide the items.  

User: [1]  
{
"title": "Refactoring: Improving the Design of Existing Code",
"salesRank_Books": 513,
"categories": [
["Books", "Computers & Technology", "Software Engineering"]
],
"price": 47.99,
"author": "Martin Fowler",
"Number of users who bought both this item and Item ID 501": 36,
"Number of users who bought both this item and Item ID 742": 29,
"Number of users who bought both this item and Item ID 823": 33
}  
Assistant: Received item [1].  

User: [2]  
{
"title": "Introduction to Algorithms, Fourth Edition",
"salesRank_Books": 128,
"categories": [
["Books", "Computers & Technology", "Algorithms"]
],
"price": 99.95,
"author": "Thomas H. Cormen",
"Number of users who bought both this item and Item ID 501": 22,
"Number of users who bought both this item and Item ID 742": 18,
"Number of users who bought both this item and Item ID 823": 25
}  
Assistant: Received item [2].  

User: [3]  
{
"title": "Working Effectively with Legacy Code",
"salesRank_Books": 846,
"categories": [
["Books", "Computers & Technology", "Programming"]
],
"price": 54.99,
"author": "Michael Feathers",
"Number of users who bought both this item and Item ID 501": 31,
"Number of users who bought both this item and Item ID 742": 27,
"Number of users who bought both this item and Item ID 823": 29
}  
Assistant: Received item [3].  

User: [4]  
{
"title": "Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software",
"salesRank_Books": 295,
"categories": [
["Books", "Computers & Technology", "Programming", "Software Design"]
],
"price": 49.99,
"author": "Eric Freeman",
"Number of users who bought both this item and Item ID 501": 40,
"Number of users who bought both this item and Item ID 742": 35,
"Number of users who bought both this item and Item ID 823": 38
}  
Assistant: Received item [4].  

User: Analyze the user’s purchase history to identify user preferences and 
purchase patterns. Then, rank the 4 items above based on their alignment 
with the user’s preferences and other contextual factors. All the items should 
be included and listed using identifiers, in descending order of the 
user’s preference. The most preferred recommendation item should be 
listed first. The output format should be [] > [], where each [] is an 
identifier, e.g., [1] > [2]. Only respond with the ranking results. 
Do not say a word or explain. Output in the following JSON format:  
{
"rank": "[] > [] .. > []"
}

The downside of this approach is cost. Having an LLM in the loop at ranking time means you’re paying for every comparison, and that adds up quickly. It’s not something you’d drop straight into a production ad system with billions of requests. But for prototyping, cold-start problems, or domains where you don’t have enough data to train a reliable model, STAR[2] proved that you can get surprisingly far without spinning up a single training job.

Two Roads, Same Goal

Looking at both systems side by side, you can see they’re solving the same pain point: how do you get the judgment of a large model into a recommender without going broke, but they attack it from opposite angles.

eBay leaned on distillation, pulling the knowledge from the big model into smaller ones that could run cheaply at massive scale. Google skipped the training grind entirely and showed that if you’re clever with embeddings and prompts, you can get a lot of mileage straight from the model as it is.

Neither path is “better” across the board. One trades upfront training cost for cheap inference. The other avoids training altogether but carries heavier runtime costs. Which one makes sense depends on whether you’re running a billion ads a day or just trying to get a new recommender off the ground.

This is where the contrast gets interesting because it’s not just about cost, it’s about bias, scale, and how much control you want over the final system. The comparison below lays out those trade-offs clearly.

Conclusion

So, bottom line? There’s no magic formula for slapping together a rock-solid recommendation system with LLMs. It’s all about what works for your setup. Sometimes you don’t have a mountain of data or the time (or, honestly, the patience) to train from scratch. That’s where stuff like STAR comes in handy, no training, just decent results out of the box. Pretty sweet. On the flip side, if you wanna go the hardcore route, you’ve got distillation tricks like LLMDistill4Ads. Basically, you take a monster model, squeeze out the good bits, fix up the weird biases, and boom, you’ve got something lean that can handle billions of items without torching your budget. Ain’t one-size-fits-all. Just gotta pick your poison.

References

[1] Soumik Dey, Benjamin Braun, Naveen Ravipati, Hansi Wu, Binbin Li (2025). LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay. https://doi.org/10.3847/1538-4357/ad95f6

[2] Dong-Ho Lee, Adam Kraft, Long Jin, Nikhil Mehta, Taibai Xu, Lichan Hong, Ed H. Chi, Xinyang Yi(2024). STAR: A Simple Training-free Approach for Recommendations using Large Language Models. https://doi.org/10.1051/0004-6361/202141068

[3] Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, Ali Farhadi(2022). Matryoshka Representation Learning. https://doi.org/10.1051/0004-6361/202141068

How E-commerce Giants Use LLMs Without Breaking the Bank was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding – Medium and was authored by Nikhilesh Pandey