Recommendation systems and search have historically drawn inspiration from language modeling. For example, the adoption of Word2vec to learn item embeddings (for embedding-based retrieval), and using GRUs, Transformer, and BERT to predict the next best item (for ranking). The current paradigm of large language models is no different.
Here, we’ll discuss how industrial search and recommendation systems have evolved over the past year or so and cover model architectures, data generation, training paradigms, and unified frameworks:
- LLM/multimodality-augmented model architecture
- LLM-assisted data generation and analysis
- Scaling Laws, transfer learning, distillation, LoRAs, etc.
- Unified architectures for search and recommendations
LLM/multimodality-augmented model architecture
Recommendation models are increasingly adopting language models and multimodal content to overcome traditional limitations of ID-based approaches. These hybrid architectures include content understanding alongside the strengths of behavioral modeling, addressing the common challenges of cold-start and long-tail item recommendations.
Semantic IDs (YouTube) explores content-derived features as substitutes for traditional hash-based IDs. This approach targets difficulties in predicting user preferences for new and infrequently interacted items. Their solution involves a two-stage framework.
In the first stage, a transformer-based video encoder (similar to Video-BERT) generates dense content embeddings. These embeddings are then compressed into discrete Semantic IDs through a Residual Quantization Variational AutoEncoder (RQ-VAE). Representing user histories with these compact semantic IDs—a few integers rather than high-dimensional embeddings—significantly improves efficiency. Once trained, the RQ-VAE is frozen and used to generate Semantic IDs for the second stage to train a production-scale ranking model.
The RQ-VAE itself is a single-layer encoder-decoder structure with a 256-dimensional latent space. It has eight quantization levels with a codebook of 2048 entries per level. The encoder maps content embeddings to a latent vector, while a residual quantizer discretizes this vector, and the decoder reconstructs the original embedding. The initial embeddings originate from a transformer with a VideoBERT backbone, producing detailed, 2048-dimensional representations that capture the topical content in video.
To integrate Semantic IDs into ranking models, the authors propose two techniques: an N-gram-based approach, which groups fixed-length sequences, and a SentencePiece Model (SPM)-based method that adaptively learns variable-length subwords. The ranking model is a multi-task production ranking model that recommends the next video to watch given the current video and user history.
Results: Directly using the dense content embeddings performed worse than using random hash IDs. The authors hypothesize that ranking models heavily rely on memorization from the ID-based embedding tables—replacing these with fixed dense content embeddings led to poorer CTR. However, both N-gram and SPM methods did better than random hashing, especially in cold-start scenarios. Ablation tests revealed that while N-gram approaches had a slight advantage when embedding table sizes were limited (e.g., $8 times K$ or $4 times K^2$), SPM methods offered superior generalization and efficiency with larger embedding tables.
Dense content embeddings (dashed lines) perform worse than random hashing (solid orange).
Similarly, M3CSR (Kuaishou) introduces multimodal content embeddings (visual, textual, audio) clustered via K-means into trainable category IDs. This transforms static content embeddings into adaptable, behavior-aligned representations.
The M3CSR framework has a dual-tower architecture, splitting user-side and item-side towers to optimize for online inference efficiency where user and item embeddings can be pre-computed and indexed via approximate nearest neighbor indices. Item embeddings are derived from multimodal pretrained models—ResNet for visual, Sentence-BERT for text, and VGGish for audio—and concatenated into a single embedding vector. These vectors are then clustered using K-means (with approximately 1,000 clusters from over 10 million videos).
Next, cluster IDs are embedded through a Modal Encoder, a dense network translating content features into behaviorally aligned spaces and assigning trainable embeddings. The Modal Encoder uses a dense network to learn the mapping from content-space to behavior space and a cluster ID lookup to assign a trainable cluster ID embedding.
On the user side, M3CSR learns on user behavior sequences to train sequential models that capture user preferences. In addition, to accurately model user modality preferences, the framework concatenates general behavioral interests with modality-specific interests. These modality-specific interests are derived by converting item IDs back into their multimodal embeddings using the same Modal Encoder.
Results: M3CSR outperformed several multimodal baselines such as VBPR, MMGCN, and LATTICE. Ablation studies highlighted the importance of modeling modality-specific user interests and demonstrated consistent superiority of multimodal features over single-modal features across datasets (Amazon, TikTok, Allrecipes). A/B testing measured that clicks increased by 3.4%, likes by 3.0%, and follows by 3.1%. In cold-start scenarios, M3CSR also showed improved performance, achieving a 1.2% boost in cold-start velocity and a 3.6% increase in cold-start video coverage.
FLIP (Huawei) shows how to align ID-based recommendation models with LLMs by jointly learning from masked tabular and language data. The core idea is to reconstruct masked features from one modality (user and item IDs) using information from another modality (text tokens), ensuring tight cross-modal alignment.
FLIP operates in three stages: modality transformation, modality alignment pretraining, and adaptive finetuning. First, tabular data is translated into text using structured prompt templates. Then, joint masked language/tabular modeling is conducted to achieve fine-grained alignment between modalities. During pretraining, textual data undergoes field-level masking (replacing entire fields with [MASK]
tokens), while corresponding tabular features are masked by substituting feature IDs with [MASK]
.
FLIP trains two parallel models with three objectives: (i) Masked Language Modeling (MLM) predicts masked text tokens using complete tabular context; (ii) Masked Tabular Modeling (MTM) predicts masked feature IDs leveraging textual data; and (iii) Instance-level Contrastive Learning (ICL) aligns global representations across modalities.
Finally, the aligned models—TinyBERT as the LLM and DCNv2 as the ID-based model—are finetuned on the downstream click-through rate (CTR) prediction task. To do this, FLIP adds randomly initialized output layers on both models to estimate click probabilities. The final prediction is a weighted sum of both models’ outputs, where the weights are learned adaptively during training.
Results: FLIP outperforms the baselines of ID-only, LLM-only, and ID+LLM models. Ablation studies show that (i) both MLM and MTM objectives improve performance, (ii) field-level masking is more effective than random token masking, and (iii) joint reconstruction between modalities is key.
Similarly, beeFormer demonstrates how to train language-only Transformers on user-item interaction data enriched with textual information. The goal is to bridge the gap between semantic similarity (from textual data) and interaction-based similarity (from user behavior).
beeFormer combines a sentence Transformer encoder for item embeddings with an ELSA (scalablE Linear Shallow Autoencoder)-based decoder that captures patterns from user-item interactions. First, item embeddings are generated through a Transformer trained on textual data. These embeddings are then used to compute user recommendations via ELSA’s low-rank approximation of item-to-item weight. The key here is to backpropagate the gradients from the recommendation loss through the Transformer model. As a result, weight updates capture interaction patterns rather than just semantic similarity.
To make training computationally feasible on large catalogs, beeFormer applies gradient checkpointing to manage memory usage, gradient accumulation for larger effective batch sizes, and negative sampling to focus training efficiently on relevant items.
Results: Offline evaluations show that beeFormer surpasses baseline models like mpnet-base-v2 and bge-m3. However, the comparison is limited (and IMHO unfair) since the baselines weren’t finetuned on the training dataset. Interestingly, models trained across multiple domains (movies + books) performed better than domain-specific ones, suggesting that there was transfer learning across domains.
CALRec (Google) introduces a two-stage framework that finetunes a pretrained LLM (PaLM-2 XXS) for sequential recommendations. Both user interactions and model predictions are represented entirely through text.
First, all input (e.g., user-item interactions) is converted into text sequences by concatenating meaningful attributes (title, category, brand, price) into structured textual prompts. Attributes are formatted in the style of “Attribute name: Attribute description” and concatenated. At the end of the user history sequence, they append the item prefix, thus prompting the LLM to predict the user’s next purchase as a sentence completion task.
CALRec has a two-stage finetuning approach. The first stage involves multi-category training to adapt the model to sequential recommendation patterns in a category-agnostic way. The second stage refines the model within specific item categories. The training objective combines next-item generation tasks (predicting textual descriptions of items) with auxiliary contrastive alignment. The former aims to generate the text description of the target item given the user’s history; the latter applies contrastive loss on the output of the separate user and item towers to align user history to target item representations.
During inference, the model is prompted to generate multiple candidates via temperature sampling. They remove duplicates, sort by the output’s log probabilities in descending order, and keep the top k candidates. Then, these textual predictions are matched to catalog items via BM25 and sorted by the matching scores.
Results: On the Amazon Review Dataset 2018, CALRec outperforms ID-based and text-based baselines (e.g., SASRec, BERT4Rec, FDSA, UniSRec). While the evaluation dataset is limited, CalRec beating the baselines is promising. Ablations demonstrate the necessity of both training stages, especially highlighting transfer learning benefits from multi-category training and incremental gains (0.8 – 1.7%) from contrastive alignment.
EmbSum (Meta) presents a content-based recommendation approach using precomputed textual summaries of user interests and candidate items to capture interactions within the user engagement history.
EmbSum uses T5-small (61M parameters) to encode user interactions and candidate content, managing long user histories by partitioning them into sessions for encoding. Then, Mixtral-8x22B-Instruct generates the interpretable user interest summaries from user histories. These summaries are then fed into the T5’s encoder to derive final embeddings.
Key to this architecture are User Poly-Embeddings (UPE) and Content Poly-Embeddings (CPE). To get a global representation for UPE, they take the last token of the decoder output ([EOS]
) and concatenate it with the representation vectors from the session encoder. This combined representation passes through a poly-attention layer which distills nuanced user interests into multiple embeddings. EmbSum training combines noisy contrastive estimation loss and summarization loss, ensuring high-quality user embeddings.
Results: EmbSum beats several state-of-the-art content-based recommenders. Nonetheless, direct comparisons with behavioral recommenders were glaringly absent. Ablation studies show that CPE contributes most to performance, followed by session-based grouping and encoding, user poly-embeddings, and summarization losses. Additionally, GPT-4 evaluations indicate strong interpretability and quality of generated user interest summaries.
• • •
LLM-assisted data generation and analysis
Another common theme is using LLMs to enrich data. Several papers share about using LLMs to tackle data scarcity and enhance the quality of search and recommendations. Examples include generating webpage metadata at Bing, creating synthetic training data to identify poor job matches at Indeed, adding semantic labels for query understanding at Yelp, crafting exploratory search queries at Spotify, and enriching music playlist metadata at Amazon.
Recommendation Quality Improvement (Bing) shares how Bing improved webpage recommendations by using LLMs to generate high-quality metadata and training an LLM to predict clicks and quality.
Previously, Bing’s webpage representations relied on extractive summaries, which often caused query classification failures. To address this, they used GPT-4 to generate high-quality titles and snippets from full webpage content for two million pages. Then, for efficient large-scale deployment, they finetuned a Mistral-7B model using this GPT-4-generated data.
To improve webpage-to-webpage recommendation rankings, they finetuned a multitask MiniLM-based cross-encoder on both pairwise click predictions and quality classification tasks. The resulting quality scores were then linearly combined with click predictions from an existing LightGBM ranker.
The MiniLM (right) is ensembled with the LightGBM ranker (left).
To better understand user preferences, they defined 16 distinct recommendation scenarios reflecting common user patterns. Using high-precision prompts, they classified each webpage-to-webpage recommendation, incorporating the enhanced title and snippets from Mistral-7B, into these scenarios. Then, by monitoring the distribution changes of each scenario, they quantified the improvements in webpage recommendation quality.
Results: The enhanced system reduced clickbait by 31%, low-authority content by 35%, and duplicate content by 76%. At the same time, higher authority content increased by 18%, cross-medium recommendations rose by 48%, and recommendations with greater specificity improved by 20%. This is despite lower-quality content (e.g., clickbait) historically showing higher CTR, demonstrating the effectiveness of the quality-focused cross-encoder.
(👉 Recommended read) Expected Bad Match (Indeed) shares how they used LLM-generated labels to filter poor job matches. Specifically, they finetuned LLMs to evaluate recommendation quality and generate labels for a post-processing classifier.
They started with building an evaluation set by cross-reviewing 250 matches, narrowing it down to 147 high confidence labeled examples. Then, they prompted various LLMs, such as Llama2 and Mistral-7B, using expert recruitment guidelines to evaluate match quality across dimensions like job descriptions, resumes, and user interactions. However, these models struggled with detailed prompts, producing generalized assessments that didn’t consider detailed job and job seeker information. On the other hand, GPT-4 performed better but was prohibitively expensive.
To balance cost and effectiveness, the team finetuned GPT-3.5 on a curated dataset of over 200 human-reviewed GPT-4 responses. This finetuned GPT-3.5 matched GPT-4’s performance at just a quarter of the cost and latency. But despite the improvements, its inference latency of 6.7 seconds remained too high for online use. Thus, they trained a lightweight classifier, eBadMatch, using LLM-generated labels and categorical features from job descriptions, resumes, and user activity. In production, a daily pipeline samples job matches, engineers features, anonymizes data, generates LLM labels, and retrains the model. This classifier acts as a post-processing filter to remove low-quality matches.
Results: The eBadMatch classifier achieved an AUC-ROC of 0.86 against LLM labels, with latency suitable for real-time filtering. Online experiments demonstrated that applying a 20% threshold filter on invitation-to-apply emails reduced batch matches by 17.68%, lowered unsubscribe rates by 4.97%, and increased application rates by 4.13%. Similar improvements were observed in homepage recommendation feeds.
(👉 Recommended read) Query Understanding (Yelp) shows how they integrated LLMs into their query understanding pipeline to improve query segmentation and review highlights.
Query segmentation identifies meaningful parts of user queries—such as topic, name, time, location, and question—and tags them accordingly. Along the way, they learned that spelling correction and segmentation could be done together and thus added a meta tag to mark spell-corrected sections and combined both tasks into a single prompt. Retrieval-augmented generation (RAG) further improved segmentation accuracy by incorporating business names and categories as context that disambiguated user intent. For evaluation, they compared LLM-identified segments against human-labeled datasets of name match and location intent.
Review highlights selects key snippets from reviews to highlight in search results. They used LLMs to generate synonymous phrases suitable for highlights. Curated examples prompted LLMs to replicate human reasoning in phrase expansion. RAG further enhanced relevance by augmenting the input with relevant business categories to guide phrase generation. Offline evaluation was done via human annotators before online A/B testing of the new highlight phrases. To scale efficiently and cover 95% of traffic, Yelp pre-computed snippet expansions using batch calls to OpenAI and stored them in key-value stores to reduce latency.
The team shared their approach—from initial formulation and proof of concept (POC) to scaling up. Initially, they assessed LLM suitability and defined the project’s scope. During POC, they leveraged the power-law distribution of queries, caching pre-computed LLM responses for common queries covering most traffic. To scale, they created golden datasets using GPT-4 outputs and finetuned smaller, cost-effective models like GPT-4o-mini. Additionally, real-time models like BERT and T5 addressed less frequent, long-tail queries.
Results: Yelp’s query segmentation significantly improved location intent detection, while enhanced review highlights increased both session and search click-through rates (CTR), especially benefiting long-tail queries.
Query Recommendations (Spotify) details how they built a hybrid query recommendation system to suggest exploratory search queries alongside direct results. This approach was necessary to support Spotify’s expansion beyond music to podcasts, audiobooks, and diverse content types by helping users explore those content.
Spotify generated query suggestions by (i)
10 Comments
whatever1
Why we don’t have an LLM based search tool for our pc / smartphones?
Specially for the smartphones all of your data is on the cloud anyway, instead of just scraping it for advertising and the FBI they could also do something useful for the user?
anon373839
Terrific post. Just about everything Eugene writes about AI/ML is pure gold.
anonymousDan
It's interesting that none of these papers seem to be coming out of academic labs….
jamesblonde
It is very interesting that Eugene does this work and publishes it so soon after conferences. Traditionally this would be a literature survey by a PhD student and would take 12 months to come out as some obscure journal behind a walled garden. I wonder if it is an outlier (Eugene is good!) or a sign of things to come?
thorum
In the age of local LLMs I’d like to see a personal recommendation system that doesn’t care about being scalable and efficient. Why can’t I write a prompt that describes exactly what I’m looking for in detail and then let my GPU run for a week until it finds something that matches?
onel
Another amazing post from Eugene
anthk
Use 'Recoll' and learn to use search strings. For Windows users, older Recoll
releases are standalone and have all the dependencies bundled, so you can search into PDF's, ODT/DOCX and tons more.
x1xx
> Spotify saw a 9% increase in exploratory intent queries, a 30% rise in maximum query length per user, and a 10% increase in average query length—this suggests the query recommendation updates helped users express more complex intents
To me it's not clear that it should be interpreted as an improvement: what I read in this summary is that users had to search more and to enter longer queries to get to what they needed.
stuaxo
Off topic – but I think joining recommendation systems and forums (aka all the social media that isn't bsky or fedi) has been a complete disaster for society.
thaumiel
ah this explains why my spotify experience has gotten worse over time.