Projects

A body of work in production-grade recommender systems — built from scratch in PyTorch, all deployed live on Streamlit. This page is the tour: what each project set out to prove and what came out of it. The full specs, experiment tables, and code live in each repo's README.

Note: These demos run on Streamlit’s free tier and sleep after a few hours of inactivity. If you land on a “Wake up” screen, just click the button — the app will be ready in about 15–30 seconds.

Retrieval · Two-Tower

Two towers — one for the user, one for the item — embed into a shared space where a dot product retrieves top recommendations. Users are represented by the features of what they’ve enjoyed, not a user ID, so anyone gets recommendations from just a few examples. Both represent the user by pooling their interaction history through several behavior-partitioned pools, and both train with a full softmax over the entire catalog — with a logit adjustment from Menon et al. (2020) added just before the loss to debias popular items (a study of its own).

Movie — The flagship two-tower build, trained on MovieLens 32M. Once it was mature, it became the launchpad for a standalone research project on LLM-generated content features (below).

Book — The first of the series, and the one with the longest experiment ladder: every architecture change — loss function, projection MLPs, partitioned history pooling, popularity debiasing — had to beat the incumbent both on offline metrics and in side-by-side comparisons of real recommendation lists before being promoted to production. The biggest single lesson: training with in-batch negatives quietly punishes canonical popular books (they show up as negatives in nearly every batch), and switching to full softmax over the catalog fixed it.

Movie Recommender

→

Trained on MovieLens 32M · 9,375 movies · ~50k users

↑ 8.7× MRR over the MSE baseline (0.115 vs 0.013)

GitHub ↗

Book Recommender

→

Trained on Goodreads · 14,753 books · 229k users

↑ 3.4× Hit@10 over the MSE baseline (16.0% vs 4.7%)

GitHub ↗

Research · LLM-Generated Content Features

A standalone research question that rides on top of a mature two-tower model — I used the Movie build, but Book or Steam would have served equally well: can content features generated by an LLM compete with human-curated ones? MovieLens ships 1,128 hand-tagged “genome” attributes per movie — the kind of labeled asset most companies never get. I built a competing fingerprint in about half a day (scrape TMDB and Wikipedia, then have an LLM score every movie against 132 attributes I designed) and ran the two head-to-head as swappable item towers. The self-built features matched the curated tags — a measured answer to “what do you do before you can afford a tagging pipeline?”

June 10, 2026 Projects

$200 vs $200k: Generating Item Features With an LLM Instead of Hand Tagging

→

Full write-up — attribute design, datasets, and the head-to-head metrics.

Research · Taming Popularity Bias

Another study on top of the Movie build, this one about a failure mode every collaborative recommender drifts into: left alone, it collapses into a Most-Popular shelf — handing every user the same blockbusters and burying the long tail. The fix is a single term in the loss from Menon et al. (2020): during training, add each movie's log-popularity to its score so popular titles sit among the softmax negatives with a head start, forcing the model to learn relevance net of popularity. It's the same logit adjustment baked into all four two-tower builds above, isolated here and turned into a knob. At α=0.5 it halved the median recommendation's popularity, lifted catalog coverage from 48% to 84% (+3,414 films that previously had zero chance of being shown), and made the long tail 5–12× more findable — for ~4% head-accuracy and zero added latency, since serving is unchanged.

June 14, 2026 Projects

Taming popularity bias in recommender systems with logit adjustment

→

Full write-up — before/after recommendation walls, the held-out metrics, and the accuracy trade.

Two-Stage · Retrieve → Rank

Same two-tower retriever as above for candidate generation — but paired with an entirely new model that is not two-tower: a Wide & Deep ranker. The ranker rescores the retriever’s top candidates, unlocking the power of user×item cross-features that a dot product can’t capture (at more than 10× the training cost). Retrieve-then-rank is the two-stage design used in industry-scale recommenders.

Steam — Also the messiest dataset of the four, which is half the point. Steam has no star ratings, so log-compressed playtime stands in as the preference signal; no timestamps, so history is deliberately shuffled to keep the model from cheating off release order; and a few Valve titles so ubiquitous they polluted every user’s recommendations until they were cut from the corpus. The two-stage design also reframes the retriever’s job: with a ranker on top, retrieval is tuned purely for recall and leaves precision to the ranker.

Steam Game Recommender

→

Trained on Steam Games Dataset · 5,437 games · 66k users

↑ Wide & Deep ranker: +16% NDCG@10 over retrieval-only

GitHub ↗

Sequential · Transformer

An ablation study isolating sequence modeling itself. A three-stage progression — bag-of-items baseline → causal attention without positional embeddings → full SASRec — measures what each architectural piece contributes, on a deliberately minimal feature set (item-ID embeddings only, no item metadata) so the gains come from the sequence model rather than side features. The user tower is purely the ordered sequence of item IDs encoded by a causal Transformer, and final sampled NDCG@10 (0.519) lands within 3.2% — relative — of the published SASRec baseline (0.536).

Amazon Games — Less a product than a reproduction study: SASRec rebuilt from first principles, adding exactly one component per stage with the data, training loop, and evaluation held fixed, so every metric delta is attributable to a single architectural choice. The finding worth remembering: causal self-attention is the entire unlock — a +76% jump in sampled NDCG@10 over the bag-of-items baseline — while the positional embeddings that “complete” the Transformer added just +1.6%, consistent with the original paper’s own ablation.

Amazon Games Recommender

→

Trained on Amazon Video Games · 16,882 games · 50,626 users

↑ Sampled NDCG@10 0.519 — within 3.2% (relative) of published SASRec’s 0.536

GitHub ↗