Reference & Glossary

Technical terminology, code references, and verification guide for the interactive documentation

Glossary: Algorithm Components

Heavy Ranker
Light Ranker
TwHIN
SimClusters
UTEG
GraphJet
Earlybird
Real Graph
Tweet Mixer
Navi
Product Mixer
MaskNet
FSBoundedParam
TweepCred
FRS
User Signal Service

Reference Sections

Code Evolution Timeline - March 2023 vs September 2025 releases
How to Verify Our Claims - Step-by-step verification guide
File Index - Where to find specific implementations
Further Reading - Official sources and academic background

Glossary: Building Blocks of the Algorithm

The Twitter algorithm is built from many interconnected systems. Here's what each piece does, explained intuitively rather than technically.

Reading this glossary: Each entry explains what the system does and why it exists. Think of these as tools in a toolbox - each serves a specific purpose in the larger recommendation pipeline.

Heavy Ranker

What it is: The main machine learning model that scores tweets.

How to think about it: Imagine a judge at a competition who can predict 15 different ways the audience might react to each performance. The Heavy Ranker looks at a tweet and predicts: "There's a 5% chance you'll like this, 2% chance you'll reply, 0.1% chance you'll click 'not interested'," and so on. Each prediction gets a weight (replies are worth 13.5x more than likes), and the weighted sum becomes the tweet's final score.

Why it exists: Scoring thousands of tweets per user is computationally expensive. The Heavy Ranker is "heavy" because it's thorough - it uses a neural network with ~48 million parameters to make highly accurate predictions. But you can only afford to run something this expensive on a pre-filtered set of candidates.

Architecture: Uses MaskNet (see below) - a special neural network design that predicts all 15 engagement types simultaneously while sharing knowledge between predictions.

Code: External repo recap

Weights: HomeGlobalParams.scala:786-1028

Light Ranker

What it is: A faster, simpler scoring model embedded in the search index.

How to think about it: If Heavy Ranker is a detailed film critic analyzing every aspect of a movie, Light Ranker is a quick star rating. It's a basic logistic regression model that runs inside the search index (Earlybird) to quickly score millions of tweets and pick the top few thousand worth sending to Heavy Ranker.

Why it exists: You can't run Heavy Ranker on a billion tweets - it would take too long and cost too much. Light Ranker is the bouncer that gets the candidate pool down from millions to thousands in milliseconds.

Trade-off: Fast but less accurate. Uses only ~20 features vs Heavy Ranker's ~6,000 features.

Code: earlybird

TwHIN (Twitter Heterogeneous Information Network)

What it is: A giant knowledge graph that represents everything on Twitter (users, tweets, topics, communities) as connected points in mathematical space.

How to think about it: Imagine a 3D map where every user is a point, every tweet is a point, and every topic is a point. Similar things are close together. If you like sci-fi movies and engage with certain accounts, you'll be positioned near other sci-fi fans. TwHIN can then say "show this person tweets from that nearby cluster they haven't seen yet."

Why it exists: Finding relevant content from people you don't follow is hard. TwHIN solves this by representing similarity mathematically - it can find "users similar to you" or "tweets similar to what you engage with" by measuring geometric distance in this abstract space.

Heterogeneous means: The graph includes different types of things (users, tweets, topics, hashtags) all in one unified mathematical representation.

Code: recos and related embeddings

SimClusters

What it is: A system that divides X into ~145,000 interest-based communities and represents both users and tweets as membership in these communities.

How to think about it: Instead of saying "Alice follows Bob and Carol," SimClusters says "Alice is 60% in the AI cluster, 30% in the cooking cluster, and 10% in the gardening cluster." Tweets are described the same way. Then matching is simple: show people tweets from clusters they belong to.

Why it exists: Communities are more stable than individual follow relationships, and they're much more efficient to compute with. Rather than comparing you to millions of individual users, the algorithm can compare your cluster membership to tweet cluster scores.

The gravitational pull effect: Because scoring uses multiplication (your_cluster_score × tweet_cluster_score), your strongest cluster keeps getting stronger. If you're 60% AI and 40% cooking today, engaging slightly more with AI content makes you 65% AI, which makes AI content score even higher, which makes you engage more with AI... and six months later you're 76% AI.

How clusters are created: X analyzes the follow graph using community detection algorithms to discover ~145,000 natural communities. Your interests (InterestedIn) are calculated from your engagement history with a 100-day half-life, updated weekly. See the Cluster Explorer interactive to understand how you're categorized.

Code: simclusters_v2

UTEG (User-Tweet-Entity-Graph)

What it is: An in-memory graph database that tracks recent engagement patterns to make real-time recommendations.

How to think about it: UTEG is like a short-term memory system. It remembers "in the last 24 hours, people similar to you engaged with these tweets." It's built using GraphJet (see below), which keeps a live graph in RAM that can answer queries in milliseconds.

Why it exists: Some recommendation systems (like SimClusters) are based on long-term patterns and update slowly. UTEG captures what's happening right now - trending topics, breaking news, viral content. It provides the "fresh" recommendations that complement the more stable systems.

Graph traversal: To find recommendations, UTEG does graph walks: "You liked tweet A → Other people who liked A also liked B → Show you tweet B."

Code: user_tweet_entity_graph

GraphJet

What it is: An in-memory graph database optimized for real-time recommendations.

How to think about it: A traditional database stores data on disk and reads it when needed (slow). GraphJet keeps the entire graph in RAM (fast) and is optimized for the specific types of queries Twitter needs: "given this user, find related tweets" or "given this tweet, find similar users."

Why it exists: Speed. When you refresh your timeline, Twitter has ~200 milliseconds to gather candidates, score them, and serve the results. GraphJet can traverse millions of graph edges in memory in just a few milliseconds.

Trade-off: RAM is expensive and limited, so GraphJet only stores recent data (typically last 24-48 hours of engagement).

Code: Open-sourced separately at GraphJet

Earlybird

What it is: Twitter's real-time search index - a specialized database optimized for finding tweets by keywords, authors, or engagement patterns.

How to think about it: When you search for "machine learning" on Twitter, Earlybird finds matching tweets in milliseconds even though there are billions of tweets. For the recommendation algorithm, Earlybird serves as the main source of in-network candidates (tweets from people you follow).

Why it exists: Traditional databases aren't fast enough for Twitter's scale. Earlybird is custom-built for one purpose: extremely fast tweet retrieval with ranking. It includes the Light Ranker (see above) built directly into the index so it can return already-scored candidates.

Real-time means: New tweets are indexed within seconds, so Earlybird always has the latest content.

Code: search

Real Graph

What it is: A system that predicts the strength of relationships between users based on interaction patterns, not just follow relationships.

How to think about it: You might follow 500 people, but you only regularly interact with 20 of them. Real Graph identifies those 20 by tracking who you reply to, whose profiles you visit, whose tweets you engage with. It creates a weighted graph where edge strength = relationship strength.

Why it exists: Following someone is a weak signal. The algorithm needs to know who you actually care about. Real Graph provides this by analyzing behavior: "You follow both @alice and @bob, but you reply to Alice 10x more often, so Alice gets 10x more weight in your recommendations."

Used for: Prioritizing in-network content, finding follow recommendations, and scoring out-of-network candidates based on similarity to your real connections.

Code: interaction_graph

Tweet Mixer

What it is: A coordination service that gathers out-of-network tweet candidates from multiple sources and combines them.

How to think about it: Tweet Mixer is like a talent scout that asks multiple agencies (TwHIN, SimClusters, UTEG, FRS) for their best recommendations, then combines those lists into one unified candidate pool to send to the Heavy Ranker.

Why it exists: Each recommendation system has different strengths - UTEG finds trending content, SimClusters finds thematic matches, TwHIN finds geometric similarity. Tweet Mixer orchestrates these systems and ensures you get a diverse mix of out-of-network candidates rather than duplicates from the same source.

Does NOT score: Tweet Mixer just fetches and combines. The actual scoring happens later in the Heavy Ranker.

Code: tweet-mixer

Navi

What it is: A high-performance inference engine that runs machine learning models in production.

How to think about it: Training a neural network happens offline in Python/TensorFlow. But when it's time to actually score tweets for millions of users, you need something blazing fast. Navi is a Rust-based serving system optimized for running the Heavy Ranker model with minimal latency.

Why it exists: Python is too slow for production inference at Twitter's scale. Navi compiles the trained model into optimized Rust code that can score thousands of tweets per second with single-digit millisecond latency.

Trade-off: More complex to deploy and maintain than standard TensorFlow Serving, but much faster.

Code: Proprietary, but referenced in NaviModelScorer.scala

Product Mixer

What it is: A framework for building content feeds - provides reusable components for fetching candidates, scoring, filtering, and mixing content.

How to think about it: Building a recommendation timeline involves many common steps: fetch candidates, hydrate features, run ML models, apply filters, insert ads, etc. Product Mixer provides these as Lego blocks so teams can assemble feeds without reimplementing everything from scratch.

Why it exists: Twitter has multiple feeds (For You, Following, Lists, Search, Notifications). Product Mixer lets them share code and ensure consistency while customizing each feed's specific logic.

Pipeline structure: Product Mixer uses a pipeline model where each stage's output feeds into the next stage, making the data flow explicit and testable.

Code: product-mixer

MaskNet

What it is: A neural network architecture designed for multi-task learning - predicting multiple related outcomes simultaneously.

How to think about it: Traditional models predict one thing ("will you like this?"). MaskNet predicts 15 things at once (like, reply, retweet, report, etc.) while sharing knowledge between tasks. The insight is that all these predictions are related - if someone is likely to reply, they're probably also likely to like - so the model can learn more efficiently by predicting them together.

Why it exists: Training 15 separate models would be inefficient and they'd miss shared patterns. MaskNet uses "shared towers" (neural network layers that all tasks use) and "task-specific towers" (layers unique to each prediction), getting the best of both worlds.

The mask part: During training, MaskNet randomly "masks" (hides) some tasks to prevent the model from cheating by learning shortcuts between correlated tasks.

Code: External repo recap

FSBoundedParam (Feature Switch)

What it is: A configuration system that lets Twitter tune algorithm parameters without deploying new code.

How to think about it: Hardcoded values like val penalty = 0.75 require a code deployment to change. FSBoundedParam defines parameters like OutOfNetworkPenalty(default=0.75, min=0.0, max=1.0) that can be adjusted through a dashboard. Twitter can run A/B tests or tune values in real-time without touching code.

Why it exists: Algorithm optimization is experimental. Twitter needs to test "what if we change the out-of-network penalty from 0.75 to 0.80?" dozens of times per week. FSBoundedParam makes this safe (the bounds prevent catastrophically bad values) and fast (no deployment required).

Important implication: Most weights, penalties, and thresholds in the algorithm are FSBoundedParams. The March 2023 open-source code shows the structure and formulas, but Twitter can tune the parameters without us seeing the changes.

Code: Used throughout, defined in param

TweepCred

What it is: A reputation score for users based on their follower graph quality, using PageRank-like algorithms.

How to think about it: Not all followers are equal. A verified account with 1M engaged followers has higher TweepCred than a bot farm with 1M fake followers. TweepCred measures "how much does the Twitter network trust/value this user?" by looking at who follows them and the quality of those followers.

Why it exists: Follower count alone is easily gamed. TweepCred provides a more robust measure of influence by analyzing the graph structure. It's used to boost high-quality accounts and filter low-quality ones (the SLOP filter removes users with TweepCred below a threshold).

Verified accounts: Get a ~100x TweepCred multiplier, which partly explains why verified accounts dominate recommendations.

Code: tweepcred

FRS (Follow Recommendations Service)

What it is: A service that recommends users you might want to follow.

How to think about it: FRS analyzes your follow graph and engagement patterns to suggest accounts similar to those you already follow or engage with. But it has a dual purpose: it also feeds into timeline recommendations by showing you tweets from accounts it thinks you should follow before you actually follow them.

Why it exists: Growing your follow graph improves your timeline quality (more in-network candidates). But FRS also serves as a candidate source - "here are tweets from people you don't follow but should."

Cluster reinforcement: FRS recommends users from your strongest SimClusters, which accelerates the gravitational pull effect. If you're 60% AI cluster, FRS recommends more AI accounts, you follow them, which makes you even more AI-cluster-heavy.

Code: follow-recommendations-service

User Signal Service (USS)

What it is: A centralized platform for collecting, storing, and serving user behavior signals.

How to think about it: Every action you take on Twitter (like, reply, click, scroll, dwell time) generates a signal. Rather than having every recommendation system separately track these signals, USS centralizes them. When the algorithm needs to know "what has this user engaged with recently?", it queries USS.

Why it exists: Reduces duplication and ensures consistency. Multiple systems use the same signals (favorites, follows, etc.), so centralizing this in USS means one source of truth.

Real-time and batch: USS provides both real-time signals (recent clicks in the last hour) and batch signals (aggregated engagement over weeks/months).

Code: user-signal-service

Code Evolution Timeline

Twitter's algorithm was open-sourced in two major releases:

March 2023: Architecture Skeleton

~300 files showing the 5-stage pipeline structure, basic candidate sources, and core concepts. The HomeGlobalParams.scala file contained only 86 lines with basic configuration—no engagement weights, no ML integration configs.

September 2025: Complete Implementation

+762 new files adding 161 feature hydrators, 56 filters, 29 scorers, complete ML integration, and full parameter definitions. The HomeGlobalParams.scala file expanded to 1,479 lines with all engagement weight parameters defined.

What This Means

Our investigation analyzes a composite system:

Architecture: March 2023 foundation (5-stage pipeline)
Implementation: September 2025 details (161 hydrators, 56 filters)
Values: External sources (ML repo, engineering blogs)

Important: Parameter definitions exist with default = 0.0, but actual production values come from Twitter's internal configuration system. The code shows structure and formulas; external documentation provides values.

Core findings remain valid: The fundamental mechanisms (multiplicative scoring, exponential decay, 0.75x out-of-network penalty, 140-day feedback fatigue) are unchanged. The September 2025 release added detail and confirmed the architecture we analyzed.

How to Verify Our Claims

Every finding in this investigation can be verified. Here's how:

1. Get the Code

git clone https://github.com/twitter/the-algorithm.git
cd the-algorithm

2. Navigate to Referenced Files

We provide file paths like:

HomeGlobalParams.scala:786-1028

To view this:

cd home-mixer/server/src/main/scala/com/twitter/home_mixer/param/
cat HomeGlobalParams.scala | sed -n '786,1028p'

3. Check Implementation Date

To see when code was last modified:

git blame path/to/file.scala | grep -A5 "pattern"

4. Verify Calculations

We show calculations like:

Tweet score = 0.5 × P(favorite) + 13.5 × P(reply)

Example:
P(favorite) = 0.1 (10% chance)
P(reply) = 0.02 (2% chance)

Score = 0.5 × 0.1 + 13.5 × 0.02
      = 0.05 + 0.27
      = 0.32

You can verify these against the code references we provide.

5. Cross-Reference Documentation

Twitter published some official explanations:

Engineering blog post
ML algorithm details
README files throughout the codebase

Our analysis adds detail and implications beyond what Twitter officially documented.

File Index: Where to Find Things

Main Pipeline

Entry point: home-mixer/server/src/main/scala/com/twitter/home_mixer/product/for_you/ForYouProductPipelineConfig.scala

Scoring orchestration: home-mixer/server/src/main/scala/com/twitter/home_mixer/product/scored_tweets/ScoredTweetsProductPipelineConfig.scala

Engagement Weights

All 15 weight parameters: home-mixer/server/src/main/scala/com/twitter/home_mixer/param/HomeGlobalParams.scala:786-1028

Engagement type definitions: home-mixer/server/src/main/scala/com/twitter/home_mixer/model/PredictedScoreFeature.scala:62-336

Filters and Penalties

"Not interested" filtering: home-mixer/.../filter/FeedbackFatigueFilter.scala

140-day penalty calculation: home-mixer/.../scorer/FeedbackFatigueScorer.scala

Author diversity exponential decay: home-mixer/.../scorer/AuthorBasedListwiseRescoringProvider.scala:54

Out-of-network 0.75x multiplier: home-mixer/.../scorer/RescoringFactorProvider.scala:45-57

Candidate Sources

Earlybird search index: src/java/com/twitter/search/

UTEG: src/scala/com/twitter/recos/user_tweet_entity_graph/

Out-of-network coordination: tweet-mixer/

FRS: follow-recommendations-service/

SimClusters and Communities

Community detection and embeddings: src/scala/com/twitter/simclusters_v2/

Approximate nearest neighbor search: simclusters-ann/

User Signals

Complete list of 20+ tracked signals: RETREIVAL_SIGNALS.md

Signal collection and serving: user-signal-service/

Real-time action stream: unified_user_actions/

Reference & Glossary

Contents

Glossary: Algorithm Components

Reference Sections

Glossary: Building Blocks of the Algorithm

Heavy Ranker

Light Ranker

TwHIN (Twitter Heterogeneous Information Network)

SimClusters

UTEG (User-Tweet-Entity-Graph)

GraphJet

Earlybird

Real Graph

Tweet Mixer

Navi

Product Mixer

MaskNet

FSBoundedParam (Feature Switch)

TweepCred

FRS (Follow Recommendations Service)

User Signal Service (USS)

Code Evolution Timeline

March 2023: Architecture Skeleton

September 2025: Complete Implementation

What This Means

How to Verify Our Claims

1. Get the Code

2. Navigate to Referenced Files

3. Check Implementation Date

4. Verify Calculations

5. Cross-Reference Documentation

File Index: Where to Find Things

Main Pipeline

Engagement Weights

Filters and Penalties

Candidate Sources

SimClusters and Communities

User Signals

Further Reading

Official Sources