Phoenix: The Behavioral Prediction System

Twitter built a sequence-based prediction system for user behavior. Instead of aggregating features, Phoenix models up to 522 of your recent actions (spanning hours to days of behavior) to predict what you'll do next—like, reply, click. The architecture suggests a fundamental shift from feature-based to sequence-based recommendation.

Important: This analysis is based on code structure and architecture patterns. While the infrastructure is verifiably complete, some aspects (like training objectives and behavioral modeling details) are inferred from architectural similarities to transformer-based systems. We clearly mark what's verified code vs. reasoned inference throughout.

Status: Phoenix infrastructure is complete and production-ready (September 2025 commit). It's currently feature-flagged (default = false), suggesting it may be in testing phase. The architecture represents a shift from feature-based to sequence-based prediction.

From Averages to Sequences

The current recommendation system (NaviModelScorer) thinks about you in terms of averages and statistics: "Alice likes 30% tech content, 20% sports, follows 342 people, engages 10 times per day." Phoenix thinks about you in terms of what you're doing right now: "Alice just clicked 3 tech tweets in a row, expanded photos, watched a video—she's deep-diving into tech content."

The Core Difference

Current System: NaviModelScorer

Feature-Based Prediction

Your profile:

User features: {
  avg_likes_per_day: 10.5
  avg_replies_per_day: 2.3
  favorite_topics: [tech, sports]
  follower_count: 342
  engagement_rate: 0.15
  ... (many aggregated features)
}

Algorithm asks: "What does Alice usually like?"

Time horizon: Months of aggregated behavior

Updates: Daily batch recalculation

Phoenix System

Sequence-Based Prediction

Your recent actions:

Action sequence: [
  CLICK(tech_tweet_1)
  READ(tech_tweet_1)
  LIKE(tech_tweet_1)
  CLICK(tech_tweet_2)
  EXPAND_IMAGE(tech_tweet_2)
  CLICK(tech_tweet_3)
  ... (up to 522 aggregated actions)
]

Algorithm asks: "What will Alice do next given her recent behavioral pattern?"

Time horizon: Hours to days of behavioral history (522 aggregated actions)

Updates: Real-time action capture, aggregated into sessions

The LLM Analogy (Inferred from Architecture)

Hypothesis: Phoenix uses a transformer-based architecture similar to language models, but instead of predicting text, it predicts actions. This inference is based on:

Comparison to language models:

Aspect ChatGPT / Claude Phoenix
Architecture Transformer (attention-based) Transformer (attention-based)
Input Sequence of tokens (words) Sequence of actions (likes, clicks, replies)
Context Window 8K-200K tokens 522 aggregated actions (hours to days of behavior)
Prediction Task "What word comes next?" "What action comes next?"
Output Probability distribution over vocabulary Probability distribution over 13 action types
Training Objective Predict next token from context Predict next action from behavioral context
What It Learns Language patterns, grammar, context Behavioral patterns, engagement momentum, intent

What Phoenix Could Capture (Inference from Sequence Modeling)

Hypothesis: By modeling behavior as a sequence, Phoenix could understand dynamics that aggregated features miss. These capabilities are inferred from how sequence models typically work, not explicitly verified in code:

1. Session-Level Interest

Scenario: User interested in both Tech and Sports (50/50 split)

Navi prediction: 50% tech, 50% sports (always the same)

Phoenix prediction:
  Monday morning: [TECH] [TECH] [TECH] [TECH] → 85% tech, 15% sports
  Monday evening: [SPORTS] [SPORTS] [SPORTS] → 10% tech, 90% sports

Same user, different behavioral context → different predictions

2. Behavioral Momentum

Engagement Streak:
[LIKE] [REPLY] [LIKE] [LIKE] [CLICK] [LIKE] → High engagement mode
Phoenix: Next tweet gets 75% engagement probability

Passive Browsing:
[SCROLL] [SCROLL] [CLICK] [SCROLL] → Low engagement mode
Phoenix: Next tweet gets 15% engagement probability

Same user, different momentum → different feed composition

3. Context Switches

Context Switch Detection:
[NEWS] [NEWS] [NEWS] → [MEME] [MEME] → Context switch!

Phoenix recognizes: User shifted from serious content to entertainment
Adapts feed: More memes, less news (for this session)

4. Intent Signals

Behavioral Pattern: Profile Click + Follow
[CLICK_TWEET] → [CLICK_PROFILE] → [FOLLOW] → Next tweet from that author

Phoenix learns: Profile click + follow = strong interest signal
Result: Boost similar authors immediately

Why This Could Change Everything

Hypothesis: Phoenix could represent Twitter's move toward "delete heuristics"—the vision of replacing manual tuning with learned patterns. This interpretation is based on architectural design patterns:

What Gets Deleted

  • Manual weights: Reply: 75.0, Favorite: 0.5, Report: -369.0 → Phoenix learns what matters from data
  • Hand-crafted aggregated features: avg_likes_per_day, favorite_topics, engagement_rate → Just action sequences
  • 15+ manual penalties: OON penalty, author diversity, feedback fatigue → Phoenix learns user preferences
  • Static predictions: "Alice likes 30% tech" → "Alice is deep-diving tech RIGHT NOW"

The result: An algorithm that understands your current intent from your behavioral patterns, not your historical preferences from aggregated statistics. This is closer to how humans actually browse—following threads of interest as they emerge, not mechanically consuming averaged content.


Experience Behavioral Prediction (Simulation)

This simulator demonstrates how sequence-based prediction could work based on Phoenix's architecture. The predictions shown are illustrative of what behavioral sequence modeling enables, not actual Phoenix output.

Behavioral Sequence Simulator

Your Recent Actions (Last 8 actions):

Click actions to add them to your sequence. Phoenix analyzes your behavioral pattern and predicts your next action.

Your Sequence:

No actions yet. Click buttons above to build your sequence.

Try These Patterns:

  • Deep Dive: Click [Tech] [Tech] [Tech] [Tech] → Phoenix detects focused exploration
  • Engagement Streak: [Like Tech] [Reply Tech] [Like Tech] → High momentum mode
  • Context Switch: [Tech] [Tech] [Sports] [Sports] → Phoenix adapts to interest shift
  • Passive Browsing: [Scroll] [Scroll] [Scroll] → Low engagement mode

The Technical Architecture

Two-Stage Pipeline

Phoenix splits prediction and aggregation into two separate stages:

Stage 1: PhoenixScorer (Prediction via gRPC)
  Input: User action sequence (up to 1024 actions) + candidate tweets
  Process: Transformer model predicts engagement probabilities
  Output: 13 predicted probabilities per tweet

Stage 2: PhoenixModelRerankingScorer (Aggregation)
  Input: 13 predicted probabilities from Stage 1
  Process: Per-head normalization + weighted aggregation
  Output: Final Phoenix score for ranking

Stage 1: Behavioral Sequence Prediction

Code: PhoenixScorer.scala:30-85

Input: Action Sequence

User action sequence (522 aggregated actions spanning hours to days):
[
  Session 1: FAV(tweet_123, author_A) + CLICK(tweet_123, author_A),
  Session 2: CLICK(tweet_456, author_B),
  Session 3: REPLY(tweet_789, author_C) + FAV(tweet_790, author_C),
  Session 4: FAV(tweet_234, author_A),
  ...
  Session 522: CLICK(tweet_999, author_D)
]

(Actions grouped into sessions using 5-minute proximity windows)

Candidate tweets: [tweet_X, tweet_Y, tweet_Z]

Processing: Transformer Model (Inferred Architecture)

Verified: Phoenix calls an external gRPC service named user_history_transformer (dependency in BUILD.bazel:20, client interface RecsysPredictorGrpc in PhoenixUtils.scala:26, usage in PhoenixUtils.scala:110-135)
Note: The actual service implementation is not in the open-source repository.
Inferred: The internal architecture likely follows transformer patterns based on the service name and sequence-to-sequence design:

Inferred Transformer Architecture:
  1. Embed each action in the sequence (action type + tweet metadata)
  2. Apply self-attention to identify relevant behavioral patterns
  3. For each candidate tweet, compute relevance to behavioral context
  4. Output 13 engagement probabilities via softmax

Verified Output Format (log probabilities):
{
  "tweet_X": {
    "SERVER_TWEET_FAV": {"log_prob": -0.868, "prob": 0.42},
    "SERVER_TWEET_REPLY": {"log_prob": -2.526, "prob": 0.08},
    "SERVER_TWEET_RETWEET": {"log_prob": -2.996, "prob": 0.05},
    "CLIENT_TWEET_CLICK": {"log_prob": -1.273, "prob": 0.28},
    ... (9 more engagement types)
  },
  ...
}

Why gRPC Service? (Verified: separate service, inferred: reasons)

Stage 2: Per-Head Normalization and Aggregation

Code: PhoenixModelRerankingScorer.scala:23-81

Step 1: Per-Head Max Normalization

For each engagement type (each "head"), find the maximum prediction across all candidates:

3 candidates, 3 engagement types:
Candidate A: [FAV: 0.42, REPLY: 0.08, CLICK: 0.28]
Candidate B: [FAV: 0.15, REPLY: 0.35, CLICK: 0.20]
Candidate C: [FAV: 0.30, REPLY: 0.12, CLICK: 0.25]

Per-head max:
  Max FAV: 0.42
  Max REPLY: 0.35
  Max CLICK: 0.28

Attach max to each candidate for normalized comparison:
Candidate A: [(0.42, max:0.42), (0.08, max:0.35), (0.28, max:0.28)]
Candidate B: [(0.15, max:0.42), (0.35, max:0.35), (0.20, max:0.28)]
Candidate C: [(0.30, max:0.42), (0.12, max:0.35), (0.25, max:0.28)]

Why normalize per-head? Different engagement types have different prediction ranges. Normalization ensures fair aggregation.

Step 2: Weighted Aggregation

Phoenix uses the same weights as NaviModelScorer for fair A/B testing comparison:

Weight parameters: HomeGlobalParams.scala:786-1028
Actual values: the-algorithm-ml/projects/home/recap

Weights (configured in production):
  FAV: 0.5
  REPLY: 13.5
  REPLY_ENGAGED_BY_AUTHOR: 75.0
  RETWEET: 1.0
  GOOD_CLICK: 12.0
  ... (8 more positive weights)
  NEGATIVE_FEEDBACK: -74.0
  REPORT: -369.0

Final Score = Σ (prediction_i × weight_i)

Example for Candidate A:
  FAV:    0.42 × 0.5   = 0.21
  REPLY:  0.08 × 13.5  = 1.08
  CLICK:  0.28 × 12.0  = 3.36
  ... (sum all 13 engagement types)

  Phoenix Score = 0.21 + 1.08 + 3.36 + ... = 8.42

The 13 Engagement Types

Code: PhoenixPredictedScoreFeature.scala:30-193

Phoenix predicts probabilities for 13 different action types:

Engagement Type Action Weight
FAV Like/favorite tweet 0.5
REPLY Reply to tweet 13.5
REPLY_ENGAGED_BY_AUTHOR Reply + author engages back 75.0
RETWEET Retweet or quote 1.0
GOOD_CLICK Click + dwell (quality engagement) 12.0
PROFILE_CLICK Click author profile 3.0
VIDEO_QUALITY_VIEW Watch video ≥10 seconds 8.0
... (6 more) Share, bookmark, open link, etc. 0.2 - 11.0
NEGATIVE_FEEDBACK Not interested, block, mute -74.0
REPORT Report tweet -369.0

Context Window: 522 Aggregated Actions (Hours to Days)

Action sequence hydration: UserActionsQueryFeatureHydrator.scala:56-149
Max count parameter: HomeGlobalParams.scala:1373-1379

CRITICAL: The "5-minute window" is for aggregation (grouping actions within proximity), not filtering (time limit on history).

Configuration:
// Aggregation window (for grouping, NOT filtering)
private val windowTimeMs = 5 * 60 * 1000  // Groups actions within 5-min proximity
private val maxLength = 1024                // Max AFTER aggregation

// Actual default used
object UserActionsMaxCount extends FSBoundedParam[Int](
  name = "home_mixer_user_actions_max_count",
  default = 522,    // ← Actual default
  min = 0,
  max = 10000       // ← Configurable up to 10K
)

Processing flow:
1. Fetch user's full action history from storage (days/weeks)
2. Decompress → 2000+ raw actions
3. Aggregate using 5-min proximity window (session detection)
   → Actions within 5-min windows grouped together
4. Cap at 522 actions (default)

Result: 522 aggregated actions spanning HOURS TO DAYS, not 5 minutes!

What "5-minute aggregation window" means:

Actual temporal span (522 actions):

Active user (~100 actions/hour):  ~5 hours of behavioral history
Normal user (~30 actions/hour):   ~17 hours of behavioral history
Light user (~10 actions/hour):    ~52 hours (2+ days) of behavioral history

Maximum (10,000 actions): Could span WEEKS for light users

Comparison to LLM context windows:

GPT-3:   2048 tokens (~1500 words, ~3-4 pages of text)
GPT-4:   8K-32K tokens (~6K-24K words)
Phoenix: 522 aggregated actions (~hours to days of behavior)

Phoenix vs Navi: Architecture Comparison

Aspect NaviModelScorer (Current) Phoenix (Future)
Input Data Many aggregated features Action sequence (522 aggregated actions, spanning hours to days)
Temporal Modeling ❌ Lost in aggregation ✅ Explicit via self-attention
Behavioral Context ⚠️ Via real-time aggregates ✅ Recent actions directly inform predictions
Session Awareness ❌ Same prediction all day ✅ Adapts to current browsing mode
Feature Engineering ❌ Many hand-crafted features ✅ Minimal (actions + metadata)
Manual Tuning ❌ 15+ engagement weights, penalties ✅ Learned patterns (eventually)
Computational Cost ✅ O(n) feature lookup ⚠️ O(n²) transformer attention
Update Frequency Daily batch recalculation Real-time, every action

Current Status (Verified from Code)

Phoenix Infrastructure: All components verified in twitter/the-algorithm repository

Infrastructure Status (September 2025 commit):

What This Means:

Multi-Model Experimentation Infrastructure

Cluster configuration: HomeGlobalParams.scala:1441-1451
Connection management: PhoenixClientModule.scala:21-61
Cluster selection: PhoenixScorer.scala:52-53

Phoenix isn't a single model—it's 9 separate transformer deployments designed for parallel experimentation:

PhoenixCluster enumeration:
- Prod          // Production model
- Experiment1   // Test variant 1
- Experiment2   // Test variant 2
- Experiment3   // Test variant 3
- Experiment4   // Test variant 4
- Experiment5   // Test variant 5
- Experiment6   // Test variant 6
- Experiment7   // Test variant 7
- Experiment8   // Test variant 8

What This Enables

1. Parallel Model Testing

Twitter can test 8 different Phoenix variants simultaneously:

  • Different architectures: 6-layer vs 12-layer transformers, varying attention heads
  • Different context windows: 256 vs 522 vs 1024 vs 2048 actions
  • Different training data: Models trained on different time periods or user segments
  • Feature integration tests: Actions only vs. actions + embeddings vs. actions + temporal features
2. Per-Request Cluster Selection

Each user's request can be routed to a different cluster:

// From PhoenixScorer.scala:52-53
val phoenixCluster = query.params(PhoenixInferenceClusterParam)  // Select cluster
val channels = channelsMap(phoenixCluster)                        // Route request

// Default: PhoenixCluster.Prod
// But can be dynamically set per user via feature flags

A/B testing flow:

User Alice (bucket: control)      → PhoenixCluster.Prod
User Bob   (bucket: experiment_1) → PhoenixCluster.Experiment1
User Carol (bucket: experiment_2) → PhoenixCluster.Experiment2
3. Progressive Rollout Strategy

Safe, gradual deployment with instant rollback:

Week 1: Deploy new model to Experiment1
        Route 1% of users to Experiment1
        Other 99% stay on Prod
        ↓
Week 2: Compare metrics (engagement, dwell time, follows, etc.)
        If Experiment1 > Prod: increase to 5%
        If Experiment1 < Prod: rollback instantly
        ↓
Week 3: Gradually increase: 10% → 25% → 50%
        Monitor metrics at each step
        ↓
Week 4: If consistently better, promote Experiment1 → Prod
        Start testing next variant in Experiment2

Key advantage: Zero-downtime experimentation. New models can be tested without code deployment or service restart—just change the PhoenixInferenceClusterParam value via feature flag dashboard.

4. Parallel Evaluation (All Clusters Queried)

Multi-cluster logging: ScoredPhoenixCandidatesKafkaSideEffect.scala:85-104

For offline analysis, Twitter can query all 9 clusters simultaneously for the same candidates:

// getPredictionResponsesAllClusters queries ALL clusters in parallel
User request → Candidates [tweet_A, tweet_B, tweet_C]
             ↓
Query Prod:         tweet_A: {FAV: 0.40, REPLY: 0.10, CLICK: 0.60}
Query Experiment1:  tweet_A: {FAV: 0.45, REPLY: 0.12, CLICK: 0.58}
Query Experiment2:  tweet_A: {FAV: 0.38, REPLY: 0.15, CLICK: 0.62}
Query Experiment3:  tweet_A: {FAV: 0.42, REPLY: 0.11, CLICK: 0.65}
... (all 9 clusters)
             ↓
Log to Kafka: "phoenix.Prod.favorite", "phoenix.Experiment1.favorite", ...
             ↓
Offline analysis: Compare predicted vs actual engagement across all models

Purpose: Create comprehensive comparison dataset without affecting user experience. Only the selected cluster's predictions are used for ranking, but all predictions are logged for evaluation.

Hybrid Mode: Mixing Navi and Phoenix Predictions

Hybrid configuration: HomeGlobalParams.scala:1030-1108

Twitter can use Navi predictions for some action types and Phoenix predictions for others:

Hybrid Mode Configuration (per action type):
- EnableProdFavForPhoenixParam         = true   // Use Navi for favorites
- EnableProdReplyForPhoenixParam       = true   // Use Navi for replies
- EnableProdGoodClickV2ForPhoenixParam = false  // Use Phoenix for clicks
- EnableProdVQVForPhoenixParam         = false  // Use Phoenix for video views
- EnableProdNegForPhoenixParam         = true   // Use Navi for negative feedback
... (13 total flags, one per engagement type)

Incremental migration strategy:

Phase 1: Enable Phoenix, but use Navi for all predictions
         (Shadow mode - Phoenix predictions logged but not used)
         ↓
Phase 2: Use Phoenix for low-risk actions (photo expand, video view)
         Keep Navi for high-impact actions (favorite, reply, retweet)
         ↓
Phase 3: Gradually enable Phoenix for more action types
         Monitor metrics after each change
         ↓
Phase 4: Full Phoenix mode - all predictions from transformer
         Navi retired or kept as fallback

Why this matters: Reduces risk by preserving proven Navi predictions while testing Phoenix predictions incrementally. If Phoenix predictions for clicks are great but favorites are worse, Twitter can use Phoenix for clicks only.

What This Reveals

This isn't experimental infrastructure—it's production A/B testing at scale.

The sophistication of the cluster system suggests:

Phoenix's feature gate (default = false) doesn't mean "not deployed"—it means "controlled rollout." Twitter can activate Phoenix for specific user cohorts, test different model variants, and compare results, all without changing code.

Technical Details

Connection pooling: Each cluster maintains 10 gRPC channels for load balancing and fault tolerance (90 total connections across 9 clusters).

Request routing: Randomly selects one of 10 channels per request for even load distribution (PhoenixUtils.scala:107-117).

Retry policy: 2 attempts with different channels, 500ms default timeout (configurable to 10s max).

Graceful degradation: If a cluster fails to respond, the system continues with other clusters (for logging) or falls back to Navi (for production scoring).

Code References

Phoenix Scorer (Stage 1 - Prediction): PhoenixScorer.scala:30-85

Phoenix Reranking Scorer (Stage 2 - Aggregation): PhoenixModelRerankingScorer.scala:23-81

User Action Sequence Hydrator: UserActionsQueryFeatureHydrator.scala:56-149

13 Engagement Predictions (Action Types): PhoenixPredictedScoreFeature.scala:30-193

gRPC Transformer Service Integration: PhoenixUtils.scala:26-159

Per-Head Max Normalization: RerankerUtil.scala:38-71

Weighted Aggregation Logic: RerankerUtil.scala:91-137

Model Weight Parameters: HomeGlobalParams.scala:786-1028

Actual Weight Values (ML Repo): the-algorithm-ml/projects/home/recap

Action Sequence Max Count: HomeGlobalParams.scala:1373-1379


The Bottom Line

What we know: Phoenix infrastructure is complete, feature-gated, and production-ready. The architecture represents a fundamental shift from feature-based to sequence-based prediction. More importantly, Phoenix has sophisticated A/B testing infrastructure that strongly suggests active deployment on real users.

Evidence of Active Deployment

The 9-cluster system isn't just placeholder infrastructure—it's production experimentation at scale:

Conclusion: This level of infrastructure sophistication indicates Phoenix is likely being tested on production traffic right now, not merely prepared for future deployment.

What This Means

Paradigm shift in progress:

If/When Phoenix Becomes Default

The algorithm would understand you not as a static profile of historical preferences, but as a dynamic behavioral sequence revealing your current intent. Your feed would adapt as you browse, following threads of interest that emerge in your behavior—not mechanically serving averaged content from long-term statistics.

This mirrors how humans actually consume content: following curiosity as it arises, deep-diving into topics that capture attention, switching contexts when interest shifts. An algorithm that learns to follow your behavioral lead, not force you into a predetermined statistical box.

Current Reality

Verified from code:

What we don't know from open-source code:

Most likely scenario: Phoenix is in active A/B testing with controlled user cohorts. Twitter is iterating on multiple model variants (via Experiment1-8 clusters), comparing results, and gradually expanding deployment as metrics improve. The infrastructure is too sophisticated to be merely preparatory.