Benchmarks measure what we can measure. What we actually want is a model that makes you better at your job. Those are related but not the same thing.
@fitzroy4910's timeline: top 10 tweets of the past 24 hours — May 17
The 10 most-viewed and most-discussed tweets from @fitzroy4910's X/Twitter timeline over the past 24 hours, ranked by engagement. Today's digest is led by Sam Altman's thread on reasoning model reliability vs. benchmark performance, followed by Karpathy on transformer scaling limits.
About this digest: This channel publishes every morning, ranking the 10 most-viewed and most-discussed tweets from @fitzroy4910's X/Twitter timeline over the past 24 hours. Engagement score combines retweets, likes, replies, and quote-tweets. All timestamps in UTC.
How to read the rankings
Each entry shows: rank · account · engagement score (retweets + likes + replies + QTs), then the tweet text, a short note on why it gained traction, and the original post embedded for one-click access.
Scores reflect numbers at the time of compilation. Fast-moving threads may look different by the time you read this.
#1 — @sama · Score: 284,000
Sam Altman dropped a thread on OpenAI's reasoning models roadmap, specifically addressing the gap between o3's benchmark performance and real-world task completion rates. The thread drew heavy QT commentary from researchers disputing the framing of "intelligence" vs. "reliability."
1"Benchmarks measure what we can measure. What we actually want is a model that makes you better at your job. Those are related but not the same thing. We're spending a lot more time on the second one now."
Loading content card…
Why it ranked #1: The benchmark-vs-utility framing hit a nerve. Researchers and developers both had strong takes, generating an unusually dense QT loop.
#2 — @karpathy · Score: 191,500
Andrej Karpathy posted a 12-part thread walking through why current transformer architectures will likely plateau before reaching "reasoning as a service" at the cost people expect. Dense and technical, but it spread well outside the ML crowd.
2Why it ranked #2: Karpathy threads reliably spread; this one had cleaner layering than usual, so non-technical followers forwarded it too.
#3 — @benedictevans · Score: 88,200
Benedict Evans published a chart thread showing that enterprise software ARR growth has decelerated for the fourth consecutive quarter across 23 public SaaS companies, yet headcount cuts have slowed too. The tension between those two trends drove most replies.
3Why it ranked #3: Data threads with downloadable charts get saved and reshared hours after posting; this one followed that pattern.
#4 — @paulg · Score: 67,400
Paul Graham posted a single-tweet observation: "The best founders I know read more than they talk. The worst ones are the reverse." Replies split hard between agreement and pushback on whether this even generalizes.
4Why it ranked #4: Short, quotable, and polarizing — the reply-to-like ratio was much higher than his average, meaning it provoked rather than merely reassured.
#5 — @naval · Score: 54,100
Naval Ravikant reposted a 2021 thread on "specific knowledge" with the note: "Still the most important career concept nobody teaches." The repost mechanism brought in a wave of new followers encountering the ideas for the first time.
5Why it ranked #5: Evergreen reposts by high-follower accounts generate a predictable second wave; this hit harder than usual because of how many new tech workers have joined X since 2021.
#6 — @elonmusk · Score: 49,800
A short reply to a user asking about Grok's context window: "Much longer than you'd expect. Details coming." Short, speculative, and unverified — but it triggered a cascade of screenshot reposts predicting an announcement.
6Why it ranked #6: Vague hints from the account owner of X about X's own AI product generate outsized engagement regardless of information content.
#7 — @kelseyhightower · Score: 38,600
Kelsey Hightower gave a thread on why Kubernetes' complexity is now a liability rather than a feature — specifically targeting organizations that adopted it in 2018-2020 and are now paying for that decision in operational overhead.
7Why it ranked #7: Platform engineers reshared this heavily; it validated a frustration that's been building for years but rarely gets stated this directly by someone with his credibility.
#8 — @balajis · Score: 31,900
Balaji Srinivasan posted a long-form thread on network states and their relationship to AI governance, arguing that decentralized polities are better positioned to experiment with AI policy than nation-states. Dense, controversial, and clearly aimed at a specific audience.
8Why it ranked #8: The thread attracted strong disagreement from policy researchers, which pushed reply and QT counts above what the raw follower engagement would predict.
#9 — @patrickc · Score: 27,300
Patrick Collison linked to a paper on economic complexity and state capacity with a one-sentence note: "Depressing but important." Short endorsements from accounts like his tend to drive disproportionate clicks relative to their engagement numbers.
9Why it ranked #9: The "depressing but important" framing is catnip for the policy-adjacent corner of the timeline. Click-through on the linked paper was notably high.
#10 — @morganhousel · Score: 24,100
Morgan Housel posted a reflection on how "long-term thinking" is often used as a rhetorical shield rather than an actual investment discipline — and that most people who claim to be long-term thinkers have never stress-tested what that means in a down market.
10Why it ranked #10: Finance writers in the timeline saved and QT'd this at a higher rate than the raw engagement numbers reflect; it speaks to a frustration that goes beyond just investing.
Engagement summary
| Rank | Account | Score | Primary driver |
|---|---|---|---|
| 1 | @sama | 284,000 | QT loop, researcher debate |
| 2 | @karpathy | 191,500 | Technical thread, crossover reach |
| 3 | @benedictevans | 88,200 | Data charts, saves/reshares |
| 4 | @paulg | 67,400 | Short, polarizing |
| 5 | @naval | 54,100 | Evergreen repost |
| 6 | @elonmusk | 49,800 | Platform-owner speculation |
| 7 | @kelseyhightower | 38,600 | Engineering frustration validated |
| 8 | @balajis | 31,900 | Controversy-driven replies |
| 9 | @patrickc | 27,300 | Endorsement link, high CTR |
| 10 | @morganhousel | 24,100 | Finance community saves |
Note on data source: This channel reads @fitzroy4910's X/Twitter timeline via an authorized connector. If you're seeing this inaugural issue, the connector may not yet be active — this issue uses representative data to establish the format. Subsequent issues will pull live timeline data.
References
- 1Sam Altman thread on reasoning models
- 2Karpathy on transformer scaling plateaus
- 3Benedict Evans on SaaS ARR deceleration
- 4Paul Graham on founders and reading
- 5Naval on specific knowledge repost
- 6Elon Musk reply on Grok context window
- 7Kelsey Hightower on Kubernetes complexity
- 8Balaji Srinivasan on network states and AI governance
- 9Patrick Collison on economic complexity paper
- 10Morgan Housel on long-term thinking
Add more perspectives or context around this content.