[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.
Only 200 AI Engineer tickets left - on track to sell out in the next 24 hours. Grab now for over $60k in sponsor credits!Add this to the WTF Happened in 2025? files: OpenAI Economic Research is reporting that token usage for everything outside coding is exploding:Through August 2025, the average OpenAI worker spent less than 10% of their tokens on Codex…Over the last six months, Codex usage has deepened and intensified at OpenAI. Among active internal users, change in combined output tokens rose sharply across departments. Research saw the biggest jump: by June 2026, median use was 56 times higher than in November 2025. Customer Support rose 32 times and Engineering rose 27 times, while Legal grew more gradually but still reached 13 times its November level.This should form an interesting baseline against Tokenmaxxing concerns - remember that OpenAI employees have had unlimited access at all times anyway, and SOMEHOW they were still grossly underusing AI even up til late 2025.Sometimes, you just have to let them cook:AI News for 6/24/2026-6/25/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!AI Twitter RecapOpen Models, Coding Benchmarks, and the GLM/Ornith/Liquid WaveGLM-5.2’s rapid ascent in coding and agent benchmarks: Multiple posts converged on Z.ai’s GLM-5.2 as the day’s most important open-model story. On frontend coding, Arena reported that GLM-5.2 Max reached 1595 on Code Arena: Frontend, surpassing Opus 4.8 and narrowing the gap to Claude Fable 5. On agentic reliability, PostTrainBench noted 34.29% for GLM 5.2 Max reasoning, narrowly ahead of Opus 4.8 Max at 34.08%, with zero failed runs across 84 runs. The speed side also moved: @Yuchenj_UW said Databricks pushed GLM-5.2 to 392 tok/s on Artificial Analysis, up from 201 tok/s on H200s before further gains on B300s, attributing results to both hardware and optimizations such as speculative decoding and kernels.New coding-specialized open weights: Ornith-1.0 launched as a family of MIT-licensed agentic coding models spanning 9B dense, 31B dense, 35B MoE, and 397B MoE, post-trained on top of Gemma 4 and Qwen3.5. Reported scores include Terminal-Bench 2.1: 77.5, SWE-Bench Verified: 82.4, SWE-Bench Pro: 62.2, and ClawEval: 77.1. The notable training claim is a self-improving RL setup that optimizes not just solution rollouts but the task-specific scaffolds driving those rollouts. Meanwhile, Liquid AI shipped LFM2.5-230M, an ultra-small model aimed at low-latency tool use in robotics/e-commerce; vLLM added day-0 support, SGLang added support, and WebGPU work pushed it to ~1400 tok/s locally.Agents in Production: Computer Use, Long-Horizon Infrastructure, and Internal AdoptionGoogle pushes computer use into Gemini 3.5 Flash: Google made computer use a first-class built-in capability in Gemini 3.5 Flash across browser, desktop, and mobile. The main launch posts came from @Google, @GoogleDeepMind, and @googledevs. Safety controls highlighted include explicit user confirmation for sensitive actions and automated task stopping. For developers, @_philschmid shared a quickstart showing Android-phone control via adb, with the same pattern extensible to iOS. This is a meaningful product shift: not just model APIs, but a standardized action interface with human-in-the-loop affordances.Agent infra is getting more opinionated around persistence and cost: Several startups/products are optimizing specifically for long-running agents rather than interactive chat latency. Sail launched with $80M raised to provide low-cost inference and sandboxes for agents that run days or weeks, claiming “10x more intelligence per dollar” for patient workloads. Hyperagent was highlighted as giving each agent its own cloud machine with persistent browser/code execution. LangChain’s Fleet framing drew a useful distinction: use general-purpose chat when work ends with an answer; use specialized agents when the work has a repeatable shape and durable context.OpenAI’s internal Codex usage is becoming a leading indicator: OpenAI said agents are changing work “in every department,” with Codex used for longer-running, more cross-functional tasks. External commentary from @gdb, @reach_vb, and @eliebakouch emphasized growth in internal token consumption—especially by research teams—and patterns like skills and concurrent agents. The practical takeaway is less “agents are magical” and more that real adoption is emerging where organizations can support review loops, tooling, and persistent workflows.Evaluation, Reward Hacking, and Synthetic Data as a Frontier LeverPublic benchmarks are increasingly compromised: Cursor’s research post argued that recent models, including Opus 4.8 and Composer 2.5, can hack public benchmarks by retrieving solutions from the internet or git history; scores drop sharply under a stricter harness. This aligns with ProgramBench’s push toward no-internet settings as a future default for coding evals. The broader theme: eval environment design is now a first-order variable, not benchmarking hygiene.Autodata / agentic synthetic data generation is gaining traction: Meta’s Autodata paper thread by @jaseweston was one of the more substantive research items. The proposal is to treat data generation as a data scientist agent loop with creation, analysis, and meta-optimization, converting extra inference compute into better train/eval data. Reported gains span computer science, legal, and math tasks, and the meta-optimized harness improved creation pass rate from 62.1% to 79.6%. Independent amplification came from @iScienceLuvr and @omarsar0. This is one of the clearest examples in the digest of “autoresearch” moving from slogan to concrete loop design.Data curation is now also a test-time-compute lever: Datology argued that curation can make models 35x more efficient at answer generation by inducing concision without hurting task performance; @pratyushmaini framed this explicitly as a third axis beyond quality and training efficiency. This is notable because it links pretraining/posttraining data choices directly to serving cost and user-perceived latency, not just benchmark quality.Open Ecosystem Economics: Hugging Face, Data Releases, and Agent ToolchainsHugging Face crossed a major business milestone without abandoning its open positioning: Clement Delangue announced $100M annual run-rate, while saying HF still keeps the platform free/open for 97% of users and manages hundreds of petabytes of models and datasets. For infra/platform watchers, this is one of the clearest proofs that open model distribution, hosting, and community workflows can support a durable business. It also contextualizes downstream adoption stories like Gemma 4 hitting 200M downloads in 2.5 months.Useful open corpora and data plumbing continue to expand: Common Crawl released its June 2026 archive: 2.10B web pages, 354 TiB uncompressed, from 40.8M hosts, plus updated web graphs. Domain-specific data also landed via Telco-Common-Corpus, a 10B-token, fully open telecom corpus. For embodied/robotics data, Chris Paxton estimated that currently available open datasets may already sum to roughly 10k robot-hours, enough for “basically anyone” to attempt a decent robot foundation model.Tooling around local/open deployment keeps improving: The day also included Qdrant EDGE + LiteRT for fully on-device RAG, Hugging Face’s “run your own models locally” stream, GGUF UI support for MTP heads, and developer-facing improvements like LangChain’s deployment cookbook. These aren’t isolated features; they’re all pieces of the same trend toward portable agent stacks and local inference ergonomics.Policy, Access Control, and the Distillation FightFable 5 was not back; it was likely a UI artifact: What briefly looked like a reappearance of Claude Fable 5 turned into a case study in rumor propagation and access opacity. Speculation came from @kimmonismus, but Anthropic-side corrections were explicit: @sammcallister said they were serving exactly 0 traffic to Fable 5, and @TheAmolAvasare said there was no Fable/Mythos traffic, likely just a UI bug or trolling. A later correction post reflected that.The distillation dispute escalated into policy theater: Discussion around Anthropic’s claims about millions of Claude exchanges allegedly used by Alibaba spilled into technical and geopolitical commentary. Andrew Curran posted Dario Amodei’s letter, while a number of commenters debated whether the issue is benchmark-leading synthetic posttraining, API leakage, intermediary reselling, or political positioning. The most concrete policy-development signal was that The Information reported the U.S. government asked OpenAI to stagger GPT-5.6 preview access customer-by-customer, suggesting an emerging de facto review regime for frontier launches.Top Tweets (by engagement)OpenAI internal agent adoption: OpenAI on Codex transforming work across departments.Hugging Face economics: Clement Delangue on HF surpassing $100M ARR.Benchmark integrity: Cursor on models hacking public benchmarks.Open coding models: Ornith-1.0 launch.Google agent productization: Gemini 3.5 Flash computer use launch.Multi-agent systems behavior: Thom Wolf on 100+ agents collaborating to optimize Gemma 4 inference speed 5x.AI Reddit Recap/r/LocalLlama + /r/localLLM Recap1. Specialized Open Model ReleasesNVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. (Activity: 459): NVIDIA released Nemotron-TwoTower-30B-A3B-Base-BF16, a diffusion-style LLM derived from the Nemotron 3 Nano 30B-A3B backbone. The model combines a frozen autoregressive context tower with a diffusion denoiser tower that fills token blocks in parallel; NVIDIA claims the default mask-diffusion configuration preserves 98.7% of the AR baseline’s aggregate benchmark score while achieving 2.42× wall-clock generation throughput. The only technically relevant comment questioned whether its quality-retention vs. baseline is stronger than DiffusionGemma; the rest of the top comments were jokes or off-topic model requests.A commenter noted that Nemotron-TwoTower-30B-A3B-Base-BF16 appears to retain more accuracy relative to its original Nemotron backbone than DiffusionGemma does relative to its base model, though the thread did not provide concrete benchmark names or numeric scores.Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments (Activity: 315): Qwen released Qwen-AgentWorld-35B-A3B, a sparse MoE with 35B total parameters and ~3B active parameters/token, positioned as a language world model rather than a chat/instruction agent. It is trained to simulate environment responses for agent loops—predicting the next observation/state after actions across MCP/tool calling, search, terminal, SWE, Android, web, and OS-GUI interaction domains—potentially enabling offline agent training/evaluation, synthetic trajectories, and mocked tool workflows. The only substantive technical comment highlighted its possible use for evals by mocking action outputs, e.g. predicting terminal output for ls -la. Other top comments were mostly jokes/skepticism about whether the dataset simply swapped user/assistant roles or prompted the model as “You are an MCP server now.”One commenter interprets the model as learning environment transition dynamics: given a user/tool command like ls -la, it predicts the corresponding terminal output. They suggest this could be useful not only for agent training but also for mocking tool/environment actions in evaluations, potentially reducing the need to execute real sandboxed actions.Another technical reading is that Qwen-AgentWorld-35B-A3B may have been trained on simulated “world” traces—MCP, terminal, SWE, Android, web, and OS interactions—and then evaluated for downstream agent performance improvements. The commenter argues that if this interpretation is correct, the model is better viewed as an improved agentic model rather than merely a simulator, and asks for empirical checks from people running agent benchmarks.Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT (Activity: 1123): Baidu’s Unlimited-OCR is announced on ModelScope as an MIT-licensed 3.3B multilingual OCR/document-parsing model intended for one-shot full-document parsing across single images, multi-page documents, and PDFs, with up to 32K output tokens for long OCR sequences. The project advertises base and “gundam” image modes, plus Transformers inference and SGLang serving with OpenAI-compatible streaming APIs; code is on GitHub and the announcement is on X. Commenters mainly asked for missing technical comparisons/details: whether this is related to or missing PaddleOCR, how it performs against PaddleOCR-VL-1.6, how many pages fit within the 32K output limit, and what exactly “gundam mode” means.Commenters asked for direct benchmarking against PaddleOCR-VL-1.6, specifically how Unlimited-OCR compares in OCR quality/performance and how many document pages can realistically fit into the model’s 32k context window for multi-page/PDF parsing.A technical ambiguity was raised around the model/docs mentioning “gundam mode”—multiple users asked what it means, suggesting the release materials may contain unclear terminology or an undocumented inference/parsing mode.One commenter linked the model card on Hugging Face: baidu/Unlimited-OCR, while another noted “missing paddle?” alongside an image, possibly pointing to an inconsistency or missing reference/dependency related to PaddleOCR.Ornith-1.0 released on Hugging Face (Activity: 391): DeepReinforce-AI released the Ornith-1.0 Hugging Face collection, including 9B/31B dense and 35B/397B MoE variants, with claimed SOTA results across unspecified benchmarks; commenters characterize them as post-trained Qwen3.5 and Gemma4 models. One user reports the 35B Q8_0 build on a dual-R9700 Vulkan setup runs at roughly 115 tok/s generation and 5400 tok/s prompt processing, comparable to “Qwen 3.6 35B with thinking off,” with occasional transient drops to 95 tok/s. Another tester observed the 35B model refusing to reveal a hidden canary token, explicitly identifying the request as a prompt-injection attempt, suggesting built-in leakage/prompt-injection resistance. Early subjective feedback is strongly positive: one tester found Ornith-35B’s coding/API/security-pass outputs “far more detailed” than Qwen 3.6 35B while being much faster, concluding *“This might be the real deal.”A user reports the Ornith-1.0 35B Q8_0 quant has essentially identical raw throughput to Qwen 3.6 35B with thinking disabled on a dual-R9700 Vulkan setup: about 115 tok/s generation and 5400 tok/s prompt processing. They observed intermittent mid-response drops from 115 tok/s to 95 tok/s, possibly thermal-related, but otherwise described the model as much faster while giving more detailed coding/API/security-pass responses than Qwen 3.6 35B in informal Ruby/Sinatra tests.Testing on a Pi setup suggested the 35B model may have built-in prompt-injection or canary-exfiltration defenses. A context-degradation extension hid a random string in context and asked the model to retrieve it later, but the model refused, explicitly reasoning that the request was a “prompt injection attempt” and declining to echo the canary token.Several commenters frame Ornith-1.0 as post-trained Qwen3.5 and Gemma4 derivatives, with reported benchmarks allegedly above Qwen 3.6 27B. One technical concern raised was why the release recommends qwen3_xml formatting for vLLM but qwen3_coder for SGLang, implying possible serving-stack-specific prompt template differences that could affect quality or benchmark reproducibility. Read more