[SWE-Smith Multilingual] Expanding to JavaScript
We expanded SWE-Smith to JavaScript with 6,099 validated patches across 74 repositories using cloud pipelines.
Blog
We expanded SWE-Smith to JavaScript with 6,099 validated patches across 74 repositories using cloud pipelines.
We spent a month trying to make synthetic data work. Found that 'the improvements you observe on synthetic benchmarks may simply not transfer to the real users you actually want to simulate.'
We built CooperBench and found that adding agents halves success rates. The channel becomes noisy with repetition, unresponsiveness, and hallucination.
We gave agents git access and saw only 1-2% improvement. Tools alone don't enable collaboration without social intelligence.
We peeked under the hood to see how reinforcement learning changes what's going on inside language models. Spoiler: it's way cooler than we thought.
What happens when you add a neutral moderator to help LLMs cooperate in strategic games? Spoiler: it works way better than you'd think.