TL;DR: Hand-written eval cases test the failures you already imagined, which are never the ones that...
We stopped writing eval cases by hand. Now every prod incident becomes one.
TL;DR: Hand-written eval cases test the failures you already imagined, which are never the ones that...
WebMCP lets a web page expose tools that AI agents can discover and execute inside the browser. That...
An agent opened a pull request on our service last week. Six hundred lines. It rewrote how we handle...
I never thought I'd say this as someone who has embraced AI from the beginning, but after relying on...