We needed a bigger eval set, so we generated one. A model wrote a few thousand test cases that looked...
We added synthetic data to our eval set. The pass rate rose, and so did our production incidents.
We needed a bigger eval set, so we generated one. A model wrote a few thousand test cases that looked...
My AI conversations were scattered across three apps that couldn't remember each other. So I built a...
DProvenanceKit — regression testing and observability for the reasoning of AI agents (Python, zero...
This week's tooling moves cluster around a common theme: eliminating the overhead tax on developer...