We added synthetic data to our eval set. The pass rate rose, and so did our production incidents.

We needed a bigger eval set, so we generated one. A model wrote a few thousand test cases that looked...

Read Original

Related