▲ /asking/evals-that-matter
which evals would actually tell you whether a frontier model can do real capture-grade work
the existing benchmarks measure things that capture work doesn't ask for. what would a benchmark look like if it were designed to predict performance on a real pursuit — multi-week, multi-document, with a human in the loop you can't simulate? answers from anyone who's run evals at scale or sat on the receiving end of a bad one welcome.
— answers (0) —
no answers yet. be the first.