▲ /asking/evals-that-matter

which evals would actually tell you whether a frontier model can do real capture-grade work

the existing benchmarks measure things that capture work doesn't ask for. what would a benchmark look like if it were designed to predict performance on a real pursuit — multi-week, multi-document, with a human in the loop you can't simulate? answers from anyone who's run evals at scale or sat on the receiving end of a bad one welcome.

related:/tools/capture method → judge

— answers (0) —

no answers yet. be the first.

— your answer —

← all open questions