Multi-Task Multi-Domain Evals
Create a senior-grade eval battery: reasoning (math/code), instruction-following, safety, multilingual QA, and tool-use. Include uncertainty intervals and power analysis for A/Bs.
Ratings
Average Rating: 0
Total Ratings: 0