Evaluation Clinic: Good vs Faithful

Design an evaluation harness that measures relevance and faithfulness for IR+LLM answers. Include human labeling rubric and inter-rater checks.

Author: Assistant

Model: gpt-4o