From Black-Box to Benchmarked: Building Trustworthy Gen AI Applications
Session Overview
This talk explains why evaluation is essential for ensuring the reliability and quality of generative AI applications. Moving beyond black-box approaches, it makes the case for a systematic evaluation framework that measures and improves reasoning, stability, and consistency of models.
The session highlights Weights & Biases Weave as a foundation for reproducible validation through complete traceability of data, models, and code. Such evaluation-centric workflows directly address the long-tail problem and the generalization limits faced by LLMs and agentic AI systems—ultimately enabling transparent and trustworthy GenAI.
Speaker
 
          Oh Hyun-woo leads initiatives across APAC to help organizations build scalable, efficient AI development workflows, with a particular focus on LLM and GenAI. He specializes in assessing enterprise AI environments and enabling teams to adopt W&B solutions tailored to their unique workflows. Before W&B, he worked at NAVER and VUNO applying AI to large-scale search systems and medical image analysis.

