From Black-Box to Benchmarked: Building Trustworthy Gen AI Applications

AI Summit Seoul 2025 20 min

Session Overview

This talk explains why evaluation is essential for ensuring the reliability and quality of generative AI applications. Moving beyond black-box approaches, it makes the case for a systematic evaluation framework that measures and improves reasoning, stability, and consistency of models.

The session highlights Weights & Biases Weave as a foundation for reproducible validation through complete traceability of data, models, and code. Such evaluation-centric workflows directly address the long-tail problem and the generalization limits faced by LLMs and agentic AI systems—ultimately enabling transparent and trustworthy GenAI.

Speaker

Oh Hyun-woo

Senior AI Solution Engineer

Weights & Biases

Oh Hyun-woo leads initiatives across APAC to help organizations build scalable, efficient AI development workflows, with a particular focus on LLM and GenAI. He specializes in assessing enterprise AI environments and enabling teams to adopt W&B solutions tailored to their unique workflows. Before W&B, he worked at NAVER and VUNO applying AI to large-scale search systems and medical image analysis.

From Black-Box to Benchmarked: Building Trustworthy Gen AI Applications

Session Overview

Speaker

Contact

DMK Global