Agentic & Gen AI: Scaling Gen AI Inference in the Wild
Session Overview
As adoption of generative AI accelerates and agentic AI systems add new inference demands, the greatest challenge lies in scaling workloads from prototypes to production, where costs, latency, and GPU management complexity often stall deployment. This talk explores essential strategies such as quantization, batching, caching, and hardware aware optimization that bridge the gap between research performance and production grade performance and reliability. Drawing on lessons from large scale deployments, we highlight how these strategies enable developers to achieve higher throughput, lower costs, and predictable outcomes. We conclude by showing how these principles are realized in FriendliAI, powered by a purpose built inference stack that abstracts infrastructure complexity and consistently delivers unmatched performance at scale.
Speaker

Speaker info to be announced soon.