Building reliable AI systems at Google DeepMind: Lessons from the trenches

Deploying large language models (LLMs) that reliably work in real-world applications requires robust evaluation. This talk dives into hands-on techniques for crafting effective evals to measure and improve your LLM's performance, as well as spotlighting common developer mistakes and how to avoid them.

Beyond evals, we share battle-tested insights from integrating Gemini models into production applications used by 100s of millions. Expect practical takeaways on tackling challenges, implementing best practices, and actionable strategies to build LLM-powered applications you can rely on.

If your team is using LLMs for solving real problems, and want to move beyond academic benchmarks to real-world impact, this talk is for you.