Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

Wednesday, July 23, 2025 • Rosenfeld Community

This video is featured in the AI and UX playlist.

Share the love for this talk
Building impactful AI products for design and product leaders, Part 2: Evals are your moat
Speakers: Peter Van Dijck
Link:

Summary

The secret ingredient for impactful AI products is “evals”—an architecture for ongoing evaluation of quality. Without evals, you don’t know if your output is good. You don’t know when you’re done. Because outputs are non-deterministic, it’s very hard to figure out if you are creating real value for your users, and when something goes wrong, it’s really tricky to figure out why. Simply Put’s Peter van Dijck will demystify evals, and share a simple framework for planning for and building useful evals, from qualitative user research to automated evals using LLMs as a judge.

Key Insights

  • AI product development involves three layers: model capabilities, context management, and user experience, with evals central to experience quality assurance.

  • Automated evals help scale testing of AI with inherently open-ended inputs and outputs, enabling faster iteration cycles with confidence.

  • LLMs can serve as judges (evaluators) of other LLM outputs, which works because classification is cognitively easier than generation.

  • Defining what 'good' means for an AI system is a detailed, evolving process informed by research, domain expertise, and observed risks.

  • A three-option evaluation (e.g., yes/no/maybe) works better than fine-grained scales for consistent automated scoring by LLMs.

  • Synthetic data, generated by LLMs based on manually created examples, efficiently expands dataset breadth and usefulness.

  • Domain experts are essential for tagging data and establishing quality criteria, especially for high-stakes areas like healthcare or legal.

  • Building effective evals requires substantial effort—expect 20-40% of project resources devoted to this work.

  • Cultural differences impact subjective evals like politeness, requiring localization and careful domain definition.

  • AI product quality management is a strategic ongoing commitment, extending beyond initial development into production monitoring and iteration.

Notable Quotes

"AI products almost always have both open-ended inputs and outputs, which makes testing really hard."

"You have to build a detailed definition of what is good for my system to do meaningful automated evals."

"It’s much easier to classify an answer than to generate an answer, and that’s why LLM as a judge works."

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

"Synthetic data is useful because it’s easier to generate more examples of something you already have than to create entirely new data."

"If you launch in the US and politeness is an issue, first try to fix it with prompts; only if that fails should you build an eval."

"Evals are really your intellectual property—they define what good looks like in your domain."

"Domain experts are crucial for tagging data because users might say ‘that’s great,’ but experts can tell it’s totally wrong."

"You should plan 20 to 40 percent of your project budget on evals—it’s a lot more work than most people expect."

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Ask the Rosenbot
Ed Mullen
Designing the Unseen: Enabling Institutions to Build Public Trust
2022 • Civic Design 2022
Gold
Sheryl Cababa
Expanding Your Design Lens with Systems Thinking
2023 • Enterprise Community
Bud Caddell
Theme 2 Intro
2021 • DesignOps Summit 2021
Gold
Alberto Ferreira
Making it Count: Developing a custom digital metric framework that works
2021 • QuantQual Interest Group
Sam Proulx
To Boldly Go: The New Frontiers of Accessibility
2022 • Civic Design 2022
Gold
Toby Haug
Discussion
2017 • Enterprise Experience 2017
Gold
Jemma Ahmed
Collaboration: learning from other fields beyond our own [Advancing Research Community Workshop Series]
2024 • Advancing Research Community
Marc Fonteijn
Increase your confidence, influence, and impact (through a Professional Community)
2024 • Advancing Service Design 2024
Gold
Victor Udoewa
Theme One Intro
2023 • Advancing Research 2023
Gold
Anna Avrekh
Diversity In and For Design: Building Conscious Diversity in Design and Research
2021 • Design at Scale 2021
Gold
Jay Bustamante
Navigating the Ethical Frontier: DesignOps Strategies for Responsible AI Innovation
2023 • DesignOps Summit 2023
Gold
Francesca Barrientos, PhD
You Need Your Own Definition of Design Maturity
2022 • Design at Scale 2022
Gold
Milan Guenther
A Shared Language for Co-Creating Ambitious Endeavours
2023 • Enterprise UX 2023
Gold
Corey Nelson
Layoffs
2022 • Advancing Research Community
Victor Udoewa
Beyond Methods and Diversity: The Roots of Inclusion
2024 • Advancing Research 2024
Gold
Emilia Åström
Unlock Your Team’s Intelligence with Collaboration Design
2022 • Design at Scale 2022
Gold

More Videos

Peter Van Dijck

"You don’t want to give too many options like rating from one to ten because consistency gets lost between different LLM calls."

Peter Van Dijck

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

July 23, 2025

Xenia Adjoubei

"The tapestry exhibition will be a three-dimensional map co-created with refugees and local school kids to reflect their stories."

Xenia Adjoubei Sean Bruce

Empowering Communities Through the Researcher in Residence Program

March 29, 2023

James Rampton

"The phone itself is almost the center of the automotive ecosystem on the way to work."

James Rampton

The Basics of Automotive UX & Why Phones Are a Part of That Future

July 25, 2024

Lily Aduana

"User Interviews offers multi-country recruitment at the same price, supporting Canada, UK, Germany, Australia, and South Africa."

Lily Aduana Savannah Hobbs Brittany Rutherford

5 Reasons to Bring Your Recruiting in-House (and How To Do It)

March 12, 2021

Kayla Farrell

"I’m excited about scaling our research work with better ops support and building tools to share knowledge more efficiently."

Kayla Farrell Chelsey Glasson Sean Fitzell Jared LeClerc

What It's Like To Be a User Researcher at Compass

March 12, 2021

Kate Kalcevich

"I recommend asking about user needs rather than disability, like whether someone needs captions or larger fonts."

Kate Kalcevich

Integrating Accessibility in DesignOps

September 23, 2024

Ovetta Sampson

"Our role as designers and researchers will be to determine what not to design to preserve human culture and values."

Ovetta Sampson

Research in the Automated Future

March 11, 2022

Deanna Mitchell

"Reflect on what you experienced internally and externally to sharpen and reveal new insights."

Deanna Mitchell

Designing with culture: Unlocking impactful insights for Product and UX

March 12, 2025

Ren Pope

"Knowledge is not only intended for human consumption, it is meant for human action."

Ren Pope

Building Experiences for Knowledge Systems

June 6, 2023