Rosenverse

Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval

Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Share the love for this talk
Hands-on AI #1: Let’s write your first AI eval
Speakers: Peter Van Dijck
Link:

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

  • Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.

  • Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.

  • UX and product teams can and should learn evals as a practical, non-technical skill.

  • Creating your own golden dataset is essential and cannot be outsourced or fully automated.

  • Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.

  • Evaluations measure task performance, not the underlying model itself, allowing comparison across models.

  • Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.

  • Biases are baked into models during training via evals used in post-training refinement.

  • LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.

  • Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Ask the Rosenbot
Sam Proulx
Understanding Screen Readers on Mobile: How And Why to Learn from Native Users
2023 • DesignOps Summit 2023
Gold
Dr. Jamika D. Burge
Advancing the Inclusion of Womxn in Research Practices
2022 • Advancing Research Community
Mohammad Hossein Jarrahi
Contextuality problem: Exploring the Benefits of Qualitative and Quantitative Research
2023 • QuantQual Interest Group
Husani Oakley
Theme Three Intro
2023 • Enterprise UX 2023
Gold
Rachel Posman
"Ask Me Anything" with Rachel Posman and John Calhoun, Authors of the Upcoming Rosenfeld Book, The Design Conductors
2024 • DesignOps Summit 2024
Gold
Ryan Matthew
DesignOps without Boundaries: Building More with What You Have
2025 • DesignOps Summit 2025
Gold
Amy Jiménez Márquez
The Atypical UX Manager Path
2020 • Enterprise Community
Uday Gajendar
From AI to Zeitgeist: Theory as the design antidote to AI hype
2025 • Rosenfeld Community
Jose Coronado
From Zero to Hero
2022 • DesignOps Summit 2022
Gold
Cheryl Platz
Collaborative Creativity through Improv
2018 • DesignOps Summit 2018
Gold
Maggie Dieringer
Creating Consistency Through Constant Change
2024 • DesignOps Summit 2020
Gold
Harry Max
Prioritization for Leaders (2nd of 3 seminars)
2024 • Rosenfeld Community
Yolanda Rankin
Black Feminist Epistemology as a Critical Framework for Equitable Design
2021 • Advancing Research 2021
Gold
Bria Alexander
Opening Remarks
2024 • Advancing Research 2021
Gold
Christian Crumlish
The Pygmalion Effect: In Which a Vibe Coding Experiment Becomes a Million Lines…
2025 • Rosenfeld Community
Sahibzada Mayed
Cultivating Design Ecologies of Care, Community, and Collaboration
2023 • DesignOps Summit 2023
Gold

More Videos

Peter Van Dijck

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Peter Van Dijck

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

July 23, 2025

Xenia Adjoubei

"The use of dovetail helps organize diverse information without losing the author’s voice or connection to original data."

Xenia Adjoubei Sean Bruce

Empowering Communities Through the Researcher in Residence Program

March 29, 2023

James Rampton

"When you project Apple CarPlay, you are basically sending the device from your phone into the car, detaching from the native OS."

James Rampton

The Basics of Automotive UX & Why Phones Are a Part of That Future

July 25, 2024

Lily Aduana

"If I forgot a critical screener question, I just messaged all qualified participants to confirm they meet the criteria before approval."

Lily Aduana Savannah Hobbs Brittany Rutherford

5 Reasons to Bring Your Recruiting in-House (and How To Do It)

March 12, 2021

Kayla Farrell

"I want to grow as a researcher by trying different environments and building new research muscles."

Kayla Farrell Chelsey Glasson Sean Fitzell Jared LeClerc

What It's Like To Be a User Researcher at Compass

March 12, 2021

Kate Kalcevich

"Create documentation around accessibility for your design system that includes user needs, testing methods, and panel access."

Kate Kalcevich

Integrating Accessibility in DesignOps

September 23, 2024

Ovetta Sampson

"The relationship between a user and AI system is multi-agency; both have agency to act and influence outcomes."

Ovetta Sampson

Research in the Automated Future

March 11, 2022

Deanna Mitchell

"Culture is the stories that we tell both ourselves and other people about who we are."

Deanna Mitchell

Designing with culture: Unlocking impactful insights for Product and UX

March 12, 2025

Ren Pope

"Many knowledge systems have little ability to create new interfaces, but their experience can be shaped by how they’re used and configured."

Ren Pope

Building Experiences for Knowledge Systems

June 6, 2023