Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #1: Let’s write your first AI eval

Wednesday, October 8, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Peter Van Dijck

Peter Van Dijck

UX and AI builder, CEO Sputnik Legal

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

•

Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.
•

Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.
•

UX and product teams can and should learn evals as a practical, non-technical skill.
•

Creating your own golden dataset is essential and cannot be outsourced or fully automated.
•

Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.
•

Evaluations measure task performance, not the underlying model itself, allowing comparison across models.
•

Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.
•

Biases are baked into models during training via evals used in post-training refinement.
•

LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.
•

Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.

Notable Quotes

"Evals are like a way to define what good looks like."

"The model was baked and once it’s baked, it does not learn again until they bake a new one."

"You need to be looking at the data. Nobody wants to, but that’s core work."

"Without a golden dataset, you have to build the golden dataset yourself."

"We’re not teaching the model anything; we’re improving our prompts and context."

"Confidence scores from the model are not a good idea because the model has no memory."

"Biases are baked in through the evals used during model training and post-training."

"LLMs judging other LLMs might sound crazy, but if you do it right, it works."

"Evals are a product and UX skill; learning them lets you make these systems do what you want."

"There is a large and growing capability overhang in these models we haven’t discovered yet."

Previous video

Next video

Ask the Rosenbot

Or choose a question:

What are the three components that make up an AI eval?

How can UX professionals use evals to improve AI product reliability?

Why is it necessary to create your own golden dataset for AI evaluations?

How do evals help in comparing performance across different AI models?

Why can't AI models learn or improve after deployment through prompting alone?

Sam Proulx

Understanding Screen Readers on Mobile: How And Why to Learn from Native Users

2023 • DesignOps Summit 2023

Dr. Jamika D. Burge

Advancing the Inclusion of Womxn in Research Practices

2022 • Advancing Research Community

Mohammad Hossein Jarrahi

Contextuality problem: Exploring the Benefits of Qualitative and Quantitative Research

2023 • QuantQual Interest Group

Husani Oakley

Theme Three Intro

2023 • Enterprise UX 2023

Rachel Posman

"Ask Me Anything" with Rachel Posman and John Calhoun, Authors of the Upcoming Rosenfeld Book, The Design Conductors

2024 • DesignOps Summit 2024

Ryan Matthew

DesignOps without Boundaries: Building More with What You Have

2025 • DesignOps Summit 2025

Amy Jiménez Márquez

The Atypical UX Manager Path

2020 • Enterprise Community

Uday Gajendar

From AI to Zeitgeist: Theory as the design antidote to AI hype

2025 • Rosenfeld Community

Jose Coronado

From Zero to Hero

2022 • DesignOps Summit 2022

Cheryl Platz

Collaborative Creativity through Improv

2018 • DesignOps Summit 2018

Maggie Dieringer

Creating Consistency Through Constant Change

2024 • DesignOps Summit 2020

Harry Max

Prioritization for Leaders (2nd of 3 seminars)

2024 • Rosenfeld Community

Yolanda Rankin

Black Feminist Epistemology as a Critical Framework for Equitable Design

2021 • Advancing Research 2021

Bria Alexander

Opening Remarks

2024 • Advancing Research 2021

Christian Crumlish

The Pygmalion Effect: In Which a Vibe Coding Experiment Becomes a Million Lines…

2025 • Rosenfeld Community

Sahibzada Mayed

Cultivating Design Ecologies of Care, Community, and Collaboration

2023 • DesignOps Summit 2023

More Videos

Peter Van Dijck

"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."

Peter Van Dijck

Building impactful AI products for design and product leaders, Part 2: Evals are your moat

July 23, 2025

Xenia Adjoubei

"The use of dovetail helps organize diverse information without losing the author’s voice or connection to original data."

Xenia Adjoubei Sean Bruce

Empowering Communities Through the Researcher in Residence Program

March 29, 2023

James Rampton

"When you project Apple CarPlay, you are basically sending the device from your phone into the car, detaching from the native OS."

The Basics of Automotive UX & Why Phones Are a Part of That Future

July 25, 2024

Lily Aduana

"If I forgot a critical screener question, I just messaged all qualified participants to confirm they meet the criteria before approval."

Lily Aduana Savannah Hobbs Brittany Rutherford

5 Reasons to Bring Your Recruiting in-House (and How To Do It)

March 12, 2021

Kayla Farrell

"I want to grow as a researcher by trying different environments and building new research muscles."

Kayla Farrell Chelsey Glasson Sean Fitzell Jared LeClerc

What It's Like To Be a User Researcher at Compass

March 12, 2021

Kate Kalcevich

"Create documentation around accessibility for your design system that includes user needs, testing methods, and panel access."

Integrating Accessibility in DesignOps

September 23, 2024

Ovetta Sampson

"The relationship between a user and AI system is multi-agency; both have agency to act and influence outcomes."

Research in the Automated Future

March 11, 2022

Deanna Mitchell

"Culture is the stories that we tell both ourselves and other people about who we are."

Deanna Mitchell

Designing with culture: Unlocking impactful insights for Product and UX

March 12, 2025

Ren Pope

"Many knowledge systems have little ability to create new interfaces, but their experience can be shaped by how they’re used and configured."

Building Experiences for Knowledge Systems

June 6, 2023

Latest Books All books

Sentient Design

Sentient Design

Crafting Intelligent Interfaces with AI

By Josh Clark, Veronika Kindred

June 2026

Designing Assistant Technology

Designing Assistant Technology

AI That Makes Us Smarter

By Christopher Noessel

March 2026

The Staff Designer

The Staff Designer

Grow, Influence, and Lead as an Individual Contributor

By Catt Small

December 2025

Design for Privacy

Design for Privacy

Keeping Personal Information Private

By Robert Stribley

November 2025

Service Design (2nd edition)

Service Design (2nd edition)

From Insight to Implementation

By Lavrans Løvlie, Ben Reason, Andy Polaine

October 2025

The Game Development Strategy Guide

The Game Development Strategy Guide

Crafting Modern Video Games That Thrive

By Cheryl Platz

September 2025

Stop Wasting Research

Stop Wasting Research

Maximize the Product Impact of Your Organization's Customer Insights

By Jake Burghardt

June 2025

We Need to Talk

We Need to Talk

A Survival Guide for Tough Conversations

By Joshua Graves

April 2025

Human-Centered Security

Human-Centered Security

How to Design Systems That Are Both Safe and Usable

December 2024

The Design Conductors

The Design Conductors

Your Essential Guide to Design Operations

October 2024

Research That Scales

Research That Scales

The Research Operations Handbook

By Kate Towsey

September 2024

The User Experience Team of One (2nd Edition)

The User Experience Team of One (2nd Edition)

A Research and Design Survival Guide

By Leah Buley, Joe Natoli

August 2024

Design for Impact

Design for Impact

Your Guide to Designing Effective Product Experiments

By Erin Weigel

June 2024

Managing Priorities

Managing Priorities

How to Create Better Plans and Make Smarter Decisions

By Harry Max

May 2024

Duly Noted

Duly Noted

Extend Your Mind through Connected Notes

By Jorge Arango

January 2024

Dig deeper with the Rosenbot

What are the challenges of self-assessment in work behaviors and how can they be mitigated?

How do multiple perspectives in a story affect the way researchers analyze and present findings?

How effective are AI-generated follow-up questions in usability testing compared to human moderators?