Log in or create a free Rosenverse account to watch this video.
Log in Create free account100s of community videos are available to free members. Conference talks are generally available to Gold members.
Hands-on AI #1: Let’s write your first AI eval
This video is featured in the Evals + Claude playlist.
Summary
If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this talk, Peter Van Dijck of the helpful intelligence company will walk you through writing your first eval. You will learn the basic concepts and the tools, and write an eval together. This talk is hands on; you can follow along, and there will be plenty of time for questions. You will go away with an understanding of the basic building blocks of AI evals, and with the confidence that you know how to write one. And more importantly, you’ll build some intuition, some product sense, around how the best AI products today are built, and how that can help you use them more effectively yourself.
Key Insights
-
•
Evals consist of a task, a golden dataset with known correct outputs, and an evaluator that measures correctness.
-
•
Manual AI prompt testing is slow and inconsistent; automated evals accelerate and scale evaluation.
-
•
UX and product teams can and should learn evals as a practical, non-technical skill.
-
•
Creating your own golden dataset is essential and cannot be outsourced or fully automated.
-
•
Models are fixed once trained; improvements happen by refining prompts and context design, not retraining the model.
-
•
Evaluations measure task performance, not the underlying model itself, allowing comparison across models.
-
•
Outputting a confidence score from models is unreliable due to lack of internal memory and inconsistent scale interpretation.
-
•
Biases are baked into models during training via evals used in post-training refinement.
-
•
LLMs can be used to judge other LLM outputs to evaluate tasks with non-binary answers.
-
•
Effective eval work requires collaboration across data analysts, engineers, subject matter experts, and UX/product teams.
Notable Quotes
"Evals are like a way to define what good looks like."
"The model was baked and once it’s baked, it does not learn again until they bake a new one."
"You need to be looking at the data. Nobody wants to, but that’s core work."
"Without a golden dataset, you have to build the golden dataset yourself."
"We’re not teaching the model anything; we’re improving our prompts and context."
"Confidence scores from the model are not a good idea because the model has no memory."
"Biases are baked in through the evals used during model training and post-training."
"LLMs judging other LLMs might sound crazy, but if you do it right, it works."
"Evals are a product and UX skill; learning them lets you make these systems do what you want."
"There is a large and growing capability overhang in these models we haven’t discovered yet."
Or choose a question:
More Videos
"This is where UX and product strategy bring huge value—defining what good means rather than leaving it to engineers alone."
Peter Van DijckBuilding impactful AI products for design and product leaders, Part 2: Evals are your moat
July 23, 2025
"The use of dovetail helps organize diverse information without losing the author’s voice or connection to original data."
Xenia Adjoubei Sean BruceEmpowering Communities Through the Researcher in Residence Program
March 29, 2023
"When you project Apple CarPlay, you are basically sending the device from your phone into the car, detaching from the native OS."
James RamptonThe Basics of Automotive UX & Why Phones Are a Part of That Future
July 25, 2024
"If I forgot a critical screener question, I just messaged all qualified participants to confirm they meet the criteria before approval."
Lily Aduana Savannah Hobbs Brittany Rutherford5 Reasons to Bring Your Recruiting in-House (and How To Do It)
March 12, 2021
"I want to grow as a researcher by trying different environments and building new research muscles."
Kayla Farrell Chelsey Glasson Sean Fitzell Jared LeClercWhat It's Like To Be a User Researcher at Compass
March 12, 2021
"Create documentation around accessibility for your design system that includes user needs, testing methods, and panel access."
Kate KalcevichIntegrating Accessibility in DesignOps
September 23, 2024
"The relationship between a user and AI system is multi-agency; both have agency to act and influence outcomes."
Ovetta SampsonResearch in the Automated Future
March 11, 2022
"Culture is the stories that we tell both ourselves and other people about who we are."
Deanna MitchellDesigning with culture: Unlocking impactful insights for Product and UX
March 12, 2025
"Many knowledge systems have little ability to create new interfaces, but their experience can be shaped by how they’re used and configured."
Ren PopeBuilding Experiences for Knowledge Systems
June 6, 2023