Log in or create a free Rosenverse account to watch this video.

Log in Create free account

100s of community videos are available to free members. Conference talks are generally available to Gold members.

Hands-on AI #2: Understanding evals: LLM as a Judge

Wednesday, October 15, 2025 • Rosenfeld Community

This video is featured in the Evals + Claude playlist.

Peter Van Dijck

Peter Van Dijck

UX and AI builder, CEO Sputnik Legal

Summary

If you’re a product manager, UX researcher, or any kind of designer involved in creating an AI product or feature, you need to understand evals. And a great way to learn is with a hands-on example. In this second talk in the series, Peter Van Dijck of the helpful intelligence company will show you how to create an eval for an AI product using an LLM as a judge (when we use a Large Language Model to evaluate the output of another Large Language Model). We’ll have a look at how that works, but also dig into why this even works. Are we creating problems for ourselves when we let an LLM judge itself? This talk is hands on; and there will be plenty of time for questions. You will go away understanding when and how to use LLM as a judge, and build some product sense around how the best AI products today are built, and how that can help you use them more effectively yourself.

Key Insights

•

Evals are a foundational feedback loop defining what 'good' means for AI products, helping to measure and improve systems continuously.
•

Evaluating fuzzy, subjective AI outputs requires innovative approaches such as using LLMs as judges to score results.
•

Binary (yes/no) scoring is more reliable than rating scales with ranges because LLMs lack internal memory and consistency.
•

Starting evals early (week one of a project) drastically improves AI product outcomes, but many teams delay due to perceived complexity.
•

High-risk or important tasks should be prioritized for evals instead of attempting broad coverage.
•

Assigning a dedicated owner or 'benevolent dictator' for evals who works closely with domain experts accelerates feedback and quality.
•

Creating a written constitution of principles helps concretize AI behavior goals and guides prompt and model training.
•

Most current eval tooling is too technical, slowing iteration cycles and making expert involvement inefficient.
•

Custom feedback interfaces tailored to expert users significantly speed up evaluating AI outputs in domains like healthcare and law.
•

Diverse perspectives from UX, product, strategy, and domain experts are critical in defining and refining what 'good' means in AI systems.

Notable Quotes

"Evals are everywhere, right? Everybody's talking about evals. It is like one of the key things in developing useful AI products."

"You want to ask an LLM to evaluate the fuzzy stuff because there’s no black and white output."

"LLMs don’t have memory, so rating on a scale from one to five is pretty random. Better to have yes or no answers."

"One of the biggest problems in AI building is evolving your prompts and having a fast feedback loop."

"By starting to categorize risk in detail, you naturally lead to better prompts and better evals."

"A constitution is a very good exercise: write down your system’s principles and values to help guide its behavior."

"Use custom systems for experts to quickly review and rate outputs, making feedback cycles much faster."

"Evals define a shared definition of good with tests to measure it, and that is the secret sauce for building great AI products."

"Model companies are students in a classroom wanting good points—they’re happy to run external expert evals to improve."

"The more I work with evals, the more I think UX and product people need to be involved because of the need for diverse perspectives."

Previous video

Next video

Ask the Rosenbot

Or choose a question:

What are evals and why are they essential in AI product development?

How can large language models be used as judges to evaluate AI outputs?

Why is binary scoring preferred over rating scales when using LLMs as evaluators?

How early should teams start implementing evals in AI projects?

How do you prioritize what aspects of an AI system to evaluate?

Bryce Benton

[Demo] AI-powered UX enhancement: Aligning GitHub documentation with USWDS at Austin Public Library

2024 • Designing with AI 2024

Dominique Ward

The Most Exciting Time for DesignOps is Now

2022 • DesignOps Summit 2022

Mac Smith

Measuring Up: Using Product Research for Organizational Impact

2021 • Advancing Research 2021

Caitlyn Hampton

Compass 101: Growing Your Career In A Startup World

2021 • Design at Scale 2021

Yunyan Li

UX Best Practices

2021 • Design at Scale 2021

Failure Friday #4: Invisible Work: How I Stalled My Career by Not Showing My Work

2025 • Rosenfeld Community

Sam Proulx

Designing For Screen Readers: Understanding the Mental Models and Techniques of Real Users

2021 • DesignOps Summit 2021

Llewyn Paine

Day 1 Using AI in UX with Impact

2025 • Designing with AI 2025

Zariah Cameron

ReDesigning Wellbeing for Equitable Care in the Workplace

2024 • DesignOps Summit 2024

Lada Gorlenko

2022 • Design at Scale 2022

Erika Kincaid

Connecting the Dots: How to Foster Collaboration and Build a Strong Design Review Culture

2022 • Design at Scale 2022

Allison Sanders

Operating with Purpose

2024 • DesignOps Summit 2020

Karen Pascoe

Developing Experience Teams and Talent in the Enterprise

2016 • Enterprise UX 2016

Marisa Bernstein

It Takes GRIT: Lessons from the Small, but Mighty World of Civic Usability Testing

2021 • Civic Design 2021

Alla Weinberg

Design Teams Need Psychological Safety: Here’s How to Create It

2022 • DesignOps Summit 2022

Laurent Christoph

Scale the impact of DesignOps in 3D: Diligence, Decision, Discipline

2025 • DesignOps Community

More Videos

Ben Davies

"Everyone should know where everything is all the time — that’s the archivist’s motto for knowledge management."

Ben Davies Matt Duignan Andrew Michael Dr. Emily DiLeo

Expert Panel: The Principles of Research Repository Design

March 11, 2022

Russ Unger

"One of our biggest pushes was accessibility out of the box—508 and ADA compliance can’t be overlooked."

Getting Out from Under Everyone: How to Escape the Paralysis of Getting Started

June 8, 2016

Elizabeth Sklar

"We underestimated the amount of work; each workshop took about 40 hours per person to prepare and run."

Elizabeth Sklar Jessica Sheng

Co-creating research enablement with your tech org: a case study

March 10, 2026

Victor Udoewa

"When difference comes together, something new emerges neither could have traveled alone."

Theme One Intro

March 27, 2023

Louis Rosenfeld

"People saying you should write a book is nice, but it’s not the real reason you should write one."

Louis Rosenfeld

Coffee with Lou: Should You Write a (UX) Book?

March 7, 2024

Erin May

"Without releasing control, democratizing research won’t scale. We have to empower people even if some things won’t be perfect."

Erin May Roberta Dombrowski Laura Oxenfeld Brooke Hinton

Distributed, Democratized, Decentralized: Finding a Research Model to Support Your Org

March 10, 2022

Alexandra Schmidt

"Designers need better training to work with off-the-shelf enterprise software like Sitecore, Salesforce, and SharePoint."

Alexandra Schmidt

Enterprise UX Playbook

December 1, 2022

Saara Kamppari-Miller

"People were encouraging each other to score themselves higher and see their own potential."

Saara Kamppari-Miller

"Prototype" vs "Prototype"--Breaking Down and Rebuilding Our Understanding of What We Do

October 24, 2019

Sam Yen

"Introducing design gates allowed us to finally stop the shipment of bad products."

Driving Organizational Change Through Design? Do more of this and less of that

June 9, 2017

Latest Books All books

Sentient Design

Sentient Design

Crafting Intelligent Interfaces with AI

By Josh Clark, Veronika Kindred

June 2026

Designing Assistant Technology

Designing Assistant Technology

AI That Makes Us Smarter

By Christopher Noessel

March 2026

The Staff Designer

The Staff Designer

Grow, Influence, and Lead as an Individual Contributor

By Catt Small

December 2025

Design for Privacy

Design for Privacy

Keeping Personal Information Private

By Robert Stribley

November 2025

Service Design (2nd edition)

Service Design (2nd edition)

From Insight to Implementation

By Lavrans Løvlie, Ben Reason, Andy Polaine

October 2025

The Game Development Strategy Guide

The Game Development Strategy Guide

Crafting Modern Video Games That Thrive

By Cheryl Platz

September 2025

Stop Wasting Research

Stop Wasting Research

Maximize the Product Impact of Your Organization's Customer Insights

By Jake Burghardt

June 2025

We Need to Talk

We Need to Talk

A Survival Guide for Tough Conversations

By Joshua Graves

April 2025

Human-Centered Security

Human-Centered Security

How to Design Systems That Are Both Safe and Usable

December 2024

The Design Conductors

The Design Conductors

Your Essential Guide to Design Operations

October 2024

Research That Scales

Research That Scales

The Research Operations Handbook

By Kate Towsey

September 2024

The User Experience Team of One (2nd Edition)

The User Experience Team of One (2nd Edition)

A Research and Design Survival Guide

By Leah Buley, Joe Natoli

August 2024

Design for Impact

Design for Impact

Your Guide to Designing Effective Product Experiments

By Erin Weigel

June 2024

Managing Priorities

Managing Priorities

How to Create Better Plans and Make Smarter Decisions

By Harry Max

May 2024

Duly Noted

Duly Noted

Extend Your Mind through Connected Notes

By Jorge Arango

January 2024

Dig deeper with the Rosenbot

What role does vulnerability play in building customer trust and brand love?

What unique features does the Rosenbot chatbot provide for UX learners?

How can improving information flows transform maternal health systems, especially in rural settings?