DSTC12: Dialogue System Technology Challenge 12

Track 1: Dialog System Evaluation: Dimensionality, Language, Culture and Safety

(02/05) Official Leaderboard released!

(19/04) Task1 Codabench website and test data now online. Click here for access!

(19/04) Task2 Codabench website and test data now online. Click here for access!

(19/03) Task Schedule has been updated to reflect data release delays. Submission deadline is now Apr 28 (23:59 Anywhere on Earth (AoE), UTC-12).

(10/03) Task1 development data and baseline results are now available.

(03/01) Task2 development data and baseline results are now available.

(19/12) Click here to register for DSTC12.T1.

Track Overview

For this track, we propose two evaluation tasks for open-domain dialogue systems:

Dialogue-level and Multi-dimensional Automatic Evaluation Metrics. Participants in this task are expected to design automatic metrics that evaluate conversations at the dialogue level (not solely turn-level) and over multiple evaluation dimensions.
Multilingual and Multicultural Safety Detection. Participants in this task are expected to develop safety classifiers that detect whether a response is unsafe.

For both tasks, participants are constrained to use open-access LMs and LLMs of size lower than 13B paramters.

Task 1: Dialogue-level and Multi-dimensional Automatic Evaluation Metrics

Overview

Previous challenges and works focus more on turn-level dialogue evaluation (Zhang et al., 2022; Rodríguez-Cantelar et al., 2023; Yeh et al., 2021) and lack further investigation of dialogue-level evaluation through automatic metrics. As LLMs advance, aspects of conversations beyond coherence, fluency, etc. should also be studied. Addiitonally, these aspects should provide a more fine-grained analysis of the levels of quality for the whole conversation.

Goals

Participants will develop automatic evaluation metrics for open-domain dialogue. The system should be able to evaluate up to 10 different dimensions including previous common ones (i.e. coherence, engageness, or naturalness), together with new ones like empathy and error handling in this challenge.

Evaluation

Based on the dialogue-level evaluation scores generated by the proposed evaluation metric for 10 selected dimensions on the testing dataset, we will calculate Spearman’s correlation between human annotations (from MTurk workers or lab members) and metric-generated scores among dimensions. An averaged correlation coefficient is calculated to rank the automatic evaluation metrics submitted in the end.

Provided Datasets

Logo_EC Task 1 development data is available for download here. Note that the datasets are only available for registered participants.

Baseline

As a baseline, Llama-3.1-8B-Instruct is used. This model is a post-trained version of LLama-3.1-8B using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. We prompt the model with a dialogue-evaluation instruction.

We report the Pearson score for the Llama-Guard-3-1B model on the development subset made available for this task. The results are as follows:

Dimension	Pearson
Empathy	0.0829
Trust	0.2546
Skill	0.2689
Talent	0.144
Capability	0.2467
Relevance	-0.1319
Non-Repetition	-0.1883
Proactivity	-0.5387
Curiosity	-0.3892
Overall	0.4234
Average	0.01724
Average (absolute)	0.26686

A minimalistic reproducible code for the baseline based on VLLM is available here.

Leaderboard

Submission Rank	Team	Codabench ID	Average (absolute)	Average	Empathy	Trust	Skill	Talent	Capability	Relevance	Non-repetition	Proactivity	Curiosity	Overall
1	baseline		0.1681	0.1217	0.0647	-0.1117	-0.0955	0.0962	0.0677	0.2337	0.3851	-0.0248	0.2253	0.3766
2	ORALIS	278269	0.1503	0.0605	-0.0788	0.0067	-0.2246	0.0529	0.1276	0.0808	0.1071	-0.1457	0.3687	0.3099
3	ORALIS	278268	0.1471	0.1265	0.1720	0.2004	0.0660	0.2409	0.2439	-0.1032	0.1432	0.0821	0.0913	0.1283
4	ORALIS	727820	0.1408	0.0595	-0.0788	0.0067	-0.2246	0.2409	0.1276	-0.1032	0.1432	0.0821	0.0913	0.3099
5	ORALIS	278271	0.1374	0.0105	-0.1658	0.1254	-0.0183	0.2240	0.1157	-0.2831	-0.0018	0.1962	0.0784	-0.1652
6	qqprun	278272	0.1360	0.0904	-0.1183	0.2407	-0.1094	0.0873	0.1717	0.0738	0.2220	0.0185	0.0064	0.3117
7	qqprun	278273	0.0730	0.0649	-0.0035	0.0594	0.0046	0.0948	0.1101	0.1439	-0.0371	0.0758	0.0221	0.1786

Task 2: Multilingual and Multicultural Safety Detection

Overview

Users are increasingly challenging current LLMs to generate harmful and/or unsafe answers. In addition, even without adversarial probing, generated responses may contain unhelpful and/or harmful content. Therefore, the automatic detection of this content is important in the deployment of these systems. Unfortunately, safety evaluation frameworks frequently narrow the notion of safety to strict definitions of bias and toxicity (Shuster et al., 2022; Ouyang et al., 2022), discarding other safety aspects. This task expands on earlier safety detection tasks, introducing other risk aspects such as unqualified and harmful advice, manipulation, and illegal activities.

Safety considerations in prior and current generation chatbots are limited to North American notions of safety and harm. In Task 2, we will expand safety datasets to a diverse set of languages and cultures, with human annotations performed by representatives of said cultures. Beyond facilitating the study of safety across cultures, it also allows for the evaluation of the robustness of safety classifiers in terms of culture and language. In this Task, we will target at least 4 different languages in the test set: English, Chinese, Portuguese, and Spanish. Additional languages may be added (within the language set provided for development).

Goals

Participants will develop automatic safety classifiers of responses generated by LLMs across different languages and cultures. The safety detectors should be able to generalize across different languages and cultures.

Evaluation

For the overall submission rankings, we will use the ROC-AUC score to evaluate the performance of the safety detectors developed by the participants on a multilingual hidden test set. The ROC-AUC score is computed at the response level. Additionally, for a more fine-grained analysis of the participants’ performance, we report their language/culture-wise ROC-AUC score.

Provided Datasets

For Task 2, we have identified and processed three datasets for development:

Bot Adversarial Dialogue (Xu et al., NAACL 2021)

Dialogue Safety (Dinan et al., EMNLP-IJCNLP 2019)

ProsocialDialog (Kim et al., EMNLP 2022)

These datasets were selected for their conversational nature and existing safety annotations. However, participants are free to use any other open-access resources during the development of their systems.

All datasets contain train/val/test splits. We automatically translate these datasets in to several target languages (Arabic, German, Spanish, French, Japanese, Portuguese and Chinese) using gpt-4o-mini, for which we report an average translation quality score of 0.7153 (using wmt23-cometkiwi-da-xl).

Logo_EC

Datasets are avilable to download here. Note that the datasets are only available for registered participants.

Baseline

As a baseline, Llama-Guard-3-1B will be used. This model is a fine-tuned Llama-3.2-1B pretrained model for content safety classification. We report baseline results below.

We report the ROC-AUC score for the Llama-Guard-3-1B model on the development subsets made available for this task. The results are as follows:

Language	ROC-AUC
Arabic (ar)	0.6432
German (de)	0.6616
English (en)	0.7137
Spanish (es)	0.6557
French (fr)	0.6642
Japanese (ja)	0.6537
Portuguese (pt)	0.6687
Chinese (zh)	0.6765
Average (all)	0.6672

Per dataset results are available on each dataset page.

A minimalistic reproducible code for the baseline based on VLLM is available here.

Leaderboard

Submission Rank	Team	Codabench ID	Average	Cultural	Multilingual	AR	DE	EN	ES	FR	JA	PT	ZH
1	VAI-CORE	273546	0.9046	0.4227	0.9648	0.9623	0.9700	0.9753	0.9695	0.9697	0.9607	0.9455	0.9656
2	VAI-CORE	277851	0.8885	0.4831	0.9392	0.9599	0.9723	0.9736	0.9685	0.9677	0.9531	0.7499	0.9684
3	qqpprun	273763	0.8078	0.4571	0.8517	0.8492	0.8714	0.8781	0.8688	0.8563	0.8147	0.8691	0.8057
4	qqpprun	273499	0.7780	0.3476	0.8318	0.8234	0.8505	0.8534	0.8364	0.8246	0.8023	0.8328	0.8311
5	baseline		0.7767	0.5126	0.8097	0.8056	0.8067	0.8179	0.8034	0.8178	0.7957	0.8162	0.8142
6	qqpprun	273497	0.7468	0.4658	0.7819	0.7782	0.8015	0.7892	0.7941	0.7773	0.7557	0.7877	0.7716
7	qqpprun	273824	0.6667	0.4830	0.6896	0.6917	0.7000	0.6854	0.6979	0.6888	0.6804	0.6843	0.6884

Citation

If you are using Task data please use the following citation:

@inproceedings{mendonca2025dstc12t1,
                author = "John Mendonça and Lining Zhang and Rahul Mallidi and Luis Fernando D'Haro and João Sedoc",
                title = "Overview of Dialog System Evaluation Track: Dimensionality, Language, Culture and Safety at DSTC 12",
                booktitle = "DSTC12: The Twelfth Dialog System Technology Challenge",
                series = "26th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)",
                year = 2025,
                month = "September",
                address = "Avignon, France"
            }

Registration Details

To become an official DSTC12 Track 1 participant, you must be registered using this Form. Once registered, you will be able to download the datasets and readme documents as well as submit your results.

There must be only one team per laboratory or research group. The members of the same team must be under a single registration, that is, the team leader must register his entire team by giving their e-mail addresses in addition to his own.

Any updates and information about the tracks will be posted on the DSTC12 official website, or check the DSTC Mailing List.

Schedule

Test data release: Apr 18
Entry submission deadline: Apr 28 (23:59 Anywhere on Earth (AoE), UTC-12)
Final result announcement: May 2
Paper submission: Jun 1
Workshop (SIGDIAL) : Aug 28

Organizers

John Mendonça (INESC-ID/IST, Portugal) - john.mendonca@inesc-id.pt
Lining Zhang (New York University, USA)
Alon Lavie (Carnegie Mellon University, USA)
Isabel Trancoso (INESC-ID/IST, Portugal)
João Sedoc (New York University, USA)
Luis F. D'Haro (Universidad Politécnica de Madrid, Spain)

Contact

For queries related to the challenge contact the organizers via the DSTC Mailing List.

Acknowledgement

This research was supported by the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Responsible.AI), by Portuguese national funds through Fundação para a Ciência e Tecnologia (FCT) with references PRT/BD/152198/2021 and DOI:10.54499/UIDB/50021/2020.

This work is supported by the European Commission through Project ASTOUND (101071191 — HORIZON EIC-2021- PATHFINDERCHALLENGES-01), and by project BEWORD (PID2021-126061OB-C43) funded by MCIN/AEI/10.13039/501100011033 and, as appropriate, by “ERDF A way of making Europe”, by the “European Union”.

We also want to give thanks to MS Azure services (especially to Irving Kwong) for their sponsorship to continue processing new datasets that could be interesting for the dialogue community.

This research project is supported by the NYU ChatEval Team led by João Sedoc.

Logo_EC Logo_PRR

FAQ

How much does participate in this Track cost?

This Track is currently free for everyone.

References

Chen Zhang, João Sedoc, Luis Fernando D'Haro, Rafael Banchs, and Alexander Rudnicky. "Automatic evaluation and moderation of open-domain dialogue systems." arXiv preprint arXiv:2111.02110 (2021).
Mario Rodríguez-Cantelar, Chen Zhang, Chengguang Tang, Ke Shi, Sarik Ghazarian, João Sedoc, Luis Fernando D'Haro, and Alexander Rudnicky. "Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4." arXiv preprint arXiv:2306.12794 (2023).
Chulaka Gunasekara, Seokhwan Kim, Luis Fernando D'Haro, Abhinav Rastogi, Yun-Nung Chen, Mihail Eric, Behnam Hedayatnia, et al. "Overview of the ninth dialog system technology challenge: Dstc9." arXiv preprint arXiv:2011.06486 (2020).
Sarik Ghazarian, Behnam Hedayatnia, Alexandros Papangelis, Yang Liu, and Dilek Hakkani-Tur. 2022. What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4194–4204, Dublin, Ireland. Association for Computational Linguistics.
Yi-Ting Yeh, Maxine Eskenazi, and Shikib Mehri. 2021. A Comprehensive Assessment of Dialog Evaluation Metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33, Online. Association for Computational Linguistics.
Jing Xu, Da Ju, Joshua Lane, Mojtaba Komeili, Eric Michael Smith, Megan Ung, Morteza Behrooz, et al. "Improving Open Language Models by Learning from Organic Interactions." arXiv preprint arXiv:2306.04707 (2023).
Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, W.K.F. Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur and Jason Weston. “BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage.” ArXiv abs/2208.03188 (2022): n. Pag.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." arXiv preprint arXiv:2306.05685 (2023).
Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Silvio Savarese, and Caiming Xiong. "DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI." arXiv preprint arXiv:2307.10172 (2023).
Ekaterina Svikhnushina, Anastasiia Filippova, and Pearl Pu. 2022. iEval: Interactive Evaluation Framework for Open-Domain Empathetic Chatbots. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 419–431, Edinburgh, UK. Association for Computational Linguistics.
Sarah E. Finch and Jinho D. Choi. 2020. Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols. In Proceedings of the 21st Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 236–245, 1st virtual meeting. Association for Computational Linguistics.
Damilola Omitaomu, Shabnam Tafreshi, Tingting Liu, Sven Buechel, Chris Callison-Burch, Johannes C. Eichstaedt, Lyle Ungar and João Sedoc. “Empathic Conversations: A Multi-level Dataset of Contextualized Conversations.” ArXiv abs/2205.12698 (2022): n. pag.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, & Ryan Lowe. “Training language models to follow instructions with human feedback.”arXiv preprint arXiv:2203.02155 (2022).
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, & Jack Clark.. “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.” arXiv preprint arXiv:2204.05862 (2022).
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, & Jared Kaplan. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv preprint arXiv:2204.05862 (2022).
Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4537–4546, Hong Kong, China. Association for Computational Linguistics.
Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-Adversarial Dialogue for Safe Conversational Agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968, Online. Association for Computational Linguistics.
Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. 2022. ProsocialDialog: A Prosocial Backbone for Conversational Agents. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4005–4029, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.