DSTC12: Dialogue System Technology Challenge 12

Track 1: Dialog System Evaluation: Dimensionality, Language, Culture and Safety


Click here to register for DSTC12.T1. (now available)


Track Overview


For this track, we propose two evaluation tasks for open-domain dialogue systems:

  1. Dialogue-level and Multi-dimensional Automatic Evaluation Metrics. Participants in this task are expected to design automatic metrics that evaluate conversations at the dialogue level (not solely turn-level) and over multiple evaluation dimensions.
  2. Multilingual and Multicultural Safety Detection. Participants in this task are expected to develop safety classifiers that detect whether a response is unsafe.
Logo_EC   For both tasks, participants are constrained to use open-access LMs and LLMs of size lower than 13B paramters.


Task 1: Dialogue-level and Multi-dimensional Automatic Evaluation Metrics



  • Overview

Previous challenges and works focus more on turn-level dialogue evaluation (Zhang et al., 2022; Rodríguez-Cantelar et al., 2023; Yeh et al., 2021) and lack further investigation of dialogue-level evaluation through automatic metrics. As LLMs advance, aspects of conversations beyond coherence, fluency, etc. should also be studied. Addiitonally, these aspects should provide a more fine-grained analysis of the levels of quality for the whole conversation.

  • Goals

Participants will develop automatic evaluation metrics for open-domain dialogue. The system should be able to evaluate up to 10 different dimensions including previous common ones (i.e. coherence, engageness, or naturalness), together with new ones like empathy and error handling in this challenge.

  • Evaluation

Based on the dialogue-level evaluation scores generated by the proposed evaluation metric for 10 selected dimensions on the testing dataset, we will calculate Spearman’s correlation between human annotations (from MTurk workers or lab members) and metric-generated scores among dimensions. An averaged correlation coefficient is calculated to rank the automatic evaluation metrics submitted in the end.

  • Basline

Additional information soon.

  • Provided Datasets

Additional information soon.


Task 2: Multilingual and Multicultural Safety Detection



  • Overview

Users are increasingly challenging current LLMs to generate harmful and/or unsafe answers. In addition, even without adversarial probing, generated responses may contain unhelpful and/or harmful content. Therefore, the automatic detection of this content is important in the deployment of these systems. Unfortunately, safety evaluation frameworks frequently narrow the notion of safety to strict definitions of bias and toxicity (Shuster et al., 2022; Ouyang et al., 2022), discarding other safety aspects. This task expands on earlier safety detection tasks, introducing other risk aspects such as unqualified and harmful advice, manipulation, and illegal activities.

Safety considerations in prior and current generation chatbots are limited to North American notions of safety and harm. In Task 2, we will expand safety datasets to a diverse set of languages and cultures, with human annotations performed by representatives of said cultures. Beyond facilitating the study of safety across cultures, it also allows for the evaluation of the robustness of safety classifiers in terms of culture and language. In this Task, we will target at least 4 different languages: English, Chinese, Portuguese, and Spanish.

  • Goals

Participants will develop automatic safety classifiers of responses generated by LLMs across different languages and cultures. The safety detectors should be able to generalize across different languages and cultures.

  • Evaluation

For the overall submission rankings, we will use the ROC-AUC score to evaluate the performance of the safety detectors developed by the participants on a multilingual hidden test set. The ROC-AUC score is computed at the response level. Additionally, for a more fine-grained analysis of the participants’ performance, we report their language/culture-wise ROC-AUC score and safety category-wise ROC-AUC score.

  • Baseline

As a baseline, Llama-Guard-3-1B will be used. This model is a fine-tuned Llama-3.2-1B pretrained model for content safety classification.

  • Provided Datasets

Additional information soon.


Registration Details


To become an official DSTC12 Track 1 participant, you must be registered using this Form. Once registered, you will be able to download the datasets and readme documents as well as submit your results at https://chateval.org/dstc12.

There must be only one team per laboratory or research group. The members of the same team must be under a single registration, that is, the team leader must register his entire team by giving their e-mail addresses in addition to his own.

Any updates and information about the tracks will be posted on the DSTC12 official website, or check the DSTC Mailing List.


Schedule


  • Training/Validation data release: Jan 3, 2025
  • Test data release: Mar 21, 2025
  • Entry submission deadline: Mar 28, 2025 (23:59 Anywhere on Earth (AoE), UTC-12)
  • Final result announcement: Apr 7, 2025
  • Paper submission: June 2025
  • Workshop: September 2025

Organizers


  • John Mendonça (INESC-ID/IST, Portugal) - john.mendonca@inesc-id.pt
  • Lining Zhang (New York University, USA)
  • Alon Lavie (Carnegie Mellon University, USA)
  • Isabel Trancoso (INESC-ID/IST, Portugal)
  • João Sedoc (New York University, USA)
  • Luis F. D'Haro (Universidad Politécnica de Madrid, Spain)

Contact


For queries related to the challenge contact the organizers via the DSTC Mailing List.


Acknowledgement


This research was supported by the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Responsible.AI), by Portuguese national funds through Fundação para a Ciência e Tecnologia (FCT) with references PRT/BD/152198/2021 and DOI:10.54499/UIDB/50021/2020.

This research project is supported by the Comunidad de Madrid through the call Research Grants for Young Investigators from Universidad Politécnica de Madrid (GENIUS:APOYO-JOVENES-21-TAXTYC-32-K61X37).

This work is supported by the European Commission through Project ASTOUND (101071191 — HORIZON EIC-2021- PATHFINDERCHALLENGES-01), and by project BEWORD (PID2021-126061OB-C43) funded by MCIN/AEI/10.13039/501100011033 and, as appropriate, by “ERDF A way of making Europe”, by the “European Union”.

We also want to give thanks to MS Azure services (especially to Irving Kwong) for their sponsorship to continue processing new datasets that could be interesting for the dialogue community.

This research project is supported by the NYU ChatEval Team led by João Sedoc.

. Logo_EC Logo_PRR


FAQ


How much does participate in this Track cost?

This Track is currently free for everyone.


References


  • Chen Zhang, João Sedoc, Luis Fernando D'Haro, Rafael Banchs, and Alexander Rudnicky. "Automatic evaluation and moderation of open-domain dialogue systems." arXiv preprint arXiv:2111.02110 (2021).
  • Mario Rodríguez-Cantelar, Chen Zhang, Chengguang Tang, Ke Shi, Sarik Ghazarian, João Sedoc, Luis Fernando D'Haro, and Alexander Rudnicky. "Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4." arXiv preprint arXiv:2306.12794 (2023).
  • Chulaka Gunasekara, Seokhwan Kim, Luis Fernando D'Haro, Abhinav Rastogi, Yun-Nung Chen, Mihail Eric, Behnam Hedayatnia, et al. "Overview of the ninth dialog system technology challenge: Dstc9." arXiv preprint arXiv:2011.06486 (2020).
  • Sarik Ghazarian, Behnam Hedayatnia, Alexandros Papangelis, Yang Liu, and Dilek Hakkani-Tur. 2022. What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4194–4204, Dublin, Ireland. Association for Computational Linguistics.
  • Yi-Ting Yeh, Maxine Eskenazi, and Shikib Mehri. 2021. A Comprehensive Assessment of Dialog Evaluation Metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33, Online. Association for Computational Linguistics.
  • Jing Xu, Da Ju, Joshua Lane, Mojtaba Komeili, Eric Michael Smith, Megan Ung, Morteza Behrooz, et al. "Improving Open Language Models by Learning from Organic Interactions." arXiv preprint arXiv:2306.04707 (2023).
  • Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, W.K.F. Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur and Jason Weston. “BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage.” ArXiv abs/2208.03188 (2022): n. Pag.
  • Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." arXiv preprint arXiv:2306.05685 (2023).
  • Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Silvio Savarese, and Caiming Xiong. "DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI." arXiv preprint arXiv:2307.10172 (2023).
  • Ekaterina Svikhnushina, Anastasiia Filippova, and Pearl Pu. 2022. iEval: Interactive Evaluation Framework for Open-Domain Empathetic Chatbots. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 419–431, Edinburgh, UK. Association for Computational Linguistics.
  • Sarah E. Finch and Jinho D. Choi. 2020. Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols. In Proceedings of the 21st Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 236–245, 1st virtual meeting. Association for Computational Linguistics.
  • Damilola Omitaomu, Shabnam Tafreshi, Tingting Liu, Sven Buechel, Chris Callison-Burch, Johannes C. Eichstaedt, Lyle Ungar and João Sedoc. “Empathic Conversations: A Multi-level Dataset of Contextualized Conversations.” ArXiv abs/2205.12698 (2022): n. pag.
  • Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, & Ryan Lowe. “Training language models to follow instructions with human feedback.”arXiv preprint arXiv:2203.02155 (2022).
  • Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, & Jack Clark.. “Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.” arXiv preprint arXiv:2204.05862 (2022).
  • Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, & Jared Kaplan. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv preprint arXiv:2204.05862 (2022).