DSTC10: Dialogue System Technology Challenge 10

Automatic Evaluation and Moderation of Open-domain Dialogue Systems

Click here to download DSTC10 data.

Click here to submit.

Task Overview

  • Subtask 1: Automatic Open-domain Dialogue Evaluation

Effective automatic dialogue evaluation metrics possess the following two important properties as indicated in (Deriu et al., 2019):

  • Correlated to human judgements - the metrics should produce evaluation scores that well correlate to human judgements (scores) across multiple dialogue evaluation aspects.
  • Explainable - the metrics should provide constructive and explicit feedback to the generative models in terms of the quality of their generated responses. For instance, if a generative model is contradicting itself, the evaluation metrics should signal such behavior to the generative models.
In this task, our goal is to seek effective automatic dialogue evaluation metrics that exhibit the above properties. These metrics can serve as a proxy to human evaluation for fast prototyping of open-domain chatbots. We have identified the following datasets to test the effectiveness of the proposed evaluation metrics:

  • DSTC6 human evaluation data (Hori et al., 2017)
  • DSTC7 human evaluation data (Galley et al., 2019)
  • Persona-Chatlog dataset (See et al., 2019)
  • USR dataset (Mehri & Eskenazi, 2020)
  • FED dataset (Mehri & Eskenazi, 2020)
During the development phase, the participants need to propose different evaluation metrics. participants can submit their metric scores via ChatEval. The Pearson and Spearman correlations between the submitted scores and corresponding human scores will be computed per evaluation category per dataset. The correlation results will be reported in the leaderboard on the evaluation category basis. The submissions will be ranked by the average correlation scores across all the categories of all the datasets.

During the final evaluation phase, we will release a hidden evaluation set and all the submitted metrics will be evaluated with the hidden evaluation set. The final ranking will be based on the performance on both the development set and the hidden test set.

Note: The above datasets are only allowed for testing the proposed metrics, not for training the evaluation systems. The performance on the hidden test set has higher importance on the final ranking of the submissions.

  • Subtask 2: Moderation of Open-domain Dialogue Systems

In this task, our goal is to evaluate the capability of generative dialogue systems to generate appropriate answers that can go beyond detecting toxicity and moderate the conversation by producing appropriate and correct answers that allow the system to continue with the dialogue. For this task a dataset of pairs of 100K messages (training and validation set) with the following characteristics will be provided for development:

  • A toxic user sends a Tweet message using one or several of the most common swear words found on the Internet. The Tweet message must be directed to one of the customer service channels.
  • A toxic user writes a Tweet message using one or several swear words and the message is replied by another user.
  • A toxic user posts a message in Reddit using one or several swear words and the message is replied by another user.

During the development phase, participants need to come up with systems that are capable of generating polite, specific and semantically appropriate responses in such scenarios.

During the evaluation phase, a hidden test set will be provided to the participants for them to generate system responses, which will be evaluated based on the objective similarity between the generated response and the original response (e.g. sentence embedding similarity, Deep AM-FM (Zhang et al., 2021), BLEU, ROUGE, etc). For the top-3 submitted systems in the objective evaluation, a set of 100 responses will be manually evaluated for politeness, specificity, semantically appropriateness and fluency.

Schedule (Coming Soon)

Automatic Evaluation Leaderboard (Coming Soon)

Open-domain Dialogue Evaluation

The leaderboard showing names of submissions and their corresponding Pearson & Spearman Correlation for each evaluation dataset. (Explanation of abbreviations: D6 - DSTC6, D7 - DSTC7, PC - Persona-Chatlog, UP - USR-Persona, UT - USR-Topical, FT - FED-Turn, FC - FED-Conversation, AVG - Average, ρ - Pearson score, π - Spearman score)

SystemD6 (ρ)D6 (π)D7 (ρ)D7 (π)PC (ρ)PC (π)UP (ρ)UP (π)UT (ρ)UT (π)FT (ρ)FT (π)FC (ρ)FC (π)AVG (ρ)AVG (π)

Registration Details

You can register at https://my.chateval.org/accounts/login/, once registered, you will be able to download the datasets and readme documents as well as submit your results at https://chateval.org/dstc10

Information about the tracks

Any updates will be posted at the official website:



If you have further questions regarding the data, please let us know by the following email address: dstc10-track-5@googlegroups.com


  • Chen Zhang (National University of Singapore, Singapore)
  • Haizhou Li (National University of Singapore, Singapore)
  • João Sedoc (New York University, USA)
  • Luis F. D'Haro (Universidad Politécnica de Madrid, Spain)
  • Rafael Banchs (Intapp Inc., USA)
  • Alexander Rudnicky (Carnegie Mellon University, USA)


[1] Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2020). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 1-56.

[2] Hori, C., & Hori, T. (2017). End-to-end conversation modeling track in DSTC6. arXiv preprint arXiv:1706.07440.

[3] Galley, M., Brockett, C., Gao, X., Gao, J., & Dolan, B. (2019). Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop.

[4] See, A., Roller, S., Kiela, D., & Weston, J. (2019, June). What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1702-1723).

[5] Mehri, S., & Eskenazi, M. (2020). USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. arXiv preprint arXiv:2005.00456.

[6] Mehri, S., & Eskenazi, M. (2020, July). Unsupervised Evaluation of Interactive Dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 225-235).

[7] Zhang C., D’Haro L.F., Banchs R.E., Friedrichs T., Li H. (2021) Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. In Conversational Dialogue Systems for the Next Decade. Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore.