DSTC10: Dialogue System Technology Challenge 10

Automatic Evaluation and Moderation of Open-domain Dialogue Systems


Click here to download DSTC10 data.

Click here to submit.


Task Overview


  • Subtask 1: Automatic Open-domain Dialogue Evaluation

Effective automatic dialogue evaluation metrics possess the following two important properties as indicated in (Deriu et al., 2019):

  • Correlated to human judgements - the metrics should produce evaluation scores that well correlate to human judgements (scores) across multiple dialogue evaluation aspects.
  • Explainable - the metrics should provide constructive and explicit feedback to the generative models in terms of the quality of their generated responses. For instance, if a generative model is contradicting itself, the evaluation metrics should signal such behavior to the generative models.
In this task, our goal is to seek effective automatic dialogue evaluation metrics that exhibit the above properties. These metrics can serve as a proxy to human evaluation for fast prototyping of open-domain chatbots. We have identified the following datasets to test the effectiveness of the proposed evaluation metrics:

  • DSTC6-Eval (D6) (Hori et al., 2017)
  • DSTC7-Eval (D7) (Galley et al., 2019)
  • Persona-Chatlog (PC) (See et al., 2019)
  • PersonaChat-USR (UP) (Mehri & Eskenazi, 2020a)
  • TopicalChat-USR (TP) (Mehri & Eskenazi, 2020a)
  • FED-Turn (FT) (Mehri & Eskenazi, 2020b)
  • FED-Conversation (FC) (Mehri & Eskenazi, 2020b)
  • DailyDialog-Eval (GD) (Gupta et al., 2019)
  • DailyDialog-Eval (ZD) (Zhao et al., 2020)
  • PersonaChat-Eval (ZP) (Zhao et al., 2020)
  • DailyDialog-Eval (ED) (Huang et al., 2020)
  • Empathetic-Eval (EE) (Huang et al., 2020)
  • ConvAI2-Eval (EC) (Huang et al., 2020)
  • HUMOD (HU) (Merdivan et al., 2020)
During the development phase, the participants need to propose different evaluation metrics. participants can submit their metric scores via ChatEval. The Spearman correlations (π) between the submitted scores and corresponding human scores will be computed per evaluation category per dataset. The correlation results will be reported in the leaderboard on the evaluation category basis. The submissions will be ranked by the average correlation scores across all the categories of all the datasets.

During the final evaluation phase, we will release a hidden evaluation set and all the submitted metrics will be evaluated with the hidden evaluation set. The final ranking will be based on the performance on both the development set and the hidden test set.

Note: The above datasets are only allowed for validating the proposed metrics, not for training the evaluation systems. The performance on the hidden test set has higher importance on the final ranking of the submissions.


  • Subtask 2: Moderation of Open-domain Dialogue Systems

In this task, our goal is to evaluate the capability of generative dialogue systems to generate appropriate answers that can go beyond detecting toxicity and moderate the conversation by producing appropriate and correct answers that allow the system to continue with the dialogue. For this task a dataset of pairs of 100K messages (training and validation set) with the following characteristics will be provided for development:

  • A toxic user sends a Tweet message using one or several of the most common swear words found on the Internet. The Tweet message must be directed to one of the customer service channels.
  • A toxic user writes a Tweet message using one or several swear words and the message is replied by another user.
  • A toxic user posts a message in Reddit using one or several swear words and the message is replied by another user.

During the development phase, participants need to come up with systems that are capable of generating polite, specific and semantically appropriate responses in such scenarios.

During the evaluation phase, a hidden test set will be provided to the participants for them to generate system responses, which will be evaluated based on the objective similarity between the generated response and the original response (e.g. sentence embedding similarity, Deep AM-FM (Zhang et al., 2021), BLEU, ROUGE, etc). For the top-3 submitted systems in the objective evaluation, a set of 100 responses will be manually evaluated for politeness, specificity, semantically appropriateness and fluency.


Schedule


  • Validation data released: Jun 14, 2021
  • Test data released: Sep 13, 2021
  • Entry submission deadline: Sep 21, 2021
  • Final result announcement: Oct 1, 2021 - Oct 8, 2021

Baselines and Data Description


Subtask 1: Automatic Open-domain Dialogue Evaluation

Subtask 2: Moderation of Open-domain Dialogue Systems


Automatic Evaluation Leaderboard (Coming Soon)


Open-domain Dialogue Evaluation

The leaderboard showing names of submissions and their corresponding Spearman Correlation Coefficients for each evaluation dataset.

SystemD6 (π)D7 (π)PC (π)UP (π)UT (π)FT (π)FC (π)ZD (π)ZP (π)GD (π)ED (π)EC (π)EE (π)HU (π)AVG (π)
Deep AM-FM (baseline)0.105-0.0330.0820.1310.2690.0460.1210.1980.236-0.0460.1640.094-0.0270.0110.097

Moderation of Open-domain Dialogue Systems

The leaderboard showing names of submissions and their corresponding automatic evaluation scores.

SystemBLEUROUGE-LDeep AM-FMBERT-scoreBLEURTAVG
DialoGPT-base (baseline)

Registration Details

You can register at https://my.chateval.org/accounts/login/, once registered, you will be able to download the datasets and readme documents as well as submit your results at https://chateval.org/dstc10


Information about the tracks

Any updates will be posted at the official website:

https://sites.google.com/dstc.community/dstc10/


Contact

If you have further questions regarding the data, please let us know by the following email address: dstc10-track-5@googlegroups.com


Organizers:

  • Chen Zhang (National University of Singapore, Singapore)
  • Haizhou Li (National University of Singapore, Singapore)
  • João Sedoc (New York University, USA)
  • Luis F. D'Haro (Universidad Politécnica de Madrid, Spain)
  • Rafael Banchs (Intapp Inc., USA)
  • Alexander Rudnicky (Carnegie Mellon University, USA)

References

  • Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2020). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 1-56.

  • Hori, C., & Hori, T. (2017). End-to-end conversation modeling track in DSTC6. arXiv preprint arXiv:1706.07440.

  • Galley, M., Brockett, C., Gao, X., Gao, J., & Dolan, B. (2019). Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop.

  • See, A., Roller, S., Kiela, D., & Weston, J. (2019, June). What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1702-1723).

  • Mehri, S., & Eskenazi, M. (2020). USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. arXiv preprint arXiv:2005.00456.

  • Mehri, S., & Eskenazi, M. (2020, July). Unsupervised Evaluation of Interactive Dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 225-235).

  • Zhang C., D’Haro L.F., Banchs R.E., Friedrichs T., Li H. (2021) Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. In Conversational Dialogue Systems for the Next Decade. Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore.

  • Zhao, T., Lala, D., & Kawahara, T. (2020, July). Designing Precise and Robust Dialogue Response Evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 26-33).

  • Gupta, P., Mehri, S., Zhao, T., Pavel, A., Eskenazi, M., & Bigham, J. P. (2019, September). Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue (pp. 379-391).

  • Huang, L., Ye, Z., Qin, J., Lin, L., & Liang, X. (2020, November). GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 9230-9240).

  • Merdivan, E., Singh, D., Hanke, S., Kropf, J., Holzinger, A., & Geist, M. (2020). Human annotated dialogues dataset for natural conversational agents. Applied Sciences, 10(3), 762.