DSTC10: Dialogue System Technology Challenge 10

Automatic Evaluation and Moderation of Open-domain Dialogue Systems


Click here to download DSTC10 data.

Click here to submit.


Task Overview


  • Subtask 1: Automatic Open-domain Dialogue Evaluation

Effective automatic dialogue evaluation metrics possess the following two important properties as indicated in (Deriu et al., 2019):

  • Correlated to human judgements - the metrics should produce evaluation scores that well correlate to human judgements (scores) across multiple dialogue evaluation aspects.
  • Explainable - the metrics should provide constructive and explicit feedback to the generative models in terms of the quality of their generated responses. For instance, if a generative model is contradicting itself, the evaluation metrics should signal such behavior to the generative models.
In this task, our goal is to seek effective automatic dialogue evaluation metrics that exhibit the above properties. These metrics can serve as a proxy to human evaluation for fast prototyping of open-domain chatbots. We have identified the following datasets to test the effectiveness of the proposed evaluation metrics:

  • DSTC6-Eval (D6) (Hori et al., 2017)
  • DSTC7-Eval (D7) (Galley et al., 2019)
  • Persona-Chatlog (PC) (See et al., 2019)
  • PersonaChat-USR (UP) (Mehri & Eskenazi, 2020a)
  • TopicalChat-USR (TP) (Mehri & Eskenazi, 2020a)
  • FED-Turn (FT) (Mehri & Eskenazi, 2020b)
  • FED-Conversation (FC) (Mehri & Eskenazi, 2020b)
  • DailyDialog-Eval (GD) (Gupta et al., 2019)
  • DailyDialog-Eval (ZD) (Zhao et al., 2020)
  • PersonaChat-Eval (ZP) (Zhao et al., 2020)
  • DailyDialog-Eval (ED) (Huang et al., 2020)
  • Empathetic-Eval (EE) (Huang et al., 2020)
  • ConvAI2-Eval (EC) (Huang et al., 2020)
  • HUMOD (HU) (Merdivan et al., 2020)
During the development phase, the participants need to propose different evaluation metrics. participants can submit their metric scores via ChatEval. The Spearman correlations (π) between the submitted scores and corresponding human scores will be computed per evaluation category per dataset. The correlation results will be reported in the leaderboard on the evaluation category basis. The submissions will be ranked by the average correlation scores across all the categories of all the datasets.

During the final evaluation phase, we will release a hidden evaluation set and all the submitted metrics will be evaluated with the hidden evaluation set. The final ranking will be based on the performance on both the development set and the hidden test set.

Note: The above datasets are only allowed for validating the proposed metrics, not for training the evaluation systems. The performance on the hidden test set has higher importance on the final ranking of the submissions.


  • Subtask 2: Moderation of Open-domain Dialogue Systems

In this task, our goal is to evaluate the capability of generative dialogue systems to generate appropriate answers that can go beyond detecting toxicity and moderate the conversation by producing appropriate and correct answers that allow the system to continue with the dialogue. For this task a dataset of pairs of 100K messages (training and validation set) with the following characteristics will be provided for development:

  • A toxic user sends a Tweet message using one or several of the most common swear words found on the Internet. The Tweet message must be directed to one of the customer service channels.
  • A toxic user writes a Tweet message using one or several swear words and the message is replied by another user.
  • A toxic user posts a message in Reddit using one or several swear words and the message is replied by another user.

During the development phase, participants need to come up with systems that are capable of generating polite, specific and semantically appropriate responses in such scenarios.

During the evaluation phase, a hidden test set will be provided to the participants for them to generate system responses, which will be evaluated based on the objective similarity between the generated response and the original response (e.g. sentence embedding similarity, Deep AM-FM (Zhang et al., 2021), BLEU, ROUGE, etc). For the top-3 submitted systems in the objective evaluation, a set of 100 responses will be manually evaluated for politeness, specificity, semantically appropriateness and fluency.


Schedule


  • Validation data released: Jun 14, 2021
  • Test data released: Sep 13, 2021
  • Entry submission deadline: Sep 21, 2021
  • Final result announcement: Oct 1, 2021 - Oct 8, 2021

Baselines and Data Description


Subtask 1: Automatic Open-domain Dialogue Evaluation

Subtask 2: Moderation of Open-domain Dialogue Systems


Automatic Evaluation Leaderboard


Open-domain Dialogue Evaluation (Development)

The leaderboard shows names of submissions and their corresponding Spearman Correlation Coefficients for each development dataset.

SystemD6D7PCUPTPFTFCZDZPGDEDECEEHUAVGRank
AM0.1120.0160.0900.0540.0700.0800.1650.0540.2460.1500.0150.0800.1000.1000.09521
FM0.0620.0320.0910.1510.1880.0800.0920.2260.4460.1360.1700.0720.0500.0970.13516
AM-FM0.1000.0270.0810.1440.1410.0510.1120.2230.4680.1770.1550.0940.0250.1170.13715
T1S10.2490.0760.0500.0830.0460.1940.3580.1460.1230.0400.0850.1590.1840.0600.13217
T1S20.2030.0990.0910.1180.0430.1230.1510.1870.3850.1580.3550.3660.3280.1240.19514
T1S30.2220.0730.0820.1440.0250.1860.1870.2190.4140.1830.3610.3620.3600.1330.21113
T1S40.2450.3400.0570.2730.2180.2390.2690.3690.5520.5680.3630.5040.3950.3290.33705
T1S50.2790.3490.0320.3070.1960.2200.3210.3490.5120.5040.2360.5240.3840.3050.32306
T2S10.0090.2150.0510.1560.2770.0990.2690.2050.2170.0210.0260.0600.0750.0080.12020
T2S20.0810.1980.0350.1220.2960.0950.2520.2100.2420.0280.0350.0720.0640.0190.12519
T3S1/T3S30.4810.2440.0680.2520.2240.1470.0420.3350.5180.3430.0740.3320.1750.2920.25211
T3S2/T3S50.5020.2600.0620.2510.3040.1430.0450.3170.5000.3550.0350.3720.1820.3000.25909
T3S40.4800.2580.0700.2160.1710.1200.0570.3310.5020.3820.1120.4100.2260.3110.26008
T4S10.0040.0070.0330.0530.0960.1260.2730.0560.0490.0470.0210.0740.0590.0100.06523
T4S20.0680.0170.0820.1040.2130.0920.0670.1130.1880.0680.1460.0430.0560.0240.09222
T4S30.0430.0040.0170.0420.0880.0850.1090.0440.0040.0190.0680.0520.1000.0050.04924
T4S50.0430.1210.0370.2670.2780.1930.0590.1990.0520.1140.1990.1730.0470.0420.13018
T5S10.1790.3250.0880.4040.3910.3040.4690.4800.6130.6330.3340.5840.3060.3320.38904
T6S10.1840.3420.1290.3550.3870.3300.4930.5300.6420.6140.3000.6040.2460.3380.39203
T7S10.6160.3130.2750.4790.4550.3520.7740.5450.7640.7890.6440.5700.5010.2250.52201
T8S10.1830.3410.1290.3620.4020.3290.4930.5280.6460.6080.3010.6040.2470.3380.39402
T9S10.1850.3320.0630.2260.1370.1990.4030.2870.5570.4670.4190.5310.3650.2230.31407

Open-domain Dialogue Evaluation (Test)

The leaderboard shows names of submissions and their corresponding Spearman Correlation Coefficients for each hidden test dataset.

SystemJSALTESLNCMDSTC10-TopicalDSTC10-PersonaAVGRank
AM0.0110.0320.0370.0850.0760.06633
FM0.0460.3430.1620.1710.1860.18022
AM-FM0.0510.3230.1650.1750.1960.18421
T1S10.0410.1300.0490.0490.0860.06931
T1S20.0570.0410.0530.0410.0200.03634
T1S30.0490.2810.0250.0670.1510.11226
T1S40.2770.4200.2990.2130.3030.27807
T1S50.1640.4320.2620.1920.3070.25918
T2S10.0310.1190.0140.1000.0780.08029
T2S20.0310.1990.0200.1090.0750.09028
T3S10.0420.0080.0160.0180.0170.01938
T3S20.1050.2530.1830.1200.2240.17423
T3S30.0990.2880.2210.1460.2580.20220
T3S40.0430.2000.1740.1330.2310.17024
T4S10.0260.0880.0360.0100.0160.02336
T4S20.0500.0620.0280.0750.0930.07430
T4S30.0230.0470.0650.0830.1660.10327
T4S40.0200.0930.0350.0220.0130.02635
T4S50.0060.0300.0820.0410.1190.06932
T5S10.0980.3480.2690.2250.3730.28303
T5S20.1170.4000.2960.2370.3750.29601
T5S30.0950.3540.2710.2270.3710.28302
T5S40.0910.3480.2690.2200.3690.27806
T5S50.0940.3500.2700.2250.3690.28105
T6S10.1270.3280.2650.1930.3550.26514
T6S20.1270.3280.2650.2000.3570.26812
T6S30.1270.3010.2510.2000.3570.26415
T6S40.1250.3010.2510.1890.3560.26017
T6S50.1270.3290.2660.2000.3580.26911
T7S10.0410.0340.0200.0140.0250.02337
T8S10.0660.3210.2560.2040.3600.26416
T8S20.0880.3230.2560.2240.3680.27608
T8S30.0650.3320.2490.2050.3630.26513
T8S40.0780.3300.2490.2220.3710.27509
T8S50.0850.3610.2550.2280.3720.28204
T9S10.0560.1620.1170.1320.1560.13525
T9S20.2620.4560.1910.1740.3380.26910
T9S30.2640.4190.1380.1550.3010.24119

More results for Task 1 can be found here


Moderation of Open-domain Dialogue Systems

The leaderboard showing names of submissions and their corresponding automatic evaluation scores.

SystemBLEUROUGE-LBERT-scoreBLEURTWin Ratio
DialoGPT-base (baseline)0.0080.0720.832-1.1800.179
BlenderBot 2.00.0090.0970.836-1.1830.443
GPT-30.0080.0650.831-1.2010.273

Registration Details

You can register at https://my.chateval.org/accounts/login/, once registered, you will be able to download the datasets and readme documents as well as submit your results at https://chateval.org/dstc10


Information about the tracks

Any updates will be posted at the official website:

https://sites.google.com/dstc.community/dstc10/


Contact

If you have further questions regarding the data, please let us know by the following email address: dstc10-track-5@googlegroups.com


Organizers:

  • Chen Zhang (National University of Singapore, Singapore)
  • Haizhou Li (National University of Singapore, Singapore)
  • João Sedoc (New York University, USA)
  • Luis F. D'Haro (Universidad Politécnica de Madrid, Spain)
  • Rafael Banchs (Intapp Inc., USA)
  • Alexander Rudnicky (Carnegie Mellon University, USA)

References

  • Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2020). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 1-56.

  • Hori, C., & Hori, T. (2017). End-to-end conversation modeling track in DSTC6. arXiv preprint arXiv:1706.07440.

  • Galley, M., Brockett, C., Gao, X., Gao, J., & Dolan, B. (2019). Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop.

  • See, A., Roller, S., Kiela, D., & Weston, J. (2019, June). What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1702-1723).

  • Mehri, S., & Eskenazi, M. (2020). USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. arXiv preprint arXiv:2005.00456.

  • Mehri, S., & Eskenazi, M. (2020, July). Unsupervised Evaluation of Interactive Dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 225-235).

  • Zhang C., D’Haro L.F., Banchs R.E., Friedrichs T., Li H. (2021) Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. In Conversational Dialogue Systems for the Next Decade. Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore.

  • Zhao, T., Lala, D., & Kawahara, T. (2020, July). Designing Precise and Robust Dialogue Response Evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 26-33).

  • Gupta, P., Mehri, S., Zhao, T., Pavel, A., Eskenazi, M., & Bigham, J. P. (2019, September). Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue (pp. 379-391).

  • Huang, L., Ye, Z., Qin, J., Lin, L., & Liang, X. (2020, November). GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 9230-9240).

  • Merdivan, E., Singh, D., Hanke, S., Kropf, J., Holzinger, A., & Geist, M. (2020). Human annotated dialogues dataset for natural conversational agents. Applied Sciences, 10(3), 762.