DSTC10: Dialogue System Technology Challenge 10

Automatic Evaluation and Moderation of Open-domain Dialogue Systems

Click here to download DSTC10 data.

Click here to submit.

Task Overview

Subtask 1: Automatic Open-domain Dialogue Evaluation

Effective automatic dialogue evaluation metrics possess the following two important properties as indicated in (Deriu et al., 2019):

Correlated to human judgements - the metrics should produce evaluation scores that well correlate to human judgements (scores) across multiple dialogue evaluation aspects.
Explainable - the metrics should provide constructive and explicit feedback to the generative models in terms of the quality of their generated responses. For instance, if a generative model is contradicting itself, the evaluation metrics should signal such behavior to the generative models.

In this task, our goal is to seek effective automatic dialogue evaluation metrics that exhibit the above properties. These metrics can serve as a proxy to human evaluation for fast prototyping of open-domain chatbots. We have identified the following datasets to test the effectiveness of the proposed evaluation metrics:

DSTC6-Eval (D6) (Hori et al., 2017)
DSTC7-Eval (D7) (Galley et al., 2019)
Persona-Chatlog (PC) (See et al., 2019)
PersonaChat-USR (UP) (Mehri & Eskenazi, 2020a)
TopicalChat-USR (TP) (Mehri & Eskenazi, 2020a)
FED-Turn (FT) (Mehri & Eskenazi, 2020b)
FED-Conversation (FC) (Mehri & Eskenazi, 2020b)
DailyDialog-Eval (GD) (Gupta et al., 2019)
DailyDialog-Eval (ZD) (Zhao et al., 2020)
PersonaChat-Eval (ZP) (Zhao et al., 2020)
DailyDialog-Eval (ED) (Huang et al., 2020)
Empathetic-Eval (EE) (Huang et al., 2020)
ConvAI2-Eval (EC) (Huang et al., 2020)
HUMOD (HU) (Merdivan et al., 2020)

During the development phase, the participants need to propose different evaluation metrics. participants can submit their metric scores via ChatEval. The Spearman correlations (π) between the submitted scores and corresponding human scores will be computed per evaluation category per dataset. The correlation results will be reported in the leaderboard on the evaluation category basis. The submissions will be ranked by the average correlation scores across all the categories of all the datasets.

During the final evaluation phase, we will release a hidden evaluation set and all the submitted metrics will be evaluated with the hidden evaluation set. The final ranking will be based on the performance on both the development set and the hidden test set.

Note: The above datasets are only allowed for validating the proposed metrics, not for training the evaluation systems. The performance on the hidden test set has higher importance on the final ranking of the submissions.

Subtask 2: Moderation of Open-domain Dialogue Systems

In this task, our goal is to evaluate the capability of generative dialogue systems to generate appropriate answers that can go beyond detecting toxicity and moderate the conversation by producing appropriate and correct answers that allow the system to continue with the dialogue. For this task a dataset of pairs of 100K messages (training and validation set) with the following characteristics will be provided for development:

A toxic user sends a Tweet message using one or several of the most common swear words found on the Internet. The Tweet message must be directed to one of the customer service channels.
A toxic user writes a Tweet message using one or several swear words and the message is replied by another user.
A toxic user posts a message in Reddit using one or several swear words and the message is replied by another user.

During the development phase, participants need to come up with systems that are capable of generating polite, specific and semantically appropriate responses in such scenarios.

During the evaluation phase, a hidden test set will be provided to the participants for them to generate system responses, which will be evaluated based on the objective similarity between the generated response and the original response (e.g. sentence embedding similarity, Deep AM-FM (Zhang et al., 2021), BLEU, ROUGE, etc). For the top-3 submitted systems in the objective evaluation, a set of 100 responses will be manually evaluated for politeness, specificity, semantically appropriateness and fluency.

Schedule

Validation data released: Jun 14, 2021
Test data released: Sep 13, 2021
Entry submission deadline: Sep 21, 2021
Final result announcement: Oct 1, 2021 - Oct 8, 2021

Baselines and Data Description

Subtask 1: Automatic Open-domain Dialogue Evaluation

Subtask 2: Moderation of Open-domain Dialogue Systems

Automatic Evaluation Leaderboard

Open-domain Dialogue Evaluation (Development)

The leaderboard shows names of submissions and their corresponding Spearman Correlation Coefficients for each development dataset.

System	D6	D7	PC	UP	TP	FT	FC	ZD	ZP	GD	ED	EC	EE	HU	AVG	Rank
AM	0.112	0.016	0.090	0.054	0.070	0.080	0.165	0.054	0.246	0.150	0.015	0.080	0.100	0.100	0.095	21
FM	0.062	0.032	0.091	0.151	0.188	0.080	0.092	0.226	0.446	0.136	0.170	0.072	0.050	0.097	0.135	16
AM-FM	0.100	0.027	0.081	0.144	0.141	0.051	0.112	0.223	0.468	0.177	0.155	0.094	0.025	0.117	0.137	15
T1S1	0.249	0.076	0.050	0.083	0.046	0.194	0.358	0.146	0.123	0.040	0.085	0.159	0.184	0.060	0.132	17
T1S2	0.203	0.099	0.091	0.118	0.043	0.123	0.151	0.187	0.385	0.158	0.355	0.366	0.328	0.124	0.195	14
T1S3	0.222	0.073	0.082	0.144	0.025	0.186	0.187	0.219	0.414	0.183	0.361	0.362	0.360	0.133	0.211	13
T1S4	0.245	0.340	0.057	0.273	0.218	0.239	0.269	0.369	0.552	0.568	0.363	0.504	0.395	0.329	0.337	05
T1S5	0.279	0.349	0.032	0.307	0.196	0.220	0.321	0.349	0.512	0.504	0.236	0.524	0.384	0.305	0.323	06
T2S1	0.009	0.215	0.051	0.156	0.277	0.099	0.269	0.205	0.217	0.021	0.026	0.060	0.075	0.008	0.120	20
T2S2	0.081	0.198	0.035	0.122	0.296	0.095	0.252	0.210	0.242	0.028	0.035	0.072	0.064	0.019	0.125	19
T3S1/T3S3	0.481	0.244	0.068	0.252	0.224	0.147	0.042	0.335	0.518	0.343	0.074	0.332	0.175	0.292	0.252	11
T3S2/T3S5	0.502	0.260	0.062	0.251	0.304	0.143	0.045	0.317	0.500	0.355	0.035	0.372	0.182	0.300	0.259	09
T3S4	0.480	0.258	0.070	0.216	0.171	0.120	0.057	0.331	0.502	0.382	0.112	0.410	0.226	0.311	0.260	08
T4S1	0.004	0.007	0.033	0.053	0.096	0.126	0.273	0.056	0.049	0.047	0.021	0.074	0.059	0.010	0.065	23
T4S2	0.068	0.017	0.082	0.104	0.213	0.092	0.067	0.113	0.188	0.068	0.146	0.043	0.056	0.024	0.092	22
T4S3	0.043	0.004	0.017	0.042	0.088	0.085	0.109	0.044	0.004	0.019	0.068	0.052	0.100	0.005	0.049	24
T4S5	0.043	0.121	0.037	0.267	0.278	0.193	0.059	0.199	0.052	0.114	0.199	0.173	0.047	0.042	0.130	18
T5S1	0.179	0.325	0.088	0.404	0.391	0.304	0.469	0.480	0.613	0.633	0.334	0.584	0.306	0.332	0.389	04
T6S1	0.184	0.342	0.129	0.355	0.387	0.330	0.493	0.530	0.642	0.614	0.300	0.604	0.246	0.338	0.392	03
T7S1	0.616	0.313	0.275	0.479	0.455	0.352	0.774	0.545	0.764	0.789	0.644	0.570	0.501	0.225	0.522	01
T8S1	0.183	0.341	0.129	0.362	0.402	0.329	0.493	0.528	0.646	0.608	0.301	0.604	0.247	0.338	0.394	02
T9S1	0.185	0.332	0.063	0.226	0.137	0.199	0.403	0.287	0.557	0.467	0.419	0.531	0.365	0.223	0.314	07

Open-domain Dialogue Evaluation (Test)

The leaderboard shows names of submissions and their corresponding Spearman Correlation Coefficients for each hidden test dataset.

System	JSALT	ESL	NCM	DSTC10-Topical	DSTC10-Persona	AVG	Rank
AM	0.011	0.032	0.037	0.085	0.076	0.066	33
FM	0.046	0.343	0.162	0.171	0.186	0.180	22
AM-FM	0.051	0.323	0.165	0.175	0.196	0.184	21
T1S1	0.041	0.130	0.049	0.049	0.086	0.069	31
T1S2	0.057	0.041	0.053	0.041	0.020	0.036	34
T1S3	0.049	0.281	0.025	0.067	0.151	0.112	26
T1S4	0.277	0.420	0.299	0.213	0.303	0.278	07
T1S5	0.164	0.432	0.262	0.192	0.307	0.259	18
T2S1	0.031	0.119	0.014	0.100	0.078	0.080	29
T2S2	0.031	0.199	0.020	0.109	0.075	0.090	28
T3S1	0.042	0.008	0.016	0.018	0.017	0.019	38
T3S2	0.105	0.253	0.183	0.120	0.224	0.174	23
T3S3	0.099	0.288	0.221	0.146	0.258	0.202	20
T3S4	0.043	0.200	0.174	0.133	0.231	0.170	24
T4S1	0.026	0.088	0.036	0.010	0.016	0.023	36
T4S2	0.050	0.062	0.028	0.075	0.093	0.074	30
T4S3	0.023	0.047	0.065	0.083	0.166	0.103	27
T4S4	0.020	0.093	0.035	0.022	0.013	0.026	35
T4S5	0.006	0.030	0.082	0.041	0.119	0.069	32
T5S1	0.098	0.348	0.269	0.225	0.373	0.283	03
T5S2	0.117	0.400	0.296	0.237	0.375	0.296	01
T5S3	0.095	0.354	0.271	0.227	0.371	0.283	02
T5S4	0.091	0.348	0.269	0.220	0.369	0.278	06
T5S5	0.094	0.350	0.270	0.225	0.369	0.281	05
T6S1	0.127	0.328	0.265	0.193	0.355	0.265	14
T6S2	0.127	0.328	0.265	0.200	0.357	0.268	12
T6S3	0.127	0.301	0.251	0.200	0.357	0.264	15
T6S4	0.125	0.301	0.251	0.189	0.356	0.260	17
T6S5	0.127	0.329	0.266	0.200	0.358	0.269	11
T7S1	0.041	0.034	0.020	0.014	0.025	0.023	37
T8S1	0.066	0.321	0.256	0.204	0.360	0.264	16
T8S2	0.088	0.323	0.256	0.224	0.368	0.276	08
T8S3	0.065	0.332	0.249	0.205	0.363	0.265	13
T8S4	0.078	0.330	0.249	0.222	0.371	0.275	09
T8S5	0.085	0.361	0.255	0.228	0.372	0.282	04
T9S1	0.056	0.162	0.117	0.132	0.156	0.135	25
T9S2	0.262	0.456	0.191	0.174	0.338	0.269	10
T9S3	0.264	0.419	0.138	0.155	0.301	0.241	19

More results for Task 1 can be found here

Moderation of Open-domain Dialogue Systems

The leaderboard showing names of submissions and their corresponding automatic evaluation scores.

System	BLEU	ROUGE-L	BERT-score	BLEURT	Win Ratio
DialoGPT-base (baseline)	0.008	0.072	0.832	-1.180	0.179
BlenderBot 2.0	0.009	0.097	0.836	-1.183	0.443
GPT-3	0.008	0.065	0.831	-1.201	0.273

Registration Details

You can register at https://my.chateval.org/accounts/login/, once registered, you will be able to download the datasets and readme documents as well as submit your results at https://chateval.org/dstc10

Information about the tracks

Any updates will be posted at the official website:

https://sites.google.com/dstc.community/dstc10/

Contact

If you have further questions regarding the data, please let us know by the following email address: dstc10-track-5@googlegroups.com

Organizers:

Chen Zhang (National University of Singapore, Singapore)
Haizhou Li (National University of Singapore, Singapore)
João Sedoc (New York University, USA)
Luis F. D'Haro (Universidad Politécnica de Madrid, Spain)
Rafael Banchs (Intapp Inc., USA)
Alexander Rudnicky (Carnegie Mellon University, USA)

References

Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2020). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 1-56.
Hori, C., & Hori, T. (2017). End-to-end conversation modeling track in DSTC6. arXiv preprint arXiv:1706.07440.
Galley, M., Brockett, C., Gao, X., Gao, J., & Dolan, B. (2019). Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop.
See, A., Roller, S., Kiela, D., & Weston, J. (2019, June). What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1702-1723).
Mehri, S., & Eskenazi, M. (2020). USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. arXiv preprint arXiv:2005.00456.
Mehri, S., & Eskenazi, M. (2020, July). Unsupervised Evaluation of Interactive Dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 225-235).
Zhang C., D’Haro L.F., Banchs R.E., Friedrichs T., Li H. (2021) Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. In Conversational Dialogue Systems for the Next Decade. Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore.
Zhao, T., Lala, D., & Kawahara, T. (2020, July). Designing Precise and Robust Dialogue Response Evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 26-33).
Gupta, P., Mehri, S., Zhao, T., Pavel, A., Eskenazi, M., & Bigham, J. P. (2019, September). Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue (pp. 379-391).
Huang, L., Ye, Z., Qin, J., Lin, L., & Liang, X. (2020, November). GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 9230-9240).
Merdivan, E., Singh, D., Hanke, S., Kropf, J., Holzinger, A., & Geist, M. (2020). Human annotated dialogues dataset for natural conversational agents. Applied Sciences, 10(3), 762.