DSTC11: Dialogue System Technology Challenge 11

Track 4: Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems

Click here to register for DSTC11.T4. (now available)

Click here to download DSTC11.T4 data. (now available)

Click here to submit your results. (now available)

Click here to use the baseline model. (now available)

Track Overview

Track Details

This track consists of two tasks which are explained in more detail below:

Participants will develop effective automatic open-ended and multilingual dialogue evaluation metrics that perform similarly when evaluated over a new language.
Participants will develop effective automatic open-ended dialogue evaluation metrics that perform robustly when evaluated over paraphrased/back-translated sentences in English.

For both tasks, proposed metrics are expected to show the following two important properties as indicated in (Deriu et al., 2019):

Correlated to human judgments - the metrics should produce evaluation scores that well correlate to human judgments (scores) across multiple languages or alternative responses (i.e., back-translated or paraphrased).
Explainable - the metrics should provide constructive and explicit feedback to the generative models in terms of the quality of their generated responses. For instance, if a generative model is contradicting itself, the evaluation metrics should signal such behavior to the generative models.

Participants can propose their own metric or optionally improve two baseline evaluation metrics: MDD-Eval (Zhang et al, 2021) or deep AM-FM (Zhang et al, 2020). A leaderboard in the ChatEval platform will be provided allowing participants to check their progress.

For each evaluation task, Spearman correlation will be computed to compare the proposed evaluation metrics against human judgments. A final average score will be calculated to rank the submitted evaluation metrics.

For more information check the Track Proposal.

See the Track GitHub for more details.

Task 1: Multilingual Automatic Evaluation Metrics

In this task, the goal for participants is to propose effective automatic dialogue evaluation metrics that exhibit previously mentioned properties (Track Details section) and perform well on a multilingual setup (English, Spanish and Chinese). In concrete, participants will propose a single multilingual model obtaining high correlations with human-annotations when evaluated on multilingual dialogues (development set in Provided Datasets section) and perform well on the hidden multilingual test set. Participants are expected to use pre-trained multilingual models and train them to predict multidimensional quality metrics by using self-supervised techniques and optionally fine-tune their system over a subset of the development data.

Finally, participants will then evaluate their models over the development and test sets, and expect to show similar performance, in terms of correlations with human-annotations on the English, Spanish and Chinese utterances. (Note: only dev and test sets will have human-annotations, and only test sets will be manually translated or paraphrased/back-translated to guarantee the correlations with the original human-annotations on the English data).

Task 2: Robust Automatic Evaluation Metrics

In this task, the goal for participants is to propose robust metrics for automatic evaluation of just English dialogues that exhibit previously mentioned properties (Track Details section) while being robust when dealing with paraphrased/back-translated English sentences. The expected performance must be on par with the correlations with human-annotations obtained over the original sentences. As robustness criteria proposed, paraphrased/back-translated sentences should have the same semantic meaning as the original sentence, but different wording.

Additionally, participants will have the opportunity of testing robustness over alternative machine translations that the organizers will provide. Finally, the influence on the metric will be also evaluated when providing the paraphrased/back-translated current turn sentences instead of the original ones, always along with their respective paraphrased/back-translated context.

During the test phase, hidden and manually curated back-translated test data will be provided to participants to evaluate their proposed metrics.

Provided Datasets

After the organizers' participation in the CHANEL@JSALT2020 workshop (Rudnicky et al., 2020) at Johns Hopkins University, they have automatically translated back-and-forth (using the same MS Azure translation service) a total of 18 well-known human-human dialogue datasets. These data sets will be used as training data. The total amount of dialogues is 393k (approx. 3M turns).

DBDC (Higashinaka et al., 2016)
CMU_DoG (Zhou et al., 2018)
Cornell Movie-Dialogs (Danescu-Niculescu-Mizil & Lee, 2011)
DailyDialog (Li et al., 2017)
DECODE (Nie et al., 2020)
EmotionLines (Chen et al., 2018)
EmpathicDialogues (Rashkin et al., 2018)
Holl-E (Moghe et al., 2018)
MEENA (Adiwardana et al., 2020)
MELD (Poria et al., 2019)
MetalWOz (Lee et al., 2019)
Movie-DiC (Banchs, 2012)
PersonaChat (Zhang et al., 2018)
SentimentLIAR (Upadhayay & Behzadan, 2020)
Switchboard Coherence (Cervone & Riccardi, 2020)
Topical-Chat (Gopalakrishnan et al., 2019)
Wizard of Wikipedia (Dinan et al., 2019)
Wochat (D'Haro et al., 2016)

As development set, organizers will provide the following datasets (details in the GitHub section "Annex: Existing Datasets for Benchmarking") identified during the DSTC10 Track 5 (Zhang et al, 2021), that sum up more than 35k turn-level human-annotations, which have been automatically translated to Spanish and Chinese, and back-translated both to English using MS Azure services.

CONVAI2-GRADE (CG) (Huang et al., 2020)
DAILYDIALOG-GRADE (DH) (Huang et al., 2020)
DAILYDIALOG-GUPTA (DG) (Gupta et al., 2019)
DAILYDIALOG-ZHAO (DZ) (Zhao et al., 2020)
DSTC7 (D7) (Galley et al., 2019)
EMPATHETIC-GRADE (EG) (Huang et al., 2020)
FED-DIAL (FD) (Mehri & Eskenazi, 2020a)
FED-TURN (FT) (Mehri & Eskenazi, 2020a)
HUMOD (HM) (Merdivan et al., 2020)
PERSONA-SEE (PS) (See et al., 2019)
PERSONA-USR (PU) (Mehri & Eskenazi, 2020b)
PERSONA-ZHAO (PZ) (Zhao et al., 2020)
TOPICAL-USR (TU) (Mehri & Eskenazi, 2020b)
JSALT (JS) (Rudnicky et al., 2020)
CHATEVAL (CS) (Sedoc et al., 2019)
DSTC10 (D10) (Zhang et al., 2021)

This development data can help participants to check the multilingualism or robustness capabilities of their trained models in terms of correlations with human-annotations. Additional databases, not mentioned here, will be added when available to increase the size of the benchmarking.

Moreover, the datasets provided by THU-COAI group (Conversational AI groups from Tsinghua University) will be used, naming this set of data CDIAL. They contain open domain human-human dialogs. They are originally in Chinese and contain of 3,470 dialogs (approx. 130k turns).

ECM (Zhou et al., 2018)
KdConv (Zhou et al., 2020)
LCCC (Wang et al., 2020)

In addition, we will provide the same datasets translated into English using the SotA Tencent MT system. These datasets will be provided to participants, together with automatic meta-data information (machine translation Quality Estimation (QE), toxicity, and sentiment analysis) for filtering and dialogue curation purposes, so the participants have a better reference of the dataset quality, being of great help for them to decide whether or not to use these paraphrases/translations in the training of their evaluation models, and optionally fine-tune multilingual pre-trained models allowing better performance on the proposed dialogue-oriented tasks.

Since the quality of the back-translated sentences can play an important role in estimating the metric scores. QE metric scores will be given to the participants using our QE system and other existing models (e.g., COMET (Rei et al., 2020)). This information will be given to participants so they can optionally use it for discarding dialogues or turns that do not show high quality when training their metrics. Participants will be welcome to use the data and ideas from the MT field to propose QE metrics that can, optionally, be included to provide final scores. Finally, the organizers may provide new translated dialogue datasets to allow participants to create more robust and better-trained systems.

Regarding the paraphrases, all the original English sentences of each dataset will have multiple paraphrases, as well as annotations so that each participant can evaluate the quality of each paraphrase. The model used will be PARROT (Damodaran P., 2021).

Additionally, ~3k random H-H turns (~1k dialogues) of CDIAL in Chinese were manually annotated by Tencent AI. Also, ~5k new H-C Chinese turns (~500 dialogues) were generated with three different SotA chatbots (Tencent's model, Microsoft's Xiaoice (Zhou et al., 2020) and Baidu's Plato (Bao et al., 2019)). Both turn-level and dialogue-level annotations were manually annotated by Tencent AI.

During the test phase, a new set of 2k turn-level (~700 dialogue-level) manually curated multilingual corpus (Spanish and Chinese) along with their turn-level and dialogue-level human evaluation annotations will be provided to participants to test models for both tasks. This corpus will be manually checked to guarantee its quality and high correlation with the original dialogues.

Furthermore, in order to check the generalization capabilities of the proposed metrics from the participant, the test data will include a new dataset of human-chatbot interactions with ~2k turns (~60 dialogues).

Datasets Summary

Datasets Name	CHANEL	DSTC10	CDIAL
#Datasets	18	7	3
Language	English, Spanish/Chinese, and English back-translation	English, Spanish/Chinese, and English back-translation	Chinese, English, and Chinese back-translation
Dialogues Type	Human-Human Open-Domain	Human-Chatbot Open-Domain	Human-Human Open-Domain
# Dialogues/ Utterances	+ 390.000 / + 3.000.000	+ 18.000 / + 55.000	+ 3.470 / +130.000
Annotations	Sentiment analysis and Toxicity	Sentiment analysis and Toxicity Turn/dialogue level human scores	Turn/dialogue level human scores
Task 1 Set	Public: Train	Public: Dev, Test Hidden: Automatic Translations	Public: Train/Dev/Test
Task 2 Set	Public: Train	Public: Dev, Test Hidden: Manually paraphrased/back-translation	—

Datasets Statistics

Data sets that make up the train set.

Name	#Turns	#Dialogues	Average Turn/Dial	Average Words/Turn	Annotation Granularity	Original Language	Translation
DBDC	8,509	415	20.50	7.31	Turn	En	Zh/Es
CMU_DoG	95,305	4,221	22.58	17.93	Turn	En	Zh/Es
Cornell Movie-Dialogs	304,713	83,097	3.67	13.72	Turn	En	Zh/Es
DailyDialog	102,960	13,116	7.85	13.96	Turn	En	Zh/Es
DECODE	296,105	35,426	8.36	15.05	Turn	En	Zh/Es
EmotionLines	14,503	1,000	14.50	10.53	Turn	En	Zh/Es
EmpathicDialogues	107,220	24,850	4.31	15.88	Turn	En	Zh/Es
Holl-E	91,452	9,071	10.08	17.74	Turn	En	Zh/Es
MEENA	3,675	193	19.04	9.14	Turn	En	Zh/Es
MELD	23,197	1,592	14.57	10.98	Turn	En	Zh/Es
MetalWOz	432,036	37,884	11.40	8.47	Turn	En	Zh/Es
Movie-DiC	512,582	65,215	7.86	13.82	Turn	En	Zh/Es
PersonaChat	162,064	10,907	14.86	11.72	Turn	En	Zh/Es
SentimentLIAR	12,781	12,781	1.00	20.16	Turn	En	Zh/Es
Switchboard Coherence	12,059	1,000	12.06	20.55	Turn	En	Zh/Es
Topical-Chat	235,281	10,784	21.82	23.23	Turn	En	Zh/Es
Wizard of Wikipedia	201,999	22,311	9.05	18.83	Turn	En	Zh/Es
Wochat	19,881	607	32.75	6.75	Turn	En	Zh/Es
Total	2,636,322	334,470	236.26	255.77

Data sets that make up the development set.

Name	#Turns	#Dialogues	Average Turn/Dial	Average Words/Turn	Annotation Granularity	Original Language	Translation
ConvAI2-GRADE	1,800	600	3.00	12.07	Turn	En	Zh/Es
DailyDialog-GRADE	900	300	3.00	12.60	Turn	En	Zh/Es
DailyDialog-GUPTA	2,460	500	4.92	12.37	Turn	En	Zh/Es
DailyDialog-ZHAO	4,248	900	4.72	12.41	Turn	En	Zh/Es
DSTC7	34,650	9,990	3.47	15.39	Turn	En	Zh/Es
Empathetic-GRADE	900	300	3.00	16.65	Turn	En	Zh/Es
FED-Dial	1,715	125	13.72	11.10	Dial	En	Zh/Es
FED-Turn	3,888	375	10.37	10.78	Turn	En	Zh/Es
HUMOD	37,468	9,499	3.94	7.97	Turn	En	Zh/Es
Persona-SEE	39,792	3,316	12.00	9.00	Dial	En	Zh/Es
PersonaChat-USR	2,790	300	9.30	12.08	Turn	En	Zh/Es
PersonaChat-ZHAO	4,614	900	5.13	12.06	Turn	En	Zh/Es
TOPICAL-USR	4,032	360	11.20	23.16	Turn	En	Zh/Es
ECM-Eval	3,004	1,502	2.00	13.13	Turn	Zh	En
KdConv-Eval	3,499	354	9.88	21.11	Turn	Zh	En
LCCC-Eval	3,009	589	5.11	11.72	Turn	Zh	En
Total	148,769	29,910	104.76	212.64

Data sets that make up the test set.

Name	#Turns	#Dialogues	Average Turn/Dial	Average Words/Turn	Annotation Granularity	Original Language	Translation
BlenderBot3	679	21	32.33	16.96	Turn/Dial	En	Zh/Es
ChatGPT	462	21	22.00	91.07	Turn/Dial	En	Zh/Es
GPT-3.5	560	17	32.94	23.73	Turn/Dial	En	Zh/Es
HCChinese	2,017	187	10.79	8.08	Turn/Dial	Zh	En
ChatEval	400	200	2.00	8.13	Turn	En	Zh/Es
DSTC10	112	28	4.00	14.00	Turn	En	Zh/Es
JSALT	46	13	3.54	17.26	Turn	En	Zh/Es
Total	4,276	487	107.60	179.23

Datasets Information

CHANEL datasets is Task 1 and Task 2 oriented. The source language is English.

CHANEL	Spanish Translation	Chinese Translation	English Back-translation	Paraphrases	Sentiment Analysis	Content Moderate	Annotation Granularity
DBDC	✔		✔	✔	✔	✔	Turn-level
CMU_DoG	✔		✔	✔	✔	✔	Turn-level
Cornell Movie-Dialogs	✔		✔	✔	✔	✔	Turn-level
DailyDialog	✔	✔	✔	✔	✔	✔	Turn-level
DECODE	✔		✔	✔	✔	✔	Turn-level
EmotionLines	✔		✔	✔	✔	✔	Turn-level
EmpathicDialogues	✔	✔	✔	✔	✔	✔	Turn-level
Holl-E	✔		✔	✔	✔	✔	Turn-level
MEENA	✔		✔	✔	✔	✔	Turn-level
MELD	✔		✔	✔	✔	✔	Turn-level
MetalWOz	✔		✔	✔	✔	✔	Turn-level
Movie-DiC	✔		✔	✔	✔	✔	Turn-level
PersonaChat	✔	✔	✔	✔	✔	✔	Turn-level
SentimentLIAR	✔		✔	✔	✔	✔	Turn-level
Switchboard Coherence	✔		✔	✔	✔	✔	Turn-level
Topical-Chat	✔	✔	✔	✔	✔	✔	Turn-level
Wizard of Wikipedia	✔	✔	✔	✔	✔	✔	Turn-level
WOCHAT	✔		✔	✔	✔	✔	Turn-level

DSTC10 datasets is Task 1 and Task 2 oriented. The source language is English.

DSTC10	Spanish Translation	Chinese Translation	English Back-translation	Paraphrases	Sentiment Analysis	Content Moderate	Human Annotations	Annotation Granularity
CONVAI2-GRADE (CG)	✔	✔	✔	✔	✔	✔	✔	Turn-level
DAILYDIALOG-GRADE (DH)	✔	✔	✔	✔	✔	✔	✔	Turn-level
DAILYDIALOG-GUPTA (DG)	✔	✔	✔	✔	✔	✔	✔	Turn-level
DAILYDIALOG-ZHAO (DZ)	✔	✔	✔	✔	✔	✔	✔	Turn-level
DSTC7 (D7)	✔	✔	✔	✔	✔	✔	✔	Turn-level
EMPATHETIC-GRADE (EG)	✔	✔	✔	✔	✔	✔	✔	Turn-level
FED-DIAL (FD)	✔	✔	✔	✔	✔	✔	✔	Dialogue-level
FED-TURN (FT)	✔	✔	✔	✔	✔	✔	✔	Turn-level
HUMOD (HU)	✔	✔	✔	✔	✔	✔	✔	Turn-level
PERSONA-SEE (PS)	✔	✔	✔	✔	✔	✔	✔	Dialogue-level
PERSONA-USR (PU)	✔	✔	✔	✔	✔	✔	✔	Turn-level
PERSONA-ZHAO (PZ)	✔	✔	✔	✔	✔	✔	✔	Turn-level
TOPICAL-USR (TU)	✔	✔	✔	✔	✔	✔	✔	Turn-level

CDIAL dataset is Task 1 oriented. The source language is Chinese.

CDIAL	English Translation	Human Annotations
ECM	✔	✔
KDCONV	✔	✔
LCCC	✔	✔

Data Format

All data given follows the Data Formats which provides guidelines on how to store, maintain and handle dialogue corpora.

Baseline Model

The default choice is Deep AM-FM (Zhang et al, 2020) (used for DSTC-10 and previously). This model has been adapted to be able to evaluate multilingual datasets, as well as to work with paraphrased and back-translated sentences.

This project has investigated more recent approaches, based on fine-tuned large language models. Zhang et al note that their approach may be limited due to domain specificity. On the other hand, LLMs are trained from large corpora that in principle are less domain-dependent. This is an empirical question.

All information related to the baseline model, such as code and data, can be found in this GitHub repository.

Dimensions Evaluation

Considering the annotations available in the development data, the test data will have the following dimensions (annotations) to evaluate in both Task 1 (English, Chinese and Spanish) and Task 2:

Turn-level: Appropriateness, Content Richness, Grammatical Correctness and Relevance.
Dialogue-level: Coherence, Engageness/Likeability, Informativeness and Overall.

The annotations will be evaluated and indicated individually, discriminating by dataset and language. In addition, a global score will be estimated by grouping all dimensions. This global value will be calculated separately at turn-level and dialogue-level for each task.

A brief description of each dimension (Mehri et al., 2022) is shown below.

Turn-level:

Appropriateness - The response is appropriate given the preceding dialogue.
Content Richness - The response is informative, with long sentences including multiple entities and conceptual or emotional words.
Grammatical Correctness - Responses are free of grammatical and semantic errors.
Relevance - Responses are on-topic with the immediate dialogue history.

Dialogue-level:

Coherence - Throughout the dialogue, is the system maintaining a good conversation flow.
Engageness/Likeability - Throughout the dialogue, the system displays a likeable personality.
Informativeness - Throughout the dialogue, the system provides unique and non-generic information.
Overall - The overall quality of and satisfaction with the dialogue.

How will we rank all submitted metrics in the test leaderboard?

Each participant's submission will be evaluated separately. For each submission, Spearmen's correlations at dimension-level will be calculated separately for each task. Then, the Spearman correlation scores obtained will be averaged. Finally, the Spearman correlation scores will be ranked.

To calculate the Spearman correlation, in the hidden test data, it will be done between the annotations provided by the participant and the hidden human judgment annotations. In addition to high correlation with human judgments, we also encourage explainability of the metrics.

Without influencing the ranking, results obtained using Pearson's correlation will also be published. Moreover, Spearman and Pearson correlations at language and dataset level will be published.

Hidden test data will be published after the presentation of the results. Additionally, a script will be shared to allow participants to evaluate their own models at different granularity-levels.

Automatic Evaluation Leaderboard

The leaderboard shows names of submissions and their corresponding Spearman Correlation Coefficients for each development dataset. The name of each column corresponds to an abbreviation of the development datasets respectively.

All the results obtained by the baseline model are very similar, proving that the metric is multilingually adequate, as well as robust when working with paraphrases or back-translations.

Task 1: Multilingual Metrics (development)

System	CG	DH	DG	DZ	D7	EG	FD	FT	HM	PS	PU	PZ	TU	AVG
AM-FM EN	0.3373	0.0916	0.2811	0.1433	0.2469	0.2548	0.1269	0.0264	0.1258	0.0262	0.0823	0.4489	0.1149	0.1774
AM-FM ES	0.3094	0.1053	0.2146	0.1170	0.2317	0.2001	0.1172	-0.0120	0.1019	0.0236	0.0634	0.4118	0.1086	0.1551
AM-FM ZH	0.2989	0.0873	0.2382	0.1391	0.2206	0.2115	0.0819	-0.0254	0.0990	0.0198	0.0849	0.3821	0.0849	0.1518

Task 2: Robust Metrics (development)

System	CG	DH	DG	DZ	D7	EG	FD	FT	HM	PS	PU	PZ	TU	AVG
AM-FM PAR	0.2842	0.0512	0.2879	0.1356	0.0374	0.2452	0.1243	-0.0039	0.1080	0.0192	0.0730	0.4241	0.0872	0.1447

All the results shown in the test tables are the averages of the 4 dimensions evaluated, in each type of task.

Task 1: Multilingual Metrics Turn-Level (test)

Team	EN	ZH	ES	Multilingual AVG	Submission Rank	Team Rank
baseline_t1t	0.2940	0.0753	0.1826	0.1840	11	4
team2_t1t_s1	0.1469	0.1054	0.0808	0.1110	12	5
team4_t1t_s1	0.4818	0.3936	0.5890	0.4881	1	1
team4_t1t_s2	0.2625	0.3096	0.5056	0.3592	6
team4_t1t_s3	0.4795	0.3656	0.5409	0.4620	2
team4_t1t_s4	0.4586	0.3618	0.5412	0.4539	3
team5_t1t_s1	0.3702	0.0701	0.1983	0.2129	9	3
team5_t1t_s2	0.2690	0.1375	0.2281	0.2116	10
team7_t1t_s1	0.1275	0.2557	0.4753	0.2862	7
team7_t1t_s2	0.2314	0.3163	0.5478	0.3652	5
team7_t1t_s3	0.1083	0.2480	0.4799	0.2787	8
team7_t1t_s4	0.2214	0.3112	0.5644	0.3657	4	2

Task 1: Multilingual Metrics Dialogue-Level (test)

Team	EN	ZH	ES	Multilingual AVG	Submission Rank	Team Rank
baseline_t1d	0.2414	0.4648	0.8080	0.5047	4	2
team4_t1d_s1	0.5342	0.7133	0.8080	0.6852	1	1
team4_t1d_s2	0.3295	0.7030	0.2500	0.4275	5
team4_t1d_s3	0.5251	0.6701	0.8080	0.6677	2
team4_t1d_s4	0.5039	0.5859	0.5915	0.5604	3
team5_t1d_s1	0.1865	0.1356	0.6830	0.3350	6	3

Task 2: Robust Metrics Turn-Level (test)

Team	Robust AVG	Submission Rank	Team Rank
baseline_t2t	0.3387	7	4
team1_t2t_s1	0.1537	11	6
team3_t2t_s1	0.1306	13
team3_t2t_s2	0.1277	14
team3_t2t_s3	0.1469	12
team3_t2t_s4	0.2697	9	5
team4_t2t_s1	0.4890	1	1
team4_t2t_s2	0.3320	8
team4_t2t_s3	0.4756	2
team4_t2t_s4	0.4427	3
team6_t2t_s1	0.4190	4	2
team6_t2t_s2	0.1742	10
team6_t2t_s3	0.0807	15
team7_t2t_s1	0.3833	5	3
team7_t2t_s2	0.3643	6

Task 2: Robust Metrics Dialogue-Level (test)

Team	Robust AVG	Submission Rank	Team Rank
baseline_t2d	0.4800	1	1
team1_t2d_s1	0.1111	8	4
team3_t2d_s1	0.2196	6	3
team3_t2d_s2	0.1453	7
team4_t2d_s1	0.3031	2	2
team4_t2d_s2	0.2335	5
team4_t2d_s3	0.2979	3
team4_t2d_s4	0.2492	4

More results for Task 1 and Task 2 can be found here.

Registration Details

To become an official DSTC11 Track 4 participant, you must be registered at this Microsoft Form. Once registered, you will be able to download the datasets and readme documents as well as submit your results at https://chateval.org/dstc11.

There must be only one team per laboratory or research group. The members of the same team must be under a single registration, that is, the team leader must register his entire team by giving their e-mail addresses in addition to his own.

Any updates and information about the tracks will be posted on the DSTC11 official website, or check the DSTC Mailing List.

Submission Details

Before submitting your results, do not forget to Sign Up on the ChatEval website. Only the team leader must register on ChatEval, with the same name and email address entered in the Microsoft Form. Once you have signed up, you can Log In and Submit your evaluations.

There are four different evaluations to test the models, namely:

Task 1 - Turn-Level
Task 1 - Dialogue-Level
Task 2 - Turn-Level
Task 2 - Dialogue-Level

Each task has annotations at turn-level and dialogue-level, so the models will be evaluated separately at turn-level and dialogue-level independently for each task, they will not be taken into account together at any level. That is, for Task 1 the models at turn-level and at dialogue-level will be evaluated separately, likewise, for Task 2 the models at turn-level and at dialogue-level will be evaluated separately.

If you want, you can participate in as many evaluations as you want. Whether you only want to participate in one, several or all of the evaluations, the scores obtained will be independent, unrelated to the other scores, and will not be combined for the final score. There will be a table with the scores obtained for each of the 4 different evaluations.

You can submit as many score files as you want for each evaluation, but only the last 5 files submitted for each type of evaluation in ChatEval will be valid and will count in the ranking to participate in the competition. Moreover, only the evaluations submitted by the team leader registered in the Microsoft form will be considered and count towards the competition.

In order to submit test data evaluations, they must be named appropriately. Below is the correct way to name the test files that should be sent correctly annotated:

<team_name>_task1_turn_v<x>.csv
<team_name>_task1_dial_v<x>.csv
<team_name>_task2_turn_v<x>.csv
<team_name>_task2_dial_v<x>.csv

Please specify clearly in the submission name which evaluation it is intended for, the team name in <team_name> and the submission version <x> to identify the submission.

Schedule

Training/Validation data release: Dec 14, 2022
Test data release: Mar 29, 2023
Entry submission deadline: Apr 3, 2023 (23:59 Anywhere on Earth (AoE), UTC-12)
Final result announcement: Apr 14, 2023
Paper submission: June 2nd, 2023
Workshop: September 11 or 12, at SIGDIAL x INLG 2023 in Prague, Czech Republic

Organizers

Mario Rodríguez-Cantelar (Universidad Politécnica de Madrid, Spain)
Chen Zhang (National University of Singapore, Singapore)
Chengguang Tang (Tencent AI Lab, China)
Ke Shi (Tencent AI Lab, China)
Sarik Ghazarian (University of Southern California, USA)
João Sedoc (New York University, USA)
Luis F. D'Haro (Universidad Politécnica de Madrid, Spain)
Alexander Rudnicky (Carnegie Mellon University, USA)

Contact

If you have further questions regarding the data, please let us know by the following email address at dstc11-robust-multilingual-automatic-evaluation@googlegroups.com.

Citation

Please cite the paper, code or data from DSTC 11 Track 4:

@inproceedings{rodriguezcantelar2023dstc11t4,

author = "Mario Rodríguez-Cantelar and Chen Zhang and Chengguang Tang and Ke Shi and Sarik Ghazarian and João Sedoc and Luis Fernando D'Haro and Alexander Rudnicky",
title = "Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4",
booktitle = "DSTC11: The Eleventh Dialog System Technology Challenge",
series = "24th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)",
year = 2023,
month = "September",
address = "Prague, Czechia"

Acknowledgement

This research project is supported by the Comunidad de Madrid through the call Research Grants for Young Investigators from Universidad Politécnica de Madrid (GENIUS:APOYO-JOVENES-21-TAXTYC-32-K61X37).

This work is supported by project BEWORD (PID2021-126061OB-C43) funded by MCIN/AEI/10.13039/501100011033 and, as appropriate, by “ERDF A way of making Europe”, by the “European Union”, and by Programa Propio - Proyectos Semilla: Universidad Politécnica de Madrid (VSEMILLA22LFHE).

We gratefully acknowledge valuable efforts from Tencent AI Lab who supports Chinese translation and annotation of datasets by funding and infrastructure.

Thanks to THU-CoAI (Conversational AI groups from Tsinghua University) for providing their Chinese datasets as part of the challenge data.

Thanks to Unbabel for providing the COMET MTQE scores annotations as part of the challenge data. This contribution was supported by national funds through *Fundação para a Ciência e a Tecnologia* (FCT) with references PRT/BD/152198/2021 and UIDB/50021/2020, and by the P2020 program MAIA led by Unbabel (LISBOA-01-0247-FEDER-045909).

We also want to give thanks to MS Azure services (especially to Irving Kwong) for their sponsorship to continue processing new datasets that could be interesting for the dialogue community.

This research project is supported by the NYU ChatEval Team led by João Sedoc.

This research project is supported in part by a grant from Amazon to Alexander Rudnicky, Carnegie Mellon University.

Thanks to Karthik Ganesan, Sarik Ghazarian, James Hagerty, Zhang Chen and Alex Rudnicky for developing the baseline model as part of the challenge tasks.

This work is supported by the European Commission through Project ASTOUND (101071191 — HORIZON-EIC-2021-PATHFINDERCHALLENGES-01). Logo_EC

FAQ

How much does participate in this Track cost?

This Track is currently free for everyone.

References

Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., Agirre, E., & Cieliebak, M. (2020). Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 1-56.
Zhang, C., D'Haro, L. F., Friedrichs, T., & Li, H. (2021). MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation. arXiv preprint arXiv:2112.07194.
Zhang, C., D'Haro, L. F., Banchs, R. E., Friedrichs, T., & Li, H. (2020). Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. In Conversational Dialogue Systems for the Next Decade (pp. 53-69). Springer, Singapore.
Zhang, C., Sadoc, J., D'Haro, L. F., Banchs, R., & Rudnicky, A. (2021). Automatic Evaluation and Moderation of Open-domain Dialogue Systems. arXiv preprint arXiv:2111.02110.
Hori, C., & Hori, T. (2017). End-to-end conversation modeling track in DSTC6. arXiv preprint arXiv:1706.07440.
Huang, L., Ye, Z., Qin, J., Lin, L., & Liang, X. (2020, November). GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 9230-9240).
Gupta, P., Mehri, S., Zhao, T., Pavel, A., Eskenazi, M., & Bigham, J. P. (2019, September). Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue (pp. 379-391).
Zhao, T., Lala, D., & Kawahara, T. (2020, July). Designing Precise and Robust Dialogue Response Evaluators. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 26-33).
Galley, M., Brockett, C., Gao, X., Gao, J., & Dolan, B. (2019). Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop.
Mehri, S., & Eskenazi, M. (2020, July). Unsupervised Evaluation of Interactive Dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 225-235).
Merdivan, E., Singh, D., Hanke, S., Kropf, J., Holzinger, A., & Geist, M. (2020). Human annotated dialogues dataset for natural conversational agents. Applied Sciences, 10(3), 762.
See, A., Roller, S., Kiela, D., & Weston, J. (2019, June). What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1702-1723).
Mehri, S., & Eskenazi, M. (2020). USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. arXiv preprint arXiv:2005.00456.
Rudnicky, A., Banchs, R., D'Haro, L. F., Sedoc, J., Chen, Z., Rodríguez-Cantelar, M., Koh, A., & others. (2020). CHANEL-Metrics: Chat/Dialogue Modeling and Evaluation report. In 2020 Seventh Frederick Jelinek Memorial Summer Workshop.
Sedoc, J., Ippolito, D., Kirubarajan, A., Thirani, J., Ungar, L., & Callison-Burch, C. (2019, June). Chateval: A tool for chatbot evaluation. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (demonstrations) (pp. 60-65).
Vinyals, O., & Le, Q. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869.
Lee, S., Lim, H., & Sedoc, J. (2020). An evaluation protocol for generative conversational systems. arXiv preprint arXiv:2010.12741.
Higashinaka, R., Funakoshi, K., Kobayashi, Y., & Inaba, M. (2016, May). The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 3146-3150).
Zhou, K., Prabhumoye, S., & Black, A. W. (2018). A dataset for document grounded conversations. arXiv preprint arXiv:1809.07358.
Danescu-Niculescu-Mizil, C., & Lee, L. (2011). Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. arXiv preprint arXiv:1106.3077.
Li, Y., Su, H., Shen, X., Li, W., Cao, Z., & Niu, S. (2017). Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957.
Nie, Y., Williamson, M., Bansal, M., Kiela, D., & Weston, J. (2020). I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling. arXiv preprint arXiv:2012.13391.
Chen, S. Y., Hsu, C. C., Kuo, C. C., & Ku, L. W. (2018). Emotionlines: An emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379.
Rashkin, H., Smith, E. M., Li, M., & Boureau, Y. L. (2018). Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.
Moghe, N., Arora, S., Banerjee, S., & Khapra, M. M. (2018). Towards exploiting background knowledge for building conversation systems. arXiv preprint arXiv:1809.08205.
Adiwardana, D., Luong, M. T., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., ... & Le, Q. V. (2020). Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2018). Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508.
Lee, S., Schulz, H., Atkinson, A., Gao, J., Suleman, K., El Asri, L., ... & Li, X. (2019). Multi-domain task-completion dialog challenge. Dialog system technology challenges, 8(9).
Banchs, R. E. (2012, July). Movie-DiC: a movie dialogue corpus for research and development. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 203-207).
Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., & Weston, J. (2018). Personalizing dialogue agents: I have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243.
Upadhayay, B., & Behzadan, V. (2020, November). Sentimental LIAR: Extended Corpus and Deep Learning Models for Fake Claim Classification. In 2020 IEEE International Conference on Intelligence and Security Informatics (ISI) (pp. 1-6). IEEE.
Cervone, A., & Riccardi, G. (2020). Is this dialogue coherent? learning from dialogue acts and entities. arXiv preprint arXiv:2006.10157.
Gopalakrishnan, K., Hedayatnia, B., Chen, Q., Gottardi, A., Kwatra, S., Venkatesh, A., ... & AI, A. A. (2019, January). Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In INTERSPEECH (pp. 1891-1895).
Dinan, E., Roller, S., Shuster, K., Fan, A., Auli, M., & Weston, J. (2018). Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
D'Haro, L. F., Shawar, B. A., & Yu, Z. (2016). REWOCHAT 2016–Shared task description report. In Proceedings of the workshop on collecting and generating resources for chatbots and conversational agents-development and evaluation (RE-WOCHAT) (p. 39).
Zhou, H., Huang, M., Zhang, T., Zhu, X., & Liu, B. (2018, April). Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
Zhou, H., Zheng, C., Huang, K., Huang, M., & Zhu, X. (2020). Kdconv: A chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation. arXiv preprint arXiv:2004.04100.
Wang, Y., Ke, P., Zheng, Y., Huang, K., Jiang, Y., Zhu, X., & Huang, M. (2020, October). A large-scale chinese short-text conversation dataset. In CCF International Conference on Natural Language Processing and Chinese Computing (pp. 91-103). Springer, Cham.
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. arXiv preprint arXiv:2009.09025.
Damodaran, P. (2021). Parrot: Paraphrase generation for NLU.
Zhou, L., Gao, J., Li, D., & Shum, H. Y. (2020). The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1), 53-93.
Bao, S., He, H., Wang, F., Wu, H., & Wang, H. (2019). Plato: Pre-trained dialogue generation model with discrete latent variable. arXiv preprint arXiv:1910.07931.
Mehri, S., Choi, J., D'Haro, L. F., Deriu, J., Eskenazi, M., Gasic, M., ... & Zhang, C. (2022). Report from the nsf future directions workshop on automatic evaluation of dialog: Research directions and challenges. arXiv preprint arXiv:2203.10012.