ChatEval

ChatEval is a scientific framework for evaluating open domain chatbots. Researchers can submit their trained models to effortlessly receive comparisons with baselines and prior work. Since all evaluation code is open source, we ensure evaluation is performed in a standardized and transparent way. Additionally, open source baseline models and an ever growing groups public evaluation sets are available for public use.

Upload your system

FAQ

How much does ChatEval cost?

ChatEval is currently free for academic researchers. It is actively developed by the NLP Group of the University of Pennyslvania.

Is there an online demo video?

You can find a video tutorial for ChatEval here.

How was ChatEval built?

The ChatEval webapp is built using Django and React (front-end) using Magnitude word embeddings format for evaluation. Our source code is available on Github.

What should I cite?

@InProceedings{N19-4011,
  author = "Sedoc, Jo{~a}o and Ippolito, Daphne and Kirubarajan, Arun and Thirani, Jai and Ungar, Lyle and Callison-Burch, Chris",
  title = 	"ChatEval: A Tool for Chatbot Evaluation",
  booktitle = 	"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) ",
  year = 	"2019",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"60--65",
  location = 	"Minneapolis, Minnesota",
  url = 	"http://aclweb.org/anthology/N19-4011"
}

About Evaluation

Model responses are generated using an evaluation dataset of prompts and then uploaded to ChatEval. The responses are then evaluated using a series of automatic evaluation metrics, and are compared against selected baseline/ground truth models (e.g. humans).

Evaluation Datasets

ChatEval offers evaluation datasets consisting of prompts that uploaded chatbots are to respond to. Evaluation datasets are available to download for free and have corresponding baseline models.

Neural Conversational Model

Supported ChatEval Dataset

In (Vinyals and Le 2015), human evaluation is conducted on a set of 200 hand-picked prompts. We make this dataset available for comparison.

Download Dataset

Dialogue Breakdown Detection Challenge

Supported ChatEval Dataset

The DBDC dataset consists of a series of text-based conversations between a human and a chatbot where the human was aware they were chatting with a computer (Higashinaka et al. 2016).

Download Dataset

Open Subtitles

Supported ChatEval Dataset

This evaluation dataset contains a random subset of 200 prompts from the English OpenSubtitles 2009 dataset (Tiedemann 2009).

Download Dataset

Twitter

Supported ChatEval Dataset

The random Twitter test set is a random subset of 200 prompts from the ParlAi Twitter derived test set.

Download Dataset

Cornell Movie Dialogue Corpus

Supported ChatEval Dataset

The Cornell Movie Dialogue Corpus (Danescu-Niculescu-Mizil and Lee, 2011) contains accurate speaker annotations. We use 1000 prompts selected by Baheti et al., 2018.

Download Dataset

Dialogue Breakdown Detection Challenge multi-turn dataset

Supported ChatEval Dataset

This dataset is for the Next Utterance Recovery task, which is a shared task in the 2020 WOCHAT+DBDC. This dataset is derived from the Third Dialogue Breakdown Detection Challenge. Here we’ve taken the most difficult turns in the dataset and are using them to evaluate next utterance generation.

Download Dataset

ESL three turn snippets

Supported ChatEval Dataset

This is an English as a second language conversational learning from ESL, Inc (https://eslfast.com). The ChatEval team selected 3 turns of 200 conversations.

Download Dataset

Persona Chat - USR Evaluations

Supported ChatEval Dataset

This evaluation dataset consists of 300 model responses, where five different models each respond to 60 prompts. The prompts are sourced from PersonaChat (Zhang et al. 2018). Three humans annotated for each model response, based on 6 different scoring categories. (Mehri & Eskenazi 2020)

Download Dataset

Topical Chat - USR Evaluations

Supported ChatEval Dataset

This evaluation dataset consists of 360 model responses, where six different models each respond to 60 prompts. The prompts are sourced from TopicalChat (Gopalakrishnan et al. 2019). Three humans annotated for each model response, based on 6 different scoring categories. (Mehri & Eskenazi 2020)

Download Dataset

Daily Dialog - Gupta

Supported ChatEval Dataset

This evaluation dataset provides model responses and multiple references to the DailyDialog dataset(Li et. al., 2017), open-sourced by Gupta et. al., 2019

Download Dataset

Daily Dialog - Zhao

Supported ChatEval Dataset

This evaluation dataset provides model responses and multiple references to the DailyDialog dataset(Li et. al., 2017), open-sourced by Zhao et. al., 2020

Download Dataset

Daily Dialog - Huang (GRADE)

Supported ChatEval Dataset

This evaluation dataset provides model responses and human annotations to the Daily Dialog dataset (Li et. al. 2017), which was open sourced by Huang et. al. 2020

Download Dataset

ConvAI2 - Huang (GRADE)

Supported ChatEval Dataset

This evaulation dataset provides model responses and human annotations to the ConvAI2 dataset (Dinan et al., 2019), provided by Huang et al. 2020

Download Dataset

Empathetic - Huang (GRADE)

Supported ChatEval Dataset

This evaulation dataset provides model responses and human annotations to the EmpatheticDialogues ataset (Rashkin et al., 2019), provided by Huang et al. 2020

Download Dataset

DSTC6

Supported ChatEval Dataset

This evaluation dataset provides model responses and human annotations to the DSTC6 dataset, provided by Hori et al.

Download Dataset

DSTC7

Supported ChatEval Dataset

This evaluation dataset consists of conversational data from reddit, as well as contextual "facts" , taken from the websites that started the (Reddit) conversation. The dataset is provided by Galley et al. (2019)

Download Dataset

Persona Chat - Zhao

Supported ChatEval Dataset

This evaluation dataset provides model responses and multiple references to the PersonaChat dataset (Zhang et. al., 2018), model responses and annotations open-sourced by Zhao et. al., 2020

Download Dataset

ChatEval Baselines

ChatEval offers "ground-truth" baselines to compare uploaded models with. Baseline models range from human responders to established chatbot models. Baselines are handpicked and uploaded by the ChatEval Team.

OS Baseline 1

Actual responses to prompts from Open Subtitles

View Model

Twitter Baseline

Actual responses to prompts for ParlAI Twitter dataset

View Model

Cornell Movie DC Baseline

Actual responses from the Cornell Movie Dialogue Corpus to the prompts

View Model

ESL 3 Human Baseline

This is the actual written continuation

View Model

USR_TC_Original_Ground_Truth

USR Tropical Chat - Original Ground Truth from Topical Chat

View Model

USR_PC_Original_Ground_Truth

USR Persona Chat - Original Ground Truth from Persona Chat

View Model

ZHAO_ground-truth

ground-truth

View Model

DSTC7_rnn

baseline GRU-based Seq2Seq generation system

View Model

HUMOD_Human

human baseline

View Model

Automated Evaluation Systems

The ChatEval Platform handles certain automated evaluations of chatbot responses. These metrics are documented here. Systems can be ranked according to a specific metric and viewed as a leaderboard.

Average Sentence Length

Metric

The average number of tokens in the model's responses.

View Source

Distinct 1

Metric

The number of unique unigrams in the model's responses divided by the total number of generated tokens.

View Source

Distinct 2

Metric

The number of unique bigrams in the model's responses divided by the total number of generated tokens.

View Source

Embedding Greedy Match Score

Metric

Greedy matching between word embeddings of target utterance and model utterance (Rus et al., 2012).

View Source

Embedding Extrema Score

Metric

Vector extrema of a model's response token vectors (Forgues et al., 2014).

View Source

Average Embedding Score

Metric

Average of a model's responses token vectors.

View Source

BLEU Score

Metric

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

View Source

References

Higashinaka, Ryuichiro, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa Inaba. "The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics." In LREC. 2016.

Liu, Chia-Wei, Ryan Lowe, Iulian Serban, Mike Noseworthy,Laurent Charlin, and Joelle Pineau. "How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation." In EMNLP, pp. 2122–2132. Association for Computational Linguistics, 2016.

Forgues, Gabriel, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. "Bootstrapping dialog systems with word embeddings." In NIPS, modern machine learning and natural language processing workshop, vol. 2. 2014.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics, 2002.

Rus, Vasile, and Mihai Lintean. "A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics." In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pp. 157-162. Association for Computational Linguistics, 2012.

Tiedemann, Jörg. "News from OPUS-A collection of multilingual parallel corpora with tools and interfaces." In Recent advances in natural language processing, vol. 5, pp. 237-248. 2009.

Vinyals, Oriol, and Quoc Le. "A neural conversational model." arXiv preprint arXiv:1506.05869 (2015).