ChatEval

ChatEval is a scientific framework for evaluating open domain chatbots. Researchers can submit their trained models to effortlessly receive comparisons with baselines and prior work. Since all evaluation code is open source, we ensure evaluation is performed in a standardized and transparent way. Additionally, open source baseline models and an ever growing groups public evaluation sets are available for public use.

Upload your method

FAQ

How much does ChatEval cost?

ChatEval is currently free for academic researchers. It is actively developed by the NLP Group of the University of Pennyslvania.

Is there an online demo video?

You can find a video tutorial for ChatEval here.

How was ChatEval built?

The ChatEval webapp is built using Django and React (front-end) using Magnitude word embeddings format for evaluation. Our source code is available on Github.

What should I cite?
@InProceedings{W18-6709, author = "Sedoc, Jo{~a}o and Ippolito, Daphne and Kirubarajan, Arun and Thirani, Jai and Ungar, Lyle and Callison-Burch, Chris", title = "ChatEval: A Tool for the Systematic Evaluation of Chatbots", booktitle = "Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS{&}NLG) ", year = "2018", publisher = "Association for Computational Linguistics", pages = "42--44", location = "Tilburg, the Netherlands", url = "http://aclweb.org/anthology/W18-6709" }

About Evaluation

Model responses are generated using an evaluation dataset of prompts and then uploaded to ChatEval. The responses are then evaluated using a series of automatic evaluation metrics, and are compared against selected baseline/ground truth models (e.g. humans).

Evaluation Datasets

ChatEval offers evaluation datasets consisting of prompts that uploaded chatbots are to respond to. Evaluation datasets are available to download for free and have corresponding baseline models.

Neural Conversational Model
Supported ChatEval Dataset

In (Vinyals and Le 2015), human evaluation is conducted on a set of 200 hand-picked prompts. We make this dataset available for comparison.

Download Dataset

Dialogue Breakdown Detection Challenge
Supported ChatEval Dataset

The DBDC dataset consists of a series of text-based conversations between a human and a chatbot where the human was aware they were chatting with a computer (Higashinaka et al. 2016).

Download Dataset

Open Subtitles
Supported ChatEval Dataset

This evaluation dataset contains a random subset of 200 prompts from the English OpenSubtitles 2009 dataset (Tiedemann 2009).

Download Dataset

Twitter
Supported ChatEval Dataset

The random Twitter test set is a random subset of 200 prompts from the ParlAi Twitter derived test set.

Download Dataset

Cornell Movie Dialogue Corpus
Supported ChatEval Dataset

The Cornell Movie Dialogue Corpus (Danescu-Niculescu-Mizil and Lee, 2011) contains accurate speaker annotations. We use 1000 prompts selected by Baheti et al., 2018.

Download Dataset

ChatEval Baselines

ChatEval offers "ground-truth" baselines to compare uploaded models with. Baseline models range from human responders to established chatbot models. Baselines are handpicked and uploaded by the ChatEval Team.

NCM Human 1 Baseline
ChatEval team collected responses to NCM prompts

View Model

NCM Human 2 Baseline
ChatEval team collected responses to NCM prompts

View Model

DBDC Human Baseline 1
ChatEval team collected responses to DBDC prompts

View Model

OS Baseline 1
Actual responses to prompts from Open Subtitles

View Model

Twitter Baseline
Actual responses to prompts for ParlAI Twitter dataset

View Model

Cornell Movie DC Baseline
Actual responses from the Cornell Movie Dialogue Corpus to the prompts

View Model

Automated Evaluation Methods

The ChatEval Platform handles certain automated evaluations of chatbot responses. These metrics are documented here. Models can be ranked according to a specific metric and viewed as a leaderboard.

Average Sentence Length
Metric

The average number of tokens in the model's responses.

View Source

Distinct 1
Metric

The number of unique unigrams in the model's responses divided by the total number of generated tokens.

View Source

Distinct 2
Metric

The number of unique bigrams in the model's responses divided by the total number of generated tokens.

View Source

Embedding Greedy Match Score
Metric

Greedy matching between word embeddings of target utterance and model utterance (Rus et al., 2012).

View Source

Embedding Extrema Score
Metric

Vector extrema of a model's response token vectors (Forgues et al., 2014).

View Source

Average Embedding Score
Metric

Average of a model's responses token vectors.

View Source

BLEU Score
Metric

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

View Source

References

Higashinaka, Ryuichiro, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa Inaba. "The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics." In LREC. 2016.

Liu, Chia-Wei, Ryan Lowe, Iulian Serban, Mike Noseworthy,Laurent Charlin, and Joelle Pineau. "How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation." In EMNLP, pp. 2122–2132. Association for Computational Linguistics, 2016.

Forgues, Gabriel, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. "Bootstrapping dialog systems with word embeddings." In NIPS, modern machine learning and natural language processing workshop, vol. 2. 2014.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics, 2002.

Rus, Vasile, and Mihai Lintean. "A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics." In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pp. 157-162. Association for Computational Linguistics, 2012.

Tiedemann, Jörg. "News from OPUS-A collection of multilingual parallel corpora with tools and interfaces." In Recent advances in natural language processing, vol. 5, pp. 237-248. 2009.

Vinyals, Oriol, and Quoc Le. "A neural conversational model." arXiv preprint arXiv:1506.05869 (2015).