About Evaluation

Model responses are generated using an evaluation dataset of prompts and then uploaded to ChatEval. The responses are then evaluated using a series of automatic evaluation metrics, and are compared against selected baseline/ground truth models (e.g. humans).

Evaluation Datasets

ChatEval offers evaluation datasets consisting of prompts that uploaded chatbots are to respond to. Evaluation datasets are available to download for free and have corresponding baseline models.

Neural Conversational Model
Supported ChatEval Dataset

In (Vinyals and Le 2015), human evaluation is conducted on a set of 200 hand-picked prompts. We make this dataset available for comparison.

Download Dataset

Dialogue Breakdown Detection Challenge
Supported ChatEval Dataset

The DBDC dataset consists of a series of text-based conversations between a human and a chatbot where the human was aware they were chatting with a computer (Higashinaka et al. 2016).

Download Dataset

Open Subtitles
Supported ChatEval Dataset

This evaluation dataset contains a random subset of 200 prompts from the English OpenSubtitles 2009 dataset (Tiedemann 2009).

Download Dataset

Twitter
Supported ChatEval Dataset

The random Twitter test set is a random subset of 200 prompts from the ParlAi Twitter derived test set.

Download Dataset

ChatEval Baselines

ChatEval offers "ground-truth" baselines to compare uploaded models with. Baseline models range from human responders to established chatbot models. Baselines are handpicked and uploaded by the ChatEval Team.

NCM Human 1 Baseline
Human 1 responding to NCM

View Model

NCM Human 2 Baseline
Human 2 responding to NCM

View Model

DBDC Human Baseline 1
Human baseline for DBDC

View Model

OS Baseline 1
Open Subtitles Baseline

View Model

Twitter Baseline
Human baseline for ParlAI Twitter dataset

View Model

Automated Evaluation Methods

The ChatEval Platform handles certain automated evaluations of chatbot responses. These metrics are documented here. Models can be ranked according to a specific metric and viewed as a leaderboard.

Average Sentence Length
Metric

The average number of tokens in the model's responses.

View Source

Distinct 1
Metric

The number of unique unigrams in the model's responses divided by the total number of generated tokens.

View Source

Distinct 2
Metric

The number of unique bigrams in the model's responses divided by the total number of generated tokens.

View Source

Embedding Greedy Match Score
Metric

Greedy matching between word embeddings of target utterance and model utterance (Rus et al., 2012).

View Source

Embedding Extrema Score
Metric

Vector extrema of a model's response token vectors (Forgues et al., 2014).

View Source

Average Embedding Score
Metric

Average of a model's responses token vectors.

View Source

BLEU Score
Metric

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

View Source

References

Higashinaka, Ryuichiro, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa Inaba. "The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics." In LREC. 2016.

Liu, Chia-Wei, Ryan Lowe, Iulian Serban, Mike Noseworthy,Laurent Charlin, and Joelle Pineau. "How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation." In EMNLP, pp. 2122–2132. Association for Computational Linguistics, 2016.

Forgues, Gabriel, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. "Bootstrapping dialog systems with word embeddings." In NIPS, modern machine learning and natural language processing workshop, vol. 2. 2014.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318. Association for Computational Linguistics, 2002.

Rus, Vasile, and Mihai Lintean. "A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics." In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pp. 157-162. Association for Computational Linguistics, 2012.

Tiedemann, Jörg. "News from OPUS-A collection of multilingual parallel corpora with tools and interfaces." In Recent advances in natural language processing, vol. 5, pp. 237-248. 2009.

Vinyals, Oriol, and Quoc Le. "A neural conversational model." arXiv preprint arXiv:1506.05869 (2015).