Access the full text.
Sign up today, get DeepDyve free for 14 days.
Seyyed Hashemi, Kyle Williams, Ahmed Kholy, I. Zitouni, Paul Crook (2018)
Measuring User Satisfaction on Smart Speaker Intelligent Assistants Using Intent Sensitive Query EmbeddingsProceedings of the 27th ACM International Conference on Information and Knowledge Management
Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, W. Dolan, Jianfeng Gao, Wen-tau Yih, Michel Galley (2017)
A Knowledge-Grounded Neural Conversation Model
Ben Carterette (2012)
Multiple testing in statistical analysis of systems-based information retrieval experimentsACM Trans. Inf. Syst., 30
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (2019)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
(2017)
Amanda Cercas Curry, and Verena Rieser
Chengxiang Zhai (2001)
Notes on the Lemur TFIDF modelRetrieved from http://lemurproject.org/lemur/tfidf.pdf.
Djoerd Hiemstra (2001)
Using Language Models for Information RetrievalCiteseer.
Alan Ritter, Colin Cherry, W. Dolan (2010)
Unsupervised Modeling of Twitter Conversations
Chongyang Tao, Lili Mou, Dongyan Zhao, Rui Yan (2017)
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems
Shumpei Sano, Nobuhiro Kaji, Manabu Sassano (2016)
Prediction of Prospective User Engagement with Intelligent Assistants
Avishek Anand, L. Cavedon, Hideo Joho, M. Sanderson, Benno Stein (2019)
Conversational Search (Dagstuhl Seminar 19461)Dagstuhl Reports, 9
Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, Minlie Huang, L. Lastras, Jonathan Kummerfeld, Walter Lasecki, Chiori Hori, A. Cherian, Tim Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta (2019)
The Eighth Dialog System Technology ChallengeArXiv, abs/1911.06394
Basma Boussaha, Nicolas Hernandez, C. Jacquin, E. Morin (2019)
Deep Retrieval-Based Dialogue Systems: A Short ReviewArXiv, abs/1907.12878
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, Shuzi Niu (2017)
DailyDialog: A Manually Labelled Multi-turn Dialogue DatasetArXiv, abs/1710.03957
Tetsuya Sakai, Zhaohao Zeng (2019)
Which diversity evaluation measures are “Good”? In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
J. Olive, Caitlin Christianson, John McCary (2011)
Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language ExploitationHandbook of Natural Language Processing and Machine Translation
Jiepu Jiang, James Allan (2016)
Correlation Between System and User Metrics in a SessionProceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval
Filip Radlinski, Nick Craswell (2017)
A Theoretical Framework for Conversational SearchProceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval
G. Amati (2006)
Frequentist and Bayesian Approach to Information Retrieval
S. Banerjee, A. Lavie (2005)
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
T. Sakai (2007)
On the reliability of information retrieval metrics based on graded relevanceInf. Process. Manag., 43
Lifeng Han, Derek Wong, Lidia Chao (2012)
LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors
Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, Erkut Erdem (2016)
Re-evaluating Automatic Metrics for Image Captioning
Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Awadallah, Aidan Crook, I. Zitouni, T. Anastasakos (2016)
Understanding User Satisfaction with Intelligent AssistantsProceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval
P. Gulyaev, E. Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pugachev, M. Burtsev (2020)
Goal-Oriented Multi-Task BERT-Based Dialogue State TrackerArXiv, abs/2002.02450
Jan Deriu, Álvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, S. Rosset, Eneko Agirre, Mark Cieliebak (2019)
Survey on evaluation methods for dialogue systemsArtificial Intelligence Review, 54
Liu Yang, Minghui Qiu, Chen Qu, J. Guo, Yongfeng Zhang, W. Croft, Jun Huang, Haiqing Chen (2018)
Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation SystemsThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
T. Sakai (2006)
Evaluating evaluation metrics based on the bootstrapProceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
K. Zhou, M. Lalmas, T. Sakai, Ronan Cummins, J. Jose (2013)
On the reliability and intuitiveness of aggregated search metricsProceedings of the 22nd ACM international conference on Information & Knowledge Management
P. Foltz, W. Kintsch, T. Landauer (1998)
The Measurement of Textual Coherence with Latent Semantic Analysis.Discourse Processes, 25
M. Sanderson, Monica Paramita, Paul Clough, E. Kanoulas (2010)
Do user preferences and evaluation measures line up?Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Koustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, William Hamilton, Joelle Pineau (2020)
Learning an Unreferenced Metric for Online Dialogue Evaluation
Alistair Moffat, J. Zobel (2008)
Rank-biased precision for measurement of retrieval effectivenessACM Trans. Inf. Syst., 27
A. Turpin, Falk Scholer (2006)
User performance versus precision measures for simple search tasksProceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
G. Amati, C. Rijsbergen (2002)
Probabilistic models of information retrieval based on measuring the divergence from randomnessACM Trans. Inf. Syst., 20
S. Harter (1975)
A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical LiteratureJ. Am. Soc. Inf. Sci., 26
Katharina Kann, S. Rothe, Katja Filippova (2018)
Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!ArXiv, abs/1809.08731
Yu Wu, Wei Wu, Ming Zhou, Zhoujun Li (2016)
Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based ChatbotsArXiv, abs/1612.01627
Yi Yang, Wen-tau Yih, Christopher Meek (2015)
WikiQA: A Challenge Dataset for Open-Domain Question Answering
Jekaterina Novikova, Ondrej Dusek, A. Curry, Verena Rieser (2017)
Why We Need New Evaluation Metrics for NLG
Gabriel Forgues, Joelle Pineau (2014)
Bootstrapping Dialog Systems with Word Embeddings
Ryan Lowe, Nissan Pow, Iulian Serban, Joelle Pineau (2015)
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, W. Dolan (2018)
Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization
E. Kanoulas, Ben Carterette, Paul Clough, M. Sanderson (2011)
Evaluating multi-query sessionsProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
T. Sakai, N. Kando (2008)
On information retrieval metrics designed for evaluation with incomplete relevance assessmentsInformation Retrieval, 11
Shikib Mehri, M. Eskénazi (2020)
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, H. Jégou, Tomas Mikolov (2016)
FastText.zip: Compressing text classification modelsArXiv, abs/1612.03651
Julia Kiseleva, Kyle Williams, Ahmed Awadallah, Aidan Crook, I. Zitouni, T. Anastasakos (2016)
Predicting User Satisfaction with Intelligent AssistantsProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, J. Weston (2018)
Wizard of Wikipedia: Knowledge-Powered Conversational agentsArXiv, abs/1811.01241
Jeff Mitchell, Mirella Lapata (2008)
Vector-based Models of Semantic Composition
Joseph Olive, Caitlin Christianson, John McCary (2011)
Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language ExploitationSpringer Science & Business Media.
T. Sakai (2005)
The Effect of Topic Sampling on Sensitivity Comparisons of Information Retrieval Metrics
T. Landauer, S. Dumais (1997)
A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.Psychological Review, 104
Tian Lan, Xian-Ling Mao, Wei Wei, Xiaoyan Gao, Heyan Huang (2020)
PONEACM Transactions on Information Systems (TOIS), 39
Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, W. Dolan (2015)
A Neural Network Approach to Context-Sensitive Generation of Conversational ResponsesArXiv, abs/1506.06714
Yi Luan, Chris Brockett, W. Dolan, Jianfeng Gao, Michel Galley (2017)
Multi-Task Learning for Speaker-Role Adaptation in Neural Conversation Models
Alan Ritter, Colin Cherry, W. Dolan (2011)
Data-Driven Response Generation in Social Media
Chen Qu, Liu Yang, W. Croft, Johanne Trippas, Yongfeng Zhang, Minghui Qiu (2018)
Analyzing and Characterizing User Intent in Information-seeking ConversationsThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
Iulian Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Yoshua Bengio (2016)
A Hierarchical Latent Variable Encoder-Decoder Model for Generating DialoguesArXiv, abs/1605.06069
Amanda Stent, M. Marge, Mohit Singhai (2005)
Evaluating Evaluation Methods for Generation in the Presence of Variation
Johanne Trippas, Damiano Spina, L. Cavedon, Hideo Joho, M. Sanderson (2018)
Informing the Design of Spoken Conversational Search: Perspective PaperProceedings of the 2018 Conference on Human Information Interaction & Retrieval
Tian Lan, Xian-Ling Mao, Wei Wei, Xiaoyan Gao, Heyan Huang (2020)
PONE: A novel automatic evaluation metric for open-domain generative dialogue systemsarXiv preprint arXiv:2004.02399 (2020).
Mengyang Liu, Yiqun Liu, Jiaxin Mao, Cheng Luo, Shaoping Ma (2018)
Towards Designing Better Session Search Evaluation MetricsThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
Xisen Jin, Wenqiang Lei, Z. Ren, Hongshen Chen, Shangsong Liang, Y. Zhao, Dawei Yin (2018)
Explicit State Tracking with Semi-Supervisionfor Neural Dialogue GenerationProceedings of the 27th ACM International Conference on Information and Knowledge Management
Filip Radlinski, Nick Craswell (2010)
Comparing the sensitivity of information retrieval metricsProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, Dongyan Zhao (2017)
How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models
(2001)
Notes on the Lemur TFIDF model. Retrieved from http://lemurproject.org/lemur/tfidf.pdf
Ilya Sutskever, Oriol Vinyals, Quoc Le (2014)
Sequence to Sequence Learning with Neural NetworksArXiv, abs/1409.3215
J. Choi, Ali Ahmadvand, Eugene Agichtein (2019)
Offline and Online Satisfaction Prediction in Open-Domain Conversational SystemsProceedings of the 28th ACM International Conference on Information and Knowledge Management
K. Zhou, Ronan Cummins, M. Lalmas, J. Jose (2012)
Evaluating aggregated search pages
Chin-Yew Lin (2004)
ROUGE: A Package for Automatic Evaluation of Summaries
Louise Su (1992)
Evaluation Measures for Interactive Information RetrievalInf. Process. Manag., 28
Katharina Kann, Sascha Rothe, Katja Filippova (2018)
Sentence-level fluency evaluation: References help, but can be spared!arXiv preprint arXiv:1809Sentence-level fluency evaluation: References help
Delphine Charlet, Géraldine Damnati (2017)
SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering
S. Clinchant, Éric Gaussier (2010)
Information-based models for ad hoc IRProceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Mark Sanderson, Monica Lestari Paramita, Paul Clough, Evangelos Kanoulas (2010)
Do user preferences and evaluation measures line up? In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information RetrievalACM
T. Sakai (2013)
Metrics, Statistics, Tests
D. Hiemstra (2003)
Language models for information retrievalProceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau (2016)
How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response GenerationArXiv, abs/1603.08023
O. Chapelle, D. Metlzer, Ya Zhang, P. Grinspan (2009)
Expected reciprocal rank for graded relevanceProceedings of the 18th ACM conference on Information and knowledge management
Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau (2016)
On the Evaluation of Dialogue Systems with Next Utterance Classification
S. Robertson (2004)
Understanding inverse document frequency: on theoretical arguments for IDFJ. Documentation, 60
Ye Chen, K. Zhou, Yiqun Liu, Min Zhang, Shaoping Ma (2017)
Meta-evaluation of Online and Offline Web Search Evaluation MetricsProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
G. Amati, Giuseppe Amodeo, M. Bianchi, C. Gaibisso, G. Gambosi (2008)
FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track
B. Wong, Chunyu Kit (2011)
Comparative Evaluation of Term Informativeness Measures in Machine Translation Evaluation Metrics
Stephen P. Harter (1975)
A probabilistic approach to automatic keyword indexingPart I. On the distribution of specialty words in a technical literature. J. Amer. Soc. Inf. Sci., 26
Aldo Lipani, Ben Carterette, Emine Yilmaz (2019)
From a User Model for Query Sessions to Session Rank Biased Precision (sRBP)Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval
K. Järvelin, Jaana Kekäläinen (2002)
Cumulated gain-based evaluation of IR techniquesACM Trans. Inf. Syst., 20
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Weinberger, Yoav Artzi (2019)
BERTScore: Evaluating Text Generation with BERTArXiv, abs/1904.09675
Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, W. Dolan (2015)
deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse TargetsArXiv, abs/1506.06863
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I. Casanueva, Stefan Ultes, Osman Ramadan, Milica Gasic (2018)
MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling
G. Sidorov, Alexander Gelbukh, Helena Gómez-Adorno, David Pinto (2014)
Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space ModelComputación y Sistemas, 18
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, Tomas Mikolov (2016)
FastTextarXiv preprint arXiv:1612.03651 (2016).
Ryan Lowe, Michael Noseworthy, Iulian Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau (2017)
Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
S. Clinchant, Éric Gaussier (2009)
Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR
Daniel Cohen, Liu Yang, W. Croft (2018)
WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage RetrievalThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
V. Rus, Mihai Lintean (2012)
A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics
Wen Zheng, K. Zhou (2019)
Enhancing Conversational Dialogue Models with Grounded KnowledgeProceedings of the 28th ACM International Conference on Information and Knowledge Management
Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, Ming Zhang (2016)
Two are Better than One: An Ensemble of Retrieval- and Generation-Based Dialog SystemsArXiv, abs/1610.07149
Jiepu Jiang, Ahmed Awadallah, Xiaolin Shi, Ryen White (2015)
Understanding and Predicting Graded Search SatisfactionProceedings of the Eighth ACM International Conference on Web Search and Data Mining
T. Sakai, Zhaohao Zeng (2019)
Which Diversity Evaluation Measures Are "Good"?Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan Liu, Rui Yan (2016)
Multi-view Response Selection for Human-Computer Conversation
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang (2016)
SQuAD: 100,000+ Questions for Machine Comprehension of Text
D. Kelly (2009)
Methods for Evaluating Interactive Information Retrieval Systems with UsersFound. Trends Inf. Retr., 3
T. Sakai (2012)
Evaluation with informational and navigational intentsProceedings of the 21st international conference on World Wide Web
Tian Lan, Xian-Ling Mao, Heyan Huang, Wei Wei (2019)
When to Talk: Chatbot Controls the Timing of Talking during Multi-turn Open-domain Dialogue GenerationArXiv, abs/1912.09879
Gabriel Murray, S. Renals, J. Carletta, Johanna Moore (2005)
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
K. Papineni, Salim Roukos, T. Ward, Wei-Jing Zhu (2002)
Bleu: a Method for Automatic Evaluation of Machine Translation
Sarik Ghazarian, Johnny Wei, A. Galstyan, Nanyun Peng (2019)
Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized EmbeddingsArXiv, abs/1904.10635
Yiping Song, Cheng-te Li, Jian-Yun Nie, Ming Zhang, Dongyan Zhao, Rui Yan (2018)
An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems
Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.
ACM Transactions on Information Systems (TOIS) – Association for Computing Machinery
Published: Aug 31, 2021
Keywords: Conversational search
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.