A Holistic Comparative Study of Large Language Models as Emotional Support Dialogue Systems

Academic Background

In recent years, with the rapid development of large language models (LLMs), their applications in the field of natural language processing (NLP) have become increasingly widespread. LLMs such as ChatGPT and LLaMA have demonstrated powerful capabilities in language generation and comprehension, even excelling in emotional expression and empathy. Emotional Support Dialogue Systems (ESDS) aim to convey understanding, sympathy, care, and support through conversation, helping individuals cope with emotional distress, pressure, or challenges. However, despite the potential shown by LLMs in emotional dialogues, their ability to provide effective emotional support has not been fully evaluated.

This study aims to explore whether LLMs can serve as the core framework for emotional support dialogue systems and assess their performance in emotional support strategies and language usage. By comparing the performance of LLMs and humans in emotional support dialogues, the research reveals the limitations of LLMs in providing emotional support, particularly in terms of strategy preferences and language generation biases.

Source of the Paper

This paper is co-authored by Xin Bai, Guanyi Chen, Tingting He, Chenlian Zhou, and Cong Guo, who are affiliated with the Faculty of Artificial Intelligence in Education at Central China Normal University, the Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, and the National Language Resources Monitor and Research Center. The paper was published in 2025 in the journal Cognitive Computation, titled A Holistic Comparative Study of Large Language Models as Emotional Support Dialogue Systems.

Research Process

1. Research Framework and Dataset

This study is based on the Emotional Support Conversation (ESC) framework proposed by Liu et al., which consists of three stages: Exploration, Comforting, and Action. Each stage includes a set of suggested dialogue strategies, such as asking questions, reflecting on emotions, and providing suggestions. The research uses the ESC benchmark dataset (ESConv), which contains approximately 1,000 dialogues and 13,000 utterances, with each utterance annotated with corresponding emotional support strategies.

2. Models and Experimental Design

The study selected two mainstream LLMs: ChatGPT and LLaMA, and designed various prompt engineering techniques to construct LLM-based emotional support dialogue systems. The experiments were conducted in the following steps:

  • Zero-shot and Few-shot Learning: Testing the ability of LLMs to generate emotional support dialogues without examples or with only a few examples.
  • Guided Models: Explicitly informing the model which strategy to use in the prompt, evaluating the model’s performance under known strategies.
  • Chain-of-Thought (CoT) Models: Using staged reasoning to first select a strategy and then generate the dialogue, assessing the model’s performance in complex tasks.

3. Evaluation Metrics

The study evaluated the models from three dimensions: - Strategy Selection Accuracy: Assessing the model’s ability to choose the correct strategy. - Generation Quality: Using automatic evaluation metrics such as BLEU and ROUGE to assess the quality of generated dialogues. - Diversity: Evaluating the lexical diversity of generated dialogues using the DIST-N metric.

Main Findings

1. Strategy Selection Accuracy

The study found that LLMs performed poorly in strategy selection accuracy, especially in the absence of examples. For instance, LLaMA achieved a strategy selection accuracy of only 21.84% in the 5-shot setting, significantly lower than non-LLM models like TransESC, which achieved 34.71%. This indicates that LLMs still face significant gaps in understanding and using emotional support strategies.

2. Generation Quality and Diversity

Although LLMs performed comparably to humans in dialogue generation quality, their generated content was often overly verbose, leading to lower BLEU scores. Additionally, LLMs performed well in lexical diversity, but in professional contexts, excessive diversity may not always be beneficial.

3. Strategy Usage Preferences

LLMs exhibited strong strategy preferences in emotional support dialogues, particularly in the comforting stage. For example, ChatGPT and LLaMA chose to use “reflection of feelings” and “affirmation and reassurance” strategies in over 50% of cases, while taking fewer real actions, such as providing suggestions or information. This preference bias limits the ability of LLMs to provide comprehensive emotional support.

Conclusions and Significance

This study demonstrates that, despite the strong empathetic capabilities of LLMs in emotional dialogues, they still exhibit significant limitations in providing effective emotional support. LLMs tend to overuse certain strategies, and their generated content often deviates from actual dialogues by human experts. This finding provides important insights for future improvements in the application of LLMs in emotional support dialogues.

Research Highlights

  1. Comprehensive Comparison: This study is the first to comprehensively compare the performance of LLMs in emotional support dialogues, revealing their biases in strategy selection and language generation.
  2. Novel Methods: The research employs various prompt engineering techniques, such as chain-of-thought models, offering new approaches for applying LLMs in complex tasks.
  3. Practical Significance: The findings have significant implications for developing more effective emotional support dialogue systems, particularly in reducing strategy preferences and over-generation.

Future Outlook

Future research could explore ways to reduce the strategy preferences of LLMs in emotional support dialogues, encouraging them to take more real actions, such as providing suggestions. Additionally, controlling the over-generation issue in LLMs is an important direction for future research.

Through this study, we have gained a deeper understanding of the performance of LLMs in emotional support dialogues and provided valuable insights for future technological advancements in this field.