Three-Way Decision Approach Based on Utility and Dynamic Localization Transformational Procedures within a Circular Q-Rung Orthopair Fuzzy Set for Ranking and Grading Large Language Models

Academic Background

With the rapid development of artificial intelligence (AI) and natural language processing (NLP), large language models (LLMs) have made significant progress in both academia and industry. However, despite the outstanding performance of LLMs in multiple NLP tasks, no single model has been able to simultaneously meet all task requirements. The diversity of task demands and the complexity of evaluation criteria make the assessment of LLMs a multi-criteria decision-making (MCDM) problem. Traditional MCDM methods, while capable of ranking, have limitations in handling uncertainty, task prioritization, and data variability, especially when dealing with binary data, making effective grading difficult.

To address this issue, this paper proposes a three-way decision (3WD) method based on utility and dynamic localization transformation, combined with circular q-rung orthopair fuzzy sets (C-Q-ROFS), to rank and grade LLMs. This method not only handles uncertainty but also effectively processes binary data through dynamic transformation procedures, providing a more robust mechanism for LLM evaluation.

Source of the Paper

This paper is co-authored by Sarah Qahtan, Nahia Mourad, H. A. Alsattar, A. A. Zaidan, B. B. Zaidan, Dragan Pamucar, Vladimir Simic, Weiping Ding, and Khaironi Yatim, who are affiliated with various research institutions, including the University of Baghdad and the University of Belgrade. The paper was published in 2025 in the journal Cognitive Computation, with the specific volume number 17 and page number 77.

Research Process

1. Research Objectives and Method Overview

The main objective of this paper is to develop a new three-way decision method combined with C-Q-ROFS for ranking and grading LLMs. The specific methods include: 1. Reconstructing the fuzzy weighted zero inconsistency interrelationship process (FWZICBIP) method using C-Q-ROFS to prioritize tasks and address weight uncertainty. 2. Constructing a decision matrix through the intersection of LLMs and NLP tasks and applying utility and dynamic localization transformation procedures to handle binary data. 3. Reconstructing the conditional probabilities by opinion scores (CPOS) method within the C-Q-ROFS framework to determine decision thresholds for each LLM.

2. Detailed Research Process

2.1 Determining NLP Task Weights

First, the authors use the C-Q-ROFS-FWZICBIP method to determine the weights of NLP tasks. This method is implemented through the following steps: 1. q-Rung Orthopair Fuzzification: Convert evaluation values into numerical values based on a five-point Likert scale and represent them using q-rung orthopair fuzzy sets. 2. Circular Fuzzy Element Construction: Transform the q-rung orthopair fuzzy elements of each task into circular fuzzy elements. 3. Scoring: Use a scoring function to calculate the score of each task and map it to the [0,1] range. 4. Weight Calculation: Determine the final weight of each task by comparing the significance mean and initial weight.

2.2 Constructing the LLMs Decision Matrix

Next, the authors construct the decision matrix for LLMs through utility and dynamic localization transformation procedures. The specific steps include: 1. Utility Procedure: Decision-makers convert the 0s and 1s in the decision matrix into percentage values based on their personal experience. 2. Dynamic Localization: Transform the percentage decision matrix into a five-point Likert scale decision matrix.

2.3 Ranking and Grading of LLMs

Finally, the authors use the C-Q-ROFS-CPOS method and Bayesian decision theory to rank and grade LLMs. The specific steps include: 1. Fuzzification: Replace the values in the decision matrix with q-rung orthopair fuzzy elements. 2. Circular q-Rung Orthopair Fuzzy Element Construction: Aggregate the fuzzy elements from multiple decision-makers into circular q-rung orthopair fuzzy elements. 3. Scoring: Calculate the weighted score for each LLM. 4. Conditional Probability Calculation: Calculate the conditional probability of each LLM and rank them based on the probability. 5. Threshold Generation: Generate thresholds based on Bayesian decision rules to classify LLMs into positive (POS), boundary (BND), and negative (NEG) regions.

3. Research Results

3.1 NLP Task Weight Results

Using the C-Q-ROFS-FWZICBIP method, the authors determined the weights of NLP tasks. The results show that sentiment analysis (SA) is the most important subtask, with a weight of 0.2324, followed by reasoning (REAS) with a weight of 0.1611. Summarization (SUMM) is the most important task in natural language generation (NLG), with a weight of 0.1178.

3.2 LLMs Decision Matrix Results

Through the utility and dynamic localization transformation procedures, the authors constructed the decision matrix for LLMs. The results show that LLM14 performed the best across multiple NLP tasks, while LLM22 performed the worst.

3.3 LLMs Ranking and Grading Results

Using the C-Q-ROFS-CPOS method, the authors ranked and graded 40 LLMs. The results show that LLM14 had the highest conditional probability (0.6528), ranking first, while LLM22 had the lowest conditional probability (0.0000), ranking last. Using Bayesian decision rules, the authors classified LLMs into POS, BND, and NEG regions. The results show that LLM14 was in the POS region for most σ values, demonstrating excellent performance.

4. Sensitivity Analysis and Comparative Analysis

4.1 Sensitivity Analysis

The authors analyzed the impact of changing the risk avoidance coefficient (σ), the q value of q-rung orthopair fuzzy sets, and the weight coefficients of NLP tasks on the ranking and grading results of LLMs. The results show that changes in σ mainly affect the grading results, while changes in the q value affect both ranking and grading results. Adjustments to the weight coefficients significantly impact both ranking and grading results.

4.2 Comparative Analysis

The authors compared the proposed method with two benchmark studies. The results show that the proposed method has a clear advantage in handling binary data and uncertainty, enabling more precise ranking and grading of LLMs.

Conclusion and Value

This paper proposes a three-way decision method based on C-Q-ROFS, successfully addressing the multi-criteria decision-making problem in LLM evaluation. This method not only effectively handles uncertainty but also processes binary data through dynamic transformation procedures, providing a robust mechanism for ranking and grading LLMs. The research results show that LLM14 performed the best across multiple NLP tasks, while LLM22 performed the worst. Sensitivity analysis further verifies the robustness and stability of the method.

Research Highlights

  1. Novel Method: This paper is the first to combine C-Q-ROFS with the three-way decision method, proposing a new framework for LLM evaluation.
  2. Handling Binary Data: By using utility and dynamic localization transformation procedures, the method successfully processes binary data, improving evaluation accuracy.
  3. Sensitivity Analysis: By changing multiple parameters, the robustness and stability of the method are verified.
  4. Practical Application Value: This method provides a scientific basis for the evaluation and selection of LLMs, offering significant practical application value.

Summary

By developing a three-way decision method based on C-Q-ROFS, this paper successfully addresses the multi-criteria decision-making problem in LLM evaluation. This method not only effectively handles uncertainty but also processes binary data through dynamic transformation procedures, providing a robust mechanism for ranking and grading LLMs. The research results show that LLM14 performed the best across multiple NLP tasks, while LLM22 performed the worst. Sensitivity analysis further verifies the robustness and stability of the method. This research provides a scientific basis for the evaluation and selection of LLMs, offering significant practical application value.