Attention-Enabled Multi-Layer Subword Joint Learning for Chinese Word Embedding
Academic Background
In recent years, Chinese word embeddings have attracted significant attention in the field of Natural Language Processing (NLP). Unlike English, the complex and diverse structure of Chinese characters presents unique challenges for semantic representation. Traditional word vector models, such as Word2Vec, often fail to fully capture the subtle semantic information within Chinese characters, particularly neglecting the varying contributions of subword information at different levels. For example, Chinese characters are composed of multiple subcomponents, such as strokes, radicals, and pinyin, which play crucial roles in understanding semantics in different contexts. However, existing models often treat these components uniformly, failing to effectively distinguish their weights.
To address this issue, this paper proposes a weight-based Chinese word vector model that divides the internal structure of Chinese words into six levels of subword information: word, character, component, pinyin, stroke, and structure. By introducing an attention mechanism, the model dynamically adjusts the weights of each subword level, thereby more comprehensively extracting semantic information. This research not only improves the quality of Chinese word embeddings but also provides new insights into handling the complex semantic structures of Chinese text.
Source of the Paper
This paper is co-authored by Pengpeng Xue, Jing Xiong, Liang Tan, Zhongzhu Liu, and Kanglong Liu. The authors are affiliated with the School of Computer Science at Sichuan Normal University, Chongqing College of Mobile Communication, the Institute of Computing Technology at the Chinese Academy of Sciences, the School of Mathematics and Statistics at Huizhou University, and the Department of Chinese and Bilingual Studies at The Hong Kong Polytechnic University, respectively. The paper was accepted on February 16, 2025, and published in the journal Cognitive Computation with the DOI 10.1007/s12559-025-10431-3.
Research Process
1. Model Design
The proposed model is named the “Attention-enabled Multi-layer Subword Joint Learning Chinese Word Embedding” (ASWE). The core idea is to decompose the semantic representation of Chinese words into six levels of subword information and dynamically adjust the weights of each level through an attention mechanism. The specific process is as follows:
- Input Layer: The model first extracts the target word and its context words from a large-scale Chinese corpus. The context words are further decomposed into multiple subword levels, including word, character, component, pinyin, stroke, and structure.
- Embedding Layer: Each subword level is converted into vector representations through embedding matrices. These matrices are randomly initialized and continuously optimized during training.
- Intra-Subword Attention Layer: Within each subword level, the model uses a self-attention mechanism to calculate the weights of each subword. For example, for word-level context words, the model learns the weights of context words through self-attention and generates a temporary target vector. For other subword levels, the model calculates the similarity between subword vectors and the temporary target vector through dot products to obtain subword weights.
- Inter-Level Attention Layer: Based on the intra-subword attention layer, the model further applies an inter-level attention mechanism to calculate the contribution of each subword level to the semantic representation of the target word. Finally, the model generates the semantic vector of the target word through weighted summation.
2. Experimental Design
To validate the effectiveness of the ASWE model, a series of experiments were designed, including word similarity, word analogy, text classification, and case studies. The corpus used for the experiments was Chinese Wikipedia, which, after preprocessing, resulted in 233,666,330 lexical tokens and 2,036,032 unique words. The experimental parameters were set as follows: context window size of 5, word vector dimension of 200, 100 iterations, 10 negative samples, and an initial learning rate of 0.025.
- Word Similarity Experiment: The WordSim-240 and WordSim-297 datasets were used to evaluate the model’s performance in word similarity. The results showed that the ASWE model outperformed most baseline models on both datasets, achieving the best results on WordSim-297.
- Word Analogy Experiment: A dataset containing 1,124 sets of Chinese analogy questions was used to evaluate the model’s word analogy capabilities. The results showed that the ASWE model performed better than other models in the themes of capitals, cities, and families, particularly excelling in the family theme.
- Text Classification Experiment: The Fudan Chinese text dataset was used to evaluate the model’s performance in text classification. The results showed that the ASWE model achieved classification accuracy rates exceeding 98% in the five themes of environment, agriculture, economy, politics, and sports, performing the best.
- Case Study: Through specific case studies, the paper further validated the advantages of the ASWE model in capturing semantic associations of Chinese words. For example, when processing words such as “强壮” (strong) and “朝代” (dynasty), the semantically related words generated by the ASWE model were more accurate and closely associated with the target words.
Main Results
The experimental results demonstrate that the ASWE model performed excellently in multiple tasks, particularly achieving significant improvements in word similarity and word analogy tasks. The specific results are as follows:
- Word Similarity: The ASWE model achieved Spearman correlation coefficients of 0.5434 and 0.6254 on the WordSim-240 and WordSim-297 datasets, respectively, outperforming baseline models.
- Word Analogy: The ASWE model achieved accuracy rates of 92.91%, 92%, and 56.99% in the themes of capitals, cities, and families, respectively, performing the best.
- Text Classification: The ASWE model achieved classification accuracy rates exceeding 98% in all five themes, outperforming other models.
Conclusion and Significance
The ASWE model proposed in this paper significantly enhances the semantic representation ability of Chinese word embeddings by introducing multi-level subword information and an attention mechanism. The model not only more accurately captures the complex semantic structures of Chinese words but also provides new solutions for NLP tasks involving Chinese text. Specifically, the ASWE model holds significant value in the following aspects:
- Scientific Value: The ASWE model provides new insights for research on Chinese word embeddings, particularly excelling in handling polysemy, fixed collocations, and complex linguistic phenomena.
- Application Value: The model can be widely applied to tasks such as Chinese text classification, sentiment analysis, and machine translation, especially demonstrating significant advantages in processing short texts and complex semantic scenarios.
Research Highlights
The highlights of this research include the following:
- Multi-Level Subword Information: The ASWE model is the first to decompose the internal structure of Chinese words into six levels of subword information and dynamically adjust the weights of each level through an attention mechanism, thereby more comprehensively extracting semantic information.
- Application of Attention Mechanism: The model effectively captures the complex semantic structures of Chinese words through self-attention and inter-level attention mechanisms, enhancing the representation ability of word vectors.
- Extensive Experimental Validation: The paper comprehensively validates the effectiveness of the ASWE model through various experiments, including word similarity, word analogy, text classification, and case studies.
Other Valuable Information
Although the ASWE model performs excellently in multiple tasks, its computational complexity and training time are relatively high. Future research could further optimize the model’s time performance, especially when processing large-scale corpora. Additionally, the concept of the ASWE model could be extended to dynamic word embeddings and large-scale pre-trained models, thereby further enhancing its application value.