Comparative Analysis of Hybrid and Ensemble Machine Learning Approaches in Predicting Football Player Transfer Values
Academic Background
In modern football economics, a player’s transfer market value is not only determined by their on-field performance but also influenced by factors such as their popularity and social media presence. With the globalization of the football industry, clubs are increasingly relying on data-driven analysis for decision-making in the transfer market. However, traditional player evaluation methods primarily focus on performance metrics like goals and assists, overlooking emerging factors such as social media activity and media coverage. Therefore, accurately predicting a player’s transfer value using machine learning and data science has become an important research topic.
The study by Wenjing Zhang and Dan Cao addresses this issue. By combining traditional performance metrics with emerging social media data, they developed a hybrid machine learning model aimed at providing clubs with more accurate predictions of player market values, thereby assisting clubs in making more informed decisions in the transfer market.
Source of the Paper
This paper was co-authored by Wenjing Zhang and Dan Cao, affiliated with the Institute of Physical Education at Liaoning Finance and Trade College and the Physical Education Department at Shenyang Medical College, respectively. The paper was published in 2025 in the journal Cognitive Computation under the title Comparative Analysis of Hybrid and Ensemble Machine Learning Approaches in Predicting Football Player Transfer Values. The research utilized the FIFA 19 dataset, combined with real-world statistical data, covering 54 features for 491 players.
Research Process
1. Data Collection and Preprocessing
The first step of the research was data collection. The researchers extracted the FIFA 19 dataset from the Sofifa.com platform, which provides player attributes, performance metrics, social media activity, and transfer market values. The dataset included 54 features for 491 players. During the preprocessing phase, the researchers first removed seven samples with incomplete feature values. Subsequently, based on the players’ club and position information, two new feature columns were added: “League Name” and “Position.” To ensure the stability of the target value, 27 low-value players from non-mainstream leagues were also excluded. The final dataset included 47 features for 457 players.
2. Feature Selection
To reduce data dimensionality and improve model accuracy, the researchers employed two filter-based feature selection methods: Variance Inflation Factor (VIF) and Mutual Information. Using these methods, the researchers identified 20 key features most critical for predicting player market values. The VIF method was primarily used to identify multicollinearity issues, while the Mutual Information method measured the correlation between features and the target variable. Finally, the researchers used the TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) method, combining Pearson correlation coefficients and Mutual Information scores, to select the 20 optimal features.
3. Machine Learning Model Development
The researchers adopted two mainstream machine learning models: Extreme Gradient Boosting (XGBoost, XGB) and Adaptive Boosting (AdaBoost, Ada), and further developed their hybrid versions. To optimize these models, the researchers introduced four metaheuristic optimization algorithms: Ali Baba and Forty Thieves Algorithm (AFT), Crystal Structure Algorithm (CSA), Henry Gas Solubility Optimization (HGSO), and Mayfly Optimization Algorithm (MOA). These optimization algorithms improved the predictive performance of the models by fine-tuning their hyperparameters.
4. Model Evaluation and Results
The researchers evaluated the models’ performance using five-fold cross-validation and employed multiple statistical metrics, including the coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE). The results showed that the XGBoost model optimized with AFT (XGAF) performed the best, achieving an R² value of 0.9905 and an RMSE of €1.9 million, indicating that the model could predict player market values with high precision. Additionally, the researchers conducted a sensitivity analysis using Shapley Additive Explanations (SHAP), revealing that a player’s reaction ability, ball control, and dribbling skills were the most critical factors influencing market value predictions.
Conclusions and Significance
This study successfully predicted football players’ transfer market values by combining traditional performance metrics with emerging social media data and developing a hybrid machine learning model. The results demonstrated that the XGBoost model optimized with AFT achieved excellent performance, with an error rate of less than 10%. This outcome not only provides clubs with a more accurate tool for player valuation but also offers new insights for data-driven decision-making in football economics.
Furthermore, the study highlighted the significant impact of players’ social media activity and popularity on their market value, opening new directions for future research. By introducing metaheuristic optimization algorithms, the researchers further enhanced the models’ predictive accuracy, showcasing the powerful potential of machine learning in complex data analysis and optimization problems.
Research Highlights
- Multidimensional Data Integration: The study considered not only traditional performance metrics but also incorporated social media data, providing a comprehensive evaluation of player market values.
- Hybrid Model Optimization: By introducing metaheuristic optimization algorithms, the researchers successfully improved the predictive accuracy of machine learning models.
- Sensitivity Analysis: Using the SHAP method, the researchers identified key factors influencing player market values, offering deeper insights for club decision-making.
Other Valuable Information
The researchers also noted that future studies could extend to other football leagues to validate the model’s generalizability. Additionally, as social media data continues to grow, future research could explore the impact of more emerging factors on player market values.
Through this study, Wenjing Zhang and Dan Cao have made valuable contributions to the intersection of football economics and machine learning, demonstrating the immense potential of data-driven decision-making in modern sports management.