Towards Transparent Deep Image Aesthetics Assessment with Tag-based Content Descriptors
Transparent Deep Image Aesthetic Assessment Based on Tag Content Descriptions
Academic Background
With the proliferation of social media platforms such as Instagram and Flickr, there is an increasing demand for Image Aesthetics Assessment (IAA) models. These models can help social network service providers optimize image ranking or recommendation results, assist ordinary users in managing albums, choose the best photos, and even offer guidance during shooting and editing processes. However, constructing a robust IAA model has always been a challenge due to the complexity of image aesthetics, encompassing various factors like objects and photography techniques.
Research Motivation
While existing deep learning methods perform well in IAA, their internal mechanisms remain unclear. Most studies predict image aesthetics through implicit learning of semantic features, failing to explicitly explain what those features specifically represent. The core goal of this paper is to create a more transparent IAA framework by introducing interpretable semantic features, using human-readable tag descriptions of image content, thus constructing an IAA model based on explicit descriptions.
Research Origin
This paper is co-authored by researchers Jingwen Hou (Nanyang Technological University), Weisi Lin (Nanyang Technological University), Yuming Fang (Jiangxi University of Finance and Economics), Haoning Wu (Nanyang Technological University S-Lab), Chaofeng Chen (Nanyang Technological University S-Lab), Liang Liao (Nanyang Technological University S-Lab), and Weide Liu (Agency for Science, Technology and Research, Singapore) and has been accepted for publication in the journal IEEE Transactions on Image Processing.
Research Process
Explicit Matching Process
The research first proposed an explicit matching process to generate Tag-based Content Descriptors (TCD) through predefined tags. The specific steps include:
- Tag Selection and Definition: Select and define two sets of predefined tags: object-related tags and photography technique-related tags.
- Feature Generation: Utilize the visual encoder and text encoder of the CLIP model to encode images and text tags into visual features and text features respectively.
- Similarity Calculation: Calculate the similarity between visual features and text features to generate Tag-based Content Descriptors (TCD).
To verify the performance of the generator, the research team annotated 5101 images and created a photography-related tag dataset for validation.
Implicit Matching Process
Given that predefined tags may not fully cover all image content, the research further proposed an implicit matching process to describe image content not covered by predefined tags. The specific steps include:
- Implicit Tag Definition: Assume the existence of implicit tag sets to describe high-level and low-level content, namely High-level Implicit Tags (HIT) and Low-level Implicit Tags (LIT).
- Optimization Process: Directly obtain the text features of implicit tags through an optimization process based on the IAA objective.
- Consistency Constraint: Introduce a consistency constraint to ensure that implicit tags and explicit tags describe different semantic patterns, encouraging independence among different tag features.
Algorithm and Model
The TCD features generated through the above two matching processes are used to train a simple multilayer perceptron (MLP) model for IAA. The optimization objectives include minimizing the error between predicted and actual aesthetic labels, as well as consistency constraints among features.
Main Experimental Results
- Single Explicit Matching Process: Using only TCD generated by predefined tags achieved a performance of SRCC 0.767, comparable to most current state-of-the-art methods.
- Explicit + Implicit Matching Process: Integrating highly relevant components generated by the implicit matching process into TCD significantly improved the IAA model’s SRCC to 0.817, surpassing existing methods.
Research Conclusions and Value
The research demonstrates that introducing human-readable tag content descriptors (TCD) can significantly enhance the transparency and performance of image aesthetic assessment. Specifically, the study achieves the following major breakthroughs:
- Transparent Interpretation: For the first time, the study adopts human-readable text features to describe image content through explicitly defined tags in the IAA, enhancing the model’s transparency.
- Performance Improvement: The introduction of the implicit matching process further enhances the expressiveness of TCD, significantly improving the IAA model’s performance.
- Data Contribution: The study provides the first annotated dataset with photography-related tags, greatly advancing research progress based on tag content descriptors.
Research Highlights
- Transparent Deep Learning Framework: The study constructs a transparent IAA framework, achieving image aesthetic assessment from the perspective of explicit image content descriptions, making the semantic interpretation of features more intuitive.
- Comprehensive Performance Enhancement: Combining explicit and implicit matching processes, the IAA model not only improves performance but also maintains high interpretability, aiding future research.
- Innovative Dataset: The photography-related tag dataset helps further validate and apply the potential of the TCD generator.
Application Prospects and Significance
By providing a transparent and efficient image aesthetic assessment method, this study offers innovative solutions for future social media management, image search, and recommendation system optimization. This not only enhances the intelligence level of image processing but also provides scientific guidance to ordinary users in image management and editing. Through this research, significant advancements in transparency and performance in image aesthetics assessment have been made, opening new directions for future research and applications.