Modeling Visual Attention Based on Gestalt Theory
Background Introduction
In the field of computer vision, research on visual attention models aims to simulate how the human visual system selects regions of interest from images or natural scenes. The human brain’s ability to quickly and accurately identify salient regions in visual scenes is of great significance in tasks such as image processing, object recognition, and image segmentation. However, effectively detecting multiple salient objects in images remains a challenging problem.
Gestalt Theory, the foundation of modern cognitive learning theory, emphasizes that “the whole is greater than the sum of its parts,” with similarity and proximity being two key principles. Although Gestalt Theory provides important theoretical support for visual perception research, applying it to multi-salient object detection still presents technical challenges. This study proposes a saliency model based on Gestalt Theory—the Color Similarity and Spatial Proximity (CSSP) model—which aims to more effectively detect multiple salient objects in images by combining color similarity and spatial proximity.
Source of the Paper
The paper is co-authored by Guang-Hai Liu and Jing-Yu Yang, from the School of Computer Science and Engineering at Guangxi Normal University and the School of Computer Science and Technology at Nanjing University of Science and Technology, respectively. The paper was published in 2025 in the journal Cognitive Computation, titled “Modeling Visual Attention Based on Gestalt Theory.” It details the design, implementation, and experimental results of the CSSP model on multiple public datasets.
Research Process and Experimental Design
1. Model Design
The core idea of the CSSP model is to detect salient objects by combining color similarity and spatial proximity. The specific process includes the following steps:
1.1 Image Segmentation
First, the input image is segmented into multiple regions (superpixels) using the Simple Linear Iterative Clustering (SLIC) algorithm. The number of superpixels is set to 30 to ensure each region is of moderate size for subsequent processing.
1.2 Region Retrieval
Based on the Color Difference Histogram (CDH) method, the color similarity of each region is calculated. The CDH method reflects the similarity between two regions by computing the color difference between them. Additionally, a spatial proximity weight (wd) is introduced to adjust the distance between regions, ensuring that neighboring regions are more likely to be considered as a whole.
1.3 Saliency Score Calculation
The CSSP model proposes two saliency score calculation methods: Uncontrolled Saliency Score (USS) and Controlled Saliency Score (CSS). USS is calculated based solely on color similarity and spatial proximity, while CSS further incorporates the logarithmic characteristic of color differences to better reflect the perceptual characteristics of the human visual system.
1.4 Saliency Map Fusion
The final saliency map is generated by fusing the USS and CSS scores. During the fusion process, a Sigmoid function is used to activate the saliency scores, reducing impurities around salient objects and highlighting their interior regions.
2. Experiments and Results
The study evaluated the CSSP model on three public datasets (ECSSD, MSRA10K, and DUT-OMRON) and compared it with various existing saliency detection methods.
2.1 Datasets
- ECSSD Dataset: Contains images with complex backgrounds, presenting a high challenge for saliency detection.
- MSRA10K Dataset: Contains 10,000 images with simple background structures, typically featuring a single salient object.
- DUT-OMRON Dataset: Contains 5,168 high-quality images with complex backgrounds, often featuring multiple salient objects.
2.2 Evaluation Metrics
Precision, Recall, F-measure, and Mean Absolute Error (MAE) were used as evaluation metrics.
2.3 Experimental Results
- ECSSD Dataset: The CSSP model performed excellently in terms of precision and F-measure, significantly outperforming other comparison methods.
- MSRA10K Dataset: The CSSP model’s precision was slightly lower than that of the GBR and HS methods, but it performed well in terms of recall and F-measure.
- DUT-OMRON Dataset: The CSSP model outperformed all comparison methods across all metrics, particularly excelling in handling multiple salient objects.
3. Visual Comparison of Saliency Detection
Through visual comparison experiments, the CSSP model demonstrated outstanding performance in handling salient objects that touch image boundaries, significantly reducing gray patches inside salient objects and impurities around them. For example, when processing images with multiple salient objects, the CSSP model was able to detect all salient objects more accurately, while other methods suffered from missed or incorrect detections.
Conclusion and Significance
By combining color similarity and spatial proximity from Gestalt Theory, the CSSP model proposes a simple yet efficient saliency detection method. Experimental results show that the model performs exceptionally well in handling complex backgrounds and multiple salient objects, significantly outperforming many existing methods. The CSSP model not only effectively detects salient objects but also handles those that touch image boundaries, which is of great significance in many practical applications.
Research Highlights
- Innovation: The CSSP model is the first to combine color similarity and spatial proximity from Gestalt Theory, proposing a novel saliency detection method.
- Efficiency: By introducing spatial proximity weights and the logarithmic characteristic of color differences, the CSSP model demonstrates higher robustness in salient object detection.
- Application Value: The excellent performance of the CSSP model on multiple public datasets indicates its broad application prospects in practical applications such as image processing and object recognition.
Future Research Directions
Although the CSSP model has achieved significant results in saliency detection, it still has some limitations. For example, it may miss some salient regions when processing a group of salient objects. Future research plans to further optimize the model’s performance by incorporating deep learning techniques and exploring its potential in more practical applications.
Through this study, we have not only validated the feasibility of visual attention modeling based on Gestalt Theory but also provided a new research direction and method for the field of saliency detection.