Modulating Effective Receptive Fields for Convolutional Kernels
GMConv: Adjusting the Effective Receptive Field of Convolutional Neural Networks
Introduction
Convolutional Neural Networks (CNNs) have achieved significant success in computer vision tasks, including image classification and object detection, through the use of convolutional kernels. However, in recent years, Vision Transformers (ViTs) have gained attention for their exceptional performance in visual recognition tasks, sometimes even surpassing CNNs. Despite this, efforts to improve CNNs have never ceased, with many research endeavors focused on designing new CNN architectures. Specifically, CNNs with large kernels have shown performance comparable to the state-of-the-art ViTs in terms of accuracy.
This paper focuses on the Effective Receptive Field (ERF) in CNNs, which indicates the contribution of a specific input pixel to the output pixel. Research has found that ERFs typically follow a Gaussian distribution. Based on this characteristic, the authors propose the Gaussian Mask Convolutional Kernel (GMConv), which, while retaining the standard convolutional kernel structure, adjusts the receptive field of the convolutional kernel by generating a concentric symmetric mask using the Gaussian function.
Source
This paper was authored by Chen Qi, Li Chao, Ning Jia, Stephen Lin, and Kun He (corresponding author), who are from Huazhong University of Science and Technology and Microsoft Research Asia. The paper was published in IEEE Transactions on Neural Networks and Learning Systems.
Background
Despite the excellent performance of existing Convolutional Neural Networks (CNNs) in computer vision tasks, the design of the receptive field (RF) with standard square convolutional kernels has certain limitations. Existing research indicates that the distribution of ERF is usually Gaussian rather than a uniform square. Against this backdrop, research has shifted towards more effectively adjusting the ERF to improve the performance of convolutional neural networks. This is the motivation behind the authors’ proposal of GMConv, which adjusts the receptive field of convolutional kernels using a Gaussian mask to enhance performance in image classification and object detection.
Research Methodology
Research Process
Proposal of GMConv: GMConv comprises a static version (S-GMConv) and a dynamic version (D-GMConv). S-GMConv only requires one extra parameter (σ) to generate a concentric circular mask, while D-GMConv needs more parameters to control the mask distribution and includes a dynamic Sigma module that dynamically predicts specific sigma parameters based on the input.
Implementation of GMConv: GMConv generates a mask based on the Gaussian function and applies it to the standard convolutional kernel, thus adjusting the receptive field of the convolutional kernel. The mask generation process uses the Gaussian distribution function to maximize the avoidance of extreme values while maintaining the effectiveness of the RF.
Application of GMConv in CNNs: GMConv can be seamlessly integrated into existing CNN architectures. By replacing standard convolutional kernels with GMConv kernels, the model performance can be significantly enhanced across multiple benchmark datasets.
Experimental Design
Experiments were conducted across multiple standard datasets, including CIFAR-10 and CIFAR-100 for image classification, ImageNet for large-scale image classification, and COCO 2017 for object detection. Based on these benchmark datasets, the authors made a comprehensive comparison of GMConv’s performance in different network architectures and conducted ablation studies to analyze various aspects of GMConv.
Main Results
Results on CIFAR datasets: Testing on ResNet-20, ResNet-56, and ResNet-18, with multiple experiments, indicated that GMConv significantly improved model accuracy. Specifically, GMConv models showed substantial accuracy improvements while maintaining similar parameter counts and computational complexities to standard models.
Results on ImageNet: Experiments showed that models with GMConv exhibited higher Top-1 accuracy. For instance, with large kernel networks like AlexNet, the Top-1 accuracy improved by 0.98%.
Results for COCO Object Detection: On Faster R-CNN and Cascade R-CNN architectures, GMConv significantly improved object detection performance, especially for small and medium-sized target detection.
Ablation Studies
Effect of Static GMConv: The static version of GMConv (S-GMConv) performed well in most benchmark models. However, there are exceptions, such as MobileNetV2, where small kernel sizes limit its performance enhancement.
Impact of Initial σ Value: Comparisons of different initial σ values indicated that a suitable initial receptive field (e.g., σ = 5) could consistently improve model performance, whereas overly large σ values might reduce performance.
Design of Dynamic GMConv: The design that uses the dynamic Sigma module to predict parameters σ1 and σ for mask generation significantly improved model performance.
Visualization Analysis
Visualization of Receptive Field Mask: Visualization results showed that GMConv mainly affects the shallow receptive fields of CNNs. Compared to standard convolution, GMConv’s smaller receptive field in shallow networks and larger receptive field in deep networks are more conducive to model performance enhancement.
Visualization of Effective Receptive Field: GMConv demonstrated more compact ERFs in object detection tasks, particularly precise in detecting small targets. Combined with deformable convolutions, it alleviated the dispersion of ERFs, achieving more precise and intensive ERFs.
Conclusion
The authors proposed GMConv, which adjusts the receptive field of convolutional kernels using Gaussian masks, significantly enhancing the performance of neural networks in image classification and object detection tasks. The static and dynamic versions of GMConv are designed for different layers of the convolutional network to balance performance and complexity. Experimental results demonstrate that GMConv can significantly improve model performance while maintaining existing CNN architecture, especially useful when the shallow receptive field is small. Future neural network designs can draw on this finding to design more efficient network architectures.