Enhancing Aerial Object Detection with Selective Frequency Interaction Network

Selective Frequency Interaction Network for Improved Aerial Object Detection

Background and Problem Statement

With the advancements in computer vision, aerial object detection has become a critical research focus in remote sensing. This task aims to identify targets such as vehicles or buildings from aerial images captured at varying angles and altitudes. It has broad applications in domains such as environmental monitoring, disaster management, and security surveillance. However, aerial object detection faces several challenges due to variations in scale, orientation, and the complexities of background environments, such as densely packed targets, variations in illumination, and perspective changes.

Current convolutional neural network (CNN)-based approaches primarily emphasize spatial and channel interactions while undervaluing the importance of frequency domain information. Frequency domain features are essential for capturing specific attributes like textures and edges. However, existing weight assignment mechanisms, such as channel attention, struggle to fully utilize frequency domain information, often leading to information loss. This limitation underscores the need for deeper exploration into the extraction and integration of frequency domain features.

To tackle this problem, this paper introduces an innovative method called the Selective Frequency Interaction Network (SFI Network). This network incorporates two key modules—the Selective Frequency-domain Feature Extraction (SFFE) module and the Selective Frequency-domain Features Interaction (SFFI) module—designed to optimize detection performance by effectively interacting and fusing features from both frequency and time domains.

Paper Origin and Authors

This work is a collaboration among researchers from multiple Chinese institutions, including Weijie Weng from the Xiamen University of Technology, Mengwan Wei from the Jiangsu Earthquake Administration, Junchi Ren from China Telecom Corporation Limited, and Fei Shen from Nanjing University of Science and Technology and Tencent AI Lab (corresponding author). It was published in the December 2024 issue of IEEE Transactions on Artificial Intelligence (Vol. 5, No. 12).

Proposed Method

The paper proposes the SFI network to enhance aerial object detection accuracy through frequency domain analysis and cross-channel interactions. The design of the SFI framework, along with its feature extraction and interaction mechanisms, is elaborated as follows:

1. Overall Framework Design

The SFI network operates through two core modules (SFFE and SFFI):

  • SFFE Module: Employs 2D Discrete Cosine Transform (2D-DCT) to extract frequency domain information from input feature maps, capturing both high- and low-frequency components. This frequency-domain analysis extracts critical features such as edges and textures.
  • SFFI Module: Facilitates efficient cross-channel interaction and fusion of frequency domain features without the information loss associated with traditional channel attention methods. The module uses 1D convolutional operations with multiple kernels to extract and assign feature weights, which are then integrated with time-domain feature maps.

This framework can seamlessly integrate with existing backbone networks (e.g., ResNet, FPN) to enhance feature extraction for aerial object detection tasks.

2. Module Details

(1) SFFE Module

The frequency-domain feature extraction module applies 2D-DCT to extract Direct Current (DC) and Alternating Current (AC) coefficients. DC coefficients capture low-frequency features (e.g., flat regions), while AC coefficients focus on high-frequency details (e.g., texture and edges). The SFFE module divides feature maps into multiple segments along the channel dimension. Each segment is assigned a specific frequency component to perform point-to-point operations, and the results are aggregated into a comprehensive frequency-domain feature vector, enabling hierarchical feature extraction.

(2) SFFI Module

This module uses 1D convolution to enable interactions between frequency-domain features. Experiments show that using two convolution kernels with sizes (e.g., 3 and 15) yields the best results, effectively balancing computational efficiency and interaction efficacy. After convolution, the frequency-domain feature weights are fused with time-domain feature maps to create more robust feature representations for object detection.

3. Loss Function and Framework Integration

The SFI network uses a combination of cross-entropy loss (Fcls) and Smooth L1 loss (Freg) for efficient training of classification and regression tasks. Modular in design, the SFI network is integrated into existing detection frameworks, such as the upsampling stage in Feature Pyramid Networks (FPNs), for flexible extensibility.

Experimental Results

The authors conducted extensive experiments and comparisons on three public datasets—DOTA v1.0, DOTA v1.5, and HRSC2016—showcasing the effectiveness and superior performance of the SFI network.

1. DOTA Dataset Results

The DOTA dataset is a comprehensive dataset for large-scale aerial object detection. Experimental results show that the SFI network significantly outperforms state-of-the-art (SOTA) baselines in Horizontal Bounding Box (HBB) and Oriented Bounding Box (OBB) detection tasks:

  • On DOTA v1.0, the OBB detection mAP (mean Average Precision) reaches 81.32%, a more than 5% improvement over traditional methods.
  • On DOTA v1.5, the SFI network excels in detecting small objects (e.g., small vehicles, helicopters) and deformable targets (e.g., bridges), demonstrating superior performance.

2. HRSC2016 Dataset Results

HRSC2016 is a specialized dataset for detecting ships at arbitrary orientations. The SFI network achieved 90.7% mAP under the VOC2007 metric and 98.47% under the VOC2012 metric, surpassing all existing algorithms and highlighting its exceptional ability to handle multi-angle ship recognition.

3. Ablation Studies

Ablation studies were conducted to evaluate the independent and combined contributions of the SFFE and SFFI modules. Using only the SFFE module improved detection accuracy by 0.6%, while adding the SFFI module further improved it by over 2%. These results strongly validate the effectiveness of both modules.

4. Visualization Analysis

Visualization results demonstrate that the SFI network consistently outperforms baseline models in challenging scenarios such as occlusion, illumination changes, and dense target distributions. The network effectively locates targets while minimizing false negatives and false positives.

Significance and Future Outlook

The SFI network makes significant contributions in both methodology and practical applications:

  1. Methodological Innovation: Introduces frequency domain interaction to aerial object detection, addressing fundamental limitations in feature extraction and information fusion.
  2. Broad Applicability: Provides a robust approach for diverse applications in environmental monitoring, military surveillance, and disaster management.
  3. Modular Design: Ensures seamless integration into existing CNN architectures, offering new opportunities for future research in object detection.

Future Work

The authors plan to extend the SFI network to a variety of deep learning architectures, including Transformers and other frameworks, to evaluate its performance in more complex scenarios.