An End-to-End Visual Semantic Localization Network Using Multi-View Images

A Study on End-to-End Visual Semantic Localization Using Multi-View Images

Background and Research Significance

With the rapid development of intelligent driving technology, precise localization of autonomous vehicles has become a hot topic in research and industry. Accurate vehicle localization is not only a core module of autonomous driving but also an important component of advanced driver assistance systems (ADAS). Traditional visual localization methods often rely on geometric models and complex parameter tuning, but their robustness and scalability are limited in challenging scenarios. Additionally, traditional feature extraction methods (e.g., SIFT, SURF, ORB) are less effective in dynamic environments influenced by varying weather and lighting conditions. Recently, high-definition maps (HD Maps) with rich semantic information have been shown to enhance localization robustness. However, efficiently achieving cross-modal matching between multi-view images and semantic maps while avoiding complex geometric optimization and multi-stage parameter tuning remains a major challenge in the field.

To address these challenges, the research proposes a novel end-to-end visual semantic localization framework named “BEV-Locator.” This method integrates multi-view images and semantic maps and employs a cross-modal Transformer module for information interaction and vehicle pose decoding, aiming to significantly improve localization accuracy and applicability in autonomous driving scenarios.

Paper Source

This research was collaboratively conducted by teams from several institutions, including the University of International Business and Economics, Tsinghua University, Queen Mary University of London, and Qcraft Inc. The research findings were published in the February 2025 issue of Science China Information Sciences (Volume 68, Issue 2), under the title “BEV-Locator: An End-to-End Visual Semantic Localization Network Using Multi-View Images.” The paper was authored by Zhihuang Zhang, Meng Xu (corresponding author), Wenqiang Zhou, Tao Peng, Liang Li, and Stefan Poslad.

Research Process

Research Objectives and Problem Definition

The objective is to solve the visual semantic localization problem: given multi-view camera images, semantic HD maps, and the vehicle’s initial pose, predict the vehicle’s precise pose. The research takes as input multi-view images and semantic maps projected onto the initial position, and outputs the vehicle’s pose offset, i.e., the deviation (∆x, ∆y, ∆ψ) from the initial position.

Research Framework

The study proposes a novel end-to-end framework comprising four main modules: the visual BEV (Bird-Eye-View) encoder, semantic map encoder, cross-modal Transformer module, and pose decoder.

1. Visual BEV Encoder

The visual BEV encoder extracts features from multi-view images and projects them into BEV space through the following steps: - Image Feature Extractor: EfficientNet (pre-trained on ImageNet) is used to extract image features from multiple cameras, compressing each image into multi-channel feature maps. - View Transformation Module: Using an MLP and the camera’s extrinsic parameters, the features in the camera coordinate system are transformed into BEV space. - Feature Dimension Reduction Module: A ResNet is employed to reduce the high-dimensional BEV features into lower-resolution multi-channel BEV feature maps.

The 2D BEV features are then flattened into a 1D sequence and supplemented with positional embedding to provide spatial order information for subsequent Transformer inputs.

2. Semantic Map Encoder

Semantic maps include various elements (e.g., lane dividers, signs, crosswalks) and are discretely represented as points, lines, or polygons. The study employs a method inspired by VectorNet to encode these elements into structured vectors through the following steps: - Each semantic element is first encoded into a high-dimensional node vector using a shared MLP. - A max pooling layer aggregates node information into a global vector representation (referred to as map queries).

3. Cross-Modal Transformer Module

This module adopts an encoder-decoder architecture to enhance the mapping relationships between BEV features and semantic maps: - Encoder: Performs a self-attention operation on the BEV feature sequence to extract global information. - Decoder: Utilizes cross-attention mechanisms to extract the spatial constraints between the vehicle and map elements through map queries.

Notably, a custom-designed positional embedding mechanism is applied to the value term in the cross-attention operation, improving the matching capability between semantic map elements and BEV features.

4. Pose Decoder

This module further encodes the global information of semantic queries and predicts the vehicle’s pose offset (∆x, ∆y, ∆ψ) through a max pooling layer followed by an MLP.

Datasets and Experimental Process

The study validates the framework on two large-scale autonomous driving datasets: - NuScenes Dataset: Covers 242 kilometers with 1,000 scenes and provides multimodal sensor data (6 cameras, LiDAR, radar, etc.) and semantic HD maps with 11 layers. - Qcraft Dataset: Spanning 400 kilometers, this dataset includes trajectories generated using 7 cameras and high-precision RTK, as well as precise semantic maps.

Experimental Design

  • The BEV-Locator was trained to predict the optimal vehicle pose based on random offsets (lateral ±1 meter, longitudinal ±2 meters, yaw ±2°) generated for initial poses.
  • The impact of different BEV grid sizes (0.15m, 0.25m, 0.5m) on model accuracy was analyzed.
  • Ablation studies were conducted to evaluate the contributions of the Transformer encoder, self-attention mechanisms, and positional embeddings.

Experimental Results and Findings

Accuracy Performance

On the NuScenes dataset, the model achieved high localization accuracy with lateral (0.076 meters), longitudinal (0.178 meters), and yaw angle (0.510°) errors. On the Qcraft dataset, which features clearer road structure, accuracy further improved to 0.052 meters (lateral), 0.135 meters (longitudinal), and 0.251° (yaw).

Visualization Results

The experiments validated localization accuracy by projecting semantic maps onto multi-view images. In most scenarios, BEV-Locator accurately predicted the vehicle’s pose, aligning semantic map elements with real-world objects in images.

Ablation Study Results

  • The Transformer encoder significantly enhanced global feature interaction, reducing longitudinal and lateral errors.
  • The dynamic positional embedding strategy was crucial for improving semantic query performance, especially in the longitudinal direction.

Research Implications and Practical Value

The BEV-Locator framework innovatively formulates the visual semantic localization problem as an end-to-end learning task, avoiding the complexity of traditional multi-stage processes. As a precise and deployable algorithm, it holds broad applicability in autonomous driving. Its accuracy and robustness enhance vehicle localization capabilities and demonstrate the feasibility of integrating semantic map matching into BEV-based perception systems, providing new technical support for future route planning and control in intelligent driving.

The research is notable for its methodological innovation and high-precision experimental results, offering new directions for visual semantic localization studies. In the future, the team plans to integrate BEV-Locator with other BEV-based perception tasks to provide a unified solution for autonomous driving systems.