Efficient CORDIC-based Activation Function Implementations for RNN Acceleration on FPGAs

Efficient Implementation of RNN Activation Functions: Breakthroughs in CORDIC Algorithms and FPGA Hardware Acceleration

Background and Research Significance

In recent years, with the rapid advancement of deep learning technologies, Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have demonstrated powerful capabilities in time-series tasks. These include applications in Natural Language Processing (NLP), speech recognition, and medical diagnosis, among others. However, compared to Convolutional Neural Networks (CNNs), RNN models face significantly higher computational costs due to their complexity and extensive use of nonlinear activation functions. This issue becomes a bottleneck, especially when deploying RNN models on resource-constrained edge devices.

As a critical component of deep neural networks, activation functions introduce nonlinearity to enhance the model’s expressive power. However, the extensive use of activation functions like tanh, sigmoid, and arctan significantly increases computational overhead and latency. Traditional methods, such as polynomial fitting, lookup tables (LUT), and piecewise linear approximation, have attempted to reduce complexity but fail to achieve an ideal balance among accuracy, resource consumption, and latency.

To address these challenges, the authors of this study adopted a hardware acceleration perspective. By leveraging the advantages of the Coordinate Rotation Digital Computer (CORDIC) algorithm, they proposed an efficient and unified hardware architecture to optimize activation function computations in RNN models. This significantly enhances the performance of RNN deployment on edge devices.

Research Origin and Publication Information

Titled “Efficient CORDIC-Based Activation Functions for RNN Acceleration on FPGAs,” this paper was authored by Wan Shen, Junye Jiang, Minghan Li, and Shuanglong Liu. The authors are affiliated with the School of Physics and Electronics, Hunan Normal University, and the Key Laboratory of Low-Dimensional Quantum Structures and Quantum Control of the Ministry of Education, China. The paper was published in Volume 6, Issue 1, of the IEEE Transactions on Artificial Intelligence in January 2025.

Research Process and Methodology

This study introduces an improved CORDIC algorithm and develops a unified hardware architecture capable of supporting multiple activation functions tailored to FPGA platforms with limited resources. The research process is outlined as follows:

1. Theoretical Foundation and Problem Analysis

The study first analyzes the working mechanism of RNNs, the current state of activation function implementations, as well as the strengths and weaknesses of the CORDIC algorithm. In traditional RNN applications, tanh and sigmoid are commonly used activation functions that are computationally intensive and time-consuming. On the other hand, the CORDIC algorithm, which only requires shift and addition operations, demonstrates high efficiency in computing various nonlinear functions. Its potential is especially noteworthy in resource-constrained settings. However, existing CORDIC implementations suffer from slow convergence, low resource efficiency, and limited support for unified implementation of multiple activation functions.

2. Improved CORDIC Algorithm

The study proposes an improved CORDIC algorithm by introducing the greedy algorithm. Key improvements include: - Employing a greedy angle selection strategy to choose optimal rotation angles in each iteration, reducing the number of iterations. - Designing a unified angle mapping and selection mechanism to implement tanh, sigmoid, and arctan functions. This enables compatibility with both circular and hyperbolic coordinate systems. - Addressing the limited convergence range of the conventional CORDIC algorithm by expanding the effective input range of activation functions through angle mapping.

3. Unified Hardware Architecture Design

A pipelined hardware architecture was developed consisting of the following key modules:

  1. Preprocessing Module:

    • Maps input angles to suitable ranges [0, 1) or [0, π/4] to fit within the CORDIC convergence range.
    • Includes negative-to-positive conversion and simple arithmetic operations.
  2. Iteration Module:

    • Composed of an angle selection module and a rotation function module.
    • The angle selection module dynamically calculates the index of the optimal rotation angle using prioritization encoders, significantly reducing computational latency.
  3. Postprocessing Module:

    • Produces tanh and sigmoid function outputs through combinations of addition, shifting, and division operations using CORDIC outputs.
    • The arctan function is computed using logic gates and multiplexers.

This hardware architecture, implemented on FPGA, supports flexible configuration for various activation functions, significantly reducing hardware resource demands required for independent module design.

Research Results and Analysis

Precision of Activation Functions

By comparing the numerical approximation of tanh, sigmoid, and arctan activation functions with traditional CORDIC algorithms, the proposed greedy angle selection strategy significantly improves the iteration convergence rate. With just four iterations, the proposed method achieves an approximation precision comparable to the traditional CORDIC method with eight iterations. The relative errors (RE) for tanh, sigmoid, and arctan can be maintained at 0.0019, 0.046, and 0.0224, respectively.

RNN Model Accuracy

Applying the proposed CORDIC-based activation functions to three typical RNN models (LSTM, Bidirectional LSTM, and GRU) showed that the inference accuracy loss remained below 2%. This exceeds the performance of traditional methods, which require higher numbers of iterations to achieve comparable accuracy.

Hardware Resource Utilization and Speed

In terms of resource consumption, the proposed architecture reuses iterative modules. While using similar logic resources as the traditional CORDIC implementation, this design reduces activation function computation latency by 50% and achieves an approximate twofold acceleration for full model inference.

Research Conclusion and Significance

The proposed CORDIC activation function implementation method not only maintains RNN inference accuracy but also significantly enhances inference speed while reducing hardware resource consumption. By providing a unified and configurable hardware architecture, this approach addresses the challenges of supporting multiple activation functions simultaneously, making it especially suitable for edge devices with strict real-time and resource constraints.

Highlights of the Study

  1. Innovative Improvements: The integration of the greedy algorithm with the CORDIC method introduces significant advancements in approximation precision and convergence speed.
  2. Unified Design: The unified and multi-functional hardware architecture enables efficient computation of tanh, sigmoid, and arctan functions.
  3. Hardware Applicability: Demonstrates higher resource efficiency and computational performance compared to traditional lookup-table and polynomial fitting methods during FPGA implementation.

Potential Application Value

This technological breakthrough not only enhances the applicability of RNNs in real-time scenarios but also provides new insights into deploying deep learning models on edge computing platforms. In the future, the research team plans to implement this design on more advanced hardware platforms, such as AMD Versal SoCs, aiming to further enhance hardware security and computational performance.