A Neural Speech Decoding Framework Leveraging Deep Learning and Speech Synthesis
Major Breakthrough in Neuroscience Research: Deep Learning Technique Achieves Decoding of Natural Speech from Brain Signals
A cross-disciplinary research team at New York University recently achieved a major breakthrough in the fields of neuroscience and artificial intelligence. They developed a novel deep learning-based framework that can directly decode and synthesize natural human voices from neural signals in the brain. This innovative achievement holds promise for developing a new generation of speech-based brain-computer interfaces to assist patients with aphasia and speech disorders.
Research Motivation Speech disorders severely impact patients’ social interactions and quality of life. Over the past decades, researchers have been devoted to developing neural prosthetics that can decode and synthesize speech from the brain, to help these patients regain communication abilities. However, building high-performance speech decoding systems has been an enormous challenge due to the scarcity of brain and speech data required for training, as well as the complexity and high dimensionality of the speech generation process.
Research Essence The team proposed an innovative deep learning-based speech decoding framework, comprising two core modules: (1) a “brain signal decoder” that transforms brain signals recorded by intracortical electrode arrays (ECoG) into interpretable speech parameters; (2) a novel “disentangled speech synthesizer” that converts speech parameters into spectrograms, which are then synthesized into waveforms using the Griffin-Lim algorithm.
The researchers also introduced the concept of a speech autoencoder, leveraging speech signals to pre-train the “speech synthesizer” and generate reference speech parameters to guide the training of the “brain signal decoder”. This framework can generate highly naturalistic speech and achieved highly repeatable decoding performance across 48 subjects.
A major innovation of this speech decoding framework is the causality of the encoder. Last year, most studies only reported results of non-causal encoders, meaning they relied not only on current and past brain signals but also on future signals, thereby depending on speech feedback information. This is infeasible for real-time speech generation applications. The researchers developed an encoder that can operate in both causal and non-causal modes, with the former only using current and past signals, making it more suitable for real-time applications. Experiments demonstrated that in the causal mode, advanced architectures such as convolutional neural networks (ResNet) and transformer networks (Swin Transformer) could achieve decoding accuracy approaching that of the non-causal mode.
Additionally, the framework demonstrated the possibility of successfully decoding speech from the right cortical hemisphere, opening up a new treatment avenue for patients with severe left-brain damage leading to aphasia. The study also found that the framework could achieve high decoding performance regardless of whether high-density or clinically common low-density electrode arrays were used, greatly expanding its application prospects.
Innovative Significance The speech decoding framework has produced innovations in multiple aspects:
By adopting an interpretable intermediate speech parameter representation and pairing it with a novel disentangled speech synthesizer, it can generate natural speech while preserving individual speaker characteristics.
It is the first to systematically investigate the causality of speech decoding encoders, providing a viable solution for real-time speech generation applications.
It demonstrated the possibility of successfully decoding speech from the right brain hemisphere, offering new hope for the treatment of aphasia patients.
It exhibited outstanding decoding performance on both high-density and low-density electrode arrays, significantly enhancing its clinical applicability.
The research team released an open-source decoding framework, which will help accelerate research in speech science and the development of speech prosthetics.
This breakthrough research has opened new doors for the fields of neuroscience and artificial intelligence. Looking ahead, speech-based brain-computer interfaces hold the promise of restoring the power of speech to silent brains.