From audio recordings, AI can identify emotions such as fear, joy, anger, and sadness.

Accurately understanding and identifying human emotional states is crucial for mental health professionals. Is it possible for artificial intelligence and machine learning to mimic human cognitive empathy? A recent peer-reviewed study demonstrates how AI can recognize emotions from audio recordings in as little as 1.5 seconds, with performance comparable to that of humans.

“The human voice serves as a powerful channel for expressing emotional states, as it provides universally understandable cues about the sender’s situation and can transmit them over long distances,” wrote the study’s first author, Hannes Diemerling, of the Max Planck Institute for Human Development’s Center for Lifespan Psychology, in collaboration with Germany-based psychology researchers Leonie Stresemann, Tina Braun, and Timo von Oertzen.

The quantity and quality of training data in AI deep learning are essential to the algorithm’s performance and accuracy. Over 1,500 distinct audio clips from open-source English and German emotion databases were used in this study. The German audio recordings came from the Berlin Database of Emotional Speech (Emo-DB), while the English audio recordings were taken from the Ryerson Audio-Visual Database of Emotional Speech and Song.

“Emotional recognition from audio recordings is a rapidly advancing field, with significant implications for artificial intelligence and human-computer interaction,” the researchers wrote.

As reported here, the researchers reduced the range of emotional states to six categories for their study: joy, fear, neutral, anger, sadness, and disgust. The audio files were combined into many features and 1.5-second segments. Pitch tracking, pitch magnitudes, spectral bandwidth, magnitude, phase, multi-frequency carrier chromatography, Tonnetz, spectral contrast, spectral rolloff, fundamental frequency, spectral centroid, zero crossing rate, Root Mean Square, HPSS, spectral flatness, and unaltered audio signal are among the quantified features.

Psychoacoustics is the psychology of sound and the science of human sound perception. Audio amplitude (volume) and frequency (pitch) have a significant influence on human perception of sound. Pitch is a psychoacoustic term that expresses sound frequency and is measured in kilohertz (kHz) and hertz (Hz). The frequency increases with increasing pitch. Decibels (db), a unit of measurement for sound intensity, are used to describe amplitude. The sound volume increases with increasing amplitude.

The span between the upper and lower frequencies is known as the spectral bandwidth, or spectral spread, and it is determined from the spectral centroid, which is the center of the spectrum’s mass, and it is used to measure the spectrum of audio signals. The evenness of the energy distribution across frequencies in comparison to a reference signal is measured by the spectral flatness. The strongest frequency ranges of a signal are identified by the spectral rolloff.

Mel Frequency Cepstral Coefficient, or MFCC, is a characteristic that is often employed in voice processing. Pitch class profiles, or chroma, are a means of analyzing the key of the composition, which is usually twelve semitones per octave.

Tonnetz, or “audio network” in German, is a term used in music theory to describe a visual representation of chord relationships in Neo-Reimannian Theory, which bears the name of German musicologist Hugo Riemann (1849–1919), one of the pioneers of contemporary musicology.

A common acoustic feature for audio analysis is zero crossing rate (ZCR). For an audio signal frame, the zero crossing rate measures the number of times the signal amplitude changes its sign and passes through the X-axis.

Root mean square (RMS) is used in audio production to calculate the average power or loudness of a sound waveform over time. An audio signal can be divided into harmonic and percussive components using a technique called harmonic-percussive source separation, or HPSS.

Using a combination of Python, TensorFlow, and Bayesian optimization, the scientists made three distinct AI deep learning models for categorizing emotions from short audio samples. The outcomes were then compared to human performance. A deep neural network (DNN), a convolutional neural network (CNN), and a hybrid model that combines a CNN for spectrogram analysis and a DNN for feature processing are among the AI models that were evaluated. Finding the best-performing model was the aim.

The researchers found that the AI models’ overall accuracy in classifying emotions was higher than chance and comparable to human performance. The deep neural network and hybrid model performed better than the convolutional neural network among the three AI models.

The integration of data science and artificial intelligence with psychology and psychoacoustic elements shows how computers may possibly perform cognitive empathy tasks based on speech that are on par with human performance.

“This interdisciplinary research, bridging psychology and computer science, highlights the potential for advancements in automatic emotion recognition and the broad range of applications,” concluded the researchers.

The ability of AI to understand human emotions could represent a breakthrough for ensuring greater psychological assistance to people in a simpler and more accessible way for everyone. Such help could even improve society since people’s increasing psychological problems due to an increasingly frantic, unempathetic and individualistic society, is making them increasingly lonely and isolated.

However, these abilities could also be used to better understand the human mind and easily deceive people and persuade them to do things they would not want to do, sometimes even without realizing it. Therefore, we always have to be careful and aware of the potentiality of these tools.