A combined CNN architecture for speech emotion recognition

Begazo Huamani, Rolinson Jhiampier

A combined CNN architecture for speech emotion recognition

dc.contributor.advisor	Dongo Escalante, Irvin Franco Benito
dc.contributor.author	Begazo Huamani, Rolinson Jhiampier
dc.date.accessioned	2024-12-06T15:54:36Z
dc.date.available	2024-12-06T15:54:36Z
dc.date.issued	2024
dc.description.abstract	Emotion recognition through speech is a technique employed in various scenarios of Human–Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%.
dc.description.uri	Trabajo académico
dc.format	application/html
dc.identifier.doi	https://doi.org/10.3390/s24175797
dc.identifier.uri	https://hdl.handle.net/20.500.12590/18477
dc.language.iso	eng
dc.publisher	Universidad Católica San Pablo
dc.publisher.country	PE
dc.relation.uri	https://www.mdpi.com/1424-8220/24/17/5797
dc.rights	info:eu-repo/semantics/restrictedAccess
dc.subject	Speech emotion recognition
dc.subject	Deep learning
dc.subject	Spectral features
dc.subject	Spectrogram imaging
dc.subject	Feature fusion
dc.subject	Convolutional neural network
dc.subject.ocde	https://purl.org/pe-repo/ocde/ford#2.00.00
dc.title	A combined CNN architecture for speech emotion recognition
dc.type	info:eu-repo/semantics/article
dc.type.version	info:eu-repo/semantics/publishedVersion
renati.advisor.dni	46703945
renati.advisor.orcid	https://orcid.org/0000-0003-4859-0428
renati.author.dni	76774488
renati.discipline	712096
renati.juror	Sotomayor Polar, Manuel Gustavo
renati.juror	Ludeña Choez, Jimmy Diestin
renati.level	https://purl.org/pe-repo/renati/level#tituloProfesional
renati.type	https://purl.org/pe-repo/renati/type#trabajoAcademico
thesis.degree.discipline	Ingeniería Electrónica y de Telecomunicaciones
thesis.degree.grantor	Universidad Católica San Pablo. Departamento de Ingeniería Electrónica y de Telecomunicaciones
thesis.degree.level	Título Profesional
thesis.degree.name	Ingeniero Electrónico y de Telecomunicaciones
thesis.degree.program	Escuela Profesional Ingeniería Electrónica y de Telecomunicaciones