Deep Audio Classifier: an Artificial Neural Network Approach
DOI:
https://doi.org/10.22105/scfa.v1i2.35Keywords:
Deep learning, Support vector machine, Random forest, Artificial neural network, Convolutional neural network, Mel-frequency cepstrum coefficients, LibrosaAbstract
This research centers on developing a deep audio classifier by examining several machine learning and deep learning algorithms, such as Support Vector Machines (SVMs), Random Forest (RF), Artificial Neural Networks (ANNs), and Convolutional Neural Networks (CNNs). The models were trained and evaluated using the UrbanSound8K dataset. The objective of this study is to create strong models that can effectively classify intricate urban sound environments. The audio samples went through comprehensive preprocessing steps, including noise reduction, normalization, and trimming to maintain consistent sample duration. Feature extraction was conducted using Mel-Frequency Cepstral Coefficients (MFCCs). The ANN model, which consists of dense layers tailored for feature learning and utilizes softmax activation for multi-class classification, obtained a classification accuracy of 80.20%. The SVM and RF models achieved accuracies of 82.34% and 84.90%, respectively, using linear and ensemble methodologies. The CNN model surpassed the others with an accuracy of 88.45%, showcasing its ability to capture spatial hierarchies and localized patterns within audio data. Model performance differed by class, demonstrating high precision in recognizing specific sounds such as car horns and gunshots.
The research ends with recommendations for future efforts, such as utilizing sophisticated data augmentation methods, investigating hybrid models, and conducting more extensive hyperparameter tuning to enhance classification accuracy and adaptability in practical urban settings.
References
[1] Mohapatra, H. (2021). Socio-technical challenges in the implementation of smart city. 2021 international conference on innovation and intelligence for informatics, computing, and technologies (3ICT) (pp. 57–62). IEEE. https://doi.org/10.1109/3ICT53449.2021.9581905
[2] Nogueira, A., Oliveira, H., Machado, J., & Tavares, J. (2022). Sound classification and processing of urban environments: A systematic literature review. Sensors, 22, 8608. http://dx.doi.org/10.3390/s22228608
[3] Pudasaini, A., Al-Hawawreh, M., Bouadjenek, M. R., Hacid, H., & Aryal, S. (2024). A comprehensive study of audio profiling: methods, applications, challenges, and future directions. Journal of latex class files, 14(8). https://doi.org/10.36227/techrxiv.171595948.84728317/v1
[4] Dongare, A. D., Kharde, R. R., &Kachare, A. D. (2012). Introduction to artificial neural network. International journal of engineering and innovative technology (IJEIT), 2(1), 189–194. https://b2n.ir/j98967
[5] Noble, W. S. (2006). What is a support vector machine? Nature biotechnology, 24(12), 1565–1567. https://doi.org/10.1038/nbt1206-1565
[6] Das, J. K., Ghosh, A., Pal, A. K., Dutta, S., & Chakrabarty, A. (2020). Urban sound classification using convolutional neural network and long short term memory based on multiple features. 2020 fourth international conference on intelligent computing in data sciences (ICDS) (pp. 1–9). IEEE. https://doi.org/10.1109/ICDS50568.2020.9268723
[7] Zou, J., Han, Y., & So, S. S. (2009). Overview of artificial neural networks. Artificial neural networks: methods and applications, 14–22. https://doi.org/10.1007/978-1-60327-101-1_2
[8] Biswas, D. G., Das, S., Kairi, A., Roy, A., Saha, T., & Samanta, M. (2024). Taxonomic delineation of musical genres through computational paradigms: an exploration employing the k-nearest neighbors (kNN) algorithm. Proceedings of the fifth international conference on emerging trends in mathematical sciences & computing (IEMSC-24) (pp. 128–144). Cham: Springer, Cham. https://doi.org/10.1007/978-3-031-71125-1_11
[9] Kademani, V., A, A., Patil, P., & M, M. S. (2024). A deep learning approach for accurate environmental sounds analysis. 2024 5th international conference for emerging technology (INCET) (pp. 1–6). IEEE. https://doi.org/10.1109/INCET61516.2024.10593397
[10] Bhise, D., Kumar, S., & Mohapatra, H. (2022). Review on deep learning-based plant disease detection. 2022 6th international conference on electronics, communication and aerospace technology (pp. 1106–1111). IEEE. https://doi.org/ 10.1109/ICECA55336.2022.10009290
[11] Zaman, K., Sah, M., Direkoglu, C., & Unoki, M. (2023). A survey of audio classification using deep learning. IEEE access, 11, 106620–106649. https://doi.org/10.1109/ACCESS.2023.3318015
[12] Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE signal processing letters, 24(3), 279–283. https://doi.org/10.1109/LSP.2017.2657381
[13] Demir, F., Abdullah, D. A., & Sengur, A. (2020). A new deep CNN model for environmental sound classification. IEEE access, 8, 66529–66537. https://doi.org/10.1109/ACCESS.2020.2984903
[14] Malaviya, P., Kumar, Y., & Modi, N. (2023). Advancements in environmental sound classification: evaluating machine learning and deep learning approaches on the urbansound8k. 2023 seventh international conference on image information processing (ICIIP) (pp. 900–905). IEEE. https://doi.org/10.1109/ICIIP61524.2023.10537679
[15] Van Houdt, G., Mosquera, C., & Nápoles, G. (2020). A review on the long short-term memory model. Artificial intelligence review, 53(8), 5929–5955. https://doi.org/10.1007/s10462-020-09838-1
[16] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., &Weiss, B. (2005). A database of german emotional speech. Interspeech (pp. 1517–1520). ISCA. https://doi.org/10.21437/interspeech.2005-446
[17] Majeed, S. A., Husain, H., Samad, S. A., & Idbeaa, T. F. (2015). Mel frequency cepstral coefficients (MFCC) feature extraction enhancement in the application of speech recognition: A comparison study. Journal of theoretical & applied information technology, 79(1). https://www.researchgate.net/publication/281785424
[18] Dev, A., & Bansal, P. (2010). Robust features for noisy speech recognition using MFCC computation from magnitude spectrum of higher order autocorrelation coefficients. International journal of computer applications, 10(8), 36–38. https://b2n.ir/a10869