Detecting voice pathology through speech and cry analysis has drawn increasing interest. Previous studies typically rely on interpretable acoustic features such as prosody, pitch, and rhythm. In this study, we present a novel approach using deep learning to analyze children's cries by extracting non-interpretable features, which can capture complex vocal patterns missed by traditional methods. Specifically, we leverage YAMNet and VGGish to extract cry features, followed by a Parallel one-dimensional Convolutional Neural Network (1D CNN), an enhanced design compared to standard CNN architectures, for binary classification of children with voice pathology and typically developing (TD) children. Experimental results show that VGGish outperforms other models with an accuracy of 81%. Our findings highlight the potential of deep learning models and non-interpretable features for cry-based voice pathology detection, offering a new direction in early diagnosis. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.