Sharpness-Aware Minimization Boosts Audio Spectrogram Transformer Performance on Respiratory Sound Dataset
Global: Sharpness-Aware Minimization Boosts Audio Spectrogram Transformer Performance on Respiratory Sound Dataset
A new framework that integrates Sharpness-Aware Minimization (SAM) with the Audio Spectrogram Transformer (AST) has been presented to improve respiratory sound classification on the ICBHI 2017 benchmark. The approach, developed by a research team and posted on arXiv in December 2025, combines loss‑surface geometry optimization with a weighted sampling scheme to address limited data volume, high noise levels, and pronounced class imbalance.
Challenges in Existing Datasets
The ICBHI 2017 dataset, frequently used for evaluating automated lung sound analysis, contains a relatively small number of recordings, substantial background noise, and a skewed distribution of pathological classes. These factors have historically constrained the generalization ability of machine‑learning models, leading to overfitting and unreliable clinical screening outcomes.
Transformer Models and Overfitting Risks
Transformer‑based architectures, such as the AST, are recognized for their powerful feature‑extraction capabilities in audio domains. However, when trained on constrained medical datasets, they tend to converge toward sharp minima in the loss landscape, which can limit robustness on unseen patient data.
Incorporating Sharpness‑Aware Minimization
SAM modifies the standard training objective by simultaneously minimizing the loss and its sensitivity to parameter perturbations. By encouraging the optimizer to seek flatter regions of the loss surface, the method aims to produce models that maintain performance across variations in input data.
Weighted Sampling for Class Imbalance
To mitigate the impact of uneven class frequencies, the framework employs a weighted sampling strategy that increases the likelihood of selecting under‑represented classes during each training epoch. This technique helps balance the gradient contributions from each class, fostering more equitable learning.
Empirical Results
When evaluated on the ICBHI 2017 test split, the SAM‑enhanced AST achieved an overall score of 68.10%, surpassing previously reported convolutional‑neural‑network and hybrid baselines. The model also recorded a sensitivity of 68.31%, a metric of particular relevance for clinical screening accuracy.
Interpretability Analyses
Post‑training analyses using t‑Distributed Stochastic Neighbor Embedding (t‑SNE) visualizations and attention‑map inspections indicated that the model captured discriminative acoustic patterns rather than memorizing ambient noise, supporting the claim of improved feature robustness.
Implications and Future Directions
The findings suggest that loss‑surface smoothing techniques like SAM, combined with targeted sampling methods, can enhance the reliability of AI‑driven respiratory diagnostics. Ongoing work aims to validate the approach on larger, multi‑center datasets and to explore integration with real‑time clinical decision support systems.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung