Boosting Audio Spectrogram Transformer Performance with Sharpness-Aware Minimization

Global: Sharpness-Aware Minimization Boosts Audio Spectrogram Transformer Performance on Respiratory Sound Dataset

A new framework that integrates Sharpness-Aware Minimization (SAM) with the Audio Spectrogram Transformer (AST) has been presented to improve respiratory sound classification on the ICBHI 2017 benchmark. The approach, developed by a research team and posted on arXiv in December 2025, combines loss‑surface geometry optimization with a weighted sampling scheme to address limited data volume, high noise levels, and pronounced class imbalance.

Challenges in Existing Datasets

The ICBHI 2017 dataset, frequently used for evaluating automated lung sound analysis, contains a relatively small number of recordings, substantial background noise, and a skewed distribution of pathological classes. These factors have historically constrained the generalization ability of machine‑learning models, leading to overfitting and unreliable clinical screening outcomes.

Transformer Models and Overfitting Risks

Transformer‑based architectures, such as the AST, are recognized for their powerful feature‑extraction capabilities in audio domains. However, when trained on constrained medical datasets, they tend to converge toward sharp minima in the loss landscape, which can limit robustness on unseen patient data.

Incorporating Sharpness‑Aware Minimization

SAM modifies the standard training objective by simultaneously minimizing the loss and its sensitivity to parameter perturbations. By encouraging the optimizer to seek flatter regions of the loss surface, the method aims to produce models that maintain performance across variations in input data.

Weighted Sampling for Class Imbalance

To mitigate the impact of uneven class frequencies, the framework employs a weighted sampling strategy that increases the likelihood of selecting under‑represented classes during each training epoch. This technique helps balance the gradient contributions from each class, fostering more equitable learning.

Empirical Results

When evaluated on the ICBHI 2017 test split, the SAM‑enhanced AST achieved an overall score of 68.10%, surpassing previously reported convolutional‑neural‑network and hybrid baselines. The model also recorded a sensitivity of 68.31%, a metric of particular relevance for clinical screening accuracy.

Interpretability Analyses

Post‑training analyses using t‑Distributed Stochastic Neighbor Embedding (t‑SNE) visualizations and attention‑map inspections indicated that the model captured discriminative acoustic patterns rather than memorizing ambient noise, supporting the claim of improved feature robustness.

Implications and Future Directions

The findings suggest that loss‑surface smoothing techniques like SAM, combined with targeted sampling methods, can enhance the reliability of AI‑driven respiratory diagnostics. Ongoing work aims to validate the approach on larger, multi‑center datasets and to explore integration with real‑time clinical decision support systems.

This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

Sharpness-Aware Minimization Boosts Audio Spectrogram Transformer Performance on Respiratory Sound Dataset