Vision Transformer and CNNs in Kidney Stone Classification: A Comparative Study

Kidney stones affect up to 15% of individuals worldwide and can lead to severe pain, infection, and long-term renal damage if not detected promptly. Automated detection on CT scans can accelerate diagnosis, reduce radiologist workload, and standardize interpretations across centers. In this work, we present the first comprehensive benchmark comparing six leading deep learning architectures on a heterogeneous, multi-center axial CT dataset of 3,364 expert-annotated slices and 35,457 augmented images from three hospitals. Our augmentation pipeline includes random rotation, horizontal flips, scaling, and intensity jitter to improve robustness against realistic variations. We evaluate three convolutional neural networks (ResNet50, EfficientNet-B0, InceptionV3) and three vision transformers (ViT-B/16, DeiT, Swin Transformer), all fine-tuned from ImageNet pre-training under identical preprocessing, stratified 70/15/15 train/validation/test splits, and consistent training protocols (Adam optimizer with 1e-4 initial learning rate, cosine annealing, batch size 32, and early stopping on validation loss). We report accuracy, precision, recall, and F1-score to capture both overall and class-specific performance. ResNet50 achieves the top accuracy of 99.43%, with near-perfect precision and recall, while Swin Transformer is the leading transformer at 98.78%. Confusion-matrix analysis highlights CNNs’ superior localization of renal calculi and transformers’ ability to model global context.
Authors:
Alif Akbar Hafiz, Derrick Vericho, Vincenzo Jason Carter, Dave Christian Thio, Mahmud Isnan, Bens Pardamean
2025 International Conference on Computer Science and Computational Intelligence (ICCSCI)