The generalization performance of deep learning models for medical image
analysis often decreases on images collected with different devices for data
acquisition, device settings, or patient population. A better understanding of
the generalization capacity on new images is crucial for clinicians'
trustworthiness in deep learning. Although significant research efforts have
been recently directed toward establishing generalization bounds and complexity
measures, still, there is often a significant discrepancy between the predicted
and actual generalization performance. As well, related large empirical studies
have been primarily based on validation with general-purpose image datasets.
This paper presents an empirical study that investigates the correlation
between 25 complexity measures and the generalization abilities of supervised
deep learning classifiers for breast ultrasound images. The results indicate
that PAC-Bayes flatness-based and path norm-based measures produce the most
consistent explanation for the combination of models and data. We also
investigate the use of multi-task classification and segmentation approach for
breast images, and report that such learning approach acts as an implicit
regularizer and is conducive toward improved generalization.