*3.1. Evaluation Datasets*

A well-curated multi-centric dataset, comprising both clinical and cadaver bones, was chosen for tye evaluation of Nautilus. CT images acquired from various scanners, using various acquisition parameters, and presenting heterogeneous resolutions, contrasts, and signal-to-noise ratios were included both for training and evaluation (see Figures S7 and S8 in the Supplementary Material). Groundtruth annotations, comprising the *C*, *Ap* and *RW* landmarks, cochlear structures and the electrode center points, were delineated by an expert radiologist using ITK-SNAP [64]. Limited by the poor resolution and imaging conditions of clinical images, only the cochlea could be manually delineated for clinical scans. On the other hand, ST and SV were successfully delineated in cadaver head CT scans since better contrast and resolutions could be achieved. The number of images used for training and evaluation for each process are mentioned in their respective sections. Each part of the pipeline was independently evaluated, as detailed below. A summary of the results is presented in Table 2.

## *3.2. Accuracy*

## 3.2.1. Landmark Detection

The landmark detection pipeline, utilized both pre- and post-operatively, was evaluated on a dataset of 60 images. The images were passed through the landmark detector, and the distance between the predicted and groundtruth annotation landmarks was computed. Mean detection errors of 0.71 ± 1.0 mm, 0.75 ± 1.14 mm, and 1.30 ± 1.73 mm were observed for *C*, *Ap* and *RW*, respectively. All the individual errors were within a distance of two voxels, with the *RW* landmark yielding the worst performance.


**Table 2.** Accuracy and robustness analysis for each pipeline process. ASSD: average symmetric surface distance, RAVD: relative absolute volume difference, HD95: 95% Hausdorff distance.
