*2.3. Segmentation of Cochlear Structures*

Nautilus is built with cochlear surgery planning, evaluation, and audiological fitting in mind. Therefore, in the current version, we focus on segmenting the two main cochlear ducts—ST and SV—and compute relevant measurements from these structures as others before us [41]. At a later stage, the delineation of ST and SV serves to relate the placement electrode array placement within the cochlea and infer information such as the characteristic frequency of each electrode contact [23]. An accurate and robust segmentation of ST and SV is therefore critical. Recent approaches based on convolutional neural networks have shown the most promise. Nikan et al., for instance [9], segmented various temporal bone structures including the labyrinth, ossicles, and facial nerve. Most of the cochlear segmentation approaches perform remarkably well on the cochlea and neighboring structures. They do not, however, separate the scalae [8,42], nor do they estimate the position of the BM, the delicate structure responsible for the transduction of mechanical waves within the cochlea into trains of electrical impulses, an essential structure to preserve in anticipation of restorative therapeutic advances. The separation of the scalae on clinical CTs is challenging as ST and SV are not discernible on clinical scans, mainly due to limited image resolution and contrast. To circumvent this issue, a shape model is often used to serve as a priori information on ST/SV distinction within the cochlear labyrinth. Recently, atlases [43] and a hybrid active shape model combined with deep learning [44] have been used with success for the separation of the scalae.

We used a pre-operative image of the implanted cochlea as the reference image for segmentation. Nautilus uses an approach similar to [44], which merges deep learning for appearance modelling with a strong shape prior constraining the final segmentation [45]. Instead of an active shape model, we build on top of a well-validated Bayesian joint appearance and shape inference model [20,46]. The parameters of this shape model were tuned and validated on μCT data. The model can then serve as a strong prior constraining the final output for the lower-resolution clinical CT images. This approach provides a probabilistic separation of ST and SV even in images of poor resolution. We provide an estimate of the BM location from the intersection of ST and SV's probability maps. Demarcy et al.'s original Bayesian framework proposed to model the foreground and background appearance (i.e., intensity) as mixtures of Student distributions. We observed that this initialization is fairly sensitive to the type of scanner used for image acquisition and to image quality despite using normative Hounsfield units. To achieve better generalization, we therefore replaced the original appearance model with a trained convolutional neural network [36].

Similarly to our landmark detection approach, we used a reference 3D U-Net implementation of MONAI [37] with 6 encoding blocks, 8 output channels after the first layer (see Figure S2), and PReLU as the activation function and trained it on 130 images. We normalized the data by resampling the images to 0.125 mm spacing and rescaled the intensities such that the 5th and 95th percentile of the intensity distribution of each image were mapped to 0 and 1. In addition to augmentations used for landmark detection, we used random patch swapping [47] to increase the robustness to artifacts and force the network to learn a stronger shape prior. The model was trained on 128 × 128 × 128 patches with the AdamW [48] optimizer minimizing the Dice focal loss [37,49].

A large number of the metrics we extract from both pre- and post-operative processes depend on reliable estimation of the cochlear ducts' centerline. Because our segmentation of ST and SV is based on a parametric shape model [46], extracting an approximate centerline is straightforward. We then refine this curve and estimate ST and SV centerlines from cross-sections of the segmentations along this curve. At each cross-section, we estimate the coordinates of the lateral wall landmark as the furthest point on the ST from the modiolar axis, OC at 80% of the distance to the LW [13], and the SG offset by −0.35 mm both radially and longitudinally from the modiolar wall landmarks (i.e., the point on the ST closest to the modiolus) as an approximation of Rosenthal's canal.
