*Background*

The development of an automated imaging pipeline enabling the exploration of cochlear anatomy in clinical populations represents a significant challenge. The cochlear structures relevant to CI therapy, specifically the ST and SV, and the CI electrode array cannot always be easily delineated from clinical CT or CBCT images due to low image contrast and poor resolution. This prevents the manual delineation of ST and SV, which would

anyway be a time-consuming, error-prone, and inconsistent process. More reasonably, semi- and fully automatic frameworks have been proposed to segment the cochlear bony labyrinth from pre-operative CT images. Earlier works focused on traditional segmentation techniques, such as level-set and interactive contour algorithms [6,7]. However, these required user input, were computationally time-consuming, and often led to incomplete segmentations. Recent works have focused on designing fully automatic convolutional neural networks capable of handling the intricate anatomy of the bony labyrinth [8–11]. The bony labyrinth is generally well identifiable in clinical CT or CBCT images, but its robust segmentation remains a challenge if one is to process images acquired with different scanners and image acquisition parameters, which may manifest in ranges of image resolution, contrast, and noise. Provided with a delineation of the bony labyrinth, various techniques permit the estimation of important metrics relevant to CI implantation, such as the cochlear duct length (CDL), which serves as an indicator of general cochlear size and what depth of insertion is reasonable to try to reach for that specific cochlea. The CDL and other metrics also enable the computation of normalized tonotopic frequencies according to Greenwood [12], Stakhovskaya [13], or Helpard et al. [14].

For all the information that can be gained from a segmentation of the bony labyrinth, many clinical questions call for the differentiation of ST from SV within the labyrinth. In this case, the automated image processing task becomes much more complex, since ST and SV are generally not visible in clinical CTs or CBCTs. Consequently, various atlases or shape models derived from temporal bone micro-CTs (μCTs) have been proposed to infer a ST/SV differentiation within the bony labyrinth when exploiting a clinical image [15–20]. The delineation of ST and SV is interesting in that CI implantation is preferentially done within ST as implantations or translocations in SV have been associated with observations of auditory pitch reversals and poorer speech intelligibility [21,22].

Post-operatively, CT imaging can provide information about the positioning of each electrode contact within or in the vicinity of the cochlea. However, the exploitation of post-operative CT/CBCT images is often compromised by metal artifacts emanating from the electrodes but generally affecting the region of interest around the electrodes enough so as to prevent the delineation of the bony labyrinth. Therefore, the post-implantation reconstruction of the CI electrode within cochlear structures often requires harnessing both the pre-operative and post-operative scans. Vanderbilt University's group first proposed to independently segment intra-cochlear structures from pre-operative images using active shape models, followed by detection of the electrode array midline from post-operative imaging before combining pre- and post-operative information through a rigid registration [23]. They also proposed to take advantage of the left/right symmetry of inner-ear anatomy by utilizing the pre-operative image of the normal contra-lateral ear for cochlear structure delineation for cases where pre-operative CT images were not available [24]. Granting the successful reconstruction of electrode placement within cochlear structures, the characteristic frequency (CF) at each contact can legitimately be computed at the estimated corresponding place on the organ of Corti (OC) [12] or at the nearest spiral ganglion (SG) [13,14,25]. The accurate inference of the relative position between an electrode and the basilar membrane (BM) lining up the ST can also enable the assessment of the potential translocation of the electrode in SV or inferential predictions of the degree of traumaticity of the insertion, e.g., if the electrode were to have either elevated or ripped through the BM and entered the SV. Although state-of-the-art research on cochlear imaging has resulted in imaging pipelines that do display accuracy levels that can warrant their use in specific settings, these pipelines have generally not been subject to a strict robustness evaluation: their ability to deal with images of heterogeneous quality as one may expect to have to deal with when working on datasets obtained across different clinical centers. Searching to facilitate the exploration of clinical questions related to the anatomical and geometrical considerations of CI therapy, Nautilus enables the automated, accurate, robust, and transparent-on-uncertainty segmentation of the cochlear bony labyrinth, ST, and SV from pre-operative CT/CBCTs. Post-operatively, Nautilus enables the automated identification and reconstruction of the electrode arrays within the cochlear structures extracted from the pre-operative image. This tool computes a range of metrics relevant to both surgical and audiological research in CI, including the characteristic frequencies at each electrode contact. Nautilus' predictions have been evaluated against several datasets annotated by experts and demonstrate state-of-the-art accuracy. Importantly, Nautilus was designed and stress-tested against images spanning a range of resolution, contrast, and noise, which results in its robust applicability, especially for a set of image input specifications that promote success, as we discuss later. Finally, the tool intends to transparently notify users of possible processing failures or complications using a set of caution flags to allow for the rejection of data points that may otherwise bias analysis.

#### **2. Methods**

Nautilus aims to be a gateway to advanced cochlear analysis. To maximize its availability, it has therefore been designed as a web application accessible via any modern web browser (e.g., Mozilla Firefox, Google Chrome, or Microsoft Edge) with no need for additional installation nor excessive requirements on the hardware. The data processing happens transparently on a cloud computing service. An overview of the processing pipeline can be seen in Figure 2, with Figure 3 illustrating the intermediary outputs of the process.

**Figure 2.** Nautilus pipeline overview. After the images are dropped onto a web browser window, the user moves a cross-hair roughly to the cochlea's center and selects the side (left/right) and whether it is a pre- or a post-operative scan. A crop (10 × 10 × 10 mm) centered on that landmark is then rid of personally identifiable information and uploaded for processing. First, relevant landmarks (the center, round window, and apex) are estimated and used for initial cochlear pose (reference coordinate system) computation. Segmentation of the cochlear bony labyrinth (CO) is obtained through a convolutional neural network, whereas subsequently, the scala tympani (ST) and scala vestibuli (SV) are obtained using Bayesian inference. From the post-operative image, electrode array contact coordinates and lead wire are extracted and fit to the Oticon Medical EVO electrode CAD model. An interactive visualization as well as pre- and post-operative metrics are available directly on the web browser. A number of additional outputs are generated by the pipeline and made available for data export for further processing and applications. The segmentations can be exported in STL format for 3D printing, for instance. An estimate of electrode trajectory is also provided from the pre-operative image to estimate the equivalent angular coverage for a given electrode insertion depth in millimeters.

**Figure 3.** Steps of the image analysis pipeline in Nautilus. Regions of interest (10 × 10 × 10 mm) around a manually placed center (blue sphere) are cropped from both pre-operative (**a**) and post-operative (**f**) images. Landmark heatmaps are estimated (**b**,**g**) for the center (green), round window (blue), and apex (red). Images are aligned with rigid registration (**c**,**h**) as shown in cochlear view. Segmentation of the cochlear bony labyrinth (CO) (**d**) is subsequently split into the scala tympani (ST) and scala vestibuli (SV) (**e**). From the post-operative image, electrode array contact coordinates and lead wire are extracted (**i**), and an Oticon Medical EVO electrode CAD model is fit (**j**).

## *2.1. Data Upload and Pseudonymization via a Web-Based Frontend*

Each user can create their private collection of images and associate each image to a specific case/individual. For each case, a unique anonymous identifier is generated upon creation. Once the image (most of the standard medical imaging formats are admissible (e.g., DICOM, NIFTI, MHA), as they can be loaded by ITK [26]) is loaded on the local browser, the image metadata (if any) are cleared of all personal identifiable information (PII). The user must then inform the laterality of the cochlea (left or right), whether it is a pre- or post-operative scan, and roughly place a cross on the targeted cochlea so as to allow the cropping and upload of a region of interest (ROI) from the original (albeit anonymized) image. After the data are uploaded, a processing job is queued and handled by the backend as soon as required computing resources become available.

#### *2.2. Cochlear Landmarks and Canonical Pose Estimation*

Cochlear pose estimation is essential to determine an initial orientation of the cochlea within the image and serves for image visualization in the standardized views [27]. The estimation of cochlear pose is also used for inferring the characteristic equation of the modiolar axis of the cochlea, which, in turn, is used to derive a number of metrics. We estimate the cochlear pose from a set of three automatically estimated landmarks—the center of the basal turn of the cochlea (C), the round window (RW—defined at its center), and the apex (Ap—defined at the helicotrema), as prescribed in [16]. Ap and C form the modiolar axis, which coincides with the z-axis. The basal plane passes through the RW, which defines the direction of the x-axis. The origin of the canonical reference coordinated is the intersection of the basal plane and the modiolar axis. Finally, the remaining axis is chosen such that the angle increases as we follow the cochlear duct starting from 0 deg at the RW. The canonical reference frame allows Nautilus' users to consistently compare cochleae of different sizes and allows equal treatment for both left and right cochleae.

A number of approaches have been proposed to estimate the landmarks or the pose, including registration and one-shot learning [28] or using regression forests to vote for the location of the landmarks [29]. More recently, reinforcement learning methods [30–32] have also been used to efficiently locate landmarks or to generate clinically meaningful image views [33] and, relevantly for our domain of application, to locate cochlear nerve landmarks [34]. Heatmap-based approaches consistently demonstrate robustness, explainability, and computational efficiency and offer an elegant form of uncertainty modelling and failure detection [35]. They do, however, sometimes have difficulties locating landmarks present

around the image borders. We employ a conventional U-Net convolutional neural network architecture [36] as implemented in [37] with three output channels, one for each landmark. We modeled each landmark with a Gaussian heatmap and trained the network to map the input image to the three target heatmaps simultaneously. Our network architecture (detailed in the Supplementary Material) has 3 encoding blocks, 8 channels after the first layer and 16 output channels for the final feature map before the final projection onto the 3 heatmap channels (see Figure S1).

Our training set consists of an assortment of 279 pre- and post-operative clinical CT and CBCT images obtained from diverse sources. Our landmark detection block must be capable of handling (and was therefore trained on) both pre- and post-operative images. It is, however, significantly more difficult to accurately annotate C, RW, and Ap on the postoperative images due to the metallic artifacts. As a workaround, the pre-operative images were registered with the post-operative images, and the landmarks from pre-operative images were transported onto the post-operative images.

For training and inference, we resampled the input images to isotropic 0.3 mm spacing and normalized the intensities between the 5–95% percentile to 0–1 with no clipping. To increase the variability of our training set, we randomly sampled from a combination of data augmentations, such as random noise, flipping in all three dimensions, Gaussian blurring, random anisotropy [38], rigid transformations, and small elastic deformations as implemented by the TorchIO library [39]. Similarly to [40], we have observed that focal loss worked particularly well for sufficiently accurate landmark detection. During the inference, we transformed the predictions with the sigmoid activation to normalize them between 0 and 1, and for each output, we pick the mode of the output distribution (the hottest voxel of the heatmap) as the corresponding landmark.
