1. Introduction
Single-cell measurements have revolutionized our understanding of cellular heterogeneity and diversity, allowing for the characterization of distinct cell types within complex tissues based on various molecular activities such as gene expression, chromatin accessibility, proteomics, and methylation. However, a significant constraint of current single-cell technologies is their capability to assess only one particular type of molecular activity per cell. For instance, a cell may undergo either single-cell RNA sequencing (scRNA-seq) or chromatin accessibility profiling (scATAC-seq), but not both. This restriction to a single molecular readout impedes our ability to comprehensively explore the interrelation of different genomic layers within individual cells [
1] and understand the regulatory aspects.
Recent advancements in single-cell analysis have led to the emergence of multiomic single-cell methods, enabling the simultaneous profiling of multiple modalities within the same cell [
2]. Unlike traditional approaches that focus solely on one omic data type in isolation, these multiomic methods facilitate integrated analysis across various molecular layers within individual cells. By adopting such holistic approaches, researchers can gain a deeper understanding of cellular behavior, elucidating how diverse omic layers, including gene expression, chromatin accessibility, DNA methylation, and protein expression, interact with and influence each other.
However, joint single-cell methods encounter various challenges apart from the technical limitations that can introduce errors or biases, further contributing to the noises in the resulting multiomic data [
3]. Another significant obstacle is the increased cost associated with these multiomic experiments. The complexity and resource-intensive nature of performing such joint single-cell analyses can lead to higher expenses compared to traditional single-cell methods that focus on a single omic modality [
2]. Additionally, the emergence of co-assays, in which multiple omic layers are simultaneously profiled from the same individual cells, is a more recent advancement in single-cell technology. Co-assay data is not as prevalent as single-assay data. Researchers may have limited access to co-assay datasets, and publicly available repositories might contain a smaller number of co-assay datasets compared to single-assay datasets. The existence of technical challenges and resource constraints makes it difficult to conduct joint profiling of multiple omic modalities within single cells.
Numerous methods have been developed to address challenges in single-cell data analysis. For scRNA-seq data, approaches such as SAUCIE [
4], Deep Count Autoencoder [
5], and scScope [
6] have demonstrated efficacy in denoising data and capturing underlying biological variability. Similarly, for scATAC-seq data, models like cisTopic [
7] and SCALE [
8] have been successful in learning informative latent representations for clustering and regulatory region identification. Recent advancements in experimental techniques have facilitated the generation of paired single-cell data, enabling more efficient multimodal modeling approaches. For example, MultiVI [
9] employs deep generative models to jointly analyze and integrate scRNA-seq and scATAC-seq data, leveraging variational autoencoders (VAEs) to embed both modalities into a shared latent space. Another notable model, BABEL [
1], utilizes deep learning techniques to translate between gene expression and chromatin accessibility profiles at the single-cell level. However, there is still significant room for improvement in performance and accuracy. Additionally, the current models lack a user-friendly way to perform inference, which limits their accessibility and usability for a broader audience. Implementing pipelines, creating datasets, and transforming data to appropriately fit the model require users to be familiar with such processes and to invest significant time and effort. Furthermore, users need to access high-performance computing resources on Linux and learn how to run analyses in these environments. This can be a daunting task for those who are more accustomed to using less technical interfaces. By addressing these gaps, we can create a more efficient and user-centric solution.
In this paper, we propose a machine learning model, CrossMP, designed to computationally generate diverse multiomic modalities within a single cell from a solitary measured modality. The model is constructed using a deep neural network architecture, employing a fully connected deep network to learn the latent representation of each modality and predict the target modality. Our focus lies in bridging the gap between scRNA-seq and scATAC-seq profiles, enabling seamless translation between the two. Essentially, given an scRNA-seq profile of a set of cells, the model outputs the corresponding scATAC-seq profile, and vice versa. We trained our model using cells collected from various human and mouse datasets. Moreover, we integrated our pretrained model into the backend of a CrossMP web portal. This portal provides researchers with the capability to predict scRNA-seq and scATAC-seq data, offering a user-friendly platform for seamless access and utilization of our model’s predictive capabilities. The novelty of our approach lies in several aspects, including achieving superior accuracy performance compared to currently existing methods, providing a user-friendly web interface for users to conduct their own predictions, and actively developing capabilities for users to train models with their own datasets. These contributions aim to enhance accessibility and applicability in diverse research settings for a broader audience.
2. Materials and Methods
2.1. Data Preprocessing
The model was trained on a curated selection of paired human and mouse single-cell ATAC-seq and RNA-seq datasets sourced from the 10x Genomics multiomics platform (
Table 1).
For the human subset, we compiled five distinct datasets. These include the COLO320DMHSR dataset, encompassing colon adenocarcinoma cells and colorectal adenocarcinoma cells. The kidney cancer dataset comprises human kidney nuclei obtained from frozen tissue. The lymphoma dataset features flash-frozen intra-abdominal lymph node tumor samples from a patient diagnosed with diffuse small lymphocytic lymphomas. Lastly, we have the PBMC I and PBMC II datasets. The former consists of peripheral blood mononuclear cells (PBMCs) from healthy male donors aged 30–35, whereas the latter comprises cryopreserved PBMCs from a healthy female donor aged 25.
For the mouse subset, we curated several datasets. The cortex dataset comprises 5081 and 10,309 nuclei from neonatal and adult mouse brains, respectively. The mouse brain dataset includes nuclei obtained from frozen brain tissue, while the mouse kidney dataset comprises nuclei extracted from frozen mouse kidney tissue. Additionally, we have the brain Alzheimer dataset, which involves a multiomic integration study con-ducted on a mouse model of Alzheimer’s disease.
To prepare the scATAC-seq data, several preprocessing steps were undertaken, as shown in
Figure 1a. Initially, peaks located on sex chromosomes were excluded from consideration. Next, to streamline subsequent computation, overlapping peaks were merged into a unified representation. The resulting cell-by-peak matrix was binarized, with all nonzero values converted to 1, denoting the presence of chromatin accessibility, whereas absent regions were represented by 0. To foster the model’s capacity to discern generalizable patterns and features representative of the overall chromatin accessibility landscape, additional refinement steps were employed. Peaks that occurred infrequently, appearing in fewer than five cells, were eliminated to prevent overfitting to rare occurrences that may lack broad applicability across the dataset. Similarly, overly common peaks, observed in more than 10% of cells, were removed to mitigate potential biases toward highly prevalent regions that may not significantly contribute to distinguishing cell types.
Following the preprocessing steps applied to the scATAC-seq data, we prepared the scRNA-seq data similarly by removing the sex chromosomes. Subsequently, cells expressing fewer than 200 genes or more than 7000 genes were removed to ensure data quality and consistency. Subsequently, we standardized the data by adjusting the counts in each cell so that they totaled the median count per cell, ensuring uniformity of data across all cells. To address potential biases, we applied log transformation followed by Z-score normalization. More precisely, data points falling within the top and bottom 0.5% of the entire distribution were clipped. These normalization and filtering steps mitigated the influence of the extreme outliers, resulting in more reliable and balanced insights from our data. This approach enhanced the robustness and interpretability of our analysis.
Additionally, we derived gene activity scores using the regulatory potential (RP) model implemented within the MAESTRO suite [
10], leveraging the scATAC-seq data. This model assessed the presence of scATAC-seq peaks surrounding each gene, indicating potential transcriptional regulator bindings and their impact on gene expression. Peaks were weighted by exponential decay from the transcription start site (TSS), and the sum of all peaks within a given gene exon region was calculated as if they were located at the TSS. This sum was then normalized by the total exon length. By inputting our scATAC-seq data into the RP model, we obtained the gene activity score corresponding to the scATAC-seq data with a 10 k decay distance using the enhanced model.
Furthermore, we acquired the raw FASTQ sequences of the scRNA-seq data and subsequently processed these raw sequence files using 10x Genomics CellRanger 3.1.0 [
11] to generate raw feature-barcode matrices, along with the intermediate BAM file. This BAM file was then used with Velocyto [
12] to convert it to the LOOM format, facilitating downstream analysis. Finally, utilizing scVelo [
13] with the LOOM file as input, we identified significant genes ranked by the velocity score.
2.2. Model Architecture
The model consists of four encoder networks and two decoder networks. Each encoder independently projects the scRNA-seq, scATAC-seq, gene activity scores, and significant gene expression into the latent space. At the bottleneck layer, we merged the latent representation from scRNA-seq and significant gene expression by using element-wise addition. Likewise, the latent representation derived from scATAC-seq and the gene activity scores were merged. The decoders were then utilized to infer the scRNA-seq and scATAC-seq outputs from the latent representation (
Figure 1b).
As shown in
Figure 1c, in the encoders for scRNA-seq and significant genes, we initially projected the gene–cell expression matrix into a 16-dimensional latent space through two fully connected layers (FC layers), each followed by batch normalization layer (BN layers) and ReLU (rectified linear unit) activation. Subsequently, we performed an element-wise merge of the two resulting latent representations. In the decoder for scRNA-seq, the 16-dimensional latent space was first expanded to a 64-dimensional space. Then, this 64-dimensional space was further processed to produce two outputs of the same dimensionality as the input. Finally, these outputs underwent exponential activation functions to calculate the mean and softplus activation and the dispersion parameters.
In the encoders for scATAC-seq and derived gene activity scores, rather than simply projecting the genome-wide peak information with a single fully connected layer, we split the whole-genome peaks by chromosome and assigned a fully connected network to process the peaks of each chromosome independently. This strategy aimed to shed light on the intrachromosomal interaction of DNA accessibility rather than focusing solely on interactions across different chromosomes. Every fully connected network contained two fully connected layers to project the input onto a 16-dimensional space, each followed by a PReLU (parametric ReLU) activation. Subsequently, we concatenated all the resulting latent representations to yield a 352-dimensional concatenated representation. This concatenated representation was then projected onto a 16-dimensional latent representation with the PReLU activation. Following this, we performed an element-wise merge of the 16-dimensional latent representations from scATAC-seq and gene activity scores. Moving to the decoder for scATAC-seq, we began by projecting the 16-dimensional latent representation onto a 352-dimensional space using the PReLU activation. This representation was then split into 22 blocks, each representing a chromosome, with each block containing a 16-dimensional space. We assigned a separate fully connected network to each latent representation, restoring the dimensions to their original sizes for each chromosome, followed by applying a sigmoid activation function.
MP model was implemented based on Python version 3.8.16, Pytorch version 1.13.1, cpuonly version 2.0, Skorch version 0.11.0, Anndata version 0.8.0, Scanpy version 1.9.1, Matplotlib vrsion 3.6.3, Pandas version 1.5.3, Scikit-learn version 1.2.0, R version 4.0.5, MAESTRO version 1.5.1, CellRanger version 3.1.0, Velocyto version 0.17.17, scVelo version 0.2.5.
2.3. Model Training
The ATAC encoder, gene activity score encoder, RNA encoder, and significant gene encoder are denoted as
,
,
, respectively. These encoders construct the low-dimensional embeddings
,
,
, and
from the input scATAC (
), gene activity scores (
), scRNA (
), and significant genes (
), as shown in Equation (1).
In the bottleneck layer, we concatenated the embeddings (
,
) and (
,
).
The ATAC decoder and RNA decoder are denoted as
and
for reconstructing the scATAC and scRNA, respectively. Here,
represents the reconstructed scATAC from scATAC,
represents the predicted scRNA from scATAC,
represents the predicted scATAC from scRNA, and
represents the reconstructed scRNA from scRNA.
To assess the accuracy of the inferred scRNA-seq data, whether generated from scRNA-seq or scATAC-seq, we employed the negative binomial (NB) loss function, denoted as
. This choice was informed by its efficacy, proven in previous studies, in terms of imputing and denoising single-cell expression data [
5,
14]. Similarly, to gauge the accuracy of the inferred scATAC-seq data, whether generated from scRNA-seq or scATAC-seq, we utilized the binary cross-entropy (BCE) loss function, denoted as
. This loss function is well-suited to evaluating binary predictions, making it a natural choice for deep learning models applied to scATAC-seq data. Additionally, we computed the KL (Kullback–Leibler) divergence loss, denoted as
, between the two bottleneck latent representations to further evaluate the similarity between the two latent representations. Finally, we derived the loss function as follows:
We trained the model using the Adam optimizer with a learning rate of 0.01. Early stopping was set to 25 epochs. The batch size was 512 during training. We set = 1.33 and = 1 for all the training datasets.
2.4. Web Server Implementation
For easier access to the developed models and results, a web-based portal, CrossMP, was developed with a lightweight development environment and hosted on Docker [
15]. Designed to enhance user experience, the system offers clean and well-organized interface components, which help to minimize operational errors. By leveraging high-performance computing resources, it ensures efficient, sustainable, and reliable performance even under heavy workloads. CrossMP generates unique user identifiers to store all input files, models, and result files securely, maintaining privacy and confidentiality. The CrossMP architecture is structured into four distinct modules (
Figure 2).
2.4.1. Web Interface Module
This module utilizes lightweight UI libraries like AngularJS [
16] to ensure user-friendliness. Its responsive design ensures a consistent appearance across various screen sizes, whether on a computer or a tablet. Additionally, it is compatible with multiple cross-platform web browsers, including Google Chrome, Firefox, Microsoft Edge, and Safari.
2.4.2. Middleware Module
This module serves as an intermediary between the web interface and the database. It employs a RESTful API built with PHP, which leverages HTTP requests for data access and retrieval, job creation, and job information display. To ensure security, a token-based login system and token-based authentication validate each API request. The API interacts with AngularJS on the front end.
2.4.3. Core Module
The core modules mainly consist of the file download, file verification, and a job picker that can run synchronously from the main application. The file download module is called whenever the job is created, and it uses Google API to access the file from Google Drive and the stream download method because the file is likely to be large. The job picker module is called by the cronjob that runs periodically, and it checks the available core and running job to determine which jobs can be run to properly utilize the hardware resources without overloading them. It also uses Python data analysis libraries, such as Scanpy [
17] and Pandas, to validate uploaded files. This module is also responsible for sending notifications to the user about the successful and failed jobs.
2.4.4. Database Module
MySQL [
18] databases are used in this module. Taking advantage of a relational database, they help keep track of the user data and the statuses of not started, running, failed, and successful jobs.
4. Conclusions
We introduced a machine learning model designed to effectively bridge the gap between scRNA-seq and scATAc-seq profiles using co-assay single-cell data. Through the comprehensive evaluation, we have demonstrated the robust performance of our model across diverse experimental contexts, including holdout test datasets and those generated using different experimental protocols. This underscores its versatility and robustness in accurately translating between modalities, thereby facilitating comprehensive analysis of single-cell omics data. Furthermore, we also thoroughly examined the potential limitations of CrossMP. Firstly, it tends to achieve superior results with large datasets, whereas its performance diminishes with smaller datasets comprising fewer than 10,000 cells. This suggests that CrossMP performs sub-optimally with smaller datasets (
Supplementary Materials, Table S1), which we plan to investigate further.
In addition to its performance, our model is accompanied by the user-friendly CrossMP web portal. This portal boasts an intuitive and interactive interface, empowering researchers to effortlessly harness the predictive capabilities of our model. By simply uploading their input modality data file into the specific h5ad format, researchers can seamlessly predict scRNA-seq or scATAC-seq data. Moreover, the portal offers advanced functionalities and visualization tools to further streamline data analysis and interpretation, fostering collaboration and accelerating discoveries in the field of single-cell omics.
In our future endeavors, we aim to enhance the performance of our pretrained human model by augmenting our dataset with additional human co-assay data. By expanding our dataset, we can improve the model’s accuracy and generalizability, enabling more robust translation of single-cell omics data. Furthermore, we intend to enhance our model’s capabilities by expanding its translation abilities to encompass a variety of organisms, including plants such as soybean, maize, Arabidopsis, and other species. This expansion will broaden the applicability of our model and facilitate cross-species comparisons in single-cell omics research. Additionally, we aspire to extend our model to accommodate translation between other single-cell modalities, such as single-cell proteomics data, in the future.
In parallel, we seek to enhance the functionality of the CrossMP web portal to empower users to train their own models using their own datasets. This feature will enable researchers to tailor the model to their specific experimental setups and biological questions, fostering customization and flexibility in single-cell omics analysis.