Next Article in Journal
Amelioration of Diabetic Nephropathy by Targeting Autophagy via Rapamycin or Fasting: Relation to Cell Apoptosis/Survival
Previous Article in Journal
Effects of Radix Polygalae on Cognitive Decline and Depression in Estradiol Depletion Mouse Model of Menopause
 
 
Article
Peer-Review Record

Effects of Sample Size on Plant Single-Cell RNA Profiling

Curr. Issues Mol. Biol. 2021, 43(3), 1685-1697; https://doi.org/10.3390/cimb43030119
by Hongyu Chen 1,†, Yang Lv 2,3,†, Xinxin Yin 1, Xi Chen 1, Qinjie Chu 1, Qian-Hao Zhu 4, Longjiang Fan 1,5 and Longbiao Guo 2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Curr. Issues Mol. Biol. 2021, 43(3), 1685-1697; https://doi.org/10.3390/cimb43030119
Submission received: 1 September 2021 / Revised: 17 September 2021 / Accepted: 15 October 2021 / Published: 20 October 2021
(This article belongs to the Section Bioinformatics and Systems Biology)

Round 1

Reviewer 1 Report

In the manuscript from Chen et al. the authors have analyzed a collection of single-cell RNA-seq data sets of A. thaliana. The authors performed several down-sampling steps and investigated reliability and robustness of identified cell types by re-analysing respective root single-cell data sets. The results can help to design single-cell studies and give orientation when planning to estimate the balanced decision (cost vs. completeness) about number of investigated cells. The authors evaluate the effect of sample size regarding the performance of tools and subsequent scRNA-seq results. In a comprehensive study they underline that importance of high cell numbers for reliability of cell cluster identification. The manuscript is well structured.

 

L40 - 42

Wrong introduction. STRT-seq is not (Macosko et al., 2025) but instead (Islam et al., 2011). The Drop-seq concept is published by (Macosko et al., 2025) and instead to what is stated by the authors of this manuscript (Klein et al., 2015) published the inDrop concept and not the Drop-seq. To me this is low quality and the integrity / expertise of the manuscript is in question?

L198: Provide a list of the known marker genes and corresponding cell type used for this study. This would improve the manuscript.

L223 – 229 Here, the authors want to provide statistical measures for cluster similarity using several indices. It is not well described what is compared against what. Is the cluster of a sub-sample compared against the cluster of the All data set? Also, the discussion is missing! This needs careful improvement!

 Figure 1: The remarks of ‘Others’ should be written as ‘*’ instead ‘**’ to fit to the corresponding information in Table S1! Also, the figure legend should explain what the numbers in the round boxes within the graphic are! I assume it is the cell number of the single-cell study and that it should give orientation to the reader. The graphic would be more informative if a trend line could be added for human, mouse&rat, plants and other organism types. 

Figure 4A: In the UMAP plot it is not clear how the reader should interpret the plots. Needs a more comprehensive discussion what this plot should tell the reader!

Figure 4B: Explain what the density distribution plot can tell the reader.

Figure S1: The inflection point should be integrated into the elbow plots

Figure S2: In addition to the cluster numbers the cell types should be added as well. Also, it is not clear to me if the colours of the clusters are identical throughout the

Figure S4: The dot matrix should be ordered by sample size and the green dots should not be connected, because this irritates and mimics true green dots.

For Table S2. Is there an explanation why HowManyCell tool cannot estimate the missing ‘NA’? What is the reason for this? Sample size seems not to be the reason, since it is working with the 20.000 and 30.000 data set, but not with the 10.000 data set

Author Response

Firstly thank each reviewer for reasonable suggestion. The following is our one-by-one responses to reviewers’ comments.

In the manuscript from Chen et al. the authors have analyzed a collection of single-cell RNA-seq data sets of A. thaliana. The authors performed several down-sampling steps and investigated reliability and robustness of identified cell types by re-analysing respective root single-cell data sets. The results can help to design single-cell studies and give orientation when planning to estimate the balanced decision (cost vs. completeness) about number of investigated cells. The authors evaluate the effect of sample size regarding the performance of tools and subsequent scRNA-seq results. In a comprehensive study they underline that importance of high cell numbers for reliability of cell cluster identification. The manuscript is well structured.

L40 - 42

Wrong introduction. STRT-seq is not (Macosko et al., 2025) but instead (Islam et al., 2011). The Drop-seq concept is published by (Macosko et al., 2025) and instead to what is stated by the authors of this manuscript (Klein et al., 2015) published the inDrop concept and not the Drop-seq. To me this is low quality and the integrity / expertise of the manuscript is in question?

>>Response: Sorry for the mistake. Corresponding changes have been made in the manuscript.

L198: Provide a list of the known marker genes and corresponding cell type used for this study. This would improve the manuscript.

>>Response: Thanks for the suggestions. Marker genes in this manuscript are mainly integrated from five published Arabidopsis root single-cell articles. Now we provide it as Table S1.

L223 – 229 Here, the authors want to provide statistical measures for cluster similarity using several indices. It is not well described what is compared against what. Is the cluster of a sub-sample compared against the cluster of the All data set? Also, the discussion is missing! This needs careful improvement!

>>Response: Sorry for the mistake. Corresponding changes have been made in the manuscript. The cluster information of each sub-sample was compared against the information of the All data set.

 Figure 1: The remarks of ‘Others’ should be written as ‘*’ instead ‘**’ to fit to the corresponding information in Table S1! Also, the figure legend should explain what the numbers in the round boxes within the graphic are! I assume it is the cell number of the single-cell study and that it should give orientation to the reader. The graphic would be more informative if a trend line could be added for human, mouse&rat, plants and other organism types. 

>>Response: Thanks for the suggestions. Corresponding changes have been made in the manuscript. At the same time, the numbers in the round boxes means cell number in plant single-cell researches. In order to highlight the number of cells in the current plant single-cell researches, we did not increase the trend line.

Figure 4A: In the UMAP plot it is not clear how the reader should interpret the plots. Needs a more comprehensive discussion what this plot should tell the reader!

>>Response: we drew a map for each sub-sample by UMAP (Uniform Manifold Approximation and Projection) , which showed that the meristem cells are in the center and have a clear trend towards multiple cell types. As the number of cells increases (such as to 20,000 cells), this trend became more obvious.

Figure 4B: Explain what the density distribution plot can tell the reader.

>>Response: Although the pseudotimes showing in the density map could not be used for direct comparison, with the help of the distribution of meristem cells, we could make a reasonable judgment on the pseudotime of root hair cells. For meristem cells, a relatively accurate pseudotime could not be estimated when the sample size being 3,000 cells or less, but the estimated pseudotime, which concentrated in the 0-5 interval, did not change in the sub-samples with 5,000 or more cells. Compared to the pseudotime of meristem cells, the pseudotime of hair cells also could not be estimated in the sub-sample with only 500 cells, as the root hair cells in this sub-sample were concentrated in the area overlapping with the meristem cells, but a relatively distinct pseudotime could be estimated in the sub-samples with 1,000 and 3,000 cells. This analysis also revealed that development of root hair cells lasted for a much longer time than that of meristem cells.

Figure S1: The inflection point should be integrated into the elbow plots

>>Response: The elbow plot only shows the interval of the PC through the approximate inflection point. The choice of the PC also needs to be judged according to the significance of each PC in JackStraw plots. So a clear inflection point is not necessary.

Figure S2: In addition to the cluster numbers the cell types should be added as well. Also, it is not clear to me if the colours of the clusters are identical throughout the

>>Response: Figure S2 is mainly to prove whether different PC numbers will affect the clustering analysis, and then to determine the appropriate number of PCs. To confirm whether 30 pcs was high enough to characterize the integrated data with 56,903 cells, we compared the cell clustering results with 30, 50 and 100 significant PCs and found that, under the same resolution, the number of cell clusters did not change significantly. When the PC number was 30, 50 and 100, the identified cell clusters was 35, 37 and 37, respectively, suggesting that the top 50 significant PCs are sufficient to represent most of the variations existing in the sub-sample with 56,903 cells.

Figure S4: The dot matrix should be ordered by sample size and the green dots should not be connected, because this irritates and mimics true green dots.

>>Response: Figure S4 mainly shows the identified differential genes between clusters under different cell numbers. The red line represents the number of differential genes identified, and they are sorted by number. Secondly, the bar graph shows the number of genes that have been identified in multiple sub-sample tests, so they are displayed through connected green dots.

For Table S2. Is there an explanation why HowManyCell tool cannot estimate the missing ‘NA’? What is the reason for this? Sample size seems not to be the reason, since it is working with the 20.000 and 30.000 data set, but not with the 10.000 data set

>>Response: Sorry for the mistake. We replaced Table S2 with a new file.

Reviewer 2 Report

Dear editor and colleagues,

 

I have carefully read the entitled meta-analysis manuscript “Effects of sample size on plant single-cell RNA profiling” submitted to the Current Issues in Molecular Biology journal.

 

Without any doubt is a paper focusing on an up-to-date topic and aims to statistically quantify the minimum number of plant cells in order to get meaningful single-cell-RNA-seq results .

 

Nonetheless, there are several caveats that according to my opinion render this paper unacceptable at the current form.

 

The authors seek to detect the effects of sample size in plants. Still a large proportion of the paper focuses on animal/human models (1,244 available studies have used scRNA-seq; among them 30 analyzed plants). Moreover, the authors used data only from Arabidopsis root (across 5 studies). Hence, the conclusions of this study are based on a small dataset lacking the broad-spectrum diversity of records, that is an a priori prerequisite in meta-analyses.

Moreover, the authors used bulked data from these five studies that originate from different developmental stages, different treatments (and different sequencing depths). The authors report that ~3,000 to ~12,000 cells (Denyer et al., 2019; Jean-Baptiste et al., 2019; Ryu et al., 2019; Shulse et al., 2019; Zhang et al., 2019) were used (unequal sample size).

Hence, each dataset could represent biologically different DEG patterns. As a consequence, this fact makes it extremely difficult to predict the universal minimum number of cells across different treatments and developmental stages, since the influence of each biological variable remained unnoticed and could distort results.

A statistical comparison across treatments and age would be a variable worth to inquire; still more published data are needed across different species. A universal conclusion for plants cannot be based solely on Arabidopsis and only on root protoplasts (this contradicts the title and scope of this study as a universal guideline)

Another factor that the authors should take into account is the number of genes expressed in a tissue or a plant (plants present tremendous differences across genome size, ploidy and tissues complexity). How many cells will be needed in an allopolyploid plant that combines three different genomes (A, C, D) like wheat?

 

Based on the above reasons I must unfortunately recommend a rejection.

Author Response

Firstly thank you for reasonable suggestion. The following is our one-by-one responses to reviewers’ comments.

I have carefully read the entitled meta-analysis manuscript “Effects of sample size on plant single-cell RNA profiling” submitted to the Current Issues in Molecular Biology journal.  Without any doubt is a paper focusing on an up-to-date topic and aims to statistically quantify the minimum number of plant cells in order to get meaningful single-cell-RNA-seq results .Nonetheless, there are several caveats that according to my opinion render this paper unacceptable at the current form.

 The authors seek to detect the effects of sample size in plants. Still a large proportion of the paper focuses on animal/human models (1,244 available studies have used scRNA-seq; among them 30 analyzed plants). Moreover, the authors used data only from Arabidopsis root (across 5 studies). Hence, the conclusions of this study are based on a small dataset lacking the broad-spectrum diversity of records, that is an a priori prerequisite in meta-analyses.

>> Response: Our manuscript wants to explore whether different cell numbers in plant single cell analysis will have a certain impact on subsequent analysis. The statistics of cell numbers in single-cell articles published are mainly to illustrate the development trend of single-cell research, and the number of cells is increasing. In published plant single-cell studies, the same trend is also shown. We have used several single-cell studies focusing on the roots of Arabidopsis thaliana. The rest of the plant-related research cannot meet the high cell number requirement, and the article analyzing 110,000 cells is just a bioRxiv paper. Through the analysis of Arabidopsis data, we provide a good case to illustrate that different cell numbers have a certain impact on the subsequent analysis of plant single cells.

Moreover, the authors used bulked data from these five studies that originate from different developmental stages, different treatments (and different sequencing depths). The authors report that ~3,000 to ~12,000 cells (Denyer et al., 2019; Jean-Baptiste et al., 2019; Ryu et al., 2019; Shulse et al., 2019; Zhang et al., 2019) were used (unequal sample size). Hence, each dataset could represent biologically different DEG patterns. As a consequence, this fact makes it extremely difficult to predict the universal minimum number of cells across different treatments and developmental stages, since the influence of each biological variable remained unnoticed and could distort results.

>> Response: Indeed, There will be certain differences between different sources of data. But as described in the Data integration and sampling section, we used a suitable integration algorithm which is specially applied to the integration of single cell data from different sources to integrate the data in the initial analysis, and removed batch effects between data as much as possible.

A statistical comparison across treatments and age would be a variable worth to inquire; still more published data are needed across different species. A universal conclusion for plants cannot be based solely on Arabidopsis and only on root protoplasts (this contradicts the title and scope of this study as a universal guideline)

>> Response: Thanks for the suggestions. Indeed, more data involving more plants or tissues will provide broader conclusion. As previously answered, we use existing Arabidopsis data to provide a good case.

Another factor that the authors should take into account is the number of genes expressed in a tissue or a plant (plants present tremendous differences across genome size, ploidy and tissues complexity). How many cells will be needed in an allopolyploid plant that combines three different genomes (A, C, D) like wheat?

>> Response: Thanks for the suggestions. Research on plant single cells does need to develop. When there is a large amount of plant single cell data, especially for the mentioned species with complex genomes, researchers can refer to the analysis process of Arabidopsis root data in our article to accurately determine the impact of cell number on the analysis of the specific species.

Reviewer 3 Report

The manuscript by Chen et al., provides an overview of the effect of sample size on plant scRNA-seq outcomes. The authors simulated and systematically compared the effects of sample coverage on downstream scRNA-seq analysis by sampling a different size of cells from a pool of ~57,000 Arabidopsis thaliana root cells investigated in five previously published studies. The authors concluded that 20,000- 30,000 cells are enough for profiling Arabidopsis root cells

 

The paper is quite interesting and useful as a general guide for optimizing sample size to be used in plant single cell RNA-seq studies.

 

However, some revisions are necessary before publication.

 

Key words: the authors should insert at least an additional key word such as Arabidopsis thaliana, since the study is mainly related to this plant species.

References inside the text are not reported according to the journal guidelines. Inside the text the references should be inserted as number 1, 2, etc. The authors should follow carefully the Instructions for authors or check some already published papers.

The discussion section must be expanded and there are no conclusions in the manuscript. The authors should also include a small paragraph reporting some concluding remarks and future perspectives

Page 2 Line The authors should report ~57,000 instead than >56,000 in order to avoid confusion for the reader.

Author Response

Firstly thank you for reasonable suggestion. The following is our one-by-one responses to reviewers’ comments.

The manuscript by Chen et al., provides an overview of the effect of sample size on plant scRNA-seq outcomes. The authors simulated and systematically compared the effects of sample coverage on downstream scRNA-seq analysis by sampling a different size of cells from a pool of ~57,000 Arabidopsis thaliana root cells investigated in five previously published studies. The authors concluded that 20,000- 30,000 cells are enough for profiling Arabidopsis root cells.

 The paper is quite interesting and useful as a general guide for optimizing sample size to be used in plant single cell RNA-seq studies.

 However, some revisions are necessary before publication.

 Key words: the authors should insert at least an additional key word such as Arabidopsis thaliana, since the study is mainly related to this plant species.

>>Response: Thanks for the suggestions. The corresponding changes have been made in the revised manuscript.

References inside the text are not reported according to the journal guidelines. Inside the text the references should be inserted as number 1, 2, etc. The authors should follow carefully the Instructions for authors or check some already published papers.

>>Response: Thanks. We replaced the corresponding literature citations in the manuscript.

The discussion section must be expanded and there are no conclusions in the manuscript. The authors should also include a small paragraph reporting some concluding remarks and future perspectives

>>Response: Thanks for the suggestions. The corresponding changes have been made in the revised manuscript.

Page 2 Line The authors should report ~57,000 instead than >56,000 in order to avoid confusion for the reader.

>>Response: We have modified it following the suggestion in the manuscript.

Back to TopTop