Next Article in Journal
Spatial Planning Data Structure Based on Blockchain Technology
Previous Article in Journal
Isochrone-Based Accessibility Analysis of Pre-Hospital Emergency Medical Facilities: A Case Study of Central Districts of Beijing
 
 
Article
Peer-Review Record

SAMPLID: A New Supervised Approach for Meaningful Place Identification Using Call Detail Records as an Alternative to Classical Unsupervised Clustering Techniques

ISPRS Int. J. Geo-Inf. 2024, 13(8), 289; https://doi.org/10.3390/ijgi13080289
by Manuel Mendoza-Hurtado *, Juan A. Romero-del-Castillo and Domingo Ortiz-Boyer
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
ISPRS Int. J. Geo-Inf. 2024, 13(8), 289; https://doi.org/10.3390/ijgi13080289
Submission received: 28 May 2024 / Revised: 2 August 2024 / Accepted: 15 August 2024 / Published: 17 August 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

-Key findings and implications for future research should be better highlighted in the abstract. 

-A clearer definition of 'meaningful places' should be provided. How place is conveptualized in this reseach. How do you define 'meaningful'. When qualififed as such, for whom is a place meaninful? Why?

-Figure 1 (b) should display a scale bar

-In the discussion, discuss the extent to which CDR can actually adress this notion of 'meaningful place'. What is it missing? When? Why?

 

 

Comments on the Quality of English Language

Consider major revision for increasing the paper's impact in the field of human geography

Author Response

Minor editing of English language required.

Answer: The manuscript have been revised against grammatical and typo errors and hopefully everything is fixed now.

Comment 1: -Key findings and implications for future research should be better highlighted in the abstract.

Answer:

We have added the following to the abstract:

“For all types of CDRs, the best results are obtained with the 20x20 subgrid, indicating that the model performs better when supplied with information from neighboring cells with a close spatial relationship, establishing neighborhood relationships that allow the model to clearly learn to identify transitions between cells of different types. Considering that it is common for a place or cell to be labeled in multiple categories at once, this supervised approach opens the door to addressing the identification of meaningful places from a multilabel perspective, which is difficult to achieve using classical unsupervised methods.”

 

Comment 2: -A clearer definition of 'meaningful places' should be provided. How place is conveptualized in this reseach. How do you define 'meaningful'. When qualififed as such, for whom is a place meaninful? Why?

Answer:

When we talk about "meaningful places" we are referring to locations regularly visited by individuals. Home and work are the most common "meaningful places" used in the bibliography and are the ones we have used to evaluate our proposal.

We have added this clarification in the abstract:

"Data supplied by mobile phones has become the basis for identifying meaningful places frequently visited by individuals."

 

Comment 3: -Figure 1 (b) should display a scale bar

Answer:

We have added a scale bar to the figure. The area represented is 23.5km2.

 

Comment 4: -In the discussion, discuss the extent to which CDR can actually adress this notion of 'meaningful place'. What is it missing? When? Why?

Answer:

Although we reference the limitations of CDRs in various parts of the paper, we have included the following paragraph in the conclusions, which we hope will clarify the pros and cons of CDRs:

“The large amount of information stored by mobile phone network operators about the activity of mobile phones in CDRs makes them a useful tool for identifying meaningful places frequently visited by individuals. On the other hand, CDRs are limited in location accuracy because they record positions only at the granularity of a radio base station. However, the increase in the density of the RBS network is proportional to the population density it serves, which helps keep the bias within acceptable margins for identifying meaningful places such as home and work."

Reviewer 2 Report

Comments and Suggestions for Authors

In the Introduction section, the authors claim “However, the groupings are based on data characteristics, which are not always related to the classification objectives.” Relevant citations or examples for the claim are missing. As this is one of the motivational factors for the presented research, the authors are suggested to provide appropriate examples/citations in support of this claim.

In section 3.2, the authors have considered 250 random subgrids in their dataset. What does random imply? Size or location? If location, then what is the size of the subgrids? If random implies size, then what does 20x20 and 250 random subgrids with data for one week means? The authors are suggested to elaborate upon the dataset that they have used.

The authors are suggested to explain the improvement or novelty of the proposed work with respect to the following work:

Isaacman, S., Becker, R., Cáceres, R., Kobourov, S., Martonosi, M., Rowland, J., & Varshavsky, A. (2011). Identifying important places in people’s lives from cellular network data. In Pervasive Computing: 9th International Conference, Pervasive 2011, San Francisco, USA, June 12-15, 2011. Proceedings 9 (pp. 133-151). Springer Berlin Heidelberg.

Author Response

Comment 1: In the Introduction section, the authors claim “However, the groupings are based on data characteristics, which are not always related to the classification objectives.” Relevant citations or examples for the claim are missing. As this is one of the motivational factors for the presented research, the authors are suggested to provide appropriate examples/citations in support of this claim.

Answer:

We have clarified the meaning of the paragraph and added citation 25.

“Clustering represents an easy way to analyze and categorize data because it is an unsupervised learning technique and, typically, no difficult data processing is required. However, the groupings are based on data characteristics, which are not always related to the classification objectives [25]. That is, clusters made based on a measure of similarity or proximity of the data do not always have to correspond well with the classes into which we want to classify the data.”

 

“25 Kononenko, I.; Kukar, M. Chapter 12 - Cluster Analysis. In Machine Learning and Data Mining; Kononenko, I.; Kukar, M., Eds.; Woodhead Publishing, 2007; pp. 321–358. https://doi.org/https://doi.org/10.1533/9780857099440.321.

 

 

Comment 2: In section 3.2, the authors have considered 250 random subgrids in their dataset. What does random imply? Size or location? If location, then what is the size of the subgrids? If random implies size, then what does 20x20 and 250 random subgrids with data for one week means? The authors are suggested to elaborate upon the dataset that they have used.

Answer:

We have modified the description of the datasets in section 3.2 to clarify that we are referring to a subgrid formed by a set of 250 random cells, as shown in Figure 1b. The revised text would be as follows:

 

• the 20x20 subgrid with data for one week,

• the 20x20 subgrid with data for one working day,

• the subgrid formed by 250 random cells with data for one week,

• the subgrid formed by 250 random cells with data for one working day,

• the 20x20 subgrid and 250 random cells with data for one week,

• the 20x20 subgrid and 250 random cells with data for one working day.

 

Comment 3: The authors are suggested to explain the improvement or novelty of the proposed work with respect to the following work:

Isaacman, S., Becker, R., Cáceres, R., Kobourov, S., Martonosi, M., Rowland, J., & Varshavsky, A. (2011). Identifying important places in people’s lives from cellular network data. In Pervasive Computing: 9th International Conference, Pervasive 2011, San Francisco, USA, June 12-15, 2011. Proceedings 9 (pp. 133-151). Springer Berlin Heidelberg.

Answer:

We have clarified the proposed method in the cited work. Although it uses the knowledge provided by 18 volunteers, this is only used to demonstrate that the variables calculated from the CDR data fulfill their intended purpose. As we explain in our work, the proposed model learns from all the data in the entire time series that makes up the training set, without being provided information about which time zones correspond to work hours and which to rest hours, or the number of events occurring between them. This is partly because this information could introduce errors depending on the regions, and in all cases, it would condition the classification of home and work, as is done in the cited work. The revised text would be as follows:

 

“Csáji et al. [15] used k-means, a clustering algorithm to identify locations exhibiting similar weekly calling patterns and to identify which places correspond to work or home based on the calling patterns. In Isaacman et al. \cite{places_Isaacman}, clustering was used to associate several clusters with the places most frequently visited by users, and then those clusters were ranked and defined as home or work based on the hour during which they had the most events. For each cluster, they calculated five variables using the CDR data and a score using logistic regression, with coefficients calculated using the reported locations of the 18 volunteers. The resulting model for identifying home and work concludes, as expected, that the determining variables are only two: the number of events that occur during work hours (between 1pm and 5pm) and home hours (weekends or weekdays between 7pm and 7am). Although the model uses the knowledge provided by the 18 volunteers, this only serves to demonstrate that the variables calculated from the CDR data fulfill the purpose for which they were defined. “

 

Reviewer 3 Report

Comments and Suggestions for Authors

This topic seems to be practical but the paper has several shortcomings such as:

1.       Introduction section is fairly organized and presented, but the paper needs a revision in the related work section.

2.       To make the contribution of the paper more clearly, they should summarize the characteristics of their method in the section of introduction.

3.       In the literature, some relevant references have investigated the applications of AI techniques like machine learning and fuzzy logic. I suggest you do a brief discussion about this. You can use the following articles:

Ø  A fuzzy clustering technique for enhancing the convergence performance by using improved Fuzzy c-means and Particle Swarm Optimization algorithms.

Ø  Extension of FCM by introducing new distance metric

4.       The section of Notations could be given for better readability.

5.       It is not clear to me which formulas were invented by the authors themselves and which ones are derived from other references.

6.       The pseudo code of the Algorithm -1 is in very short. Write the steps of the algorithm in a clear way

7.       The flow chart of the proposed algorithm must be given

8.       A couple of more practical cases must be added to the comparison.
More discussion must be added to the final sections concerning the obtained out comes

9.       There is no uniformity in writing referencing. Make uniformity in references.

10.    Can you address more in the discussion part why this work is better than the others?

11.    Author should elaborate results and discussion more clearly.

12.    Better to analyze the time complexity of the proposed method in the worst case

13.    The diagrams are bit off, improvement can be done.

14.    There are many grammatical and typo errors throughout the document that should be corrected before this manuscript is published.

Comments on the Quality of English Language

There are many grammatical and typo errors throughout the document that should be corrected before this manuscript is published.

Author Response

This topic seems to be practical but the paper has several shortcomings such as:
Comment 1.       Introduction section is fairly organized and presented, but the paper needs a revision in the related work section.
    Answer:
We have expanded and explained the review of related works. We highlight the modifications introduced in the following paragraph:

“Clustering techniques [21] have been widely used in the literature for place identi-
fication ([19], [22], [23]). Cluster analysis, also known as clustering, is the unsupervised
process of partitioning a set of unlabeled data points into distinct and mutually exclusive
groups (clusters) based on their inherent similarities. This process aims to maximize the
similarity of data points within each cluster while minimizing the similarity between data
points belonging to different clusters. Csáji et al. [22] used k-means, a clustering algorithm
to identify locations exhibiting similar weekly calling patterns and to identify which places
correspond to work or home based on the calling patterns. In Isaacman et al. [15], clustering
was used to associate several clusters with the places most frequently visited by users, and
then those clusters were ranked and defined as home or work based on the hour during
which they had the most events. For each cluster, they calculated five variables using
the CDR data and a score using logistic regression, with coefficients calculated using the
reported locations of the 18 volunteers. The resulting model for identifying home and
work concludes, as expected, that the determining variables are only two: the number of
events that occur during work hours (between 1pm and 5pm) and home hours (weekends
or weekdays between 7pm and 7am). Although the model uses the knowledge provided
by the 18 volunteers, this only serves to demonstrate that the variables calculated from
the CDR data fulfill the purpose for which they were defined. In [9], data traces are also
grouped with clustering, and time-based clustering is used to categorize the identified
places.”

Comment 2.       To make the contribution of the paper more clearly, they should summarize the characteristics of their method in the section of introduction.
    Answer:
    We have added the following summary in the introduction:


Overall, the contributions of our work are as follows.

• We propose to address the problem of meaningful place identification from a super-
vised perspective to improve the results obtained with unsupervised methods, such as
clustering techniques, which are traditionally used when dealing with large amounts
of unlabeled data like those provided by CDRs.
• To achieve this, we propose selecting a representative portion of the available data
and correctly labeling it to use as a knowledge base for training a supervised model or
classifier. This will enable us to accurately classify the entirety of the data for which
we do not know the labels..
We use mobility CDR data from the city of Milan to identify home and work places.
To achieve this, we correctly label the data corresponding to a 20x20 subgrid and a
series of random cells into which we divided the city of Milan.
• We compared the results obtained using k-means and k-medoids algorithms as un-
supervised classifiers and SAMPLID-kNN using the kNN classifier as a supervised
classifier.
• Finally, we not only assess the effectiveness of the proposed method compared to
the alternatives considered, but we also draw additional conclusions such as: what
information contained in the CDRs is most useful for identifying home and work
places; whether better results are obtained using the knowledge provided by the
20x20 subgrid, the selected random cells, or their combination; the advantages and
limitations of the proposed method; and the direction for future research.

Comment 3.       In the literature, some relevant references have investigated the applications of AI techniques like machine learning and fuzzy logic. I suggest you do a brief discussion about this. You can use the following articles:
Ø  A fuzzy clustering technique for enhancing the convergence performance by using improved Fuzzy c-means and Particle Swarm Optimization algorithms.
Ø  Extension of FCM by introducing new distance metric
    Answer:
    We have added a paragraph acknowledging the new methods, but our approach for the paper is studying if a supervised approach works better in this study field.
“Recently, more advanced and refined clustering techniques have been introduced, such as the method described in [24], which combines Particle Swarm Optimization (PSO) with Fuzzy c-means (FCM) clustering. This approach aims to overcome the limitations of traditional FCM, including its sensitivity to noise and reliance on initial centroid selection.”

[24] Kumar, N.; Kumar, H. A fuzzy clustering technique for enhancing the convergence performance by using improved Fuzzy c-means and Particle Swarm Optimization algorithms. Data and Knowledge Engineering 2022, 140, 102050. https://doi.org/10.1016/j.datak.2022.102050.

Comment 4.       The section of Notations could be given for better readability.
    Answer:
    It has been reviewed and created a section for better understanding.

Comment 5.       It is not clear to me which formulas were invented by the authors themselves and which ones are derived from other references.
    Answer:
All the formulas appearing in the paper related to data processing are based on prior work, and we have included references to the sources from which they are derived.
Reference 25: “Bishop, C.M. Pattern Recognition and Machine Learning; Springer, 2006.” has been added in order to reference the source of the supervised classification problem definition.

Comment 6.       The pseudo code of the Algorithm -1 is in very short. Write the steps of the algorithm in a clear way
    Answer:
We have attempted to clarify the algorithm without losing its formal foundation. We hope we have achieved this.

Comment 7.       The flow chart of the proposed algorithm must be given
    Answer:
    A flow chart has been created in order to better help in understanding the algorithm used.

Comment 8.       A couple of more practical cases must be added to the comparison.
More discussion must be added to the final sections concerning the obtained out comes
    Answer:
    As we mentioned in the paper : “Access to communication data is often restricted, typically requiring research teams to enter into non-disclosure agreements and research contracts with private companies. This limited availability of open datasets presents a significant challenge for researchers seeking to conduct studies in this field.”  That's why most of the works in this field are limited to studying a single data source.

Unfortunately, like many of the cited works, we only have a single data source available to evaluate our proposal. In our case, we use the CDRs provided by Telecom Italia in association with several universities and foundations for the city of Milan. Nevertheless, we believe the results clearly demonstrate the effectiveness of our proposal.

Comment 9.       There is no uniformity in writing referencing. Make uniformity in references.
    Answer:
    We have revised the bibtex entries and fixed some proceedings references among other fixes.
Comment 10.    Can you address more in the discussion part why this work is better than the others?
    Answer:
In this sense, and at the request of other reviewers, we have modified the abstract and added the following:

“For all types of CDRs, the best results are obtained with the 20x20 subgrid, indicating that the model performs better when supplied with information from neighboring cells with a close spatial relationship, establishing neighborhood relationships that allow the model to clearly learn to identify transitions between cells of different types. Considering that it is common for a place or cell to be labeled in multiple categories at once, this supervised approach opens the door to addressing the identification of meaningful places from a multilabel perspective, which is difficult to achieve using classical unsupervised methods.”

Although the analysis of results and conclusions highlight the benefits of our proposal, we have added the following paragraph in the conclusions to emphasize its potential.

“As a final conclusion, we can affirm that the results obtained in this work using the supervised approach proposed by SAMPLID are sufficiently good to consider its application in any population analysis traditionally addressed using unsupervised clustering techniques, especially when dealing with large amounts of unlabeled data, such as CDRs. In this study, we have demonstrated that labeling a small representative portion of the data can lead to substantial improvements by enabling the application of supervised learning techniques.”

Comment 11.    Author should elaborate results and discussion more clearly.
    Answer:
In this regard, we have made several modifications that we hope will clarify not only the results and conclusions but the entire paper.

Comment 12.    Better to analyze the time complexity of the proposed method in the worst case
    Answer:
    We have indicated the complexity in the section 2 of the paper.
    “The time complexity of the algorithm in the prediction phase is O(n · m + n log n).”
Comment 13.    The diagrams are bit off, improvement can be done.
    Answer:
    We have improved the visualization of the figures in the document and make sure that all of     them are aligned properly.
Comment 14.    There are many grammatical and typo errors throughout the document that should be corrected before this manuscript is published.
    Answer:
    The manuscript have been revised against grammatical and typo errors and hopefully everything is fixed now.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Based on my review of the revised paper, I recommend accepting it, with no further changes needed

Back to TopTop