Employing Source Code Quality Analytics for Enriching Code Snippets Data
Abstract
:1. Introduction
2. Materials and Methods
2.1. Metrics and Violations Analyzer
2.2. Readability Analyzer
2.3. Source Code Parser
2.4. Abstract Syntax Tree Analyzer
Algorithm 1 pq-Grams algorithm pseudocode for building the pq-Extended tree. |
procedure pqExtendedTreeConstruction() |
Add p-1 ancestors to the root of the tree |
For each non-leaf node in tree: |
Add q-1 children before the first child |
Add q-1 children after the last child |
For each leaf node in tree: |
Insert q children |
Return the pq-Extended tree |
end procedure |
Algorithm 2 pq-Grams algorithm pseudocode for computing pq-Grams profile. |
procedure computePqGramsProfile() |
List pqGramsProfile = empty list |
For each node in tree: |
If node has p-1 ancestors and q children: |
Extract pq-Gram pattern consisting of the node, p-1 ancestors, and q children |
Add the extracted pq-Gram pattern to pqGramsProfile |
Return pqGramsProfile |
end procedure |
2.5. Tree Distance Extractor
2.6. Hierarchical Clusterer
Listing 1. Example snippet that is in the same cluster with the snippet of Listing 2. |
public static EarthEllipsoid getType(String name){ if (name == null) return null; return hash.get(name); } |
Listing 2. Example snippet that is in the same cluster with the snippet of Listing 1. |
public Object getUserProperty(Object key){ if (userMap == null) return null; return userMap.get(key); } |
3. Results
- Snippets: This collection contains the code and the docstring of each snippet, information about its origin, the AST and the id of the cluster it belongs to, which can be used to group snippets into clusters (i.e., similar snippets).
- AnalysisMetrics: This collection includes the static analysis metrics calculated by the SourceMeter analysis tool.
- Violations: This collection contains the source code violations identified by the PMD tool.
- ReadabilityMetrics: This collection includes the extracted readability metrics, which are split into four categories, with respect to the research approach they refer to (Buse and Weimer—BW, Posnett, Dorn, and Scalabrino).
Listing 3. Example query that retrieves static analysis metrics and readability metrics of a cluster. |
db.getCollection("snippets").aggregate([ { "$match": { "clusterID": 1 } }, { "$lookup": { "from": "analysismetrics", "localField": "url", "foreignField": "url", "as": "satmetrics" } }, { "$lookup": { "from": "readabilitymetrics", "localField": "url", "foreignField": "url", "as": "readmetrics" } } ]) |
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Crnkovic, I.; Larssom, M. Challenges of Component-Based Development. J. Syst. Softw. 2002, 61, 201–212. [Google Scholar] [CrossRef]
- Brandt, J.; Guo, P.J.; Lewenstein, J.; Klemmer, S.R. Opportunistic Programming: How Rapid Ideation and Prototyping Occur in Practice. In Proceedings of the 4th International Workshop on End-User Software Engineering, New York, NY, USA, 10–18 May 2008; pp. 1–5. [Google Scholar]
- Nguyen, T.; Rigby, P.C.; Nguyen, A.T.; Karanfil, M.; Nguyen, T.N. T2API: Synthesizing API Code Usage Templates from English Texts with Statistical Translation. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, New York, NY, USA, 13–18 November 2016; pp. 1013–1017. [Google Scholar]
- Xu, C.; Sun, X.; Li, B.; Lu, X.; Guo, H. MULAPI: Improving API method recommendation with API usage location. J. Syst. Softw. 2018, 142, 195–205. [Google Scholar] [CrossRef]
- Nguyen, P.T.; Di Rocco, J.; Di Ruscio, D.; Ochoa, L.; Degueule, T.; Di Penta, M. FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns. In Proceedings of the 41st International Conference on Software Engineering, IEEE Press, Montréal, QC, Canada, 25–31 May 2019; pp. 1050–1060. [Google Scholar]
- Gu, X.; Zhang, H.; Zhang, D.; Kim, S. Deep API Learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, New York, NY, USA, 13–18 November 2016; pp. 631–642. [Google Scholar]
- Cai, L.; Wang, H.; Huang, Q.; Xia, X.; Xing, Z.; Lo, D. BIKER: A Tool for Bi-Information Source Based API Method Recommendation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA, 3–9 December 2019; pp. 1075–1079. [Google Scholar]
- Li, X.; Jiang, H.; Kamei, Y.; Chen, X. Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding. IEEE Trans. Softw. Eng. 2018, 46, 1–17. [Google Scholar] [CrossRef]
- Chen, C.; Peng, X.; Sun, J.; Xing, Z.; Wang, X.; Zhao, Y.; Zhang, H.; Zhao, W. Generative API Usage Code Recommendation with Parameter Concretization. Sci. China Inf. Sci. 2019, 62, 192103. [Google Scholar] [CrossRef]
- Ponzanelli, L.; Bacchelli, A.; Lanza, M. Seahawk: Stack Overflow in the IDE. In Proceedings of the 2013 International Conference on Software Engineering, Piscataway, NJ, USA, 18–26 May 2013; pp. 1295–1298. [Google Scholar]
- Campbell, B.A.; Treude, C. NLP2Code: Code Snippet Content Assist via Natural Language Tasks. In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution, Los Alamitos, CA, USA, 17–24 September 2017; pp. 628–632. [Google Scholar]
- Diamantopoulos, T.; Oikonomou, N.; Symeonidis, A. Extracting Semantics from Question-Answering Services for Snippet Reuse. In Proceedings of the 23rd International Conference on Fundamental Approaches to Software Engineering, Dublin, Ireland, 25–30 April 2020; pp. 119–139. [Google Scholar]
- Gu, X.; Zhang, H.; Kim, S. Deep Code Search. In Proceedings of the 40th International Conference on Software Engineering, New York, NY, USA, 26–27 May 2018; pp. 933–944. [Google Scholar]
- Papathomas, E.; Diamantopoulos, T.; Symeonidis, A. Semantic Code Search in Software Repositories using Neural Machine Translation. In Proceedings of the 25th International Conference on Fundamental Approaches to Software Engineering, Munich, Germany, 2–7 April 2022; pp. 225–244. [Google Scholar]
- ISO/IEC 25010:2011. 2011. Available online: https://www.iso.org/obp/ui/#iso:std:iso-iec:25010:ed-1:v1:en (accessed on 28 August 2023).
- Spinellis, D. Code Quality: The Open Source Perspective; Adobe Press: San Jose, CA, USA, 2006. [Google Scholar]
- Sedano, T. Code Readability Testing, an Empirical Study. In Proceedings of the 2016 IEEE 29th International Conference on Software Engineering Education and Training (CSEET), 5–6 April 2016; pp. 111–117. [Google Scholar]
- Pfleeger, S.L.; Atlee, J.M. Software Engineering: Theory and Practice; Pearson Education India: Noida, India, 1998. [Google Scholar]
- Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv 2019, arXiv:1909.09436. [Google Scholar]
- Kamiya, T.; Kusumoto, S.; Inoue, K. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE Trans. Softw. Eng. 2002, 28, 654–670. [Google Scholar] [CrossRef]
- Jiang, L.; Misherghi, G.; Su, Z.; Glondu, S. DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones. In Proceedings of the 29th International Conference on Software Engineering, Minneapolis, MN, USA, 19–27 May 2007; pp. 96–105. [Google Scholar]
- White, M.; Tufano, M.; Vendome, C.; Poshyvanyk, D. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA, 3–7 September 2016; pp. 87–98. [Google Scholar]
- Aktas, M.S.; Kapdan, M. Structural Code Clone Detection Methodology Using Software Metrics. Int. J. Softw. Eng. Knowl. Eng. 2016, 26, 307–332. [Google Scholar] [CrossRef]
- Terragni, V.; Liu, Y.; Cheung, S.C. CSNIPPEX: Automated Synthesis of Compilable Code Snippets from Q&A Sites. In Proceedings of the 25th International Symposium on Software Testing and Analysis, New York, NY, USA, 18–20 July 2016; pp. 118–129. [Google Scholar]
- Raghothaman, M.; Wei, Y.; Hamadi, Y. SWIM: Synthesizing What i Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of the 38th International Conference on Software Engineering, New York, NY, USA, 18–20 May 2016; pp. 357–367. [Google Scholar]
- Haiduc, S.; Aponte, J.; Marcus, A. Supporting Program Comprehension with Source Code Summarization. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering—Volume 2, New York, NY, USA, 1–8 May 2010; pp. 223–226. [Google Scholar]
- Katirtzis, N.; Diamantopoulos, T.; Sutton, C. Summarizing Software API Usage Examples using Clustering Techniques. In Proceedings of the 21th International Conference on Fundamental Approaches to Software Engineering, Thessaloniki, Greece, 14–21 April 2018; pp. 189–206. [Google Scholar]
- Janjic, W.; Hummel, O.; Schumacher, M.; Atkinson, C. An Unabridged Source Code Dataset for Research in Software Reuse. In Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, 18–19 May 2013; pp. 339–342. [Google Scholar]
- Gelman, B.; Obayomi, B.; Moore, J.; Slater, D. Source code analysis dataset. Data Brief 2019, 27, 104712. [Google Scholar] [CrossRef] [PubMed]
- Scalabrino, S.; Linares Vasquez, M.; Oliveto, R.; Poshyvanyk, D. A Comprehensive Model for Code Readability. J. Softw. Evol. Process 2018, 30, e1958. [Google Scholar] [CrossRef]
- Buse, R.P.L.; Weimer, W.R. Learning a Metric for Code Readability. IEEE Trans. Softw. Eng. 2010, 36, 546–558. [Google Scholar] [CrossRef]
- Posnett, D.; Hindle, A.; Devanbu, P. A Simpler Model of Software Readability. In Proceedings of the 8th Working Conference on Mining Software Repositories, New York, NY, USA, 21–22 May 2011; pp. 73–82. [Google Scholar]
- Dorn, J. A General Software Readability Model. Master’s Thesis, The University of Virginia, Charlottesville, VA, USA, 2012. [Google Scholar]
- Parr, T.J.; Quong, R.W. ANTLR: A Predicated-LL(k) Parser Generator. Softw. Pract. Exper. 1995, 25, 789–810. [Google Scholar] [CrossRef]
- Donnelly, C.; Stallman, R. Bison: The Yacc-Compatible Parser Generator; Free Software Foundation: Boston, MA, USA, 2015. [Google Scholar]
- Tai, K.C. The Tree-to-Tree Correction Problem. J. ACM 1979, 26, 422–433. [Google Scholar] [CrossRef]
- Augsten, N.; Böhlen, M.; Gamper, J. The Pq-Gram Distance between Ordered Labeled Trees. ACM Trans. Database Syst. 2008, 35, 1–36. [Google Scholar] [CrossRef]
- Diamantopoulos, T.; Symeonidis, A. Localizing Software Bugs using the Edit Distance of Call Traces. Int. J. Adv. Softw. 2014, 7, 277–288. [Google Scholar]
- Parker, Z.; Poe, S.; Vrbsky, S. Comparing nosql mongodb to an sql db. In Proceedings of the 51st ACM Southeast Conference, Savannah, Georgia, 4–6 April 2013; pp. 1–6. [Google Scholar]
- Mi, Q.; Keung, J.; Xiao, Y.; Mensah, S.; Gao, Y. Improving code readability classification using convolutional neural networks. Inf. Softw. Technol. 2018, 104, 60–71. [Google Scholar] [CrossRef]
- Choi, S.; Kim, S.; Kim, J.; Park, S. Metric and Tool Support for Instant Feedback of Source Code Readability. Inf. Softw. Technol. 2020, 15, 221–228. [Google Scholar]
- Karanikiotis, T.; Papamichail, M.D.; Gonidelis, I.; Karatza, D.; Symeonidis, A.L. A Data-driven Methodology towards Interpreting Readability against Software Properties. In Proceedings of the 15th International Conference on Software Technologies, Held Online, 7–9 July 2020; pp. 61–72. [Google Scholar]
- Fakhoury, S.; Roy, D.; Hassan, S.A.; Arnaoudova, V. Improving Source Code Readability: Theory and Practice. In Proceedings of the 27th International Conference on Program Comprehension, Montreal, QC, Canada, 25–26 May 2019; pp. 2–12. [Google Scholar]
- Roy, D.; Fakhoury, S.; Lee, J.; Arnaoudova, V. A Model to Detect Readability Improvements in Incremental Changes. In Proceedings of the 28th International Conference on Program Comprehension, New York, NY, USA, 13–15 July 2020; pp. 25–36. [Google Scholar]
- Papoudakis, A.; Karanikiotis, T.; Symeonidis, A. A Mechanism for Automatically Extracting Reusable and Maintainable Code Idioms from Software Repositories. In Proceedings of the 17th International Conference on Software Technologies (ICSOFT), Lisbon, Portugal, 11–13 July 2022; pp. 79–90. [Google Scholar]
- Diamantopoulos, T.; Thomopoulos, K.; Symeonidis, A.L. QualBoa: Reusability-aware Recommendations of Source Code Components. In Proceedings of the IEEE/ACM 13th Working Conference on Mining Software Repositories, Austin, TX, USA, 14–16 May 2016; pp. 488–491. [Google Scholar]
- Michailoudis, A.; Diamantopoulos, T.; Symeonidis, A. Towards Readability-aware Recommendations of Source Code Snippets. In Proceedings of the 18th International Conference on Software Technologies (ICSOFT), Rome, Italy, 10–12 July 2023; pp. 688–695. [Google Scholar]
- Gil, Y.; Marcovitch, O.; Orrú, M. A Nano-Pattern Language for Java. J. Comput. Lang. 2019, 54, 100905. [Google Scholar] [CrossRef]
- Diamantopoulos, T.; Karagiannopoulos, G.; Symeonidis, A. CodeCatch: Extracting Source Code Snippets from Online Sources. In Proceedings of the IEEE/ACM 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE), Gothenburg, Sweden, 27 May–3 June 2018; pp. 21–27. [Google Scholar]
- Kuhn, A.; Ducasse, S.; Gírba, T. Semantic Clustering: Identifying Topics in Source Code. Inf. Softw. Technol. 2007, 49, 230–243. [Google Scholar] [CrossRef]
- Sillito, J.; Maurer, F.; Nasehi, S.M.; Burns, C. What Makes a Good Code Example? A Study of Programming Q&A in StackOverflow. In Proceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM), Trento, Italy, 23–28 September 2012; pp. 25–34. [Google Scholar]
Metric | Value |
---|---|
Number of Documents/Snippets | 496,685 |
Number of Repositories | 500 |
Number of Clusters | 893 |
Data Size | 11.3 GB (940.3 MB compressed) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Karanikiotis, T.; Diamantopoulos, T.; Symeonidis, A. Employing Source Code Quality Analytics for Enriching Code Snippets Data. Data 2023, 8, 140. https://doi.org/10.3390/data8090140
Karanikiotis T, Diamantopoulos T, Symeonidis A. Employing Source Code Quality Analytics for Enriching Code Snippets Data. Data. 2023; 8(9):140. https://doi.org/10.3390/data8090140
Chicago/Turabian StyleKaranikiotis, Thomas, Themistoklis Diamantopoulos, and Andreas Symeonidis. 2023. "Employing Source Code Quality Analytics for Enriching Code Snippets Data" Data 8, no. 9: 140. https://doi.org/10.3390/data8090140
APA StyleKaranikiotis, T., Diamantopoulos, T., & Symeonidis, A. (2023). Employing Source Code Quality Analytics for Enriching Code Snippets Data. Data, 8(9), 140. https://doi.org/10.3390/data8090140