*Proceeding Paper* **RGen: Data Generator for Benchmarking Big Data Workloads †**

**Rubén Pérez-Jove 1,\*, Roberto R. Expósito <sup>2</sup> and Juan Touriño <sup>2</sup>**


**Abstract:** This paper presents RGen, a parallel data generator for benchmarking Big Data workloads, which integrates existing features and new functionalities in a standalone tool. The main functionalities developed in this work were the generation of text and graphs that meet the characteristics defined by the 4 Vs of Big Data. On the one hand, the LDA model has been used for text generation, which extracts topics or themes covered in a series of documents. On the other hand, graph generation is based on the Kronecker model. The experimental evaluation carried out on a 16-node cluster has shown that RGen provides very good weak and strong scalability results. RGen is publicly available to download at https://github.com/rubenperez98/RGen, accessed on 30 September 2021.

**Keywords:** Data generator; MapReduce; HDFS; Apache Hadoop; Java; Big Data; Benchmarking
