**1. Introduction**

Personas are constructs that represent user archetypes and have been used extensively in various stages of human-computer interaction design. Personas have traits that subsets of potential users exhibit [1]. The traits can be mapped to user requirements, usually not in a one-to-one manner but rather as abstracted characteristics that the user requirements can be derived or fine-tuned to. Personas may be useful in various stages of a design. For the requirements or needfinding phase, designers use them to formulate the requirements, and explain them to the design team by applying them to the personas. For the design phase, prototyping may be applied to personas for testing. For the evaluation, the product is applied to the personas in order to evaluate whether the expected goals are met and identify whether the needs were met and to what extent. For the deployment stage, personas may be applied accordingly, depending on the actual persona and product, from market share penetration evaluation to research objectives fulfilment to ethical and inclusiveness verification [2].

Personas can be integrated to use case models, resulting in revised or adapted use case models that describe more compact sets of use cases [3]. They may also be created and utilized as part of stories and scenarios [4] for enriching scenario or case-driven interaction design processes. On the other hand, other studies include recommendations toward separating scenarios and persona descriptions [5].

Designing personas well is an important aspect, since only a small number of personas may be created for them to be useful. Therefore, personas have to be concise enough to accurately express the design requirements, as well as conceptually abstract enough to provide the required coverage of the user types and requirements [6]. One of the major shortcomings of personas is that they can hardly account for change, especially fast change. Even the most well-constructed personas may become partially obsolete or inaccurate after a period of time, resulting in the need for additional effort, time and expense in order to repair inconsistencies and lost credibility [7].

Salminen et al. did an extensive study to evaluate how persona creation and utilization are affected by statistical online analytics using big data [8]. The critical points that were reported are the main challenges that this paper addresses and are as follows:


There are several approaches to data-driven persona construction, systems and methodologies for quantitatively generating personas using large amounts of online social media data [9]. However, data-generated personas were found to suffer from similar shortcomings to the traditional manually constructed personas. Coherency and consistency is an inherent problem and a challenge for persona designers when constructing and utilizing the elements of information in a meaningful and usable form [10].

Automatically generated personas address challenge 1 and 2 by utilizing web data to automatically create a number of personas, requiring minimal human involvement. However, they cannot address trustworthiness and inaccuracy over time. Designer-generated personas are costly and take time to create. Additionally, they may be biased by their creators. However, they generally address challenges 3 and 4 better than automatically created personas, since designers may pick sources and specific data that seem trustworthy, as well as selecting representative information that they deem to be as futureproof as possible. From the above information, it would be interesting to put an approach to the test that may address all challenges to a certain degree.

This work examines how traditional persona design may be assisted by persona metadata derived from fast-changing big data. Building on the identified shortcomings of the manual and the data-generated persona construction and their individual advantages, this paper proposes a hybrid approach that is simple enough to apply, yet contextual and analytical so as to provide useful insights.

The structure of the paper is outlined as follows: Section 2 presents the related work. Section 3 presents the motivation and rationale behind this work. Section 4 details the experimental setup and method. Section 5 presents the results of the user study experiments in persona construction, while Section 6 presents the evaluation of the cultural event persona construction. Section 7 discusses the paper's results and outlines future work.

#### **2. Related Work**

Personas are not constant. One of the major points of critique in the use of personas is that they become irrelevant or non-applicable very soon. Therefore, the designer needs to account for variations and update the personas. The change can be significant, even for six-month or yearly periods. An advantage of data-driven persona construction is that continuous time-stamped data may be used to account for persona variation over time [11]. The challenge for data-driven persona construction is to monitor and identify how change happens over time: the veracity, velocity and volume. Findings show that topical interests, as reflected by personas constructed using data from online sources, change by an average of over 20%, while only a third of the personas in those cases experience topical consistency [12]. This shows the necessity for a constant update of the personas in

order to reflect the changes in topical interests. The frequency of the employed routine data analysis to achieve the updates reflects upon the design lifecycle [13].

Automatic Persona Generation is the implementation of a methodology for quantitatively generating data-driven personas from online social media data [14]. Personas may be generated automatically, in real time, using very rich social media data that include timestamps such as YouTube, eliminating most of the labor associated with persona construction [15]. On the other hand, personas that are built from user data may be incomplete or incomprehensible, and therefore unusable as they are, requiring the designer to barge in and fill in the blanks [16]. Therefore, the interpretation of the persona characteristics is designer-dependent and sometimes designer-biased.

Tapping into social web information is a challenging task, mainly due to reasons that are associated with the processing and analysis of social web data [17,18]. On the other hand, the personas themselves are created for various tasks; there are personas that are required for marketing, for social research and for educational designs, amongst others. There are personas that can cover all possible cases (elastic) or personas that are only useful for narrow or very specific cases. Christoforakos et al. examined marketing stakeholder personas for prototyping [19], while Schoch et al. created personas to understand social barriers and used them for prototyping a web app [20]. Ozkan et al. showed the importance of how designers or product owners, in their case the university faculty, regard the disconnection between them and users, in their case the students, using personas as a design technique for revamping a university school curriculum [21].

Personas may be influenced by their designers or by researchers who are making assertions about the expectations of other users. The designer team, as well as the example data and their size, form a dynamic mix that unknowingly assigns bias to the personas [22,23]. Salminen et al. examined data-generated personas under the assumption that bias may be affected by the age and gender of the persona as well as by the number of generated personas [24]. The study found that a small number of personas increased the bias, which would be a valid hypothesis since the bias-inducing parameters would be exaggerated in a small set of personas. In their study, female personas were found to be underrepresented for small persona sets. Therefore, algorithmic bias is present and a manual validation by experts is necessary.

The vast number of data may lead to personas that either summarize user requirements or contain overly precise information, making them have an insignificant impact [25]. Demographics are a textbook example for broad data that require proper taming so that they either do not result in unnecessary large number of personas or they are not spread thin and consumed by other persona attributes. A study using YouTube data from videos utilizing the full demographic classification showed that 2772 demographic-based personas would be generated using the existing demographic groupings for gender, age and origin [26]. An et al., (2018) used aggregated data to define customer behavior segments and created personas based on the demographics from those segments [27].

Co-creating personas benefits users so as to have them engage in accessible design, and it achieves a broader inclusion of demographics in the co-created personas [28]. Extending the use of personas outside user archetype modelling, personas can be used for roleplay simulation with real users for collaborative design [29]. In a recent study, interaction design across cultures could be aided using child-generated personas [30]. The study found that children could be more expressive, providing details based on enthusiasm, which in turn provided behavioral and activity-based thematic scenarios.

Depending on the generated number of personas, unsupervised learning methods, such as clustering or topic modelling, may be used to cluster the personas based on their attributes, thereby providing a means to go from raw data to understandable semantics. The actual attributes and the way they are presented to designers affects their perception of the personas. Salminen et al. determined that using actual numbers to describe attributes had a positive effect on the perception of the persona usefulness by users such as analysts but a significantly negative effect on the perception of the persona completeness by both analysts and market experts [31].

Transparency in data-driven generated personas is achieved by providing the sources of the information on the personas. Transparency affects credibility (decrease), completeness (increase) and clarity (increase). The persona gender also affects the perceived completeness of the persona by the user, but this was evident only for female personas [32].

Incomplete personas that may not contain certain types of information constitute an attempt by researchers to eliminate factors that induce bias and uncertainty. For example, "thin" personas that do not contain personalization samples such as a name and picture but retain demographic (gender, origin, age) and behavioral attributes are used for automated categorization methods, such as clustering to reduce the numbers of generated personas [33]. That way, persona sets are described by their clustered core objective information and avoid the causes of subjectiveness that the personal attributes would induce.

Another way to fine-tune or reduce persona sets is by traditional large scale online surveys and a quantitative analysis of the questionnaire information so as to revise the persona sets or even create additional personas that were not generated by the data-driven methods [34]. Xu and Lee identified persona types for online shopping communities using large scale surveys [35]. They analyzed the data in terms of social connections and characteristics, such as reading and posting behavior, which led, via clustering, to a limited number of personas as categories of users. Those "very thin" personas were described by their main representative social behavior characteristic and an accompanying descriptive sentence. An additional aspect that designers can keep in check is perceived likability. Studies show that, similar to designer bias, users and designers are affected by visual properties. To keep this effect from happening, pictures (stock, generated or otherwise, e.g., sketched) may be refrained from being used so that the acceptance of the persona by the users will not be affected by the likeability of the persona picture [36].

#### **3. Motivation**

Data-driven persona generation may utilize multiple analysis techniques as well as traditional methods for accurate persona construction. Kim et al. used a trend analysis as well as face-to-face interviews and online surveys to extract cybersecurity-attributed user characteristics [37]. They used this hybrid technique to compare the data from the three sources (trend analysis, face-to-face interviews and online surveys) in order to formulate the personas. Such a post-analysis was quite difficult, since the data collection from the sources was performed in parallel and the data were not cross-fed during or after the collection process. Therefore, the datasets had different granularity and coarseness values, as well as no automatic connections, which the users were then tasked to understand and correlate.

All the aforementioned identified issues related to the collection of data, analysis of the persona attributes, construction and use of personas result in problems that end users, designers, marketeers and researchers ultimately face [38,39]. Matthews et al. identified the main issues with personas with regard to users, finding them misleading and distracting as well as abstract and impersonal [40]. The authors argue that perhaps a more prudent approach to persona formulation would be to avoid persona attributes that mislead and distract the users. Furthermore, they deem this aspect as being more important than striving to create engaging personas.

From the above information, it is argued that automatically data-generated personas cannot fully replace the designers delving into the data and the insights and intuitions that they gain for the design requirements. Several studies also identify specific persona shortcomings that trigger mistrust, causing the designers to refrain from adopting them fully for their design approaches. Achieving a balance between data-generated personas and human intuition is the motivation of this work.

The hypothesis is that, based on the related literature, data-assisted persona construction may yield more accurate personas. The approach of this work is that, instead of collecting human knowledge from questionnaires and interviews and combining or fusing the knowledge with the data-generated personas, the designers can be assisted on a higher level with data-processed information related to persona construction. The information is stripped from any data or aspects that affect human

decisions on a sentimental or likeness level, thereby shielding the designers from knowingly or unknowingly induced bias. This way, the design process is supported by the data analysis, while at the same time allowing the designers to utilize higher-level data knowledge in their traditional persona construction approach.

In the following subsections, we present the experimental approach to big data assisted persona construction, examining the effectiveness of an elaborate data analysis for the created personas, and comparing it with the traditional and frequently used data collection and analysis by human designers.

#### **4. Experimental Setup and Method**

In the following paragraphs, we elaborate on (i) the experimental design, (ii) the data-driven persona metadata-assisted designer user study and (iii) the evaluation of the persona designs using standard metrics.

#### *4.1. Data*

The data were collected from Twitter for the well-known live music event @rockamring from January to February 2020 (inclusive). The collection crawled the most frequent relevant hashtags, such as #rar2020 and #rockamring. The former was used for the topic modelling, and both were used for the pictures, videos and links. We used a pipelined process to clean up and validate the data [41]. Out of the posts collected, 1811 were used for topic modelling using the Latent Dirichlet Allocation (LDA) approach [42,43].

The topics were modelled as interesting based on LDA clusters, the tie strength of the context words from a sentiment analysis, and quantitative social sharing information from their associated Twitter posts [44]. User information, including gender, demographics and name/photo were excluded to avoid biasing the designers for or against specific information. Figure 1 shows the topics of interest as extracted for the aforementioned period.

**Figure 1.** Topics presented to the designer participants. The main topics are on the left and related keywords are on the right. The keyword colors represent the identified sentiment (green: positive, yellow: neutral, red: negative, blue: no sentiment).

#### *4.2. Participants*

The user study and evaluation participants were recruited through the University forum and social networks. Thirty English speaking participants were selected, 57% male and 43% female. The average age of the participants was 22 years. All were undergraduate and graduate students, while 70% of them reported having taken an HCI-related course and all had previously participated in human studies. All reported familiarity with the use of social networks for obtaining information. All participants attended an informal lecture on personas and user design. Examples of personas and their use were provided during that session. After their familiarization, they were explained about the study specifics and given the tasks (Table 1).


**Table 1.** Persona construction and evaluation task breakdown.

The participants were randomly placed into two groups of 15 people each. The task was to construct thin personas, so photos, biographical information, personal status, quotes, work and background text were optional. The reason for this was to eliminate or minimize the potential bias for the persona peer evaluation. The participants were given a minimum of one hour and a maximum of two hours to construct personas for the selected music event. They were told to use an online translation app for the non-English content, knowing that the specific event had a large amount of German language content. The study facilitators recorded details of interest on paper during the sessions.

#### **5. Persona Construction**

All participants from both groups created personas in the allotted timeframe. Since the participant experience in creating personas varied, the time spent to finish the tasks could not reflect on the data usefulness for either group. A total of 159 personas were constructed by all participants. Table 2 shows the breakdown of the number of personas created per group and participant gender.


**Table 2.** Number of personas constructed per participant group (task) and gender.

Group A participants constructed 66 personas in total, while Group B participants constructed 93, which was 41% more. It is also evident that the female and male participants of Group A constructed about the same average number of personas, while in Group B, male participants constructed more than one additional persona than their female group partners did.

The task of the Group B participants was more demanding, since they were not provided with the topic analysis information and had to explore the data on their own. In order to do so, they utilized several online Twitter analytics tools, such as the Tweet Sentiment Visualizer (https://www.csc2.ncsu.edu/faculty/healey/tweet\_viz/tweet\_app/), which enables insight into sentiment, topics, timeline, connections, maps and timelines (Figure 2), and the Floom (https://floom.app/) app, which enables a quick look into the keywords mentioned in Twitter streams (Figure 3).

**Figure 2.** Looking into the data for persona construction: sentiment (**top**) and topics (**bottom**). The users select and view the sentiment for the entities of interest and the cluster information for a deeper view of how topics are semantically identified.

**Figure 3.** Standard keyword cloud for Twitter streams. The designer may use the keywords to search social media and the web for user comments that they can utilize for the persona construction.

The Group B participants also used Twitter and Facebook as a source. They utilized the information that was automatically generated to quickly access and verify the content and select representative posts and threads for useful information.
