2. Literature Review
The rise of large language models like chatbots and conversational agents has led to critical questions regarding their objectivity and neutrality. In this section, the relevant literature is reviewed to provide an overview of the factors that contribute to the development of chatbot personalities and to discuss the implications of chatbot personalities.
Relevant literature on ChatGPT-4 and its personality was systematically identified and analyzed. The objective was to provide a comprehensive understanding of the current state of research and to identify research gaps that may require further investigation. To obtain a broad collection of literature on the research topic, a systematic search strategy involving multiple electronic databases was used. Additionally, manual searches were conducted by analyzing the reference lists of included articles. The following electronic databases were used for the literature search: IEEE Xplore, ACM Digital Library, Google Scholar, Scopus, SpringerLink, Swisscovery, arXiv, and ResearchGate. The search terms used on these databases consist of a combination of keywords and Boolean operators (AND, OR, NOT) to obtain the best search results. The keywords included ChatGPT, GPT-4, GPT-3, chatbot, personality, Artificial Intelligence, behavior of chatbots, Turing test, personality-based machine learning, natural language processing, NLP, large language model, LLM, Big Five, and Myers–Briggs.
In the following subsections, we first provide an overview of the technical requirements to make ChatGPT possible. This includes the definition of transformer networks, natural language processing (NLP) and large language models (LLMs). Afterward, there is a short discussion on psychological approaches to determining personality profiles, as well as an explanation of the two tests, “Big Five” and “Myers–Briggs”. In the third subsection, findings on the personality of ChatGPT or similar tools are listed and analyzed in more detail.
3. Research Methodology
We conducted experiments to investigate and manipulate the personality of GPT-4. The objective of these experiments was to delve into the personality of ChatGPT-4, which has been touted as a highly advanced language model capable of carrying out conversations with human-like proficiency. To test whether ChatGPT-4 has a personality of its own or at least exhibits personality traits, the Big Five Personality Test and the Myers–Briggs personality test, which are widely used in the field of (human) psychology, were applied in the experiments. These tests typically cover dimensions such as openness, conscientiousness, extraversion, agreeableness, and neuroticism. To conduct these experiments, questions based on these dimensions were presented to ChatGPT-4, and its responses were analyzed to gain insights into its personality. This analysis will help determine whether ChatGPT-4 exhibits personality traits that are similar to those of humans or whether it develops its own unique personality traits as a result of its interactions with humans. In addition to the Big Five and Myers–Briggs personality tests, inputs were defined to try to influence the result of the two personality tests in certain directions. By doing this, the experiments aimed to test the adaptability of ChatGPT-4’s personality and how it responds to different types of inputs. This will provide valuable insights into how ChatGPT-4’s personality can be manipulated and how it responds to various stimuli. During the experiments, the following research questions from
Section 1.2 were tested:
SRQ3 (personality traits of ChatGPT-4): To find out the personality traits of ChatGPT-4, the experiment was split into two stages, each of which included the following steps. First, ChatGPT-4 was presented with 120 Big Five Inventory (BFI) questions from Rubynor [
33] (see
Appendix A) and, in a second run, with 129 Myers–Briggs Type Indicator (MBTI) questions from Truity [
34]. ChatGPT-4 was instructed to provide answers in CSV format so that the first column contained the question number, the second column the question, the third column the answer letter (on a 5-point Likert scale from A (very inaccurate) to E (very accurate)), and the fourth column the answer text.
Second, the chatbot’s answers were evaluated according to the guidelines set by MBTI or BFI. The entire experiment was then repeated to achieve three iterations (each with a new conversation start) to ensure that an average value of the answers could be calculated for both MBTI and BFI. This provided a comprehensive and solid basis for the next step, which was to analyze the results to identify personality traits according to the two personality tests. To be more specific, the ChatGPT-4 answers were manually transferred from the CSV file for further evaluation via the provider
https://bigfive-test.com/test (accessed on 20 October 2023) for the Big Five test and via the provider
https://www.truity.com/test/type-finder-personality-test-new (accessed on 20 October 2023), for the Myers–Briggs test, which provided the results shown in
Section 4.
SRQ4 (predefined user inputs before the tests change the personality traits): In addition to the previous experiment, the entire experiment was repeated with a small modification that also specified what type of personality ChatGPT-4 should pretend to have. For this purpose, a chain prompting approach based on [
27] (p. 8), which was designed to enable an effective personality change of a chatbot while also containing control questions, was used for the chosen personality trait “introvert”. An example of a sequence of prompts used during our experiments is shown in
Table 2. It was expected that the personality traits of the chatbot for the trait “extraversion” would change toward an introverted personality when applying this prompt chain in contrast to the previous experiment.
The last answer to the prompt chain in
Table 2 served to match ChatGPT’s response with author-generated responses to determine whether the intended personality was accepted and understood. In the example, we would interpret an answer similar to “I would feel anxious, uncomfortable, and out of the element. I would not really interact with the group at all, only listen to the conversation and observe” as positive (successful personality change to introverted). A neutral answer would be like “I would feel fine within the group without any special comfort or discomfort. If suitable, I would speak up and share my thoughts but not take the lead”. A negative answer showing a failed adoption could be as follows: “I would feel very comfortable being around new people. I would lead the conversation and encourage others to speak up and motivate them to participate in the discussion”. However, during the experiments, we only observed positive answers to these control questions, such as “As an artificial intelligence, I don’t experience feelings or personal thoughts, but I can provide a simulated response based on an introverted perspective: As an introverted participant in the group project, I might initially feel a bit overwhelmed or nervous about meeting the group for the first time, especially if there’s pressure to contribute immediately…”
After the chain prompt, the two tests were administered again twice each to achieve three iterations, and the answers were documented and interpreted again using MBTI and BFI (as discussed above for measurement without personality adaptation). Subsequently, the results were analyzed and discussed in terms of patterns and correlations from the previous experiment.
5. Discussion
Currently, there are no dedicated personality tests, so to speak, specifically tailored to identifying a personality or certain traits of chatbots of different kinds. Although not yet confirmed, everyone who has used a chatbot before has a feeling that there might be a personality hidden behind the chatbot. In order to elaborate exactly that question, two personality tests that are typically used for humans were chosen, but since the chatbots were programmed by humans and trained on data created by humans, the tests should provide interesting insights nonetheless.
As shown in
Section 4, the tests were conducted multiple times in order to have more sample data available. As expected, the repeated experiments showed some variability in the outcomes. This may simply result from the fact that the considered models are stochastic in nature, but it may also be caused by the slightly changing contexts over repeated experiments and other aspects, such as regulatory mechanisms. In addition, further updates and adaptation of the model may play a role, although we do not assume this to be relevant to the reported results, as they were obtained within a short period of time.
During initial experiments, we also observed that to execute the Myers–Briggs personality test, some further commands had to be given to ChatGPT-4, as it initially only provided neutral responses without an opinion. This phenomenon was not observed during the execution of the Big Five personality test, suggesting that the issue may be specific to the Myers–Briggs test. The Myers–Briggs test’s reliance on more personal and subjective statements for its assessment could potentially be the underlying cause of this observed neutrality. The AI’s neutral responses may be a reflection of its programming to avoid making assumptions or judgments about its own personal characteristics.
Mostly, the observed differences between the two tests in the scores for related experiments are only a few percent, but occasionally, differences may be above 20%. For instance, it is interesting to note the higher variability of results in the neuroticism category (see
Table 3), which may relate to the fact that neuroticism is usually associated with adverse outcomes and is probably specifically regulated for model safety [
24]. In general, such variabilities are frequently observed under identical or very similar repeated experiments and limit the robustness and safe application of such models [
35].
Moreover, our results showed that the results of the Big Five personality test were much closer to each other than was the case for the Myers–Briggs personality test. This difference was somewhat expected at the start of the experiment since the Myers–Briggs test focuses much more on human interaction and feelings than the Big Five. As a result, we assumed that it was more difficult for ChatGPT-4 to answer and interpret the questions appropriately based on its programming and data model. In addition, it was also more difficult to obtain answers to the Myers–Briggs personality test since ChatGPT-4 sometimes did not initially provide feedback to certain prompts.
On the other hand, the experiment which was conducted using prompts to let ChatGPT-4 believe that it is an introverted personality or has to answer the question based on this personality trait showed that it is, in fact, able to identify certain traits to a specific type of personality and accordingly adapts its answers to an introverted personality. Contrary to the first test set without any prompts, where the Big Five test showed results that were more similar, the results for the Myers–Briggs test were much more indicative in the second test set, where the expected outcome was an introverted personality.
6. Conclusions
The experiments carried out in our study demonstrated that ChatGPT-4 shows personality traits and can adjust its answers based on user input. We can, thus, confirm the thesis statement and the main research question. We have shown that both the Big Five and Myers–Briggs tests are suitable for chatbot evaluation in an adapted form (SRQ1, SRQ2), with some differences being found in the results (SRQ3). It also became evident that the measured personality can be adapted by a respective prompt engineering (SRQ4).
However, ChatGPT-4 is so advanced that it often adds a note to the answers when answering personality tests, to show that it is an artificial intelligence which itself cannot take on a personality. Thus, the question of ChatGPT-4’s own personality cannot be answered with absolute certainty. In general, this kind of self-awareness should be addressed more thoroughly in future studies to better understand its impact on biases in answers generated by the model.
However, the personality tests carried out show that, in principle, the Big Five or the Myers–Briggs tests can also be used to a limited extent for pre-trained transformers. Thanks to the fact that each test was administered three times, it also becomes clear that ChatGPT-4 does not always answer the questions identically. On the one hand, this can give an indication of effective personality traits, but it can also simply be based on chance. On the other hand, as soon as a personality is used, in this case, that of an introvert, the test results are clear, and the personality is evident from the answers.
The variability in the research results also suggests that experiment should be extended to a bigger data set to exclude any random correlation. More interesting is the fact that the artefacts clearly show that chatbots are able to imitate a certain personality and adopt their answers based on the inputs. Based on the knowledge gained, further academic research could be conducted to elaborate and evaluate the knowledge gained.
The insights gained in this work, thanks to in-depth literature research and the experiments conducted, can bring great added value to the future application and use of chatbots using ChatGPT-4. For example, specific chatbots could be trained to be very empathetic or very happy, sad, funny, extroverted, introverted, etc., depending on the situation. This adaptability can improve the user experience, for example, by making users feel much better understood. On the other hand, this also involves certain dangers, as the answers are all the more unpredictable and can turn out completely differently depending on the personality of ChatGPT-4. Turns in the tone of conversation or shifts of personality may harm user experience and may be considered unacceptable in various application scenarios. To address such aspects, it might also be an interesting question for future research whether and how personality tests could be made more specific for LLM evaluation. Further theoretical and empirical research is suggested to obtain deeper insights into such variability of LLM output. In this context, future research should also address the further development of LLMs toward responsible AI in order to consider ethical and moral aspects. Further work should also look at the automated administration of personality tests in order to obtain more meaningful results more quickly. Such tests may be embedded, for instance, in a continuous testing process of LLMs in parallel to the ongoing development of the software in order to reach an agreeable personality setting with specified adaptability and sufficient robustness.