GenAI-Assisted Database Deployment for Heterogeneous Indigenous–Native Ethnographic Research Data

Wang, Reen-Cheng; Yang, David; Hsieh, Ming-Che; Chen, Yi-Cheng; Lin, Weihsuan

doi:10.3390/app14167414

Open AccessArticle

GenAI-Assisted Database Deployment for Heterogeneous Indigenous–Native Ethnographic Research Data

by

Reen-Cheng Wang

^1,*

,

David Yang

¹,

Ming-Che Hsieh

²,

Yi-Cheng Chen

² and

Weihsuan Lin

³

¹

Department of Computer Science and Information Engineering, National Taitung University, Taitung 950309, Taiwan

²

Department of Information Science and Management Systems, National Taitung University, Taitung 950309, Taiwan

³

Interdisciplinary Bachelor’s Program, National Taitung University, Taitung 950309, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 7414; https://doi.org/10.3390/app14167414 (registering DOI)

Submission received: 26 June 2024 / Revised: 13 August 2024 / Accepted: 15 August 2024 / Published: 22 August 2024

(This article belongs to the Topic Innovation, Communication and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

In ethnographic research, data collected through surveys, interviews, or questionnaires in the fields of sociology and anthropology often appear in diverse forms and languages. Building a powerful database system to store and process such data, as well as making good and efficient queries, is very challenging. This paper extensively investigates modern database technology to find out what the best technologies to store these varied and heterogeneous datasets are. The study examines several database categories: traditional relational databases, the NoSQL family of key-value databases, graph databases, document databases, object-oriented databases and vector databases, crucial for the latest artificial intelligence solutions. The research proves that when it comes to field data, the NoSQL lineup is the most appropriate, especially document and graph databases. Simplicity and flexibility found in document databases and advanced ability to deal with complex queries and rich data relationships attainable with graph databases make these two types of NoSQL databases the ideal choice if a large amount of data has to be processed. Advancements in vector databases that embed custom metadata offer new possibilities for detailed analysis and retrieval. However, converting contents into vector data remains challenging, especially in regions with unique oral traditions and languages. Constructing such databases is labor-intensive and requires domain experts to define metadata and relationships, posing a significant burden for research teams with extensive data collections. To this end, this paper proposes using Generative AI (GenAI) to help in the data-transformation process, a recommendation that is supported by testing where GenAI has proven itself a strong supplement to document and graph databases. It also discusses two methods of vector database support that are currently viable, although each has drawbacks and benefits.

Keywords:

generative AI; prompt engineering; database deployment; ethnographic research data

1. Introduction

Numerous studies in community-based research and education generate rich materials of fieldwork and survey data. After data collection, the establishment of a database is necessary to store the data, facilitate subsequent analysis and provide findings. Dozn and Sanchez-Jankowski [1] demonstrated that computer software for ethnographic data analysis offers several advantages, including enhanced data management that streamlines the handling of large volumes of information. It increases reliability by documenting analytical processes, thereby improving research credibility. Integrating computational aid also enhances researchers’ analytical capabilities, resulting in more comprehensive and insightful findings in ethnographic studies.

Since 2019, the Center for Innovation in Humanities and Social Practices at National Taitung University has assembled an interdisciplinary research team to focus on the Dazhugao Creek watershed in the South-Link area of Taitung County, Taiwan. This initiative centers on the agency of local residents as its core value. Taitung University serves as a “bridge” between collaborative learning and action with the community to forge a “path” towards an alternative modernity. The project aims to assist the residents in developing sustainable development strategies that align with their cultural, social, environmental, political and economic conditions. Over more than five years, the team has accumulated a variety of valuable and widely applicable ethnographic research data (ERD), which include:

Background Information: This provides background information on subjects such as age, gender, education, occupation and family background.
Environmental Data: These describe geographical location, weather, traffic conditions, social environment and cultural background.
Fieldwork Notes: These are detailed records of the subjects’ conduct, activities, expressions and conversations.
Interview Records: These contain the content of interviews conducted with subjects, but do not include their responses, reactions and emotions.
Imagery Data: These include the photographing or videoing of locations, environments and objects to help researchers better understand and describe their subjects and settings.
Researchers’ Personal Notes: Researchers’ observations and analyses on fieldwork are vital records, aiding in understanding the research process and outcomes.
Survey Data: These include subjects’ answers to questionnaires, gathering information and opinions relevant to a research topic.

The data are recorded in various languages, including Mandarin, indigenous languages such as the East-Paiwan language and South-Paiwan languages, Paiwan language data documented in Chinese, Paiwan language data recorded in Romanized script and mixed Paiwan and Japanese records from the Japanese colonial period.

These various and non-uniform data types are often referred to as heterogeneous data. For example, the data in Table 1 may have different attributes with the characteristics described below, requiring appropriate storage methods to improve data usability and manageability.

Data Attributes Variations: Even within the same data type or structure, attribute values are inconsistent. For example, the three field surveys and interview records may differ significantly in the number and type of attributes they include.
Structural Inconsistency in Data: Upon review of the sample data, it becomes clear that the data fall into several categories: comprehensive tables, fieldwork records, transcripts and presentations.
Containing Images: Among the samples, a significant portion of the data contain images, with all types incorporating visuals, except for verbatim transcripts. Typically, these data entries contain between two and six images. Consequently, these data cannot be stored using purely text-based methods.

Table 1. Heterogeneous field study records in the relational SQL database. The asterisks in the table represent masked portions of personally identifiable information (PII).

No.	Topic	Date/Time	Location	Interviewers	Keywords	Respondents	Language
1	Ladan’s Pre-meeting for Millet Sowing Festival	26 February	Tusan, Daren Township, Taitung	*Chen, Ko, **Wu	Agriculture, Food	None	Chinese
2	Queen Wang’s Ma Chai Terrace	None	None	110***24	None	None	Paiwan
3	Impact of the Epidemic on Settlements	17 August	Tusan, Daren Township, Taitung	110***11	Social issues, epidemic prevention	***Chen	Chinese

A challenge for current databases is effectively managing heterogeneous data. Ensuring the integrity of both visual and textual content, a database must have multifunctional capabilities for hosting diverse data types. The complexity of data relationships such as geographic connectivity, ancestral connectivity, seasonal separation, sequence timing and socio-political hierarchies is traditionally a struggle for databases. Chen et al. [2] demonstrated that the complexity of a simple location description can identify eight types of information, such as place semantics and contextual knowledge. Only with proper database construction can the understanding and utilization of knowledge of the data be possible. That means creating a need to turn to technologies that are able to overcome these challenges, such as ERD, where traditional operations can be cumbersome. Burns and Wark [3] introduced two critical perspectives. First, databases should be viewed not merely as technical tools but as social and cultural artifacts that encapsulate specific ways of understanding the world. They emphasized that databases reflect values, norms and power dynamics, influencing how individuals and groups interact with data and each other. Second, database ethnography complements traditional digital ethnography by specifically focusing on the database as a site of knowledge production. They suggested that this approach can yield valuable insights into the social and political implications of database use and design. To develop cross-cultural databases, Watts et al. [4] discussed several methodologies focusing on best practices and principles that enhance the reliability and validity of the data. Key methodologies include the use of existing ethnographic materials, the transformation of qualitative descriptions, standardization of coding procedures, retention of source information, addressing biases and limitations, incorporation of metadata and interdisciplinary collaboration. These methodologies are used as the reference criteria in our evaluations.

The current lack of theoretical or practical technical comparisons in the transition from ethnography to databases makes it difficult to find appropriate reference models when we try to construct these databases. This research aims to evaluate various database technologies such as Traditional Relational Databases, NoSQL Key-Value Databases, Graph Databases, Document Databases, Object-Oriented Databases and Vector Databases, which are building the capability to improve traditional database architecture for managing diverse data types such as ERD from field surveys and questionnaires. The various technologies are compared to the specific needs for storing heterogeneous data derived from the combination of modern technology and traditional knowledge.

In contrast, the recent surge in artificial intelligence has introduced the complex challenge of integrating AI with database construction. Zhou et al. [5] conducted a detailed analysis of the early relationship between AI and databases. They described a learning-based database configuration that automates tasks such as knob tuning and indexing and viewing, a learning-based database optimization to improve performance through cost estimation and join order selection, a learning-based database design, monitoring and security, all using AI to improve efficiency. These techniques address the capabilities of traditional database systems, making them more intelligent and capable of handling large-scale applications. The term “early” refers to the period before the advent of generative AI, despite being relatively recent. Some scholars have also explored the primary aim of this paper—addressing database issues pertinent to humanities data. Zhao [6] developed a database system specifically for the preservation and management of intangible cultural heritage, using modern database technology, but only the MySQL with MVC Framework is addressed. With the rapid advancement of generative AI in recent years, research has increasingly focused on integrating large language models with databases. Some scholars have endeavored to transform substantial data from traditional databases into the knowledge base of large language models. Jindal et al. [7] identified three core challenges in applying generative AI to databases. They are accuracy, scale and privacy. They also proposed a framework that automates the process of turning data from databases into actionable insights. It leverages a large data model to generate relevant data models in response to user questions and employs a scalable retrieval approach to handle databases but can only be applied on SQL databases. Other researchers have sought to leverage large language models to tackle issues related to knowledge graphs. Yang et al. [8] proposed a large language model with knowledge graph-enhancement to solve the challenge of complex knowledge structures between actions and events. This is the same concept as what we do with graph databases and vector databases, as seen below. Furthermore, our research adopts a more pragmatic approach, aiming to construct large language models in real-world environments to facilitate the development of humanities databases.

Another prominent concern with integrating ethnographic data into digital formats is research ethics, particularly when AI is used in data processing. Since AI cannot tell whether transformed data align with the interests of indigenous communities, numerous controversies have arisen. This concern is extensively discussed in the literature. Pinhanez et al. [9] outlined key ethical concerns in AI and indigenous languages, emphasizing the need to avoid data extractivism and colonial thinking, and advocating the principle of “Nothing for us without us”, ensuring indigenous communities actively participate in projects that benefit them. Sustainability is crucial, as technologies must not harm these communities or their cultures. Respect for traditional knowledge is essential, particularly in addressing societal challenges. The paper also highlighted the importance of balancing social impact with technical opportunities, and called for ethical guidelines to align AI development with the values and needs of indigenous peoples. Junker [10] emphasized key ethical concerns regarding data mining of indigenous languages, including the need for indigenous control over their data and the importance of consent to prevent exploitation, warning against the potential misuse of data in AI models, which may propagate errors and misrepresent cultures. The concept of “digital colonialism” is introduced, emphasizing the exploitation of Indigenous data as a free resource without proper acknowledgment or compensation. There is also a call for increased awareness among stakeholders of the importance of data mining and generative AI, and to advocate respectful and equitable co-operation in language-preservation efforts.

This paper is structured as follows: Firstly, the methodology of research is addressed. Secondly, a comprehensive review of the current database types capable of storing ERD is conducted. Thirdly, based on actual construction experience, the advantages and disadvantages of these are discussed. Next, for the more appropriate three types of databases, this paper proposes practical methods of support through generative AI. Lastly, the final section provides a comprehensive discussion and conclusion.

2. Research Methodology

This study employed a practical construction approach to evaluate the advantages and disadvantages of integrating ethnographic data into various databases. We sampled 50 heterogeneous ethnographic data and inserted them into different types of databases, which could be either cloud-based or on-premises, depending on their characteristics. Given the extensive nature of the process, the step-by-step detail is not detailed in this paper. Here we evaluate the pros and cons of implementing several observational metrics:

Ease of Initial Database Construction: This metric refers not to the configuration of the database system itself but to the process of preparing the schema (if necessary) to the point where data can be inserted into the database that is already in place;
Smoothness of Importing Heterogeneous Data: These include, but are not limited to, text data, images, audio data, video files and geographic co-ordinate data;
Effectiveness in Querying Data: This measures whether a database can retrieve relevant data based on the intent of a user, which includes the accuracy of keyword searches and the ability to perform semantic searches. This is particularly important for indigenous languages, where language system differences often mean that simple keyword searches may not meet users’ expectations.

The evaluations of the first two criteria are conducted by system engineers responsible for actual database construction and data import. They use a percentage scoring method based on five indicators: cost, configuration, performance, schema definition and data insert/modify/delete to measure feasibility and convenience. The third criterion is assessed by social science experts through precise and semantic keyword searches, using percentage scoring to evaluate accuracy and accessibility. Lastly, the research team conducted a consensus meeting to conduct a comprehensive analysis.

Our study established several stringent criteria for research ethics to determine the eligibility of data for inclusion in the database, regardless of whether the conversion is performed manually or with AI assistance. These criteria include:

Obtaining written consent from the individuals involved in the data;
Submitting the data to a tribal council for review if it pertains to traditional tribal knowledge;
Ensuring that all processed data undergo manual review by experts in the field before public dissemination.

All research activities are currently conducted under the supervision of the Institutional Review Board (IRB) at National Cheng Kung University Governance Framework for Human Research Ethics, a national-level IRB funded by the National Science and Technology Council of Taiwan, under contract number 111-468-2.

3. Comprehensive Review of Database Type for ERD

3.1. Relational Databases

In a relational database, data are organized into tables with unique names, rows and columns. For instance, Table 1 illustrates how field data can be structured in a relational database system. They are structured at the most basic level, and relationships are one of the important features of this approach, which uses a normal page of Structured Query Language (SQL) to avoid including redundant data. Each table is reserved for a unique entity such as people, festivals or ceremonies, whereas columns represent entities of interest such as priest name, activity, ceremony process and so on. Normalization is based on the simple concept of separating related data, and maintains simple tables and links as a method of measurement in the case of atomic tables. This method also goes a long way to preventing various anomalies such as insertion, deletion and updating anomalies. It can also guarantee data integrity.

3.2. Key-Value NoSQL Databases

NoSQL means “Not only SQL”, which incorporates various differences from traditional SQL databases. There is an opinion that NoSQL databases are poor at storing relational data. The reality is that NoSQL simply stores the data differently from how SQL tables do, but still with the relational aspects. Mihai [11] compared the data-handling differences in these two types of databases and concluded that NoSQL databases were particularly well suited for applications that require high scalability, flexibility and the ability to handle diverse data types. As described by Kamal et al. [12], “Non-relational databases are generally known for their schema-less data models, improved performance and scalability”. The abandonment of table structure in NoSQL also leads people to believe that NoSQL databases can be easier to create than SQL databases.

A rise in labor costs combined with falling hardware prices has spurred the development of a new wave of NoSQL databases. Over the past decade, NoSQL databases have surged in popularity for various reasons, primarily for their ability to integrate seamlessly into existing commercial systems and simplify the development process. Moreover, NoSQL databases serve as essential tools for developing storage solutions capable of handling the ever-growing mass of data. Unlike standard SQL databases, which may struggle to manage and store large amounts of data from various sources, NoSQL databases excel in this area. For instance, consider a database that tracks the activities of a tribe and details of yearly festivals. Relational databases can request activity tables for activity names, locations and participants, potentially including dozens or even hundreds of fields for each entry. NoSQL can store all relevant information about a record in a single document, as shown in Figure 1, streamlining data management and enhancing flexibility.

3.3. Document Databases

Categorized as a form of NoSQL, document databases are specifically intended to process, manage and store a range of data types, including temporal and geographic data, images, text search, etc., and are often used in conjunction with geospatial data.

Document databases lack structured data, and they provide a lot of flexibility for storing unstructured data and data with attribute values that can differ. Traditional relational databases require pre-established table structures and attribute fields, and are therefore less suitable for heterogeneous data storage.

By using flexible attribute definitions, developers can store variable attribute data in document databases without needing pre-defined attribute fields. It is incredibly helpful to preserve things like interview records that may contain quite different attributes, and an example is shown in Figure 2. Each document represents an interview record containing numerous attributes that are specific to the interview. The model permits the dynamic addition or removal of attributes per document to accommodate the heterogeneity of the data.

3.4. Graph Databases

Graph databases are another type of NoSQL database characterized by its use of a graph structure for data storage and relational querying. As described by Origlia et al. [13], “Graph databases represent an ideal way to combine multiple-source data but, to be successful, strategies accounting for inconsistencies and format differences have to be defined to support coherent analysis. Also, the continuously changing nature of crowd-sourced data makes it difficult, for the research community, to compare technological approaches to the different tasks that are linked to cultural heritage, from recommendation to management”. They utilized nodes and edges to represent data and the relationships between data. Because graph databases allow for direct connections between data relationships, searches can often be conducted through a single operation. The inherent storage of relational properties within the database enables rapid and intuitive querying of relationships, making graph databases particularly useful for highly interconnected data.

Pansara [14] concluded that graph databases offer significant advantages in data management by effectively representing complex relationships between entities. These characteristics improve data traversal and query performance, surpassing traditional relational databases in navigating complex connections. Their flexible data model adapts to evolving business needs, ensuring scalability for large datasets. For instance, consider a graph database storing ethnographic fieldwork sample data, as shown in Figure 3. The nodes could represent different aspects of the research, such as locations, interviewers and topics, while the edges would represent the relationships between these aspects, such as “subtopic”, “make”, or “verbatim”.

Graph databases excel at combining data relationships with the data themselves, making them well-suited for applications where data interconnectivity is high and complexity is significant. Compared to traditional relational databases, graph databases are more capable of handling large-scale, unstructured and semi-structured data, and they find applications in scenarios such as social network analysis, fraud detection, network analysis, risk management, intelligent search and recommendation systems.

A key feature of graph databases is their efficient data querying and analysis capabilities. They employ query languages such as Gremlin [15], SPARQL [16] and Cypher [17] for complex querying and analytical operations such as recursive queries, graph-pattern matching and community detection. These functionalities enable the discovery of implicit relationships within the graph structure and facilitate a deeper understanding and analysis of the data.

3.5. Object-Oriented Databases (OODB)

OODBs are a NoSQL database model that organizes and stores datasets based on objects. Compared to conventional relational databases, object-oriented databases provide a more adaptable and instinctive data configuration. In an object-related database, components and information are coordinated as an item. Each component has its own particular properties and strategies, grasping rich highlights like legacy and polymorphism. They can address regular substances in this present reality, for example individuals, sacrificial utensils, or houses, or they can address dynamic ideas of things like records or oblations. Cattell et al. [18] detailed the characteristics of object model, object definition language, object interchange format, object query language, transaction model and database operations in the Object Data Standard.

The reason for the development of object-oriented databases is often attributed to the increasing usage of object-oriented programming languages. Most of the data storage relies on traditional relational databases. However, it was difficult to manage stored abstractions and manipulate them in relational databases using object-oriented technology. Being aware of this, database developers began developing new systems specifically for managing such abstract objects, and this led to the advent of object-oriented databases.

Applications with well-organized data models, especially those aligned with object-oriented design, often do well when paired with object-oriented databases. Such software is typically easier to understand, reproduce and prototype than software that is quickly thrown together. Instead of manipulating objects to fit into a relational database, object-oriented databases more closely align with the way objects are modeled and manipulated during programming, thus intended to relieve the burden on developers caused by the inflation of Object-Relational Mapping (ORM) translations. Torres et al. [19] stated that the main challenges of ORM include impedance mismatch between object-oriented programming and relational databases, which complicates mapping. Key issues include overlooked primary key mapping, foreign key representation and handling complex relationships like many-to-many and inheritance. Performance concerns arise with large datasets and complex queries, necessitating optimization. Moreover, achieving a balance between transparency and low coupling is difficult, as high coupling can hinder maintenance and evolution of applications. These challenges underscore the importance of careful design and informed decision-making when implementing ORM solutions to ensure effective data persistence and application performance. ORM is also able to accommodate and efficiently fetch and display all types of object.

3.6. Vector Databases

Vector databases, a novel type of database system, harness machine learning technologies (like the transformers in natural language processing, image processing and audio analysis) to transform diverse and complex data types (such as images, videos, text and audio files) into vector embedding. This approach preserves the original data features and enables similarity searches based on the characteristics of vectors, using various distance-calculation methods like Cosine, Manhattan, Euclidean and Chebyshev distances. There are numerous benefits to be gained from using vector databases, including:

Enhanced Search Techniques: vector databases hold information in vectors, a series of numbers that represent specific features and these vectors are created using machine-learning principles. Vector databases enable researchers to find nearby results in vector distance. As shown in Figure 4, this allows users to perform searches by finding the closest results in vector distance, thereby facilitating more nuanced and precise search capabilities compared to traditional keyword-based searches.
Diverse Data Type Compatibility: Vector databases are capable of effectively processing diverse data types. They have access to a variety of formats including text, audio, image and video. When carrying out possible functions between equally different data types, they can also handle unstructured and structured data. Through machine learning techniques, vector databases would use various set-functioning values to undertake embedding and save the dataset features for accessibility. This also provides vector databases with better efficiency and management.

Vector databases excel in situations where traditional database systems are not available. Mailon et al. [20] explored vector databases and vector embedding techniques, and presented the evolution, architecture, advantages and challenges of vector databases in their study. They also discussed the challenges of vector databases that include selecting distance metrics, dimensionality, data integrity, cost, etc. These databases are especially useful for managing complex and intricate data types. For instance, they can conduct searches on the basis of form or style similarity in image or audio identification systems, which is not feasible using conventional keyword-based techniques. Additionally, they are well suited for use in AI and data analytics for implementing mathematical models in machine learning, and for naturally storing and querying complex data.

4. The Pros and Cons of Different Database in Real Implementation

4.1. Relational Databases

Relational databases that rely on SQL achieve unprecedented data processing by querying, inserting, updating and deleting records. Adhering to the ACID (atomicity, consistency, isolation and durability) principles of transaction integrity and historical accuracy, relational databases have transformed from basic data-integration platforms in the 1970s into critical hubs for business intelligence operations today. Medjahed et al. [21] presented these four key properties for ensuring reliable database transactions as:

Atomicity ensures the transactions are all-or-nothing—rolling back if any part fails;
Consistency guarantees that transactions that transit into the database from one valid state to another comply with integrity rules;
Isolation allows transactions to operate independently, thus preventing interference and ensuring that uncommitted changes are not visible to others;
Durability ensures that once a transaction is committed, its effects are permanent, surviving system failures through mechanisms such as transaction logs. Together, these properties maintain data integrity and reliability in database systems.

Through sophisticated indexing and schema design, relational databases provide unprecedented efficiency and productivity in processing business data. Nevertheless, relational databases have constraints, and significant financial investment is associated with processing performance scaling through vertical growth. Financially challenged research is thus dissuaded from designing a cost-effective information solution using relational databases. A rigid structure base implies that particular data formats must be applied, which could constrain adaptability and make changes more complex. Managing the interrelationships among numerous tables complicates things beyond the simplicity provided by massive single-table databases. Therefore, it is important to note the conflict between the benefits of data management with relational databases and their flexibility and financial requirements.

4.2. Key-Value NoSQL Databases

Key-value NoSQL databases bring a lot of benefits, especially in their simplicity and scalability. The pairing of a basic data model’s attribute name, or “key”, with its value, simplifies data management and is suitable for massive data scalability. These types of database are also particularly adept at efficient key-based queries, ensuring swift data access and high data availability, even when dealing with a heavily faulty system, with features such as master–slave replication in the Redis [22] database system.

Nevertheless, the drawbacks of key-value databases must also be considered. Given that they do not have relational database features such as table joins and structured tables, key-value data stores are not well suited to some types of application, particularly those that require complex data structures. They also require a lot of extra work at the application layer to deal with a number of the simplest tasks that are native to traditional relational databases.

In short, while key-value NoSQL databases are powerful for user-friendliness, scalability and reliability, their limitations in relational and complex query functions make them unsuitable for certain data storage and analysis requirements.

4.3. Document Databases

Among modern data management forms, document databases are exceptional because their information structures are really adaptable, accommodating different structures, arrangements and sizes. The adaptability of document databases is advantageous for software engineers, as it avoids the need to introduce new fields to specific records when prerequisites change, allowing all document types already stored in the database to continue being used. Therefore, along with this specificity, when querying data, we can locate entire text specimens using more flexible criteria such as festival dates or tribal positions.

Document databases, while flexible, have certain disadvantages over traditional relational databases. One major drawback is their lack of stringent data-consistency controls that can lead to inconsistencies in stored data, which is potentially problematic for applications that require high data integrity. In addition, when handling high-volume transaction processing, document databases struggle due to the lack of rigorous consistency protocols, making them less effective than relational databases in such scenarios. Furthermore, even regular queries might perform more slowly in document databases because their querying processes are not as optimized or stable as those found in relational data-bases, affecting overall performance.

4.4. Graph Databases

Graph databases stand out for handling complex, interconnected data, efficiently handling unstructured and semi-structured data with dynamic attributes. Designed to represent and query relationships directly, they speed up access to connected nodes, eliminating the need for extensive searches and supporting scalability from small databases to large-scale peta and exabyte levels.

Despite their advantages, graph databases present challenges, including a steep learning curve and the necessity for both deep technical knowledge in graph theory and domain knowledge of stored data. Their complex nature demands careful planning to fully leverage their performance capabilities. While excellent for scenarios with complex relationships, graph databases may not fit cases needing highly normalized data or simple queries, where relational or key-value databases could perform better.

In essence, graph databases are potent for specific, complex data relationship scenarios, but their application is limited by the required technical expertise and the intricacies of managing interconnected data, making them a little difficult but suitable for the project needs.

4.5. Object-Oriented Databases

Object-oriented databases provide significant advantages in data management by simplifying the data model, as well as eliminating the need for conversion, thereby speeding up development and streamlining. They allow for direct interaction with data objects and their relationships, enabling a more natural representation of complicated object networks and hierarchies, which is ideal for applications that require rich object interactions.

Despite their advantages, object databases have certain limitations. First, their query performance is often not as good as that of solid mathematical foundations. This has various consequences and is generally difficult to execute in complex queries. The second limitation is that, while SQL has a wide range of tools for reporting—online analytical processing (OLAP), data backups and retrieval—the object database lacks these tools. Instead, specialized tools are needed to handle data manipulation. Thomsen [23] defined OLAP as a technology that enables users to quickly access information from multidimensional data warehouses. It provides a flexible way to view data, supports complex computation and facilitates trend analysis and sophisticated data modeling. These features are not just technical tools but integral to understanding business processes, reflecting the values and power dynamics within organizations. The third limitation is that, because of the many problems inherent in the mismatch between object-oriented programming and the databases, the more complex the fields and the relationships, the more difficult it is to reconstruct database objects. Another problem is that there is no purely object-oriented database system on the market currently.

To sum up, even though object-oriented databases make information easier to portray, and they are useful in facilitating complex interactions, they face complications in terms of question effectiveness, the potential to work together with development and being able to manage the increasingly demanding details of information.

4.6. Vector Databases

Vector databases provide creative strategies for data control, particularly in complex search innovations. They can enhance by calculating vectors that represent not only what the data are, but what they are in relation to everything else—the point in multidimensional space where the data live. They also provide a versatile support for a variety of data types, including natural numbers, words, pictures and sounds. They are skilled at sorting and interpreting both disordered and coherent data by involving machine learning to mimic how people see ideas in the real world in an organized way.

While vector databases offer many benefits, they also represent some important challenges. One of the foremost challenges is the high cost involved in using vector databases. They require significant storage space, which can be particularly expensive when dealing with high-dimensional vectors or large datasets. This need for large amounts of storage space can result in increased storage costs and heightened hardware requirements. Another major challenge is the configuration of the model and distance algorithms. When constructing a vector database, it is necessary to select the appropriate dimensions for the vectors and the appropriate methods for performing distance calculations on the overall dataset. Different algorithms may perform well under different conditions, and it is difficult to know in advance which method of distance calculation will be most effective under which conditions. It takes careful configuration and a good understanding of the data and the algorithms to achieve an optimal combination that makes the most of the advantages of vector databases.

5. GenAI-Assisted Database Deployment

Based on the aforementioned study, it can be said that ERD, document, graph and vector databases provide valuable benefits for requirements. They are also the most difficult to automate. Converting field survey data into the required database format often necessitates substantial manual intervention. Therefore, the study attempts to use novel generative AI methods to assist in the normalization and conversion of data.

5.1. Document Databases

In document databases, manually entering each datum would inevitably require substantial manpower and time. Therefore, this study examines the use of OpenAI [24], which is currently easy to access and utilize, to assist in identifying specific tags to some extent. This approach allows us to significantly reduce the human effort needed to supervise the correctness of the results produced by AI.

Firstly, as shown in Figure 5, a set of prompts was designed using the method of Prompt Engineering. These prompts enabled the transformation of a reflective report, like the one shown in Figure 6, into the necessary data format, as seen in Figure 7, within seconds. Subsequently, by processing the text-to-JSON conversion through programming, the format illustrated in Figure 8 was generated, facilitating the direct import of data into a document database.

5.2. Graph Databases

In this study, the Cypher language with the Neo4j database was used as an example. Before importing data for desirable search methods, it was necessary to determine the types of data nodes and the relationships to be stored in the database. This report initially established several data node types and relationships based on examples from document databases, as illustrated in Table 2. From this table it was evident that certain search scenarios would be essential, such as “Subject build by an Author” or “Subject having a Label”.

Thus, when establishing information in the database based on a file from the filed study data, the commands shown in Figure 9 were necessary. The first block of code primarily set up the data nodes required for this file, while the subsequent blocks established the relationships between these nodes. Compared to creating a record in a document database, constructing a graph database requires more complexity due to the need to consider the relationships between nodes. This complexity improves readability, and potentially improves search efficiency (as shown in Figure 10).

However, manual input for each data entry would be expensive in terms of human resource. Therefore, exploring the capability of the large language model-based OpenAI to analyze nodes and relationships in each field survey dataset is a viable option. If it can accurately identify these details to a certain extent, checking the correctness of the results is the only requirement.

Initially, the text content and file name of the field survey document were put into OpenAI, and the necessary information was requested. Prompts, designed for this purpose, are shown in Figure 11, using the same data as presented in Figure 6. OpenAI’s response, shown in Figure 12, appeared fairly accurate and indicated that OpenAI’s language model can, indeed, analyze the specific data required and even provide the corresponding Cypher commands for importing the data, as shown in Figure 13. This demonstrates that OpenAI can significantly assist in the construction of graph databases.

5.3. Vector Databases

Since Vector databases themselves do not assist in converting data into vector embeddings, this part needs to be handled externally by the creators using models trained for this purpose. Currently, many models are capable of converting textual data into vector embeddings, such as Gensim’s Doc2Vec [25] and Google’s BERT [26], among others. As the text segment of Google’s BERT is character-based, it does not align with the Chinese phrase segment structure. This research is conducted using Doc2Vec. In practical applications, it is also necessary to carefully adjust the sample data and supervise the accuracy.

The model training and conversion process is outlined below. Given the characteristics of Chinese text, the Jieba [27] segmentation method was used to split the entire article into the code. The segmented data were then fed into a model that was designed to generate embeddings with a dimensionality of 512, as specified by the database setting, and which underwent training. Subsequently, new articles were processed by the model to convert and output their vector embeddings. Due to the large volume of data, only pseudocode in Figure 14 is provided as an example:

This example demonstrates how to prepare data, train a Doc2Vec model, and convert new text into vector embeddings, taking into account Chinese text segmentation. This approach is beneficial for integrating complex textual data into databases with enhanced search and retrieval capabilities.

Since OpenAI’s database also uses vector databases, using the vectors pre-defined in OpenAI directly is an alternative approach, as illustrated in Figure 15, with results as shown in Figure 16. The main advantage of this method is its simplicity and speed. However, the drawbacks include the costs involved, and since OpenAI’s vectors are designed for general-purpose language models, indigenous languages are not originally included in its training set, which can limit its applicability for the primary goal of handling regional ERD data.

For specialized applications like ERD involving indigenous languages, it might be necessary to develop custom models or adapt existing models to better capture the linguistic nuances and cultural contexts specific to these languages. This could involve additional training on specialized corpora or tweaking the model parameters to better handle the unique features of the data concerned.

6. Discussion and Conclusions

Among the six database solutions compared, the object-oriented database was excluded due to its overly theoretical nature. Although some database systems claim to possess object-oriented features, they cannot be independently used for practical analysis. The remaining five types of database are readily available in cloud-based, on premise, or hybrid solutions for experimentation. Excluding the object-oriented database, a comprehensive comparison table is presented as Table 3.

Relational databases, while traditional and simple, struggle with ERD, which is distributed and heterogeneous. Constructing a schema that encompasses all possible fields becomes extremely challenging, resulting in a sparse database configuration that significantly increases construction costs. Although native SQL supports a wide array of relational queries, it is not feasible to perform semantic querying directly.

Key-Value NoSQL databases do not require a pre-defined schema, which provides significant flexibility in storing a variety of data types. However, due to the lack of relationships between data elements, semantic retrieval is not achievable with this type of database. Although it is known that generative AI can assist in initializing database configurations, the inability to perform semantic searches does not meet required needs and, therefore, the potential assistance that GenAI could offer is not emphasized in this research.

Document databases are particularly proficient when it comes to organizing a broad spectrum of data. This fact makes these databases particularly well suited for researchers working in the social sciences who want to retain the practice of keeping survey data that cater to life in the field. These databases do have a sort of format omniscience, an inherent ability to absorb and understand binary files, such as the use of Word documents, presentations in PowerPoint, spreadsheets made in Excel and Adobe Acrobat PDFs. The extra efficiency that comes with storing files directly within the database is especially convenient for initiating backup systems and handling the management of data.

Document databases provide matchless adaptability for preserving information—they promote no individual uniformity for filed details that can be successful in the coming days. Without any hassle, they can preserve text of a more substantial group of attributes in one document. Given their bi-dimensional structure, these databases guarantee a faster response rate compared to other databases when queries are transmitted by client software.

Using generative AI can automate the generation of metadata during data construction, and even achieve the goal of automatic semantic extension tagging, making document databases more practical than manual tagging. GenAI adoption not only improves the process but also enhances the richness and accuracy of the metadata, enabling more efficient data retrieval and utilization.

Graph databases excel at storing highly relational data, and their usefulness can be seen in controlling for the complex relationships hidden within survey records. They efficiently manage the interconnections between disparate data, such as the records of surveys, manuscripts and family links—they appropriately explain the sophisticated correlations that lie in every piece of social science research they do.

In constructing graph databases, the most challenging aspect is the development of semantic topology. This process is often time-consuming and labor-intensive, and any necessary changes may require extensive adjustments as they tend to have cascading effects throughout the system. It is demonstrated in this research how GenAI can be leveraged to assist in the construction process. This technology facilitates the automation of tasks such as metadata generation and relationship connection, enhancing the efficiency and accuracy of building complex data systems. By optimizing prompt engineering and providing support, importing individual data records has shown promising results. However, for future applications involving overlapping relationships between multiple data entries, it will be crucial to design a knowledge base for GenAI to use as a benchmark. This is necessary to prevent the generated relations from diverging too broadly, thereby maintaining the advantages of semantic topologies.

The challenges for Vector databases, renowned for their advanced search power, lie in complex modeling and algorithmic configuration, as well as in storage demands. Despite these issues, their promise to process multilingual data and integrate with cutting-edge AI technology makes them unbeatable future options.

Although vector databases theoretically offer the most ideal form of semantic querying, not just for keywords but also for semantic similarity, perfecting a vector database construction that fits local and regional applications is still exceedingly challenging. The study attempts to assist in the construction process using two different generative AI methods, and is observed as the following:

Generating vector embeddings through self-trained models shows better results for data containing indigenous languages and scripts, such as the Paiwan language written in Roman script. However, developing these models requires considerable computational resources and requires high hardware construction costs;
Using commercial services like the OpenAI API can quickly produce results, but the accuracy for indigenous linguistic data is not ideal. This is because current commercial platforms primarily use English training data, making it nearly impossible to meet the needs of languages like Paiwan in our research field in the short term. For example, the paragraph in Figure 17 is an oral conversation of the South-Paiwan language recorded in the Romanized script. The paragraph states that “when I was young, my father and uncle would go out at night with hunting knives and rifles. By dawn, they would return with their catches. Sometimes it was wild boar, sometimes flying squirrels and occasionally pheasants. They would process the catch at home and share it with the members of our community”. However, if a commercial service like ChatGPT 4o is used to analyze and understand the meaning, the results will be displayed as shown in Figure 18. Responses from ChatGPT 4o exhibit the lack of recognition of indigenous weapons and the understanding of indigenous prey. It is also not understood that hunting takes place at night and the concept of sharing after hunting is a common practice. The model is only capable of summarily identifying the discourse as a familial activity based on kinship terminology. Even within kinship terminology, the distinction between “kama”(father) and “ti kama”(uncle) cannot be discerned. This approach is not only inadequate for our database construction but could also lead to erroneous correlations. Additionally, each data insertion and retrieval incurs extra costs, leading to another type of financial burden for long-term database operations.

Regardless of whether approach one or two is used, both share a common issue: if the input data are incorrect, the feedback mechanism of vector embedding often leads to data contamination. Even after contaminated data are removed, residual effects may remain. This is the primary reason the overall evaluation is deemed incomplete, underscoring the need for further technological development to tackle these challenges.

To summarize, document and graph databases provide valuable benefits for certain research requirements, while vector databases may have possibilities for the future. This extended understanding supports the process of choosing the right technology for the specific data storage and analysis requirements.

Author Contributions

Conceptualization, all authors.; methodology, R.-C.W.; software, D.Y.; validation, R.-C.W.; formal analysis, R.-C.W.; investigation, D.Y.; resources, W.L.; data curation, R.-C.W.; writing—original draft preparation, R.-C.W.; writing—review and editing, R.-C.W.; visualization, R.-C.W. and D.Y.; supervision, M.-C.H.; project administration, Y.-C.C.; funding acquisition, M.-C.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the project “Flow (mavalidi) and Morphogenesis (sivaikan): Co-creating the Conditions for Sustainable Development of the South-Link Area in Taitung County” of the National Taitung University, sponsored by the National Science and Technology Council, Taiwan, R.O.C. under Grant no. NSTC 111-2420-H-143-007-HS1.

Institutional Review Board Statement

All research activities are currently conducted under the supervision of the Institutional Review Board (IRB) at National Cheng Kung University Governance Framework for Human Research Ethics, a national-level IRB funded by the National Science and Technology Council of Taiwan, under contract number 111-468-2.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data is available on request due to personal information privacy reasons.

Acknowledgments

The authors would like to thank all colleagues and students at the Center for Innovation in Humanities and Social Practices at National Taitung University who contributed to collecting ethnographic research data for this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dohan, D.; Sanchez-Jankowski, M. Using computers to analyze ethnographic field data: Theoretical and practical considerations. Annu. Rev. Sociol. 1998, 24, 477–498. [Google Scholar] [CrossRef]
Chen, H.; Vasardani, M.; Winter, S.; Tomko, M. A graph database model for knowledge extracted from place descriptions. ISPRS Int. J. Geo-Inf. 2018, 7, 221. [Google Scholar] [CrossRef]
Burns, R.; Wark, G. Where’s the database in digital ethnography? Exploring database ethnography for open data research. Qual. Res. 2020, 20, 598–616. [Google Scholar] [CrossRef]
Watts, J.; Jackson, J.C.; Arnison, C.; Hamerslag, E.M.; Shaver, J.H.; Purzycki, B.G. Building quantitative cross-cultural databases from ethnographic records: Promise, problems and principles. Cross-Cult. Res. 2022, 56, 62–94. [Google Scholar] [CrossRef]
Zhou, X.; Chai, C.; Li, G.; Sun, J. Database meets artificial intelligence: A survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 1096–1116. [Google Scholar] [CrossRef]
Zhao, H. The database construction of intangible cultural heritage based on artificial intelligence. Math. Probl. Eng. 2022, 2022, 8576002. [Google Scholar] [CrossRef]
Jindal, A.; Qiao, S.; Madhula, S.; Raheja, K.; Jain, S. Turning Databases Into Generative AI Machines. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), Santa Cruz, CA, USA, 14–17 January 2024. [Google Scholar]
Yang, L.; Chen, H.; Li, Z.; Ding, X.; Wu, X. Give us the facts: Enhancing large language models with knowledge graphs for fact-aware language modeling. IEEE Trans. Knowl. Data Eng. 2024, 36, 3091–3110. [Google Scholar] [CrossRef]
Pinhanez, C.S.; Cavalin, P.R.; Vasconcelos, M.; Nogima, J. Balancing Social Impact, Opportunities, and Ethical Constraints of Using AI in the Documentation and Vitalization of Indigenous Languages. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI), Macao, China, 19–25 August 2023; pp. 6174–6182. [Google Scholar]
Junker, M.O. Data-mining and Extraction: The gold rush of AI on Indigenous Languages. In Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages, St. Julians, Malta, 21–22 March 2024; pp. 52–57. [Google Scholar]
Mihai, G. Comparison between relational and NoSQL databases. Econ. Appl. Inf. 2020, 3, 38–42. [Google Scholar] [CrossRef]
Kamal, S.H.; Elazhary, H.H.; Hassanein, E.E. A qualitative comparison of nosql data stores. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 330–338. [Google Scholar] [CrossRef]
Origlia, A.; Rossi, S.; Martino, S.D.; Cutugno, F.; Chiacchio, M.L. Multiple-source data collection and processing into a graph database supporting cultural heritage applications. J. Comput. Cult. Herit. (JOCCH) 2021, 14, 1–27. [Google Scholar] [CrossRef]
Pansara, R.R. Graph Databases and Master Data Management: Optimizing Relationships and Connectivity. Int. J. Mach. Learn. Artif. Intell. 2020, 1, 1–10. [Google Scholar]
Rodriguez, M.A. The gremlin graph traversal machine and language. In Proceedings of the 15th Symposium on Database Programming Languages, Pittsburgh, PA, USA, 27 October 2015; pp. 1–10. [Google Scholar]
Buil-Aranda, C.; Arenas, M.; Corcho, O.; Polleres, A. Federating queries in SPARQL 1.1: Syntax, semantics and evaluation. J. Web. Semant. 2013, 18, 1–17. [Google Scholar] [CrossRef]
Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A.; et al. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; pp. 1433–1445. [Google Scholar]
Cattell, R.G.G.; Barry, D.K.; Berler, M. (Eds.) The Object Data Standard: ODMG 3.0; Morgan Kaufmann: San Francisco, CA, USA, 2000. [Google Scholar]
Torres, A.; Galante, R.; Pimenta, M.S.; Martins, A.J.B. Twenty years of object-relational mapping: A survey on patterns, solutions, and their implications on application design. Inf. Softw. Technol. 2017, 82, 1–18. [Google Scholar] [CrossRef]
Kukreja, S.; Kumar, T.; Bharate, V.; Purohit, A.; Dasgupta, A.; Guha, D. Vector Databases and Vector Embeddings-Review. In Proceedings of the 2023 International Workshop on Artificial Intelligence and Image Processing (IWAIIP), Yogyakarta, Indonesia, 1–2 December 2023; pp. 231–236. [Google Scholar]
Medjahed, B.; Ouzzani, M.; Elmagarmid, A. Generalization of Acid Properties; Encyclopedia of Database Systems; Springer: Boston, MA, USA, 2009; pp. 1221–1222. [Google Scholar]
Eddelbuettel, D. A brief introduction to redis. arXiv 2022, arXiv:2203.06559. [Google Scholar]
Thomsen, E. OLAP Solutions: Building Multidimensional Information Systems; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
OpenAI. Available online: https://openai.com/ (accessed on 26 June 2024).
“Gensim: Topic Modelling for Humans,” models.doc2vec—Doc2vec Paragraph Embeddings—Genism. Available online: https://radimrehurek.com/gensim/models/doc2vec.html (accessed on 26 June 2024).
Getting Started with the Built-in Bert Algorithm. Available online: https://cloud.google.com/ai-platform/training/docs/algorithms/bert-start (accessed on 26 June 2024).
Gao, S.; Xu, Z. Research on Automatic Clustering Technique of Chinese Words in Statistical Language Model. Comput. Eng. Appl. 2003, 11, 69–70. [Google Scholar]

Figure 1. ERD store in key-value NoSQL database.

Figure 2. ERD store in document database.

Figure 3. ERD store in graph database.

Figure 4. 3-nearest query in a vector database.

Figure 5. Prompt to fetch the metadata from a record.

Figure 6. An experience report for example.

Figure 7. Response from GenAI.

Figure 8. Convert to JSON format for DocumentDB importing.

Figure 9. A record insert with Cypher commands.

Figure 10. A simple ERD database established.

Figure 11. Prompt for finding out Node, Relation and Property of a record.

Figure 12. The response from OpenAI.

Figure 13. Generate Cypher commands automatically.

Figure 14. Pseudocode of vector embedding.

Figure 15. Get the vector from OpenAI.

Figure 16. The response OpenAI API.

Figure 17. A sample of South-Paiwan oral conversation record in Romanized script.

Figure 18. The result of analyzing Figure 17 by ChatGPT4o.

Table 2. The element nodes and relations used to construct an ERD graph.

Node:
Name	Property
Author	Name, Number
Subject	Topic, File, Date
Label	Tag
Relationship:
Node—Relation—Node
Author—Build—Subject
Subject—Have—Label

Table 3. Comparison of different databases for ERD storage.

	Relational Database	Key-Value NoSQL	Document Database	Graph Database	Vector Database
Cost	High	Low	Low	Low	High
Configuration	Easy	Easy	Medium	Medium	Hard
Performance	High	High	Low	Medium	Low
Schema-defined	Hard	No need	No need	Hard	Difficult
Data Insert/Modify/Delete	Easy	Easy	Easy	Hard	TBD
Semantic Query	No	No	Keywords	Keywords. Relations	Keywords, Semantic Similarity
GenAI-assisted	Not evaluated	Not evaluated	Metadata	Metadata, Relations	Metadata, Vector embedding
Overall	Not good	Not good	Good	Good	Incomplete

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, R.-C.; Yang, D.; Hsieh, M.-C.; Chen, Y.-C.; Lin, W. GenAI-Assisted Database Deployment for Heterogeneous Indigenous–Native Ethnographic Research Data. Appl. Sci. 2024, 14, 7414. https://doi.org/10.3390/app14167414

AMA Style

Wang R-C, Yang D, Hsieh M-C, Chen Y-C, Lin W. GenAI-Assisted Database Deployment for Heterogeneous Indigenous–Native Ethnographic Research Data. Applied Sciences. 2024; 14(16):7414. https://doi.org/10.3390/app14167414

Chicago/Turabian Style

Wang, Reen-Cheng, David Yang, Ming-Che Hsieh, Yi-Cheng Chen, and Weihsuan Lin. 2024. "GenAI-Assisted Database Deployment for Heterogeneous Indigenous–Native Ethnographic Research Data" Applied Sciences 14, no. 16: 7414. https://doi.org/10.3390/app14167414

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GenAI-Assisted Database Deployment for Heterogeneous Indigenous–Native Ethnographic Research Data

Abstract

1. Introduction

2. Research Methodology

3. Comprehensive Review of Database Type for ERD

3.1. Relational Databases

3.2. Key-Value NoSQL Databases

3.3. Document Databases

3.4. Graph Databases

3.5. Object-Oriented Databases (OODB)

3.6. Vector Databases

4. The Pros and Cons of Different Database in Real Implementation

4.1. Relational Databases

4.2. Key-Value NoSQL Databases

4.3. Document Databases

4.4. Graph Databases

4.5. Object-Oriented Databases

4.6. Vector Databases

5. GenAI-Assisted Database Deployment

5.1. Document Databases

5.2. Graph Databases

5.3. Vector Databases

6. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI