*3.3. Semiotic Framework*

Semiotics is the study of signs and symbols used to convey meaning to various users. Data quality researchers have also adopted the semiotic perspective of data; for instance, Price and Shanks [29] identified three data quality levels: syntactic, semantic, and pragmatic. Semiotic theory concerns using symbols to convert knowledge and define levels in the framework for analysing structure, physical form, meaning, and data usage. A thorough examination of the various levels of semiotics would reveal that the pragmatic level is associated with knowledge, the semantic level with information, and only the syntactic level with data. In other words, the dimensions operating at the pragmatic, semantic, and syntactic levels pertain to the quality of knowledge, information, and data.

According to Falkenberg et al. [66], data are meaningful symbolic creations consisting of a limited arrangement of signs and symbols. Thus, the semiotic framework was used in this study to define data quality dimensions. The semiotic framework consists of four levels: empiric, syntactic, pragmatic, and semantic. Each level of the semiotic framework facilitates data quality evaluation from several perspectives, including structure, data, information, and knowledge for assessing highway infrastructure data for decision-making at various levels of the highway decision-making hierarchy, for instance, while selecting a treatment technique for damaged pavement in a highway construction project.

Each decision-making level bases its decisions on the raw data, information, and knowledge available at that level. The strategic level is the top level of an organisation and is responsible for strategic planning. This involves making long-term, big-picture decisions and establishing policies that impact the organisation. For the decision of treatment technique, the system performance (policymaking) policies are established, requiring knowledge to make policies. Similarly, at the network level, the fund distribution (planning) decisions are made, i.e., allocating funds according to project requirements. At the program level, the decision of pavement evaluation and prioritisation is considered for each project. At the project selection level, the project is selected according to the prioritisation made at the program level, and treatment selection is made at the project level.

Kahn et al. [67] addressed the relationships between semiotic levels, the data-informationknowledge (DIK) hierarchy, and associated data-quality issues, as shown in Figure 2. The relationship between semiotic levels and structure, data, information, and knowledge facilitate the identification of unique data quality issues that may necessitate the application of specialised skills to resolve. Knoke and Yang [68] claimed that information originates with data and is transferred to knowledge in the DIK hierarchy. Depending on how data's meaning, structure, and operation are communicated at different semiotic levels of the DIK hierarchy, such transference could increase or decrease data's meaningfulness, transferability, and applicability.

**Figure 2.** Semiotic levels and the data-information-knowledge hierarchy, adapted from [32]. 2018, Huang.

The empiric level focuses on the quality aspect of data access and the means of communication. It considers how much and in what way raw data are available for stakeholders for decision-making. In highway projects, decision-makers at each project phase, such as preconstruction, construction, and post-construction phases, consider data availability essential for effective decision-making. At the empirical level, accessibility, security, and timeliness (currentness) are considered to evaluate the data communication and access perspective of the raw data stored in the data lake [68]. For example, the dimension accessibility of highway data could be the availability of real-time traffic data on a particular highway. If the data are easily accessible through an open data portal such as a data lake, API, or mobile app, they would have a high level of accessibility. On the other hand, if the data are only available through a difficult-to-navigate website or requires complex technical skills, they would have a low level of accessibility.

On the other hand, the syntactic level concentrates on the forms and structure of data, or, more accurately, their physical form instead of their content. After assessing the accessibility criteria of raw data, the second crucial limitation for decision-makers is the kind and format of accessible data. To quantify the structure of raw data stored in a data lake, the syntactic level considers accuracy, concise presentation, ease of operation, consistency, integrity, and completeness as data quality dimensions [49]. For instance, the accuracy dimension in highway infrastructure data could be the precision of the measurements taken for the width of a particular road lane. Inaccurate measurements could lead to too narrow lanes, potentially causing safety issues or impeding traffic flow.

The semantic level of data quality is concerned with the meaning of data for information generation rather than the data [69]. The decision-makers at the program and project

selection decision-making levels require information regarding project performance for decisions such as budget allocation and project prioritisation. The dimensions at the semantic level are credibility, interpretability, and understandability for assessing the interpretation of data that provides meaning. For example, dimension interpretability refers to the ease with which stakeholders can understand and use data. In the context of highway data, interpretability could be the use of visualisations or dashboards that make it easier for stakeholders to understand complex data sets. This could include interactive maps or charts that allow users to explore different aspects of highway infrastructure data, such as traffic volume or accident rates.

The pragmatic level is concerned with the relationship between data, information, and behaviour in a specific context of decision-making [69]. The generation of knowledge from the available data and information for making the policies and planning at the strategic and network levels of decision-making of highway infrastructure projects requires data utilisation quality. Dimensions of data quality associated with the pragmatic level include appropriateness, value-addition, reputation, relevancy, and usefulness [68]. Contextual features of pragmatic concerns are related to dimensions of relevance and utility of data and information for making decisions. As a dimension, reputation focuses on the user's expectations of data utility. The value-addition dimension aims to comprehend the user intent. These facets concern the data's compatibility with the challenging job. Related data quality dimensions are concerned with the intended application, i.e., how data would be utilised in connection to the current issue [70], for instance, value addition as a data quality dimension that refers to the extent to which data are valuable and add value to the organisation or individual stakeholders using it. In the context of highway data, it could use data analytics and machine learning algorithms to identify patterns and trends in data that are not immediately apparent. This could help highway agencies to identify areas of the highway system that require additional investment or maintenance and to prioritise their efforts accordingly.

Consequently, each semiotic level handles certain data quality and communication concerns. Understanding the overall data utilisation of highway infrastructure data stored in the data lake for making decisions at each decision-making level depends on the quality dimensions of the semiotic levels [32]. Within each semiotic level, it is crucial to identify the data quality requirements of decision-makers at their respective decision-making levels. For instance, strategic-level decision-makers focus on the utility of data and information for making effective policies throughout the organisation. Similarly, the other decision-making levels also required their specific data quality according to the requirement of decisionmakers. Table 2 shows the data quality dimensions and the perspectives of dimensions along with the semiotic framework categories.

Applying a semiotic framework can be considered one of the philosophical approaches to studying data and its quality. In a semiotic framework, a top-down approach involves starting with high-level concepts or theories and breaking them into their constituent parts to understand how they work. In terms of the decision-making hierarchy, NHAI also follows a top-down approach. The higher officials make the authority's decisions at the top of the organisational structure and then communicate to the lower-level employees for implementation. Overall, by using a semiotic framework for data quality assessment, NHAI can ensure that its decision-making processes are informed by high-quality data that are relevant, accurate, and consistent. This can help to improve the efficiency and effectiveness of NHAI's operations and ensure that its highway and road networks are developed and maintained to the highest standard. However, the semiotic perspective has not become popular among researchers and practitioners to date [71]. The present study uses semiotic categories to describe the highway infrastructure data quality, specifically to identify the data quality dimensions to assess the data quality for effective decisionmaking [29]. Presently, no research has been reported to comprehend the link between data quality dimensions and highway infrastructure data about the semiotic levels that represent them.


**Table 2.** Data quality (DQ) dimensions and perspectives as per the semiotic framework levels.

### **4. Methodology**

In order to meet the research objectives, this study was carried out in three steps. The first step was to identify the data quality dimensions of highway infrastructure using the semiotic framework. Most appropriate dimensions that were applicable to the highway infrastructure project were identified. In the second step, the questionnaire was prepared to the selected data quality dimensions finalised in step one. The responses were collected for the questionnaire from the highway infrastructure stakeholders. Finally, the responses were analysed in the third step to identify the critical dimensions and to rank them according to their mean value. These steps are described in detail in the following sub-sections.

Step 1: Identification of data quality dimensions of highway infrastructure data using the semiotic framework.

The semiotic framework consists of 43 data quality dimensions, as defined by Tejay. G. et al. [72]. These data quality dimensions are defined in the context of information system security. For the study of highway infrastructure projects' data quality, the dimensions were reduced to 20 out of 43 data quality dimensions, according to the relevant literature sources. A few dimensions have synonyms dimensions, and those were combined and considered a single dimension. The dimension accessibility, portability, and locatability have a similar meaning in the context of data quality; thus, we considered accessibility the primary dimension for assessing data quality. The established data quality dimensions were used to determine the data quality of highway infrastructure data. The 20 dimensions were personally reviewed with the three highway stakeholders; one chief general manager from the headquarters office responsible for network-level decision-making, one regional officer from the regional office responsible for the program and project selection level decisionmaking, and one project director from the project-implementing unit responsible for projectlevel decision-making were selected to verify the exhaustiveness/comprehensiveness of the selected data quality dimensions. Among the professionals, the chief general manager had more than ten years of experience, the project director had eight years of experience, and the regional officer had six years of experience in highway construction projects. The responses were not uniform, and the experience of the stakeholders was considered a limitation. Hence, all 20 dimensions were considered for the questionnaire survey for a comprehensive understanding of highway data quality for the effective use of data for effective decision-making.

#### Step 2: Data Collection

The questionnaire was designed based on the 20 data quality dimensions identified in Step 1. The survey targeted the National Highway of India decision-makers who utilised these data in decision-making. A pilot study was undertaken with 40 responses to test the language and understanding of the questionnaire. The responses are from the site engineers, deputy engineers, and managers from the project implementing units and regional offices. According to the suggestions from the pilot study, some significant changes were made to the questionnaire to make it more understandable for the stakeholders. The questionnaire was then shared via google forms with the 220 stakeholders. The stakeholders included the members, chief general manager, managers, regional officers, deputy general managers, and project directors. A total of 105 experts participated in the survey, which is a 48% response rate. The stakeholders with significant experience deal with the critical decisions from the National Highway Authority of India (NHAI), representing the strategic, network, program, project selection, and project levels, respectively. The questionnaire consists of three parts. Part 1 deals with the basic contact details, role, responsibility, and decision-making level in the decision-making hierarchy. The second part evaluates each attribute's importance at each decision-making level for the available data. The third part deals with ranking the dimensions, which states the priority of dimensions required in decision-making within the category of the semiotic framework.

A five-point Likert scale of 1 to 5 was used to record the decision-makers' level of importance of the data quality attribute. Here, '1' refers to "no importance," '2' refers to "low importance," '3' refers to "somehow important," '4' refers to "important," and '5' refers to "high importance" [73].

#### Step 3: Data Analysis

The data were analysed by using the software package SPSS 25. The analysis was carried out in two parts. The first part analysed the data's reliability using Cronbach's alpha test. It was found to be 0.875 at a 5% significance level greater than 0.5. Hence, it confirmed the reliability of the data. The dimensions were ranked according to their mean value to measure the consensus in the experts' opinions. However, when the mean values of two or more dimensions were identical, the dimensions with the lowest standard deviation were placed higher [74]. The ranking of the dimensions based on the data collected through the questionnaire survey is shown in Table 3.


**Table 3.** Ranking of data quality dimensions.
