**2. Literature Background**

The literature review method proposed by Webster and Watson (2002) was followed to methodologically analyze and synthesize quality literature. The goal of the literature review is to gain an understanding of the current knowledge base with regards to the role of data governance for creating trust in data science decision-making outcomes. In order to understand the duality of data governance, we discuss literature which helps us understand how data governance structures organizations, taking into account research into the adoption and impact of technology on organizations as suggested by research on other disrupting technologies such as artificial intelligence (AI) and the internet of things (IoT). This paper utilizes the duality of technology theory (Orlikowski 1992) as a practice lens for studying the role of data governance for the creation of trust in data science and follows the case study methodology to investigate this phenomena. The propositions that are investigated in the case studies are synthesized from the literature following the logic of duality of technology.

Based on Giddens' (1976) theory of structuration, duality of technology (Orlikowski 1992) describes technology as assuming structural properties while being the product of human action. Giddens (1976) recognizes that "human actions are both enabled and constrained by structures, ye<sup>t</sup> that these structures are the results of previous actions" (Orlikowski 1992, p. 404). In her structuration model of technology, Orlikowski (1992) identifies four main relationships, namely: (1) technology as a product of human agency, (2) technology as a medium of human agency, (3) organizational conditions of interaction with technology and, (4) organizational consequences of interaction with technology. The technology referred to in this article is data science. Data governance is about the coordination and control of the use and managemen<sup>t</sup> of data (Janssen et al. 2020; Khatri and Brown 2010). The objective of this article is to understand the role of data governance as a boundary condition for data science. As such, this article looks at the role of data governance in data science using the duality of technology as

a guiding logic. Figure 1 below shows how the synthesized propositions and their elements are linked following the logic of the duality of technology.

**Figure 1.** The relationship of the propositions with duality of technology. According to Orlikowski (1992), technology is created as a result of human agency. In order for the process of technology creation to be successful, certain organizational boundary conditions need to be met. The resulting technology also has consequences for the organization, which need to be coordinated and controlled. For example, in order to develop a data science capability for an organization, it is necessary to have the required information technology (IT) infrastructure in place, available data, and sufficient data scientists with the necessary knowledge, requiring large investments (Adrian et al. 2017).

### *2.1. The Role of Data Governance with Regards to Data Science as a Product of Human Agency*

According to Gao et al. (2015), data scientists develop domain expertise over time, and apply this knowledge in big data analysis to gain the best results. However, the intellectual limitations of the data scientists themselves as well as the computational limitations of the available technology (Gigerenzer and Selten 2002) mean that although data scientists often seek to compensate limited resources by exploiting known regularity, bias and variance can create errors in the decision outcomes which can be exacerbated by large data sets (Brain and Webb 2002). Following the logic of bounded rationality (Simon 1947), data scientists develop models based on their own limited knowledge, and therefore, the models are themselves constrained by the intellectual limitations of their makers as well as the quality of the data from which they learn and the technical infrastructure in which they operate.

Big data can provide organizations with complex challenges in the managemen<sup>t</sup> of data quality. According to Saha and Srivastava (2014), the massive volumes, high velocity, and large variety of automatically generated data can lead to serious data quality managemen<sup>t</sup> issues, which can be difficult to manage in a timely manner (Hazen et al. 2014). For example, IoT sensors calibrated to measure the salinity of water may, over time, begin to provide incorrect values due to biofouling. Data science information products often rely on near real-time data to provide timely alerts, and, as such, problems may arise if these data quality issues are not timely detected and corrected (Gao et al. 2015; Passi and Jackson 2018).

Often, modern data processing systems which are required to allow large amounts of varied big data (Dwivedi et al. 2017) to be ingested without compromising the data structure are generally immediately accessible, allowing users to utilize dynamic analytical applications (Miloslavskaya and Tolstoy 2016; Ullah et al. 2018). This immediate accessibility, as well as the retaining of data in its original format presents a number of challenges regarding the governance of the data, including data security and access control (Madera and Laurent 2016), as well as in maintaining compliance with regards to privacy (Morabito 2015). As such, data governance has increasingly gained popularity as a means of ensuring and maintaining compliance, and Madera and Laurent (2016) have gone so far as

to posit that data governance principles should be key components of data science technologies for managing risk related to privacy and security. According to Kroll (2018), a responsible data governance strategy should include strategies and programs in both information security and privacy.

**Proposition 1.** *Organizations with an established data governance capability are more likely to have a well-functioning data science capability.*

Proposition 1 considers the interaction of data science with human agency from a product perspective. In other words, data governance is believed to play an essential role in coordinating and controlling the development of data science as a capability of the organization.

### *2.2. The Role of Data Governance with Regards to Data Science as a Medium of Human Agency*

Data science differs from traditional science in a number of ways (Dhar 2013; Provost and Fawcett 2013). Traditionally, scientists study a specific subject and gather data about that subject. This data is then analyzed to gain in-depth knowledge about that subject. Data scientists tend to approach this process by gathering a wide variety of existing data and identifying correlations within the data which provide previously unknown or unexpected practical insights. Data scientists gain domain expertise and apply this knowledge in big data analysis to gain the best results (Gao et al. 2015). However, the trustworthiness of data science outcomes in practice is often affected by tensions arising through ongoing forms of work (Passi and Jackson 2018). According to Passi and Jackson (2018), data science is a socio-material practice in which human agency and technology are mutually intertwined. de de Medeiros et al. (2020) therefore stress the importance of developing a "data-driven culture."

Data governance is important for creating value and moderating risk in data science initiatives (Foster et al. 2018; Jones et al. 2019), as it can help organizations make use of data as a competitive asset (Morabito 2015). Data governance aims at maximizing the value of data assets in enterprises (Otto 2011; Provost and Fawcett 2013). For example, capturing electric and gas usage data every few minutes benefits the consumer as well as the provider of energy. With active governance of big data, isolation of faults and quick fixing of issues can prevent systemic energy grid collapse (Malik 2013).

### **Proposition 2.** *Organizations with established data governance capability are more likely to generate trusted data science outcomes.*

Proposition 2 looks at the interaction of data science with human agency from a medium perspective. In other words, data governance is expected to play an important role in coordinating and controlling the use of data science in organizations.

### *2.3. The Role of Data Governance with Regards to Organizational Conditions of Interaction with Data Science*

A common challenge in data science is aligning the data science inputs and outcomes with the structure of an organization (Janssen et al. 2020). This mismatch can result in unclear responsibilities and a lack of coordination mechanisms which give organizations control of the data over its entire life-cycle. This is particularly the case for data science projects which require data inputs from multiple departments. There is often a lack of established mechanisms for data governance leading to the ad hoc handling of data (Janssen et al. 2020). According to Wang et al. (2019) it is necessary to develop data governance mechanisms beginning with policy development to define governance goals and strategies, followed by the establishment of organizational data governance structures. Top managemen<sup>t</sup> support (Gao et al. 2015), well-defined roles and responsibilities (Saltz and Shamshurin 2016), and the choice of the data governance approach (Koltay 2016) are considered critical. According to Janssen et al. (2020), data governance contains mechanisms to encourage preferred behavior. Incentives such as monetary rewards or public recognition should be complemented by mechanisms such as audits. Creating sound

data governance requires a balance between complete control, which does not allow for flexibility, and lack of control (Janssen et al. 2020).

Research has shown that favoring analytical techniques over domain knowledge can lead to risks related to the incorrect interpretation of the data (Provost and Fawcett 2013). Waller and Fawcett (2013) therefore believe that a data scientist should have a good understanding of the subject matter as well as having strong analytical skills. For example, recent years have seen a surge of interest in predictive maintenance and anomaly detection in the asset managemen<sup>t</sup> domain (Raza and Ulansky 2017), however, when implementing data science for predictive maintenance or anomaly detection, data scientists also need to have a strong understanding of how assets deteriorate over time. Furthermore, according to Kezunovic et al. (2013), much of the data may not be correlated in time and space, or not have a common data model, making it difficult to understand without in-depth knowledge of how or why the data has been generated. As the number of people with data science skills as well in-depth domain knowledge is limited (Waller and Fawcett 2013), these insights sugges<sup>t</sup> that data science initiatives should be governed by people with in-depth domain knowledge. According to Wang et al. (2019), organizations should develop comprehensive data governance mechanisms, beginning with policy development to define governance goals and strategies, followed by the establishment of organizational data governance structures.

**Proposition 3.** *Organizations with an established data governance capability are more likely to ensure that organizational conditions of data science are met.*

Proposition 3 considers the role of data governance as being important for coordinating and controlling the organizational requirements of the data science capability.

### *2.4. The Role of Data Governance with Regards to the Organizational Consequences of Data Science*

As well as establishing data managemen<sup>t</sup> processes that manage data quality, data governance should also ensure that the organization's data managemen<sup>t</sup> processes are compliant with laws, directives, policies, and procedures (Wilbanks and Lehman 2012). According to Cato et al. (2015), policies and principles should be aligned with business strategies in an enterprise data strategy. Panian (2010) states that establishing and enforcing policies and processes around the managemen<sup>t</sup> of data should be the foundation of effective data governance practice as using big data for data science often raises ethical concerns. For example, automatic data collection may cause privacy infringements (Cecere et al. 2015; van den Broek and van Veenstra 2018), such as in the case of cameras used to track traffic on highways, which often record personally identifiable data such as number plates or faces of persons in the vehicles.

Data governance processes should ensure that personally identifiable features are removed before data is shared or used for purposes other than legally allowed (Narayanan et al. 2016). Data governance should, therefore, establish what specific policies are appropriate (Khatri and Brown 2010) and applicable across the organization (Malik 2013). For example, Tallon (2013) states that organizations have a social and legal responsibility to safeguard personal data, whilst Power and Trope (2006) sugges<sup>t</sup> that risks and threats to data and privacy require diligent attention from organizations.

**Proposition 4.** *Organizations with an established data governance capability are more likely to be able to manage organizational and process changes introduced by the data science outcomes.*

Proposition 4 considers the role of data governance as being important for coordinating and controlling the organizational consequences of data science outcomes.
