**4. Findings**

The results of the case studies were analyzed using a combination of within-case and cross-case analysis. In Sections 4.1 and 4.2 the within-case analysis is reported using the theory of the duality of technology as a guiding logic. The cases have been anonymized. Section 4.3 reports the cross-case analysis.

### *4.1. Project A: Asphalt Life Expectancy*

The organization under whose auspices project A is managed is a public organization in Europe tasked with the managemen<sup>t</sup> and maintenance of public infrastructure, including the construction and maintenance of roads. The organization has a budget of approximately €200 million per annum on asphalt maintenance, with operational parameters traditionally focused on traffic safety. According to interviewee 4, "this has led to increasing overspend due either to premature maintenance, or too expensive emergency repairs in the past." Interviewee 5 stated that the prediction of asphalt lifetime based on traditional parameters has been shown to be correct "one-third of the time."

According to staff members, the organization has implemented data governance for their big data in order to remain "future-proof, agile, and to improve digital interaction with citizens and partners." According to an interviewee 3, "(*the organization*) wants to be careful, open, and transparent about the way in which it handles big and open data and how it organizes itself."

### 4.1.1. Data Science as Product of Human Agency

The data science model utilized more than 40 different datasets which were fed into a data lake from the various source systems using data pipelines. These datasets included data related to traditional inspections, historical data generated during the laying of the asphalt, road attribute data, and planning data, as well as automatically generated streaming data, such as weather data, traffic data, and IoT sensor data. The current model takes about 400 parameters into consideration. According to an interviewee 2, "this number will only grow, as the (*project partners*) continue to supply new data." The ultimate goal of the project is a model that can accurately predict the lifespan of a highway. In the model, higher-order relationships between the datasets were discovered using machine learning techniques such as decision trees, random forests, and naïve Bayes algorithms. Neural networks were used to reduce overfitting and improve generalization error, and gradient boosting was used to efficiently minimize the selected loss function.

The organization has implemented a policy of providing knowledge, tools, and a government-wide contact network in which best practices are shared with other governmen<sup>t</sup> organizations. These best practices refer to organization of data management, data exchange with third parties, data processing methods, and individual training. Furthermore, the organization has introduced the policy of assessing and publishing the monetary cost of data assets in order to raise awareness of the importance of data quality management. According to interviewee 2, "managers are required to know the cost of producing their data." This means that every process and every organizational unit is encouraged to

be aware of its data needs and the incurred costs. The data is then considered a strategic asset and considered to be a production input.

### 4.1.2. Data Science as Medium of Human Agency

The goal of project A is to reduce spending by extending the lifespan of asphalt where possible while reducing the number of emergency repairs made through predictive, "just-in-time" maintenance. Using available big data in a more detailed manner, such as raveling data combined with vehicle overloading data, has doubled the prediction consistency. According to interviewee 1, improving the accuracy of asphalt lifetime prediction "has enabled better maintenance planning, which has significantly reduced premature maintenance, improving road safety and cost savings, and reducing the environmental impact due to reduced traffic congestion and a reduction in CO<sup>2</sup> emissions."

### 4.1.3. Organizational Conditions of Data Science

The organization has translated their policy and principles into a data strategy in which the opportunities, risks, and dilemmas of their policies and ambitions are identified in advance and are made measurable and practicable. Interviewee 3 reported that the organization has also asked the data managers in the organization to appoint a sponsor or data owner. By means of the above control and design measures, the organization ensures that the data ambitions are operationalized.

The organization has invested heavily in the fields of big data, open data, business intelligence and analytics. Interviewee 5 believed that "the return (of the investment) stands or falls with the quality of data and information." As such, according to the interviewee 5, "the underlying quality of the data and information is very important to work in an information-driven way and as much as 70% of production time has been lost in almost every department due to inadequate data quality." The organization has, therefore, implemented a data quality framework to improve its control of data quality. The data quality managemen<sup>t</sup> process follows an eight-step process, which begins by identifying: 1. the data to be produced, 2. the value of the data for the primary processes, and 3. a data owner. The data owner is the business sponsor.

### 4.1.4. Organizational Consequences of Data Science

Once ownership had been established, the current and desired future situations were assessed in terms of production and delivery. Interviewee 2 reported that a roadmap was then established, which was translated into concrete actions. According to the interviewee, "the final step in the process was the actual production and delivery of data in accordance with the agreement." The organization has developed their own automatic auditing tool in combination with a manual auditing tool to monitor the quality of the data as a product in order to further improve its grip on data quality. According to interviewee 3, these tools "ensure that quality measurements were mutually comparable," and " ... cause changes in the conscious use of data as a strategic asset." Data quality measuring is centralized; the goal is to ensure a standardized working method. However, the organization maintains the policy that every data owner is responsible for improvements to the data managemen<sup>t</sup> process and the data itself. The data quality framework is based on fitness for use, and data quality measurement is maintained according to 8 main dimensions and 47 subdimensions. Terms and definitions are coordinated with legal frameworks related to the environment to ensure compliance. Responsibilities relating to compliance with privacy laws are centralized, and privacy officers are assigned to this role. The CIO has the final responsibility for ensuring that privacy and security are managed and maintained, however, data owners are responsible for ensuring compliance to dataset-specific policy and regulations.

### *4.2. Project B: Fraud Detection in Electrical Grids*

Project B is a data science project designed to detect the fraudulent use of electricity within medium and low voltage grids without infringing on personal privacy rights. The project is managed

under the auspices of a large European distribution grid operator (DGO). The role of the DGO is to transport the electricity from the high voltage grid to the end-user. The project was developed to improve the discovery rates of traditional methods utilized by the expensive commercial, o ff-the-shelf (COTS) system, which was in place at the time.

According to interviewee 5, the organization has implemented data governance as "an integral part of their digital transformation strategy." Interviewee 4 reported that "the data governance team, the data science team, and the data engineering teams are managed within the same department and report directly to the Chief Data O fficer."

### 4.2.1. Data Science as Product of Human Interaction

Project B was one of the first data science projects undertaken in the organization. Developing the data science capability within the organization required the development of a managed data lake and data pipelines to ensure connectivity from the data sources. Initially, the data science model was developed in an external data lake. According to interviewee 1, this meant that "no automatic data pipelines between the internal data systems and the data science model could be established, so we were forced to improvise." This meant that the data science model needed to be initially fed with batch uploads of data. This situation was eventually rectified with the development of an internal data lake which allowed connectivity with the original data source systems.

According an interviewee 2, the data science model initially utilized two sets of data originating from smart grid terminals, "but was eventually expanded to include ten data sets after we had spent quite some time on discovery and after many hours of discussion and investigation." Two years of training data were made available to the data scientists. According to the data scientists involved in the case, understanding the data was exceptionally di fficult in this case. For example, during the project, it was discovered that the values in certain columns had been incorrectly labeled and needed to be corrected to attain the correct value, which corresponded to the required units. The data were not supplied with metadata, and finding subject matter experts with in-depth knowledge about the data was very di fficult. For example, the data scientists discovered during the project that the OBIS codes did not follow the standardized values. The OBIS code is a unique identification of the registers in the smart meter's memory, according to IEC 62056-61.

Data were supplied by a subsidiary of the organization. The subsidiary was eventually sold to a third party during the project. This led to a situation whereby data owners were not available, and no single person could be found with a definitive knowledge of how the data were collected and collated. Data were collated and managed by two data engineers assigned to the project. Interviewee 4 reported that collaboration between the data engineers and the data scientists was not optimal as code was sometimes changed without su fficient documentation or collaboration. According to a data scientist 1 "the engineer changed quite a lot of code without checking with us (the data scientists) first."

### 4.2.2. Data Science as Medium of Human Interaction

Reducing fraudulent usage of electricity on the middle and low voltage electrical grids without infringing on personal privacy rights is of importance for a number of reasons, although few of the reasons are directly related to the DGO itself. Fraudulent usage of electricity is essentially theft, as electricity is being used without paying the provider for the service. According to interviewee 4, "in middle and low tension grids it is especially hard to decide from whom the fraudster is stealing electricity, because there are multiple electricity providers who sell their electricity directly to the end-user but use the common grid to transport the electricity." The fraudster is essentially taking electricity out of a shared service, so it is impossible to know from whom electricity is being stolen. Furthermore, fraudsters that are caught generally only have to pay the net stolen kWh, although damage is also su ffered by the network operator. This amount, the so-called "grid loss", is 70% lower than the price that consumers pay.

It is important to know how much energy is being used on the grid in advance in order to be able to balance the use of energy with the supply so that the grid is not overloaded. However, balancing of the entire electricity supply is generally performed by the transmission system operator (TSO), which manages the high voltage grid.

Catching fraudsters also requires collaboration with a number of parties, including the police. Moreover, European privacy laws dictate that the end-user is the owner of the data collected by the electricity meters, which means that DGOs are not able to read the values without permission from the end-users, which fraudsters are unlikely to give. According to interviewee 5 it is often difficult to coordinate a response to combating fraud, whilst the rewards for fraudulent usage remain high—"we are always behind fraudsters as catching them is expensive, whilst there is almost no risk for them." From a data governance perspective, this makes it especially difficult to coordinate and control the proper collection and collation of the required data.

### 4.2.3. Organizational Conditions of Data Science

The data science projects in the organization are decided upon and prioritized by managers of the primary business processes. The data scientists work according to sprints of two weeks, according to directions suggested by the product owner. The disruption to the project caused by the sale of the subsidiary mean that a new product owner as well as new data owners need to be found within the organization. According to one of the data scientists, the data owners are necessary "to be able to coordinate and control the proper collection, collation, and managemen<sup>t</sup> of the data, provide input to the data scientists regarding the content of the data (metadata) and accept and control the quality of the data science outcomes." This means that data governance officers and privacy officers attached to the department were required to develop a roles and responsibilities matrix for the managemen<sup>t</sup> of the data and the use of the data, in concurrence with privacy regulations.

### 4.2.4. Organizational Consequences of Data Science

Despite the technological and social challenges faced during the project, the data science team reported that after an extended period of 18 months, they were able to present a workable model that greatly outperformed traditional methods of fraud detection. The model was presented to the energy managemen<sup>t</sup> team which had been identified as the client and the main data owner. The data science team reported that the presentation was not well-received and that the model was eventually not adopted, despite the proven improvements. The data science team believed that the reason for this was that "they didn't want to believe the results. (The organization) has spent millions on the COTS system, and they are reluctant to accept that they've made a procurement error. Their argumen<sup>t</sup> was that the data was unreliable, but technically it's the same data being used by the COTS system." This reaction suggests that end-users as well as data owners should be an integral part of the data science project and that not only results but also intentions should be tested throughout the project.
