**1. Introduction**

Competitive markets today demand the use of new digital technologies to promote innovation, improve productivity, and increase profitability [1]. The growing interests in digital technologies and the promotion of them in various aspects of economic activities [2] have led to a wave of applications of the technologies in manufacturing sectors. Over the years, the advancements of digital technologies have initiated di fferent levels of changes in manufacturing sectors, including but not limited to the replacement of paper processing with computers, the nurturing and promotion of Internet and digital communication [1], the use of programmable logical controller (PLC) and information technology (IT) for automated production [3], as well as the current movement towards a fully digitalized manufacturing cycle [4]. The digitalization waves have enabled a broad range of applications from upstream supply chain management, shop floor control and management, to post-manufacturing product tracing and tracking.

Among the new digital advancements, the development of artificial intelligence (AI) [5], Internet of Things (IoT) devices [3,5] and digital twins (DTs) have received attention from governments,

agencies, academic institutions, and industries [6]. The idea of Industry 4.0 has been put forward by the community of practice to achieve a higher level of automation for increased operational e fficiency and productivity. Smart technologies under the umbrella of Industry 4.0, such as the development of the IoT, big data analytics (BDA), cyber-physical systems (CPS), and cloud computing (CC) are playing critical roles in stimulating the transformation of current manufacturing to smart manufacturing [7–10]. With the development of these Industry 4.0 technologies to assist data flow, a number of manufacturing activities such as remote sensing [11,12], real-time data acquisition and monitoring [13–15], process visualization (data, augmented reality, and virtual reality) [16,17], and control of all devices across a manufacturing network [18,19] is becoming more feasible. The implementation of Industry 4.0 standards by institutions and companies encourages them to implement a more robust, integrated data framework to connect the physical components to the virtual environment [1], enabling a more accurate representation of the physical parts in digitized space, leading to the realization and application of DTs.

The concept of creating a "twin" of a process or a product can be traced back to the late 1960s when NASA ensembled two identical space vehicles for its Apollo project [20–22]. One of the two was used as a "twin" to mirror all the parts and conditions of the one that was sent to the space. In this case, the "twin" was used to simulate the real-time behavior of the counterpart.

The first definition of a "digital twin" appeared in 2002 by Michael Grieves in the context of an industry presentation concerning product lifecycle managemen<sup>t</sup> (PLM) at the University of Michigan [23–25]. As described by Grieves, the DT is a digital informational construct of a physical system, created as an entity on its own and linked with the physical system [24].

Since the first definition of DT, interpretations from di fferent perspectives have been proposed, with the most popular one given by Glaessegen and Stargel, noting that a DT is an integrated multiphysics, multiscale, probabilistic simulation of a complex product and uses the best available data, sensors, and models to mirror the life of its corresponding twin [26]. It is generally accepted that a complete DT consists of a physical component, a virtual component, and automated data communications between the physical and virtual components [2]. Ideally, the digital component should include all information of the system that could be potentially obtained from its physical counterpart. This ideal representation of the real physical system should be an ultimate goal of a DT, but for practical usage, simplified or partial DTs are the dominant ones in industry currently, including the employment of a digital model where the digital representation of a physical system exists without automated data communications in both ways, and a digital shadow where model exists with one-way data transfer from physical to virtual component [2].

Together with the US Food and Drug Administration (FDA)'s vision to develop a maximally efficient, agile, flexible pharmaceutical manufacturing sector that reliably produces high quality drugs without extensive regulatory oversight [27], the pharmaceutical industry is embracing the general digitalization trend. Industries, with the help of academic institutions and regulatory agencies, are starting to adopt Industry 4.0 and DT concepts and apply them to research and development, supply chain management, as well as manufacturing practice [9,28–31]. The digitalization move that combines Industry 4.0 with International Council for Harmonisation (ICH) guidelines to develop an integrated manufacturing control strategy and operating model is referred to as the Pharma 4.0 [32].

However, according to the recent survey conducted by Reinhardt et al. [33], the preparedness of the industry for this digitalization move is still unsatisfactory. It is noted that most pharmaceutical and biopharmaceutical processes currently rely on quality control checks, laboratory testing, in-process control checks, and standard batch records to assure product quality, whereas the process data and models are of lower impact. Within pharmaceutical companies, there are gaps in knowledge and familiarization with the new digitalization move, resulting in a roadblock in strategic and shop floor implementation of such technologies.

With the rapid development of DT and its building blocks, state-of-the-art review studies concerning pharmaceutical and biopharmaceutical manufacturing are limited. This paper aims to provide a literature review and a discerning summary of the current status of DT development and its application in the pharmaceutical industry, focusing on small and large molecule drug product manufacturing for the purpose of identifying current and future research directions in this area. The remainder of the paper is structured as follows. A description of the general DT framework is provided in Section 2, followed by a detailed review of DT in pharmaceutical and biopharmaceutical manufacturing in Sections 3 and 4, respectively. More specifically, we intend to provide readers with a summary of the critical components of an e ffective DT and the progress of implementing these components in pharmaceutical and biopharmaceutical manufacturing. After discussing the current status, we discuss the challenges associated with the development and application of DT in each section, with conclusions at the end.

#### **2. Digital Twin Framework**

As mentioned in Section 1, a DT has a physical component, a virtual component, and automated data communication in between, which is realized through an integrated data managemen<sup>t</sup> system. This synergy between the physical, virtual space, and the integrated data managemen<sup>t</sup> platform is demonstrated in Figure 1. The physical component consists of all manufacturing sources for data, including di fferent sensors and network equipment (e.g., routers, workstations) [34]. The virtual component needs to be a comprehensive digital representation of the physical component in all aspects [8]. The models are built on prior knowledge, historical data, and the data collected in real-time from the physical components to improve its predictions continuously, thus capturing the fidelity of the physical space. The data managemen<sup>t</sup> platform includes databases, data transmission protocols, operation data, and model data. The platform should also support data visualization tools in addition to process prediction, dynamic data analysis, and optimization [34]. Sections 2.1–2.3 discuss each component in more detail.

**Figure 1.** Physical component, virtual component, and data managemen<sup>t</sup> platform of a general digital twin (DT) framework.

#### *2.1. Physical Component*

Sourcing data from the physical process and component is one of the most essential elements in the development of a DT. The critical process parameters (CPPs) for equipment can be obtained either manually from the human–machine interface (HMI) generally provided by the equipment manufacturer or automated using several machine–machine interfaces (MMI). There are several standard MMIs such as Open Platform Communications (OPC), OPC Data Access (OPC DA), OPC Unified Architecture

(OPC UA), and Modbus [35] for automating the data transfer between equipment software to a control or historian software. OPC UA is considered to be the current standard as it has added features such as multiple tags along with their properties [36]. Data can also be transmitted over the network using message queue telemetry transport (MQTT), Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc. The critical quality attributes (CQAs) for the product are determined using soft sensors, and they usually employ network protocols for data transmission [37]. Soft sensors are a combination of hardware sensors with their propriety software-enabled models that help obtain information about the process [38]. Soft sensors have been implemented in several process industries for process monitoring and control. These sensors have been used to measure cake resistance in freeze-drying applications [39], measuring temperature from pyrometers [40], estimating product quality during crude distillation [41], and have also found several other industrial applications [42–45]. Continuous acquisition of large amounts of data requires a systematic framework such as a data historian to store the historical data. Several studies have employed local data historians [46,47] to create an information infrastructure enabling the synchronous collection of process and sensor data. Zidek [48] demonstrated the Industry 4.0 concept for small–medium size enterprises (SMEs) where the quality of the product was assessed by a DT, and the communication between the OPC server and PLC system was achieved using OPC-UA. A combination of network and OPC communication protocols was used by Kabugo [35] to develop the cloud-based analytics platform for a waste-to-energy plant. Several other studies focusing on smart factories according to Industry 4.0 standard have utilized similar communication protocols [49–51].

#### *2.2. Virtual Component*

The virtual component consists of a collection of models to simulate the physical process and to analyze the current and future state of the system. With appropriate models, the virtual components can be used to perform real-time process simulation and system analyses, including but not limited to sensitivity studies that identify the set of most influential factors [52], design space studies that yield feasible operating conditions [53], and system optimization [54]. Results from real-time process simulation can be sent to the data managemen<sup>t</sup> platform to visualize the process, and the results of system analyses, together with the preprogrammed expert knowledge, can be used to deliver control commands to the physical counterpart to ensure process and component conformity.

Di fferent model types exist for use in DT, namely mechanistic models, data-driven models, and hybrid models. Mechanistic models strongly rely on process knowledge and understanding, as the development is based on fundamental principles and process mechanisms [55]. The resulting models are highly generalizable with physically interpretable variables and parameters, with a relatively low requirement from process data. Often, however, this comes with high development and computation costs [54,56]. In contrast, data-driven models depend only on process data, and no prior knowledge is needed [55]. The advantages include more straightforward implementation, relatively low development and computational expenses, and convenient online usage and maintenance. However, the poor interpretability, poor generalizability, and the need for large amounts of data present limitations of this modeling method [55,57,58]. A hybrid modeling strategy is then introduced to balance the advantages and disadvantages of the other two model types [57,59–61]. With di fferent hybrid structures, the hybrid modeling method o ffers improved predictability and flexibility in process modeling [58,61,62].

In addition to the development of models, the computational cost is also a main concern in the virtual component of DT. Since a fully developed DT aims to represent the physical counterpart and perform system analyses, it would require extensive computational power. For a large system, local desktops and consumer-grade Central Processing Units (CPUs) cannot meet the demand. Many computationally intensive models can run in parallel using high-performance computing (HPC) to enhance the computational speed to achieve real-time or near-real-time simulations [63–65].

To develop models, perform simulations, and conduct system analyses for the virtual component of the DT framework, appropriate modeling platforms are needed. Various commercial modeling platforms and software packages have been developed and have become available. Among all the available ones, MATLAB and Simulink (MathWorks) [66], COMSOL Multiphysics (COMSOL) [67], gPROMS FormulatedProducts (Process Systems Enterprise/Siemens) [68], aspenONE products (AspenTech) [69], and STAR-CCM+ (Siemens) [70] are commonly seen in process industries. These platforms offer a large collection of models and/or tools that enable users to create or incorporate unit operations and flowsheet models based on the actual process. Some of these companies have also been developing local and cloud platforms (e.g., gPROMS Digital Applications Platform [71] from Process System Enterprise/Siemens, Siemens Mindsphere [72]) for hosting and computing models, for integrating physical component, and for providing data managemen<sup>t</sup> functions, providing end-to-end DT solutions. Others have focused on improving compatibility with common data managemen<sup>t</sup> and Internet of Things (IoT) integration platforms, which are described next in Section 2.3.

#### *2.3. Data Management*

In addition to model managemen<sup>t</sup> and simulation platforms, several commercial IoT Platforms as a Service (PaaS), such as Predix (General Electric) [73], Mindsphere (Siemens) [72], SEEQ [74], TrendMiner [75], TIBCO Cloud [76], etc. have been developed. These platforms offer a large collection of tools that enable users to develop, visualize, analyze, and manage data on cloud servers. Some cloud service companies, such as Amazon Web Services (AWS) [77], Microsoft Azure [78], Google Cloud [79], IBM Watson [80], offer multipurpose platforms which are more versatile [81]. These platforms also offer distributed computing, data analysis tools, interaction protocols, and data and device managemen<sup>t</sup> tools. Several of the interface protocols mentioned in Section 2.1 are also applicable to data transfer in the cloud. These platforms also provide large data storage capacities at affordable prices. Industrial grade IoT platforms are developed with a higher emphasis on secure device connectivity and cyber-security [82].

Seamless data integration in most cases is mainly hindered by a large amount of heterogeneity between manufacturers and services based on the software used and data formats supported [83]. Some cloud services provide their solutions as optional application program interfaces to integrate with other software, but several are left out due to the large number of software present. Thus, there is a need for a standard file format that needs to be employed to encourage cross-platform integration. The World Wide Web Consortium (W3C) has proposed Extensible Markup Language (XML), Resource Description Framework (RDF), among other markup languages to model information explicitly [84]. XML [85] provides the user with the freedom to define tags and data structures which are both readable by machines and humans. This syntax is further developed to incorporate the graph structure of the information within the RDF framework. The W3C also proposed Web Ontology Language (OWL) for information modeling. OWL is a vocabulary extension of RDF and is currently in use with XML and RDF. Unfortunately, these files become cumbersome when large databases need to be stored [86]; thus, new standard language Structured Query Language (SQL) for relational databases was recommended by the American National Standards Institute (ANSI) [87]. SQL databases are commonly found on cloud servers; however, their difficulty in horizontal scalability has led to the development of Non-SQL (NoSQL) databases, which are easily scalable vertically and horizontally [88] and can be hosted on cloud servers. Cloud servers are not limited to storage, but they offer large and scalable compute capabilities that can be leveraged for quick data analysis and simulations. A web service can also be hosted on a cloud server to create an online dashboard to visualize both the real-time physical data and the data from the simulation/data analysis.

#### *2.4. Applications of Digital Twin*

DT frameworks, as presented in Sections 2.1–2.3, are implemented across various industries [2,4,89] for simulation, real-time monitoring, control, and optimization to handle "what-if" or risk-prone [89] scenarios for improving process efficiency, safety analysis, maintenance, and decision-making [24]. This section provides a brief overview of such applications [4] within various industries such as

aerospace, energy, manufacturing, automobile, chemical, healthcare, semiconductor, and city planning, as shown in Table 1.



A commercial application of fully integrated DT was first demonstrated by General Electric (GE) at the Minds + Machines event in 2017 for the GE90 engine [104], with 300 engines integrated together to supply historical and real-time process information for predicting process failure, mitigating risks, and optimizing maintenance costs. Similar applications in the aviation industry include DT of airplanes used for training simulations [100] and aircraft health managemen<sup>t</sup> [98,99,105,106] for damage assessment and rectification. The aerospace industry focuses on DT applications for the development of next-generation outer-space vehicles, following a successful demonstration of Apollo 13 by NASA [26,101] rectifying maintenance problems. DT applications in the energy sector include GE's

wind farm [92] and steam turbines [4,90–92]. These DTs are capable of integrating historical data in terms of process, fuel costs, electricity, process wear and tear, and weather forecasts to sugges<sup>t</sup> possible real-time modifications for reducing operating costs. Smart manufacturing is another sector benefitting from DT applications through digitization of product manufacturing [96,97] and development of digital shop floor (DTS) [2,18,93–96], incorporating real-time information of manufacturing plant, state of production machinery, environmental conditions, and its effects on manufactured products. DT applications in the area of automobile and transportation focus on automation of vehicles [107] and long-distance transportations [102] along with analysis of maintenance [22] and risk-prone issues [108]. The healthcare industry includes applications such as virtual replica of patients used for surgical operation training [4], sensors for health monitoring [109], the study of health of a country's population [110], and the "The Living Heart" [111] project developed for the analysis of blood circulations. Furthermore, city planning is another domain where virtual replica of cities, known as "smart cities" [103] are used for urban city planning and optimal resource allocation [112]. Such efforts promote the construction of smart, sustainable cities [113] while providing a holistic view of cross-vertical optimization of overall city infrastructure [114].

From the applications reviewed, it is clear that the concept of DT is rapidly being employed across various domains, given its advantages. However, it is important to identify the challenges associated with the development and application of integrated frameworks for the systematic utilization of DTs.
