A DMP serves the objectives of the H2020 ORD pilot by promoting the FAIR data principles and fostering improved data management practices. In turn, making data FAIR renders them more easily available to other researchers, potentially resulting in the generation of new scientific knowledge. While the ever-increasing body of data will be of value to the scientific community, the individual researchers who produce the data may question the value of data management because the required time, effort and resources could be perceived as detracting from their research, whilst benefitting others. Any such misgivings are likely to be further compounded by the increased transparency and scrutiny that will result from easier access to research data. Consequently, attention needs to be given to the workflows, infrastructure and incentives needed to convince the research community of the value of data management.
3.1. Challenges
While there are convincing arguments that favour data management becoming a useful (and possibly indispensable) feature of the mainstream research process, the accompanying risks and pitfalls need to be acknowledged and accommodated.
Research culture—until recently, the research community has operated to good effect in the complete absence of any systematic approach to data management. That said, it is reasonable to assume that the failure of the scientific community to preserve (and enable access to) the larger proportion of the considerable body of data it has generated has impeded scientific progress because the circumstance prevents the re-examination and reuse of data and risks duplication of effort. Even so, such arguments may still be insufficient to motivate project partners to engage in data management activities. There is thus a need to demonstrate to researchers the value of data management, such as data quality improvement; minimisation of data loss; better utilisation of resources; re-examination and reuse; and increased citations.
Data sharing—ultimately, scientific research aims to benefit society and while this principle applies equally to data sharing practices, it is not necessarily the case that Open Data always satisfies this requirement. Although the H2020 ORD principle of “as open as possible, as closed as necessary” encourages partners to make data open to third parties, partners may be reluctant to engage in the activities required by the DMP. For example, there is the risk that the term “open” can be (mis)interpreted as mandatory, which may act as an obstacle to organisations participating in the pilot.
Even where there is broad support for data management as a core component of a project, it only takes one dissenting partner to undermine the entire initiative because the resulting body of data is incomplete and hence of limited value. Moreover, there is the risk of data sharing being restricted to subgroups of partners, thereby leading to fragmented cooperation, potentially alienated from the rest of the consortium. This claim is supported by evidence from seven projects (of more than 50 unique partners) where IRES explored the data access preferences of the different project partners, asking them if their data sets should be open (public), restricted (shared within the consortium) or closed (not shared at all). The results are summarised in
Figure 2.
Despite being protected by the terms of the Grant Agreement and Consortium Agreement, the results indicate an unwillingness of more than half the project partners to share their data even within the project consortium. Instead, they prefer to avoid data sharing wherever possible unless it is essential to achieve certain project objectives. In those cases, they opt only to share their data with targeted partners to complete the project tasks. In turn, such circumstances risk undermining the co-ordination of the project.
Another potentially problematic feature of data sharing is the increased transparency and scrutiny. Scientific data are often complex and generated by emerging protocols, i.e., not following the recognised standards. By sharing their data, whether openly or to a lesser extent as part of a managed co-operation, researchers are exposing their data and related findings to potential criticism. Furthermore, there is the potential for the misinterpretation of results. The authors have encountered such concerns, which have, on occasion, undermined the data management activity.
Infrastructure—another obstacle to data management is the likely absence of infrastructure, including interoperability standards, that is required. It is not unreasonable to claim that the pursuit of Open Data has been motivated by the success of the Open Access initiative and whereas the infrastructure and workflows for the latter have been established over many hundreds of years, data management infrastructures and workflows are only just emerging. Thus, while expectations of the benefits arising from improved data management are high, much work is required to establish a robust infrastructure on which to build a vibrant data management ecosystem. In turn, the time and effort needed to realise such an infrastructure risks losing momentum and enthusiasm. Furthermore, even with the establishment of a robust infrastructure, until the various scientific disciplines have developed the data technologies (i.e., interoperability standards) required for seamless data exchange and aggregation, data gathering tasks will remain a significant burden that detracts from rather than contributes to the research process.
Project coordination—perhaps of most immediate concern are the practical problems the authors have encountered relating to resourcing and coordination. When drafting a proposal, there is the risk that consortia pay insufficient attention to the DMP and data management. This circumstance is understandable given that investing resources in Open Access and data management could be perceived as detracting from the research. Unfortunately, the outcome is that authoring the DMP (and hence formulating the data management policy) is often left to a single partner. Thereafter, during the project implementation, other partners are unaware of their responsibilities. This circumstance can lead to problems that risk undermining the data management component of a project. Firstly, most of the partners remain unaware of the expected data flow during the project and their role or responsibility in that process, which leads to misunderstandings about the handling of the data during the active phase of the project. Second, the resources allocated to DMP tasks are often insufficient to account for the person months and infrastructure needed for their implementation. As a result, partners may find themselves lacking the necessary resources to cover the data management activities throughout the term of the project. The authoring of the DMP by a single partner also risks misunderstanding about the data to be shared between project partners, as well as the expected data quality. In addition, if data are not handled in a standardised way, sharing among the partners will be difficult, resulting in ineffective use of resources and delay the outcome of the research.
The DMP may itself also present obstacles. For example, academic and industrial partners alike might have good reason to keep their data closed and therefore be resistant to participation in the ORD pilot [
36] on the understanding that Open Data is a mandatory requirement, whereas the actual requirement is “as open as possible, as closed as necessary”. Such misunderstandings have serious implications given that recognition in the sciences relies largely on knowledge discovery, which in turn relies on data. It is thus unsurprising that a researcher who has invested significant effort in designing and undertaking data creation activities may be reluctant to make the results openly accessible and thereby risk a loss of intellectual advantage. Similarly, for commercial organisations the loss of advantage associated with Open Data risks losing competitive advantage. Hence, market competition not only discourages partners from making their data open but also risks companies not participating in the ORD pilot. While data accessibility should not present any particular difficulties because of the “as open as possible, as closed as necessary” principle, in reality academic and commercial sensitivities are such that any suggestion of Open Data being mandatory will prove an obstacle to data management. This is somewhat unfortunate given that restricting access to data to mitigate loss of competitive advantage is entirely compatible with the FAIR principles [
18].
Beyond the authoring of the DMP, ill-considered programming of tasks also has the potential to detract from the data management activities. Whereas there is a tendency for data management tasks to be a feature of dissemination and exploitation work packages, any circumstance that decouples data collection from data creation will likely hinder the data management activities of a project. Instead, data collection tasks need to be a feature of the activity that generates the data, otherwise there will be the risk of delays and hence a failure to share data in a timely manner amongst project partners.
3.2. Solutions
Data management has to overcome various challenges if its potential benefits are to be realised. While underlying issues are invariably inter-related, the authors give attention to the issues individually.
Research culture—there are good reasons to believe that research data management will (or perhaps already has) become a feature of mainstream research insofar as the entities upstream and downstream of research, namely the funding agencies and the publishing houses, respectively are mandating improved data management practices. What remains is for the research community to be convinced that data management adds value to the research process. In anticipation of realising such a circumstance, perhaps most important is to demonstrate that, rather than simply delivering a body of data towards the project term, data management supports the scientific objectives of a project. Simply delivering a body of data does not contribute to the scientific objectives of a project and can reasonably be claimed to consume resources better used for research. Instead, where data management is an integral feature of data creation activities from the outset of a project, opportunities will exist for exchanging data between partners as they become available; ensuring data consistency and quality; and for entering into data sharing arrangements. With systematic research data management being a relatively new phenomenon, concrete examples of the benefits are sparse. That said, over the relatively short period of the H2020 ORD pilot, the authors have direct experience of both anticipated and unforeseen benefits. For example, in consequence of data peer review, the INCEFA-PLUS project was able to deliver a body of high-quality test results that provided the basis for data mining activities and an international data sharing arrangement in a subsequent H2020 proposal, which has since been favourably evaluated [
37]. Presently however, it is still not uncommon to encounter a combination of a lack of data sharing culture and motivation [
17,
24]. Sometimes in fact, the authors have encountered even quite hostile reactions and in such circumstances the failure to adhere to the DMP becomes a self-fulfilling prophecy.
Given the DMP is a relatively recent phenomenon that is still somewhat unfamiliar to the research community, a gradual introduction to its concepts and purpose is required that will allow researchers to become aware of potential pitfalls and benefits. This circumstance favours engagement with early-career researchers, who by definition are in the process of learning the skills needed for a career in the sciences. Early-career researchers are also likely to be more favourably inclined due to their not having been exposed to a research environment where data management has been entirely lacking. Hence, giving responsibility for the DMP and data management tasks to early-career researchers would provide a concrete opportunity to contribute both to their career development and to the implementation of the projects in which they are participating.
For consortia as a whole, it is important that all partners participate in the DMP activities. Just one weak link and the data management risks being undermined. In this context, one value proposition that may motivate participation is the development of organisational data documentation standards and protocols, which have the potential to improve data quality, facilitate cross-department data exchange and minimise data loss within an organisation. In particular, thorough documentation of the data lifecycle at any research phase provides all the details about their generation, management and processing, thereby allowing validation of the research results. This and other potential benefits, such as data sharing and data citation, could be the subject of training workshops that feature as milestones of any given project.
In the context of EC-funded projects where the JRC has participated in a data management role, the appointment of a Data Committee has proven particularly effective in engaging researchers in data management activities [
38,
39]. Perhaps most importantly, the collective responsibility for authoring and maintaining the DMP ensures that the larger proportion of the project partners are familiar with and supportive of the data policy of the project. Furthermore, by undertaking actions such as the formulation of data protocols and data review, partners become familiar with the practicalities of DMP implementation and the accompanying added value, whereby data protocols identify the data needed to achieve scientific objectives and data review ensures the quality of data and helps ensure the consistency of the data coming from the different partners.
Data sharing—following the success of the FP7 Open Access pilot, it is unsurprising that funding agencies have sought a similar access paradigm for research data. Certainly, data can be completely open and freely available for reuse, such as public sector data, which although typically lacking inherent value, can yield derived value, such as services developed to locate preferred parking zones based on municipal parking fine data or suitable bathing areas based on municipal water quality data [
40]. There will though be legitimate reasons for data remaining entirely closed e.g., where there is inherent commercial, financial or intellectual value. To address such circumstances, i.e., where consortia consider that little or no data are available or suitable for open access, the European Commission allows complete withdrawal from the ORD pilot (commonly known as opt-out) at any stage during the project lifecycle, even after the Grant Agreement has been signed. While the authors have experience both of both circumstances, typically data will be positioned somewhere between the extremes of the data access spectrum. Again, the European Commission recognises this circumstance and although promoting open access to research data is a core principle of the H2020 ORD pilot, making the entirety of project data open has never been mandatory [
41]. Specifically, the principle of “as open as possible, as closed as necessary” allows for keeping certain data sets open and others closed. Even so, the notion that data access can simply be open or closed ignores the complexities of data sharing. Hence, further measures may be required.
Data may be open but the owner expects acknowledgement in derivative works, which would require that the data are accompanied by bibliographic data and a licence. Conversely, data may be nominally closed but made available if sharing is on mutually beneficial terms, which would require the data can be discovered and access requests submitted. Both these circumstances can be addressed by enabling data for citation. Irrespective of whether open or closed, data citation ensures the data can be discovered and the data creators acknowledged. Beyond these benefits, citing data also allows for data transparency, thereby facilitating the verification of results. Furthermore, where data citation relies on a digital identifier, such as the digital object identifier (DOI), there is scope for machine-readability and long-term preservation. Identifiers ensure the data with which they are associated can be discovered irrespective of usage licences. The practice of citing data typically relies on infrastructures that generate and assign DOIs and store metadata for data sets. Data repositories that register identifiers automatically via such infrastructures are becoming more common, e.g., Zenodo [
33] assigns DataCite DOIs for all uploads [
22]. Similarly, in the context of those EC-funded projects where the JRC has participated, extensive use has been made of the DataCite framework. The DataCite metadata schema ensures sufficient bibliographic data are available for citation and that the terms of (re)use are specified. The platforms that DataCite operates expose the metadata in various formats, thereby promoting discoverability and reuse [
42,
43].
Using the DataCite framework, the JRC engineering materials database hosted at
https://odin.jrc.ec.europa.eu (accessed 30 October 2020) allows data to be cited in the same way as traditional scientific publications, so that Open Data remain open and closed data remain closed, but all data sets are citable and discoverable by way of their publicly accessible bibliographic metadata. In the case of nominally closed data, the platform supports submitting data access requests. This allows bilateral discussion and informed decision by both parties about whether to enter into data sharing. Consequently, all circumstances where data owners may have reservations about data sharing can be accommodated, whether it is because of the complexities of the data, the protocols used to produce the data; the commercial sensitivity of the data; etc. Where there is an underlying willingness to share data, this combination of data citation and on-demand data access delivers a “managed” data access paradigm that can reasonably be claimed to be more effective than that of Open Data, accommodating as it does the reuse of both open and closed data; the competitive interests (intellectual and commercial) of project partners; opportunities for dialog between the owners and consumers of the shared data. In those projects where the JRC has participated, “managed” data access has alleviated the data sharing concerns of industrial partners, thereby encouraging their participation in the data management activities, typically more so than partners from the academic sector, where it appears that concerns about the increased transparency are an obstacle to data sharing.
Similarly, in those EC-funded projects where IRES has participated, effective data management practices ensure data generated during a project are gathered and stored for use over the entire term of the project and reuse thereafter. In this context, ensuring that the data are citable will promote their discovery; the transparency of project results; reuse by other members of the research community; and accreditation [
44]. The H2020 eNanoMapper project is a case that strongly supports this argument. eNanoMapper developed a computational infrastructure for the management of engineered nanomaterials toxicological data by integrating related public data stored in several databases [
45].
Enabling access to scientific data may also add credibility to the results of a project, whereby data can be validated by scientists from outside organisations and institutes. In consequence, the reputation of organisations that deliver quality data increases. Additionally, the existence of large and well-organised bodies of data that result from effective data management may also become a useful training set for algorithms, which can be exploited to the benefit of the owning organisation(s) as well as third parties.
The experience of the authors has been that academic and industrial partners can be unwilling to share their data publicly in the circumstance their competitive position in their respective fields suffers. In this circumstance, the benefits that accompany open access are insufficient to counter lost advantage. In which case, alternative data sharing paradigms need to be explored, such as restricted data sharing or data licensing, which may constitute a significant value proposition for the data owner. Restricted data sharing refers to the exchange of data between the members of a specific group or project consortium to their mutual benefit. Otherwise, data licensing can be adopted to promote data brokerage, whereby a transaction takes place, either in-kind by reciprocal data exchange or financial. The data licensing approaches have the potential to become increasingly popular, especially due to the development of data marketplaces, such as those of the VIMMP [
46] and MarketPlace [
47] projects. Again however, data integration with marketplaces requires standardised vocabularies and technical infrastructure that can ensure data interoperability with minimal effort for the data provider. As long as these requirements remain unresolved, these approaches cannot easily be pursued.
Infrastructure—open data standards enable interoperability and the development of a robust infrastructure that supports data management activities, whereby interoperability is feasible between humans, humans and machines and just machines. Such communication will face critical challenges in the circumstance that incompatibilities exist in the language used to express the meaning of the data and in the manner that data are collected, stored and shared.
For data collection, measures have to be taken that will determine the personal data to be shared and recommendations for predetermined procedures and methods to be adopted for the acquisition and handling of the research data for the duration of a project. Additionally, people and organisations that exchange data will form heterogeneous network systems i.e., networks that consist of computers and other devices that use different protocols and configurations. Moreover, the data that is exchanged may also be heterogeneous. To address these circumstances, network architectures and communication technologies are needed that enable the interconnectivity of the devices and the seamless transfer of digital data. Such specifications may concern the file format, the data structure and the methods used for data exchange, including the ways in which access is granted.
Efficient data exchange requires standards for the terminology used to describe data and metadata. Such vocabulary harmonisation is important since misunderstandings about the meaning of words can lead to incorrect assumptions. A shared vocabulary that consists of words that are rigorously defined helps to communicate knowledge accurately. Aligned to the development of vocabularies is the development of ontologies, which are formal descriptions of knowledge that consist of concepts and relationships. It is not only important that new domain ontologies be developed but also that existing ones are maintained and expanded. Where different terms and relationships are used to represent data, means exist to create connections among them, such as ontology alignment methods, which map entities from one ontology to another. Furthermore, shared vocabularies along with ontologies need to be machine-readable, thereby rendering the sharing of data among machines easier and allowing for further computational analysis. Hence, semantic interoperability does not only contribute to a common understanding of the knowledge among the partners but also to the enhanced inference of information by machines.
A DMP can begin to address interoperability insofar as data management activities should employ already existing standards. Thereafter, DMP tasks can extend to the development of new standards for those domains and applications where data harmonisation is limited or non-existent. A case where the DMP included such initiatives is that of the OYSTER H2020 project [
28], where a standardised vocabulary was developed for the description of materials characterisation methodologies and the resulting data. OYSTER also undertook to harmonise the process of the DMP, by creating the Data Management Ontology [
34]. This ontological representation, where the entities of data management and materials characterisation co-exist in a unified linked data schema, can serve as a paradigm for harmonisation of the high level DMP concepts with the domain specific terminology. Such data representations from various domains can be used to standardise the data management process even further. In this regard, the Data Committee responsible for the DMP would operate a software product built on those data structures to allow for the integration of the project data with either open access repositories or data marketplaces. Moreover, the Data Committee consults the partners on the operation of the system and supervises the data integration process. This approach would accommodate both public data sharing and data licensing/brokerage, as well as allow the DMP to deliver more predictable and effective results. Currently, although such a digital infrastructure does not exist, its development could be considered important given the EC investment in data marketplaces, open innovation environments and open innovation test beds. It would be reasonable to envision such a software product as an extension of the CORDIS platform, with its development funded by one or more calls of Horizon Europe (HE).
In those EC-funded projects where the JRC participated in a data management role, consortia have been encouraged to develop interoperability standards for data, whereby solutions are implemented for test types of immediate interest in accordance with a common methodology. In so doing, as well as addressing its data exchange and systems integration requirements, each project contributes to a body of data standards of use to the wider engineering materials community. Such work has typically taken place in the context of the standards setting environment, with CEN Workshops providing a platform to engage a broad representation of stakeholders, promote widespread adoption and address the longer-term prospects of the resulting technologies. While these standards serve the needs of any given project, their use by project partners serves to validate the standards and allow their further development to an increased technology readiness level. The result has been the delivery of standards for fatigue and nanoindentation test data, where the technical specifications are described in CEN Workshop Agreements [
48,
49] and corresponding reference implementations are publicly available from
http://uri.cen.eu/cen (accessed 30 October 2020). For example, the reference implementation for fatigue test data is available as an XML Schema Definition from
http://uri.cen.eu/cen/cwa/17157/1/xsd/iso-12106.xsd (accessed 30 October 2020).
Beyond addressing the technical challenges of developing interoperability standards for data, The CEN Workshops have given particular attention to the implications for the industrial and standards setting communities, whereby data standards have the potential to be disruptive. In the standards setting domain for example, the workflows and publication routes for ICT standards differ to those of traditional, documentary standards, with the result that entirely new policies are presently the subject of attention at both CEN and ISO. In those EC-funded projects where IRES has participated, CEN Workshops have also provided the platform for the development of documentation for material modelling data [
50] and material characterisation data [
51]. This trend continues with a number of recently funded H2020 projects in which the authors are participating, including nanoMECommons, OntoCommons and OpenModel, whereby interoperability standards and domain ontologies will be delivered in support of data exchange, semantic interoperability and data science as such, now adopting materials modelling and characterisation data workflows templates [
52,
53,
54,
55].
Project co-ordination—overcoming practical issues that arise during proposal preparation and project implementation will mitigate the risk associated with the data management policies of any given project. As indicated already, potential difficulties extend to misunderstandings about data management roles and responsibilities, adoption of practices tailored to project needs and the allocation of the necessary resources.
There is a need for a precise description of the data management roles and responsibilities of everyone involved. Responsibilities may concern the collection and curation of the data, their upload to the platform used for their storage, the reporting of research progress and the backup of the data in case of incidents that endanger their security. The clear description of the role of each partner in the DMP, as well as the corresponding responsibilities, should be specified in the proposal. In practice, this can be managed by anticipating the establishment of a Data Committee, so that during proposal preparation, individual partners have an opportunity to enrol in the Data Committee and hence give consideration to their contributions. While it would be neither effective nor appropriate for all partners to participate, the membership of the Data Committee should be sufficiently representative to ensure the consortium as a whole becomes aware of their roles and responsibilities. Moreover, the partner responsible for data management should communicate this information to the other partners as early as possible during proposal preparation. In this regard, partners have to be aware of the data sets to which they have access. Furthermore, consideration is needed of the legal restrictions concerning the dissemination and use of the data after the end of a project. Such policies should define the data to be shared; the duration of their availability; whether they will be open, closed or restricted; and their use by third parties. These policies ensure that research data will be used ethically and with due respect to any personal or sensitive information. Establishing a common understanding about the accessibility of research data is also important.
To encourage participation in the DMP, the effort required from the project partners needs to be minimised. To gather the information required for the synthesis of an effective DMP, one option is to undertake a survey of the partners. It is important that any such survey consist of questions tailored to the data management needs of the specific project. Simple questionnaires targeted towards the needs of a project in combination with comprehensive guidance not only prevents the researchers getting distracted from the execution of their tasks but also renders data collection easier. Thereafter, the template provided by the EC for H2020 projects is sufficiently general to be applicable to any type of research project but can be adjusted to serve a particular case based on the results of the survey.
To allow for the implementation of all DMP tasks, explicit estimations of resourcing should be provided at the time of the proposal. Awareness of the required budget prior to the start of a project helps organisations manage their resources effectively. Such practice also allows identification of mismatches between project tasks regarding the allocation of resources [
56] and may prevent misunderstandings among the partners. Examples of the costs that have to be taken into consideration for data management include the personnel time for data entry tasks and participation in the Data Committee; platform design and development; accompanying hardware; and archival beyond the term of the project since there may be charged fees depending on the storage duration.