Scientists routinely analyse and share data for others to use. Successful data (re)use relies on having metadata describing the context of analysis of data. In many disciplines the creation of contextual metadata is referred to as
reporting. One method of implementing analyses
[...] Read more.
Scientists routinely analyse and share data for others to use. Successful data (re)use relies on having metadata describing the context of analysis of data. In many disciplines the creation of contextual metadata is referred to as
reporting. One method of implementing analyses is with workflows. A stand-out feature of workflows is their ability to record
provenance from executions. Provenance is useful when analyses are executed with changing parameters (changing contexts) and results need to be traced to respective parameters. In this paper we investigate whether provenance can be exploited to support reporting. Specifically; we outline a case-study based on a real-world workflow and set of reporting queries. We observe that provenance, as collected from workflow executions, is of limited use for reporting, as it supports queries partially. We identify that this is due to the generic nature of provenance, its lack of domain-specific contextual metadata. We observe that the required information is available in implicit form, embedded in data. We describe
LabelFlow, a framework comprised of four
Labelling Operators for decorating provenance with domain-specific
Labels.
LabelFlow can be instantiated for a domain by plugging it with domain-specific metadata extractors. We provide a tool that takes as input a workflow, and produces as output a
Labelling Pipeline for that workflow, comprised of
Labelling Operators. We revisit the case-study and show how
Labels provide a more complete implementation of reporting queries.
Full article