1. Introduction
Organizations such as large companies, institutions or government bodies (e.g., central banks, regional and municipal authorities, credit institutions, and so on) have built powerful information systems to support their business processes and institutional missions. Very large volumes of data are stored and managed by these information systems, both internally generated and periodically uploaded from external sources: in fact, in many cases, external data flows feed the information systems with a certain regularity (e.g., daily, weekly and monthly). Thus, if from one side the underlying technology has been improved to deal with ever-increasing amounts of data, from the other side database administrators and system integrators have practically defined and implemented solutions to efficiently collect and manage these data flows, optimizing the upload processes as much as possible.
A very critical activity that many organizations must perform is “continuous reporting”, i.e., periodically generating reports: in fact, periodical reports are required by many governmental and regulatory authorities, to monitor the economic and social operators whose activity is subject to strong rules, to verify that they actually respect them. However, data gathering and management for generating reports present issues that are independent of the specific application domain; these issues can significantly reduce the efficiency of the processes to generate reports. (i) Typically, reports must provide aggregate data, but they often come from many data sources; consequently, data to aggregate are not defined based on a standard dictionary and common semantics; usually, this issue is addressed by manually writing queries that make data homogeneous. (ii) The previous issue can be solved only by a very small number of experts, who have both technical skills and domain knowledge: these are very rare people, difficult to find and enroll. (iii) The users who are in charge of preparing reports typically do not have complete knowledge about the meaning of fields in data sources; in contrast, they need a uniform and homogeneous view of data they must aggregate into reports, possibly with clear semantic annotations. (iv) Users in charge of generating reports are not database experts: they need a high-level tool that provides access to data in a transparent way with respect to technicalities of data management. (v) Terms and concepts often are informally defined by communities and their meaning often varies from one community to another; furthermore, specialists use a lot of complex interconnected concepts, including those defined by international standards, developed to provide global services. (vi) A common comprehension of terms and their semantics is essential to define reports effectively and efficiently; this issue could be solved by building a common dictionary of terms.
Different business areas experience these problems. This work considers the application contexts of the “financial market”. This is usually characterized by high volumes of data generated every day. Banks and financial institutions continuously exchange data flows describing financial transactions. These data flows, from different sources, constantly feed information systems and their databases. Unlike traditional activities, these data are essential for reporting activities, since reports are necessary not only for internal use (e.g., auditing and decision making). In fact, the European Central Bank (known as BCE) constantly monitors the financial status of banks; for example, BCE asks for periodical reports that summarize the performance of loans managed by banks.
The large amount of data to aggregate to produce summaries, as well as the repeated and possibly frequent generation of reports, overseen by employees who usually do not have technical competences in data management, are practical obstacles for organizations. Even though they are provided with powerful information systems that can deal with large volumes of data for operational activities, reporting activities are often not supported by software tools in an integrated way. This situation causes waste of time, delays, errors and high costs.
On the market, there are many systems that enable the analysis of large amounts of data, such as Microsoft Power BI, QlikView and Tableau. These tools are devised for the activity called ”business intelligence” and aim to retrieve, analyze, transform and report on dashboards data to enable easy and accurate analysis of data usually contained in data warehouses [
1]. However, users involved in business-intelligence activities are analysts, unusually with good knowledge of the application domain; furthermore, reports are generated as the output of the analysis, usually performed in a one-shot way. Finally, a good preliminary knowledge about data to analyze is necessary, to effectively use these tools. In other words, BI tools are not suitable for “continuous reporting”, which is a regular business activity to be performed by employees.
The novel framework named
RADAR (acronym for “Resilient Application for Dependable Aided Reporting”) aims to provide a “bridge” between data and employees who must generate reports. The acronym synthetically summarizes its goal: the framework is aimed to aid reporting activities, so that data put into reports are dependable; furthermore, the application should be resilient, meaning that it collects all the knowledge about data and application domain for a possibly-long period of time. Many different features characterize the
RADAR Framework. (i) A knowledge base is managed, to collect all terms that characterize the application domain in a common dictionary. (ii) The framework applies the concept of ”Operational Data Store” (ODS) [
2,
3], i.e., a database able to provide an integrated but still operational view of data; this way, all data sources are integrated into one unique database (called
RADAR DB), but they are not transformed towards a multidimensional representation. (iii) Data are modeled by means of the
RADAR Data Model, i.e., a hybrid solution between an ontology of terms and concepts and the actual operational and relational schema of data sources; the resulting
RADAR Schema gives a high-level view of the collected data, which are semantically characterized by ontological concepts. This way, data collected within the
RADAR DB actually gives a ”concrete view” to ontological concepts. (iv) The
RADAR Data Model also explicitly maintains mapping between data gathered from source databases and their representation within the framework; this way, the work of database administrators is simplified and the knowledge about mapping is explicitly maintained. (v) The
RADAR Data Model encompasses the
RADAR Rule Language, whose goal is to provide domain experts with a practical yet high-level formalism to encode how to complete properties that do not correspond to attributes in the source data. (vi) Employees in charge of generating reports are provided with a clear and high-level view of data; they can browse the knowledge base in general and the
RADAR Schema in particular, to easily retrieve and aggregate data to put into reports, for which they define the layouts.
Notice that the features of the RADAR Framework are quite effective in addressing the above-mentioned issues regarding the activity of reporting: (i) database experts do not have to write low-level queries, and the mapping between source databases and the RADAR DB is clearly described and documented; (ii) domain experts provide their knowledge about the application domain in a high-level way through the RADAR Data Model, obtaining the RADAR Schema, which lasts over time and documents the knowledge; (iii) employees can easily browse the RADAR Schema by moving from ontological classes to retrieve data on a semantic basis; (iv) reports are directly tied to data in the form provided by the RADAR Schema, so as technicalities to retrieve data and generate reports are automatically dealt with by the RADAR Framework.
In comparison with our previous work [
4], which for the first time gave a (still partial) presentation of the
RADAR Framework, the present paper provides many contributions. (i) The application scenario is clearly described, by exploiting a running example. (ii) The
RADAR Data Model is extensively and formally defined in all its aspects. (iii) A complete
RADAR Schema is provided for the running example. (iv) The architecture of the
RADAR Framework and its components are extensively presented, showing screenshots of the user interfaces to clarify the usage perspective of the framework.
The paper is organized as follows:
Section 2 introduces relevant related works and technical background.
Section 3.1 presents the running example that is considered in this work, while
Section 3.2 describes, respectively, the problem and the approach we devised, together with the vision we identified to solve the problem.
Section 3.3 presents the general architecture of the
RADAR Framework.
Section 4 introduces the
RADAR Data Model. Then,
Section 5.1 describes each component of the
Knowledge Base, while
Section 5.2 presents the
Report Designer in details. Finally, concluding remarks are reported in
Section 6.
3. Running Example, Approach and Architecture
This section is devoted to illustrate the practical problem, from which the idea of developing the RADAR Framework came out, the approach we followed to address it and the architecture of the RADAR Framework. The practical case is simplified into a running example that will be exploited throughout the remainder of the paper.
3.1. Running Example
Consider the case of a bank that operates on the market of loans. For various reasons, some banks sell their loans to other banks, which acquire them and become the real owners of sold loans. Let us suppose that the bank called MyBank buys loans from other banks. The contract by means of which loans are bought is called “Cession” and describes the acquired loan. Although MyBank is the owner of these loans, the interaction with the end customer is still performed by the original bank, so MyBank receives large data flows concerning the acquired loans from other banks: specific technical solutions are implemented by technicians to receive these data flows and store them into (usually relational) databases. Data flows usually update the list of paid installments and the amortization plan of each loan.
Depending on the economical capability of customers, i.e., their capability of regularly paying installments, loans are specifically classified: (i) a “Performing Loan” is a loan that has neither arrears nor unpaid installments; (ii) a “Defaulted Loan” is a loan that is either close to default or with many unpaid installments. (iii) A “Delinquent Loan” denotes a loan that has significant troubles but there is still the possibility that becomes again performing. Each type of loan is further subdivided according to other features, such as (a) “Mortgage Loan” (i.e., a loan that is secured by a mortgage), (b) “Guaranteed Loan” (i.e., a loan that is not secured by mortgages but by other guarantees, such as pledges), and, finally, (c) “Unguaranteed Loan” (i.e., loans that are not supported by any kind of guarantee).
National and international regulatory bodies, as well as financial-rating agencies (one of the most famous is Moodie’s) ask the bank to send them reports concerning its degree of reliability with respect to non-performing loans, to be informed if the financial situation of the bank is getting bad. Employees responsible for auditing and control of the financial state of the bank must prepare these reports. Some typical reports to prepare and send are the following ones.
Portfolio 1. This report provides summary data based on a few categories: (i) “Outstanding Principal”, which summarizes the theoretical residual debt; (ii) “Accrued Interest”, which summarizes the accrued interest and delinquent installments; (iii) “Unpaid Outstanding Principal”, which summarizes the actual remaining debts. These aggregations are also disaggregated based on loan status (“Performing”, “Delinquent” and “Defaulted”) and related guarantees (“Mortgage”, “Guaranteed” and “Unguaranteed”).
Portfolio 2. This report provides aggregate data related to advance payments (closure of debts) during a given quarter, together with an immediately preceding quarter.
Loan by Loan. This report summarizes the status of each single loan. For each loan, apart from the usual identification data of the bank, the customer and the loan, the current status of the loan, the current rate, the amount of the loan secured by mortgage (and so on) are reported.
To produce the above-mentioned reports, employees must retrieve raw data previously received through the incoming data flows, aggregate them and put the aggregate values into specific report layouts. To effectively perform this task, they also need to know the semantics of data to aggregate, as well as the data structures of the database where data are stored. However, these competences are hardly owned by the same person. At least two different types of employees are involved: experts of the application domain (usually, economists) and computer technicians. This situation makes the process to generate a new report quite hard and long.
For this reason, the primary objective of the tool that we have developed was to allow employees without in-depth knowledge of the application domain to generate reports. An important feature of the reports is the possibility of generating different versions of the same report over time, as regulators such as the BCE require monthly, quarterly and annual reports.
4. RADAR Data Model
This section introduces the RADAR Data Model. Based on it, data managers and domain experts design the RADAR Schema, which will be provided to report designers to retrieve desired data and aggregate them to design reports.
The RADAR Data Model is structured in three different layers:
Ontological Layer. This layer defines the Reference Ontology that provides the conceptual framework for managed data.
Concrete Layer. This layer provides a high-level and uniform view of actual data, in a way suitable for report designers.
Mapping Layer. This layer maps classes defined in the Concrete Layer to source data, used to feed the RADAR DB, i.e., the internal database of the RADAR Framework.
Figure 2 illustrates the overall
RADAR Schema for the running example. Notice the three layers in which the model is subdivided. In the remainder of this section, we introduce all the components of the
RADAR Data Model.
4.2. Concrete Layer (Core RADAR Data Model)
The core of the RADAR Data Model is the Concrete Layer, i.e., the layer where ontological classes are made concrete, to be able to fit real data.
4.2.1. Concrete Classes
Definition 2. Concrete Class.A “Concrete Class” represents a specific concept or object described by data. It is defined as a tuple
where “” uniquely identifies the concrete class, while “P” (with ) is the set of properties (property descriptors) (later defined in Definition 3) that denote the properties of the concrete class; “K” denotes the key (or set of key-property names), whose value uniquely identifies each single instance of the class. The “” field is a textual message that explains the meaning of the class. To be valid, the set of property namesmust include the set K of key-property names; this constraint is formally expressed as
The notion of property (descriptor) for a concrete class is formalized by the following Definition 3.
Definition 3. Property and Property Descriptor.A “Property” represents an aspect or feature or attribute that characterizes a class. It is defined by a “Property Descriptor”, which is the following tuple:
where “” is the unique name (within the class) of the property, while “” denotes the type of property, i.e.,“categorical-closed"(the property is categorical and has a closed set of predefined values),“categorical-open"(the property is categorical and has an open set of values),“measure"(the property is numerical). The “” field defines the data type of property values, i.e.,“integer",“decimal",“boolean",“string",“date". The ”” field is the set of predefined values in the case that the property type is“categorical-closed".
The final “” field is a textual message that explains the property and its values.
In the Concrete Layer, inheritance is considered as well. Definition 4 defines the concept of “subclass”.
Definition 4. Concrete Subclass.A concrete class can be specialized into a “Concrete Subclass”, which is formally defined by the tuple:
where “” uniquely identifies the subclass, “” is the set of “local properties” of the subclass (where a property is defined as in Definition 3), and “” is the name of the super- class that is specialized by . Finally, the “” field is a textual message that explains the meaning of the subclass. A subclass inherits the properties and the key from its direct super-class. Thus, the full set of properties and the key of a concrete subclass are denoted as:
1. | | |
2. | | |
To be valid, the subclass must respect the following constraint:
that is, no property names inherited from the super-class can be redefined. Once defined the concepts of concrete class and subclass, the notion of “instance” must be defined.
Definition 5. Instance.Consider a concrete class and the pool of its subclasses. Let us denote with the aggregated set of property names.
An “Instance” o of the concrete class is a key-value map , where is a property name belonging either to the concrete class or to any of its subclasses. If a name is not defined in the o instance, it is assumed that null, meaning that the value for k is missing.
Notice that one single instance o of the concrete class covers all the subclasses of . Depending on the specific data source the instance comes from, it could happen that some property is not defined; in this case, the property has the null value.
Furthermore, based on Definition 5, it is possible to say that the RADAR DB is a “container of instances”, in which all known instances of concrete classes and their subclasses are stored.
Example 2. The concrete layer in the RADAR Schema for the running example is zoomed in Figure 4; Figure 5 depicts the properties for the concrete classes. Four concrete classes are in the layer, which are hereafter explained. The Cession concrete class denotes the contract by means of which a loan has been bought by the bank; in practice, it denotes the loan itself. The Financing_State concrete class denotes the state of a loan at a given time (typically, a month). It reports the number of paid and unpaid installments, the residual debt and so on; there is an instance for each passed installments, to represent the overall history of the loan.
The Amortization_Plan concrete class reports the full amortization plan for each instance of the Financing_State concrete class; for example, if the loan has 120 installments, for each instance of the Financing_State concrete class there are 120 instances of the Amortization_Plan concrete class. Why this apparent redundancy? Because the amortization plan can change over time and flows coming from external banks report, for each new state of the loan, the full list of past and future installments.
Finally, the Deposit concrete class denotes deposits that guarantee the loan.
The Financing_State class has two subclasses, named Moodys_State and BCE_State, which describe how the state of a loan is evaluated based on Moody’s (one of the World’s most important private rating agencies) rules and BCE (the European Central Bank) rules, respectively.
4.3. Mapping Layer
Actual sources for data to import into the RADAR DB and manage by means of concrete classes are external tables possibly stored in relational databases. Many different tables could provide instances of the same concrete class. The role of the Mapping Layer is to map external tables to concrete classes.
A preliminary definition is necessary.
Definition 9. Attribute Descriptor.An “Attribute Descriptor” is defined by the tuple
where “” corresponds to the name of the attribute and “” corresponds to its data type (i.e.,“integer",“decimal",“boolean",“string"and“date"). The “” field is a textual message provided to explain the meaning of the attribute. The attribute descriptor is used to define the schema of an external table.
Definition 10. External Table.An “External Table” is described by the following tuple:
where “” is the name of the table in the database denoted by “”, while ”A” is a list of attribute descriptors (see Definition 9) in the table (thus, “A” is the ”schema” of the table). Finally, “” is a textual message provided by technicians that import external tables. By means of Definition 10, it is possible to define a generic external table that could be either stored within a relational database or stored as a CSV (Comma-Separated Value) file. In this regard, is the connection string to the data source.
4.3.2. RADAR Rule Language
The RADAR Data Model is completed by the RADAR Rule Language. In fact, not all the properties of a concrete class are necessarily mapped to attributes of external tables. To cope with this situation, the RADAR Rule Language was introduced: the goal of a RADAR Rule is to assign a value to properties without value, which, otherwise, would have the null value. These properties are called “derived properties”. Therefore, the RADAR Rule Language provides a non-procedural way to define the values of derived properties.
RADAR Rules are defined as “Condition-Action” (CA) rules [
47] since their action is executed when their condition is true. By means of the following definitions, rules and their execution semantics are defined.
Definition 12. Rule.A “RADAR Rule” is defined by the following tuple:
where “” uniquely identifies the rule, “” is the concrete class which the rule is applied to, “” is a logical condition based on comparison predicates possibly composed by logical operators; finally, “” is a non-empty list of assignments based on the=operator, where the left-hand side is the name of a property p, while the right-hand side is an expression that possibly refers to properties. The “” field is a textual annotation provided by rule designers. Both the condition and the assignment expressions can refer to properties of , as well as to properties in concrete classes reachable from by navigating either Look-Up Relationships or Virtual Relationships by means of thereachoperator (see Definition 13).
As reported in Definition 12, conditions and assignment expressions possibly must navigate the RADAR Schema by means of the Reach operator, defined in Definition 13.
Definition 13. ReachOperator.The syntax of theReachoperator in the RADAR Rule Language is
where “” is the concrete class to reach though the operator, by navigating both Look-Up Relationships and Virtual Relationships specified in the non-empty repetition of pairs “via”. The operator returns the instances of the target class reached by navigating the list of relationships reported in the parentheses; if the sequence of relationships does not reach the target class (broken sequence or wrong class) the operator returns no instance.
The dot notation can be applied after theReachoperator, to refer to properties in the instances returned by the operator application.
The reader can find examples of
RADAR Rules in
Listing 1, which will be discussed in the example below.
At this point, Definition 14 can define the execution semantics of RADAR Rules.
Definition 14. Execution Semantics of Radar Rules.Consider a RADAR Rule and the set of instances of the concrete class for which the rule is defined; these instances are known as the “Native Instances” and are denoted as .
Each native instance is combined with the set of instances returned by applying the i-th occurrence of theReachoperator in , by moving from the n native instance; the set of “Compound Instances” is obtained (where k is the number of occurrences of theReachoperator in the rule).
For each compound instance for which is true, the assignments in are performed on the corresponding source instances (i.e., while is obtained by combining the images of the instances, the assignments are performed on instances actually stored in the RADAR DB).
The rationale behind the RADAR Rule Language is to provide the Mapping Layer with a tool to complete missing properties. In fact, properties in concrete classes are independent of the presence of a corresponding attribute in the input external tables; often, many properties must be derived or computed based on other properties. RADAR Rules provide the way to complete new instances of concrete classes.
Example 7. The RADAR Schema of the running example is completed by RADAR Rules that are necessary to evaluate those properties in concrete classes that cannot be mapped onto attributes in external tables. Listing 1 reports three rules. TheJunk_BCE_Ratingrule must evaluate the value for theBCE_Ratingproperty in theBCE_Stateconcrete class, which is a subclass of theFinancing_Stateclass (remember that a concrete class and its subclasses share the same instances, see Definition 5). Once a new instance is added to the class, the condition is evaluated; specifically, it is true when the number of non-paid installments is greater than the number of residual installments multiplied by 1.3. If the condition is true, the action assigns the string“Junk"to theBCE_Ratingproperty.
Listing 1.
Three RADAR Rules for the running example.
Rule: Junk_BCE_Rating |
Class: BCE_State |
Condition: Not_Paid_Installments > Residual_Debt * 1.3 |
Action: BCE_Rating=“Junk" |
|
Rule: Default_Moodys_Rating |
Class: Moodys_State |
Condition: BCE_Rating == “Junk" AND |
Reach(Cession via Related_Cession).PerformingCategory == “Default" |
Action: Moodys_Rating=“D" |
|
Rule: Performing_Cession |
Class: Cession |
Action: Performing_Category = “Performing" |
Rule: Junk_BCE_Rating |
Rule: Junk_BCE_Rating |
TheDefault_Moodys_Ratingrule behaves similarly. Its condition exploits theReachoperator (see Definition 13) to get the value of thePerforming_Categoryproperty in the related instance of theCessionconcrete class; it moves from the current concrete class (i.e.,Moodys_State, which is a subclass of theFinancing_Stateconcrete class) and via theRelated_Cessionlook-up relationship obtains the associated instance of theCessionconcrete class; the dot notation completes the reference to the desired property. Substantially, the rule behaves this way: it assigns the“D"value (that stands for “Default") to theMoodys_Ratingproperty of the evaluated instance of theMoodys_Stateconcrete class, which is a subclass of theFinancing_Stateconcrete class, if itsBCE_Ratingproperty is“Junk"and thePerforming_Stateproperty in the referred instance of theCessionconcrete class is“Default". Notice that theBCE_Ratingproperty is not within theMoodys_Statesubclass; however, theFinancing_Stateclass and its subclasses (BCE_StateandMoodys_State) share the same instances, so theBCE_Ratingproperty is visible to the rule.
Finally, thePerforming_Cessionrule is associated with theCessionconcrete class and has the goal of assigning the value"Performing"to thePerforming_Categoryproperty of the instance of theCessionconcrete class it is evaluated on. This time, the condition exploits theReachoperator (see Definition 13) to reach instances of theDepositconcrete class via theReference_Depositvirtual relationship; if there is at least one associated deposit whose value of theDeposit_Amountproperty is greater than the value of theAmountproperty of the cession that the rule is evaluating, the condition is true and the action is executed.