The security of software-intensive systems is frequently attacked. High fines or loss in reputation are potential consequences of not maintaining confidentiality, which is an important security objective. Detecting confidentiality issues in early software designs enables cost-efficient fixes. A Data Flow Diagram (DFD) is a modeling notation, which focuses on essential, functional aspects of such early software designs. Existing confidentiality analyses on DFDs support either information flow control or access control, which are the most common confidentiality mechanisms. Combining both mechanisms can be beneficial but existing DFD analyses do not support this. This lack of expressiveness requires designers to switch modeling languages to consider both mechanisms, which can lead to inconsistencies. In this article, we present an extended DFD syntax that supports modeling both, information flow and access control, in the same language. This improves expressiveness compared to related work and avoids inconsistencies. We define the semantics of extended DFDs by clauses in first-order logic. A logic program made of these clauses enables the automated detection of confidentiality violations by querying it. We evaluate the expressiveness of the syntax in a case study. We attempt to model nine information flow cases and six access control cases. We successfully modeled fourteen out of these fifteen cases, which indicates good expressiveness. We evaluate the reusability of models when switching confidentiality mechanisms by comparing the cases that share the same system design, which are three pairs of cases. We successfully show improved reusability compared to the state of the art. We evaluated the accuracy of confidentiality analyses by executing them for the fourteen cases that we could model. We experienced good accuracy.
In software-intensive systems, software contributes an essential influence on the design, construction, deployment, and evolution of the system as a whole (Institute of Electrical and Electronics Engineers, 2000). Consequently, software-intensive systems certainly cover all software systems but also cover, for example, modern production systems, cyber–physical systems or the internet of things. Many attacks target software-intensive systems (Deogirikar and Vidhate, 2017, Sadeghi et al., 2015). Thus, establishing and maintaining security of software-intensive systems is necessary. There are various security objectives that shall be established. Confidentiality, which is one of these security objectives, ensures that “information is not made available or disclosed to unauthorized individuals, entities, or processes” (International Organization for Standardization, 2018). Confidentiality is hard to achieve in software-intensive systems (Alguliyev et al., 2018) but it is important to consider in order to avoid high penalties and loss of reputation. Strong data protection regulations such as the General Data Protection Regulation (GDPR) (Union, 2016) of the European Union carry high financial penalties for failing to protect the data of users. For instance, British Airways is facing a penalty (Denham, 2020a) of £20m and Marriott International is facing a £18.4m penalty (Denham, 2020b) because of confidentiality breaches. Another threat to companies is loss of reputation after information disclosure. For instance, Facebook users lost trust (Weisbaum, 2018), which also affected the market value, after the Cambridge Analytica scandal (Isaak and Hanna, 2018).
Considering confidentiality is not a small polishing step in the development process but has to be done right from the beginning on. Big software vendors like Microsoft already consider confidentiality in all development phases (Microsoft Corporation, 2020). Considering confidentiality in the software design is especially crucial to avoid a significant increase in the overall development effort: Boehm et al. (1975) reported that fixing an issue becomes more expensive, the later it is fixed. Therefore, issues should be fixed as early as possible in the development process. The same holds for security issues in the development process (Microsoft Corporation and iSEC Partners, Inc., 2009, Hoo et al., 2001, McGraw, 2006). This is critical because design issues cause about 50 % of all security issues (McGraw, 2006). Ensuring proper software designs does not free developers from considering confidentiality in the remaining phases but builds a solid foundation for further phases by identifying and fixing fundamental issues that can barely be fixed later even when spending considerable effort.
Model-based confidentiality analyses are appropriate for identifying confidentiality violations caused by a confidentiality issue in software design, as Jürjens (2005b) demonstrated as part of a case study. A confidentiality violation is a detectable violation of a confidentiality requirement such as a system that receives data, to which it should not have access. A confidentiality issue is the reason why a confidentiality violation occurs. For instance, a system might acquire wrong data because of a wrong service call. Manual inspections of system designs can detect confidentiality violations but this task is complex and labor-intensive, which impedes fast and early detection of violations. A modeling language that is not capable of representing the important aspects for detecting confidentiality violations makes the detection process even harder. Automated model-based confidentiality analyses operating on appropriate models have the potential to speed up finding violations (Tuma et al., 2020). Especially, model-based confidentiality analyses operating on DFDs are promising because security problems tend to follow the data flow (Shostack, 2014), i.e. to identify the cause of a violation, it is often necessary to follow the path that the data took. We already demonstrated that model-based confidentiality analyses based on software designs given as data flows can yield valuable results in Industry 4.0 settings in previous work (Al-Ali et al., 2019). DFDs are part of, among others, the curriculum of requirements engineering certifications, such as the IREB certification (Pohl and Rupp, 2015), and textbooks on requirements engineering, such as (Dick et al., 2017, Wiegers, 2005), which is why designers are usually familiar with DFDs and do not require a steep learning curve.
Confidentiality analyses must support access control and information flow control because both are important confidentiality mechanisms: Access control is the standard for protecting confidential data (Sabelfeld and Myers, 2003). Therefore, it is commonly used in practice. For instance, a system might violate an access control requirement by providing a user with information of a certain type, which should be kept secret from that particular user. Information flow control can detect information leaks by data propagation that allow drawing conclusions without direct data flows (Hedin et al., 2017). For instance, a system might violate an information flow requirement by providing a user with information that has been derived from other information, which in turn should be kept secret from that particular user. Simple information flow control approaches such as taint analysis (Arzt et al., 2014) are applied in practice but more powerful information flow control approaches such as fine-grained noninterference enforcements are not (Staicu et al., 2019). Access control and information flow control are valid options to use depending on the system and the development context. Even combinations of simple information flow control and access control are possible at implementation level (Xu et al., 2006, Wang et al., 2009), which can improve the protection of information. If modeling and analysis approaches are not capable of representing information flow and access control, the chances are high that they are not applicable in a significant amount of cases in practice.
This article addresses the automatic detection of confidentiality violations in data-oriented software designs. Related work such as Tuma et al., 2019, van den Berghe et al., 2018 and Alghathbar and Wijesekera (2003) (discussed in detail in Section 4) as well as our previous work (Seifermann et al., 2019) already suggested modeling languages and analysis semantics in order to realize automated confidentiality analyses of software designs. Nevertheless, we still see the need for further research because of the following challenges that neither related work nor our previous work addressed comprehensively so far: (Ch1) A systematic consideration of all possible paths, which data can take in a system design, is necessary to find violations systematically. (Ch2) Modeling and analyzing information flow and access control within separate artifacts introduces consistency issues, so a consistent modeling and analysis approach, which supports both confidentiality mechanisms, is necessary. (Ch3) User-defined analyses are necessary to cope with specific analysis needs, which are hard or tedious to define in terms of established confidentiality mechanisms. We describe these challenges in more detail in . The following two contributions address these challenges:
(C1) Extended DFD Syntax. We specify an extended DFD syntax by a metamodel that addresses the previously described challenges via syntactical extensions for representing confidentiality mechanisms. The metamodel introduces the concept of alternative data flows via pins to represent multiple data sources and destinations (Ch1). The metamodel distinguishes between system parts that depend on particular confidentiality mechanisms and system parts that do not. Everything related to specific confidentiality mechanisms is encapsulated in extensions that can be defined by users (Ch3). An extension consists of confidentiality properties and behavior descriptions, i.e. descriptions of how the system changes these properties during its execution. The metamodel can represent information flow and access control (Ch2) by such extensions.
(C2) DFD Semantics for Confidentiality Analyses. We introduce analysis semantics based on label propagation that support various types of confidentiality analyses. Confidentiality properties are mapped to labels. Behavior descriptions are mapped to label propagation functions. An analysis is defined by a comparison of labels resulting from the label propagation with expected labels stemming from requirements. The comparison can cover information flow and access control analyses (Ch2) as well as user-defined analyses (Ch3). The semantics explicitly consider all possible data flows as well as their combinations, i.e. all data flow paths (Ch1).
We evaluate the presented modeling and analysis approach in a case study including fifteen cases. A case consists of a system, confidentiality requirements given in terms of a particular confidentiality mechanism as well as the properties and behaviors required to reason about confidentiality. We evaluate three aspects of the approach: the expressiveness in specifying systems and analyses, the reusability when replacing confidentiality mechanisms as well as the accuracy of analyses. We evaluate information flow analyses on nine cases and access control analyses on six cases. All cases used to evaluate information flow analyses and half of the cases used to evaluate access control analyses stem from related work. The results indicate good expressiveness and accuracy as well as improved reusability compared to the state of the art.
The remainder of this article is structured as follows. describes the three challenges that we address. We describe the running example to illustrate our approach throughout the article in Section 3. Section 4 covers the discussion of the state of the art in DFD semantics as well as design time confidentiality analyses. An overview on how the approach works is given in Section 5. The core contributions are the syntax and the semantics, which we describe in Sections 6 Syntax of extended data flow diagram, 7 Semantics of extended data flow diagram, respectively. We show how to detect confidentiality violations using both contributions in Section 8. We briefly report on our tooling in Section 9. Section 10 presents the evaluation of the expressiveness and reusability of the syntax as well as the accuracy of defined analyses. Section 11 concludes the article.
In this section, we describe the challenges in using the DFD syntax of DeMarco (1979) for detecting violations of confidentiality requirements. DFDs as introduced by DeMarco (1979) are graphs presenting a functional viewpoint on systems based on data processing. There are only four fundamental elements: Data flows are unidirectional edges that connect nodes to describe a data transmission between them. Source and sink nodes (also called actors) start or terminate a sequence of data flows. Process nodes transform incoming data to outgoing data. File nodes (also called stores) persist and emit data. DeMarco describes the semantics of DFDs in an intuitive but incomplete way, so there is no standard semantics.
The lack of full-fledged semantics and shortcomings of the simple syntax make automated analyses of DFDs challenging. Especially, we see the following three open challenges that have not been addressed sufficiently yet.
(Ch1) Exploration of multiple data flow paths. A data flow path is a sequence of nodes, which a data item took to reach a particular node. Multiple paths providing the same type of data to the same node commonly occur in realistic applications. For instance, branches can change call destinations and thereby also the destination of sent data. Multiple calls arriving at a certain location imply multiple sources of data for the callee. Modeling approaches have to provide means for describing these multiple paths to represent realistic system designs. The corresponding analysis approaches have to consider all of these paths in a systematic way to detect possible violations. Often, not all combinations of data flows build a valid data flow path from a logical point of view. Therefore, modeling approaches should provide means to specify valid combinations. A common approach to treat multiple data flows is to require an explicit selection of one particular path before the analysis but this is problematic because it does not scale well: In theory, the cross product of all possible choices at every node in a DFD has to be considered if no specification of valid paths is available.
(Ch2) Coverage of multiple confidentiality mechanisms. Usually, DFDs require extensions to capture the information required to conduct confidentiality analyses. Single purpose models and analyses cover phenomenons pretty well and provide accurate analyses. However, the downside of single purpose approaches is the lack of flexiblity, i.e. designers have to choose a particular confidentiality mechanism, e.g. information flow or access control, before they start modeling. Switching to another confidentiality mechanism implies remodeling large parts of the system in the new modeling language even if fundamental parts, such as the system structure, could be reused. Remodeling large parts may imply consistency problems: software designers have to ensure that the shared part of both models actually represents the same design. Creating (automated) mappings between two single purpose models is possible in general but such kind of consistency management is challenging if the languages diverge too much (Torres et al., 2020). A feasible approach for addressing this consistency problem when switching confidentiality mechanisms is necessary.
(Ch3) User-defined confidentiality analyses. Requirements to keep information confidential can be formulated in various ways. However, when designers are forced to use predefined confidentiality mechanisms, even simple requirements such as that a certain piece of information must not flow to one specific node can become complex: In Role-based Access Control (RBAC), a designer has to specify roles and assign these roles to data and nodes in a way that the simple policy can be checked by comparing roles. In information flow, a designer has to do roughly the same steps but for labels instead of roles. Defining custom analyses can be easier. To do so, designers need means for specifying custom analyses and according modeling concepts. As a side effect, this would also allow to integrate new confidentiality mechanisms. An underlying formalism supporting analyses of various confidentiality mechanisms as well as an appropriate modeling language is needed to provide such means.
3. Running example
To illustrate the concepts described in this article as well as the limitations of the state of the art, we use the TravelPlanner case study (Katkalov et al., 2013) of iFlow as a running example. The case study consists of the four systems shown in Fig. 1: The travel planner app queries flights and books them on behalf of the user. The credit card center app manages the credit card information of a user. An airline service provides flight information and allows booking flights. A travel agency service mediates between the travel planner and the airline. The scenario is that users query flights, load their credit card data (CCD), book the flight with the airline and the airline pays a commission for mediating to the travel agency.
With respect to confidentiality, there are three totally ordered security levels: The first level User,Airline,Agency contains information accessible to all parties. The travel agency, airline and user have clearance for this level. The travel planner and credit card center apps belong to the user. Both apps and the user always have the same clearance. The second level User,Airline dominates, i.e., it is bigger than or at least equal to (⩾), the first level and contains information regarding the flight booking. The airline and user have clearance for this level. The third level User dominates the previous levels and contains information only meant for the user. The user has clearance for this level. The critical part of the system is that credit card information from level three must not be disclosed to entities with lower clearance level. However, the airline needs the credit card information to process the booking. Therefore, a declassification of the credit card data explicitly lowers the security level to the second level. If this declassification is missing, there is a violation of the information flow requirements.