/ Documentation / LiSA’s Structure

LiSA’s Structure

Note:
This website describes LiSA’s architecture and provides guides on how to use and extend it. It is intended to be valid for the latest stable release of LiSA, but should be compatible with versions 0.2 and later. Signatures or packages might differ in older versions, but the overall architecture and design principles should remain the same.

Warning:
The documentation is still under construction: some pages are still missing.

This section describes the internal structure of LiSA, providing an overview of its main components and how they interact with each other.

The codebase of LiSA is split into four projects: lisa-sdk, lisa-analyses, lisa-program, and lisa-imp. The lisa-sdk project contains the core infrastructure of LiSA, including the analysis engine, the control flow graph representation, and the framework for defining analyses and checks. The lisa-analyses project implements a collection of ready-to-use analyses that can be employed out of the box. The lisa-program project provides simple Statement implementations that can be used for quick prototyping of analyzers and for reference. Finally, the lisa-imp project contains an implementation of the IMP language, which is used for testing and demonstration purposes. The documentation in this section focuses primarily on the lisa-sdk project, as it forms the foundation of LiSA’s architecture. Documentation on the IMP language can be found in the IMP section, while a list of availble analyses is present under Configuration. For information on how to build LiSA and build upon it, refer to the Get Started section.

A high-level overview of LiSA’s structure is shown in the following diagram:

Intuitively, the analysis starts on a program P, written in a programming language i, composed by a set of files. The Frontend for language i parses the source files of P and translates them into a LiSA’s Program, which is a mainly composed by a set of control flow graph (CFGs). To keep things simple, we assume that the program is written in a single language, but the process can be extended by combining multiple frontends. The CFGs are then passed to LiSA, where the analysis starts.

The analysis begins by applying some validation checks on the program to ensure its correctness. Then, the Syntactic Checks provided in the configuration are executed, which inspect the program’s structure without performing any semantic analysis. After that, the Interprocedural Analysis is carried out, which computes a whole-program fixpoint. Such a fixpoint is computed by analyzing each CFG in the program, delegating to each Statement (the nodes of the CFG) the task of defining how it affects the program state. Each statement can perform a combination of operations, since it represent high-level constructs (e.g., calls, conditionals) rather than low-level instructions (e.g., arithmetic operations, comparisons). Each operation is either:

a call to some other CFG, in which case the statement delegates back to the Interprocedural Analysis the computation of the result;
an atomic operation whose effect depends on the abstract domain used for the analysis, in which case the statement forwards the operation (in the form of a Symbolic Expression) to the abstract domain currently in use.

Upon delegation to the Interprocedural Analysis, the latter computes the result of the call by analyzing each possible target. Target resolution is performed by the Call Graph, which computes all possible call targets by relying on both the program structure and the language-specific algorithms for call resolution (part of the Language Features). This process enables the simplification of abstract domains, as they do not need to handle call resolution on their own.

When a global fixpoint is reached, the analysis ends. The results of the analysis are then passed to the Semantic Checks, which inspect the analysis results to detect potential issues in the program. Finally, the Outputs of the analysis are generated, which include warnings, graphs, and reports. During the whole analysis, each component can emit Events to an event queue, where registered listeners can process them, either synchronously or asynchronously, to produce messages, logs, output files, or implement any custom behavior.

In the schematics above, components with blue borders are language-dependent, meaning that they must be defined once for each programming language that LiSA is used to analyze. Components with green borders are language-independent and take part modularly in the analysis, allowing for easy extension and customization of LiSA’s capabilities.

A more in-depth explanation of each component can be found in the pages linked below, together wiht pointers to the corresponding classes. For a step-by-step guide on how to build each component, refer to the Get Started section.

Analysis Components

Analysis components are the building blocks of LiSA’s analysis engine. They define how the analysis is performed, how the results are produced, and the general workflow of the analysis. As highlighted by the diagram above, all components are configurable: they are defined through interfaces and abstract classes, allowing for easy extension and customization.

Lattices

The Lattice interface defines the core operations that the information produced by the analysis must support, following the Abstract Interpretation theory. This interface is implemented by all information produced by domains, fixpoints, and analysis results. Despite the name of the interface, it does not enforce a lattice structure, but rather a poset with some extra operations for retrieving abstractions of unknown values (top elements) and unreachable or erroneous values (bottom elements). Other operations have default implementations, but can be overridden to construct possibly complete lattices. LiSA also comes with common lattice implementations ready to use. Read more about lattices in the Lattices page.

Symbolic Expressions

SymbolicExpressions represent atomic semantic operations that the program executes. They are used to deconstruct high-level syntacitc constructs (i.e., the Statements, that are the instructions of the program) that can have ambiguous meaning depending on the context into simpler operations that have a well-defined semantics for abstract domains. As an example, Java’s addition operator’s (+) semantics can be produce either a NumericAddition symbolic expression (if both operands have numeric type), or a StringConcatenation symbolic expression (if at least one of the operands is of a string type). SymbolicExpressions are what abstract domains analyze: this allows domain definitions independent from the source programming language, since domains can then interpret this symbolic expression according to their own logic. Read more about symbolic expressions in the Symbolic Expressions page.

Semantic Domains

The SemanticDomain interface defines the operations that an abstract domain must implement to be used in LiSA’s analyses. It is also implemented by the non-extensible Analysis class, that is the outer-most domain that Statements, Interprocedural Analysis, and the fixpoints interact with during the analysis. Abstract domains are responsible for defining how the program state is manipulated during the analysis. They provide operations for handling assignments, expressions, and conditionals. Read more about semantic domains in the Semantic Domains page.

The Simple Abstract Domain

In LiSA, an abstract domain is responsible for tracking the whole program state, including the values and types of variables and expressions, and the structure of the memory. This can be a complex task, especially for languages with rich features and constructs. To ease the development of new abstract domains, LiSA provides the SimpleAbstractDomain class, which implements a simplified model in which the program state is divided into three main components:

the memory domain, which tracks the structure of the memory;
the type domain, which tracks the types of variables and expressions;
the value domain, which tracks the values of variables and expressions.

This separation allows domain developers to focus on specific aspects of the program state, without worrying about the entire state management. Each component can be implemented independently, and combined to form a complete abstract domain. Read more about the simple abstract domain in the Simple Abstract Domain page.

The Interprocedural Analysis

The InterproceduralAnalysis interface defines how the whole-program analysis is performed. It is responsible for orchestrating the analysis of each CFG in the program, and for computing the results of calls. These two tasks are tightly coupled: if the analysis has to proceed top-down starting from a main function, then calls are resolved by analyzing the target CFGs on-the-fly; if the analysis proceeds bottom-up, then calls are resolved by retrieving pre-computed summaries of the target CFGs. For this reason, a single entity is responsible for both tasks. Individual CFGs are analyzed by delegating to a Fixpoint instance, which computes the fixpoint for that specific CFG. Read more about the interprocedural analysis in the Interprocedural Analysis page.

The Call Graph

Solving calls is a fundamental task in interprocedural analyses, mainly relying on the types of the call’s parameters and on how the programming languages matches call signatures to their targets. While the latter is fixed for a given language (a detailed discussion is present later on this page, in the Language Features and Type System section), the former can be carried out in different ways. To decouple call resolution from the interprocedural analysis, LiSA introduces the CallGraph interface, which defines how call targets are resolved. The call graph is queried by the interprocedural analysis whenever a call is encountered, and it returns the set of possible targets for that call. Read more about the call graph in the Call Graph page.

Semantics of Statements and Expressions

The aim of LiSA is to be applicable not only to several programming languages, but also to a variety of analyses. This entails that the specification of the semantics of the code under analysis must be independent from the abstract domains used in the analysis. LiSA adopts an analysis-time rewriting towards Symbolic Expressions, that allows the semantics to be tuned relying on the invariants computed by the fixpoint engine. Each Statement and Expression defines its own forwardSemantics and backwardSemantics methods, that is implemented by feeding symbolic expressions to the analysis being executed. Read more on semantics definitions in the Instruction Semantics page.

Syntactic and Semantic Checks

A Check is simply a visitor of the program, that provides hooks to inspect various components of the program structure. There are two types of checks:

SyntacticChecks, which inspect the program structure without relying on any semantic information;
SemanticChecks, which inspect the results of the analysis to detect potential issues in the program.

Checks are executed at specific points during the analysis: syntactic checks are executed right after the validation phase, while semantic checks are executed after the interprocedural analysis has reached a fixpoint. Read more about checks in the Checks page.

Analysis Events

Several Events are emitted during the analysis, to signal the occurrence of specific situations. Events can be consumed by EventListeners, which can process them either synchronously or asynchronously. This mechanism allows for decoupling the analysis from side-tasks, such as logging, output generation, or custom behaviors. Read more about events in the Analysis Events page.

Outputs

Several outputs can be generated by LiSA, either after the analysis completes or during its execution. Outputs can include graphs, reports, and other support files that provide insights into the analysis results. Outputs files are generated using the FileManager class, that provides a simple interface for creating and managing output files. Read more about outputs in the Outputs page.

Program Structure

LiSA’s Program is a data structure that contains all of the code that has been parsed from the input files, together with the Type System and Language Features specific to the programming language of the input code.

Units

A Unit represents a logical grouping of code, such as a source file, a class, or a module. The Program itself is a Unit. A Unit contains a set of code members (CFGs with a descriptor) and a set of Globals (global variables or constants). Read more about units in the Units page.

Control Flow Graphs

A Control Flow Graph (CFG) represents the control flow of a single function, method, or procedure. It is composed of Statements (the nodes of the graph) and Edges (the directed connections between statements). Each CFG has a CodeMemberDescriptor, which provides metadata about the CFG, such as its name, its parameters, and its return type. A special kind of CFGs, called NativeCFGs, can be used to compactly represent the behavior of library or runtime functions. Read more about control flow graphs in the Control Flow Graphs page.

Types

As the aim of LiSA is to be language-agnostic, no assumptions on types are made. All types are represented by instances of the Type interface, which can be extended to represent the types of a specific programming language. Sub-interfaces are used to identity specific categories of types, such as primitive types, reference types, array types, and function types. Read more about types in the Types page.

Language Features and Type System

The LanguageFeatures interface defines language-specific algorithms and behaviors that are required during the analysis. These include algorithms for resolving call targets, traversing type hierarchies, and handling specific language constructs. The TypeSystem interface defines how types are managed and manipulated during the analysis. It provides operations for type checking, type inference, and type compatibility. Both interfaces must be implemented for each programming language that LiSA is used to analyze. Read more about language features and type systems in the Language Features and Type System page.

Statements, Expressions, and Edges

Statements are the nodes of a CFG, representing high-level constructs such as assignments, conditionals, loops, and calls. Each statement defines how it affects the program state during the analysis, by providing an implementation of the forwardSemantics() method. Expressions are Statements that evaluate to a value, such as literals, variable accesses, and binary operations. Edges represent the directed connections between Statements in a CFG. Similarly to Statements, each Edge defines its own traverseForward() method, which specifies how the edge affects the program state when traversed during the analysis. Read more about statements, expressions, and edges in the Statements, Expressions, and Edges page.

Frontends

Frontends are responsible for translating source code or compiled code into LiSA’s program representation. They parse the input files, build the corresponding CFGs, and assemble them into a Program instance that can be analyzed by LiSA. They also provide language-specific algorithms for, e.g., matching a call to its target or traverse a type hierarchy (see the Language Features and Type System section). Each programming language requires its own frontend, as the translation process is highly language-dependent. Read more about frontends in the Frontends page.

The IMP Language

The IMP language is a high-level, imperative, and dynamically typed programming language insipred by Java. In LiSA, it is used internally for testing analyses end-to-end, to showcase what a frontend should look like, and to provide a simple playground for prototyping new analyses. Read more about the IMP language in the IMP page.