An Embedded system is designed to perform a specialized task consistently, deterministically, with minimal intervention. The system is safety-critical if a failure can cause significant damage to life, property, or the environment. A nuclear power plant's temperature control system and a home thermostat perform a similar function but significantly differ in their failure impacts. The former is safety-critical and latter is not. Some domains involve higher usage of safety-critical systems than others, such as Power Generation, Medical Diagnostics, Transportation (Aerospace & Aviation, Railways, and Automotive), and Defense.
IEC 61508 defines safety as “The freedom from unacceptable risk of physical injury or of damage to the health of people, either directly or indirectly as a result of damage to property or the environment.” It defines Functional Safety as “The absence of unacceptable risk due to hazards caused by malfunctioned behavior of Electrical/Electronic systems.”
Is it possible to build a failure-proof system?
It is impossible to build a failure-proof system. A safety-critical embedded system can handle failures predictably and minimize their impact. The project team should specify the system's failure modes, failure rate, and failure handling mechanisms in the product definition phase, and operators and regulatory authorities should approve those details.
The design of a safety-critical system starts with “Risk Analysis”. In this step, the project team identifies the system's failure scenarios with their likelihood and consequences. The analysis concludes with safety and system requirements for the product, both for hardware and software. The safety requirements include expected failure rates, error handling, and other applicable safety constraints, while the system requirements list the functional and non-functional system behavior needed for system modelling. The safety processes followed in hardware and software development differ only in a few phases.
In this article, we focus on the processes recommended before starting software development for safety-critical embedded systems.
Importance of a process
A V model represents the recommended phases for developing safety-critical software. Before moving from one phase to the next, the project team must complete all the steps in the ongoing phase, including verifying and validating the outputs.
A safety-critical system must adhere to its specifications on failure rates. Even in the case of a failure, the system must follow a predictable course. To achieve this, every project team member must religiously follow the processes defined for each step of software development, laid out right at the beginning of the project. In some applications, the authorities do not permit the system's use unless approved by third-party auditors and government regulatory bodies. Hence the team should involve third-party auditors to review the system's safety specifications. The team should also factor in ease of audit while defining processes or workflows. For example, the process for code review workflow should mandate the documentation of any informal code walk-throughs for ready reference by the auditors.
Documentation
Documentation is one of the most important steps in developing safety-critical Embedded systems. The availability of the documents and review reports during audits is a good indicator of adherence to the processes.
The project team must document the software requirements unambiguously. The system architects and safety experts must check and approve the requirements. Clearly defined and documented requirements reduce the chances of buggy software and save significant time and effort later. The team members can also use the documentation as a reference during development and testing. After finalizing the software requirements, the team should prepare, review, and define the software architecture, which is the first step in the development process. A poorly architected software is likely to be unstable, unpredictable, and challenging to maintain. Hence software architects must duly incorporate safety, predictability, performance, and maintainability in their decisions. They should also consider microcontroller peripheral features, limitations, and access times and document their usage and impact. The architects should also record the design approaches considered but not chosen for any reason. For software partitioned as modules/units, the project team should document unit-level concurrency analysis & critical data along with functional and non-functional requirements and interfaces. It should also record the unit dependencies to understand and analyze the linkage.
Testing
For safety-critical Embedded systems, the test plan should be prepared along with system requirements. With every requirement or design document, corresponding level of test specification should be drafted as well. This test specification may not be complete but creating a test specification before implementation prevents the “Code bias”. Once code is implemented, tester might get biased to satisfy the implementation and not the real requirements. If tests are defined at the requirement level, bias will be automatically eliminated and when tests are executed, any deviation from the requirements will be highlighted. Tests should be reviewed by requirement engineers, software architects, and it is also recommended to involve safety experts for certain test levels. A test plan can be prepared in a way that, with each development milestone, a set of test cases can be executed. The test results can be useful for alterations in the development approach or architecture to improve behavior and performance. It is a must to execute all the test levels – unit tests, software integration tests, SW HW integration tests, and overall software tests. Separate test levels simplify root cause analysis for different issues.
Traceability
The process of connecting input requirements to their inferred requirements/architecture and accompanying tests is traceability. When defining software requirements, each software requirement must correspond to at least one system or safety requirement. The requirement must be “up-traced” to system/safety level requirements. The same software requirement must be realized into at least one software architecture element, i.e., “down-traced” to at least one architecture element. Along with up and down tracing, all requirements must be traced to at least one test case, i.e., it must be “side traced” as well. The V-model describes the up, down, and side directions. The impact of the change in the software architecture or requirements is directly visible in the traceability matrix. So, the test team may only validate selectively if the architecture or product requirements change, thus reducing the testing effort in response to changes. The project team should aim for full traceability and ensure the analysis, implementation, and testing of all requirements. With the help of traceability matrices, an auditor can check the complete information of any safety requirement, including system requirements and coding & test results from different test levels.
Change management
Module design changes, new feature implementations, and bug fixes are a part of the product life cycle. Hence the team should finalize the change management process before starting development. Change management includes version management and features list management. The team should record the list of changes between two successive software versions with the impact analysis of those changes. Change management also helps in tracking and avoiding unwanted changes. A group of experts should review every change request. They should analyze the impact of the proposed change and ensure that the testing team validates the changes before incorporating them.
Minimizing the dark corners
Software failures are systematic and reproducible. Safety-critical software should have a predictable behavior even in case of failures. Unknown areas in the software make it less predictable and increase the risk of systematic failures. Typically, a project team understands and exhaustively tests the code team members write. But modern software also uses many libraries and open-source software. Such components are often the unknown areas or dark corners that the team doesn't understand and test very well. The team should minimize such dark corners by understanding the error behavior of external code in detail. In the following sections, we will cover the commonly used external components used in modern Embedded software development.
Microcontrollers
A microcontroller is the brain of an Embedded system; its failure can lead to unexpected consequences. For use in a safety-critical system, the system architects must only consider safety-certified microcontrollers. Manufacturers thoroughly test, ensure predictable behavior, and provide failure rates observed in the field for safety-certified microcontrollers. Many safety-certified microcontrollers have internal mechanisms to handle errors, which ensures a predictable course after failures. The failure rate of a microcontroller is a significant determinant of the overall system failure rate.
Manufacturers provide errata for their microcontrollers that list known issues. Software developers must understand the known issues and handle them in software. The team should include the errata information in the relevant design documents. It should also monitor updates to the errata throughout the product lifecycle and perform impact analysis on the software with every update. With this approach, one can minimize the unpredictability introduced by the microcontroller. To further minimize the risk, the system designer may introduce a redundant microcontroller from a different manufacturer to avoid the same silicon bugs. But this is not in the scope of software development.
Compilers
A compiler is a crucial tool in software development. A buggy compiler may introduce errors that manifest only in limited scenarios. It is challenging to catch such issues and attribute their root cause to the compiler. A software architect should only consider a safety-certified compiler and monitor its errata throughout the product lifecycle.
Compiler optimization can also introduce dark corners. While it improves the performance or reduces memory utilization, it adds a lesser understood aspect to the code. Optimized code is different from the original code. The differences could cause some side effects and adversely impact software safety. Reviewing optimized assembly code to catch potential issues is time-consuming. To avoid such errors and have complete code control, the team could decide to disable compiler optimization. This will align compiled code with the source. The team should favor algorithm optimization over compiler optimization.
Off-the-shelf libraries
Modern software systems utilize many off-the-shelf libraries and third-party software in the overall stack, such as operating systems, file systems, and communication drivers. The project's software architect should only consider safety-certified libraries. Additionally, the architect must choose libraries with failure rates much lower than the targeted failure rate of the system. The team should review the available documentation and source code of the chosen libraries to understand them in detail. If source code is unavailable, the team should refer to past usage of the libraries in other safety-critical systems, preferably within the organization.
Undefined behavior of programming languages
Most programming languages have some undefined and unspecified behaviors not always handled by compilers. Differences across compilers in handling those behaviors create portability issues.
The code implementations that execute such areas may behave differently across compiler versions. The software developers for safety-critical systems should specifically consider such behaviors and avoid triggering them. Developers widely use C and C++ for Embedded software. Per ISO standards, there is a dedicated section about undefined and unspecified behaviors in the documentation of programming languages. For example, the C compiler doesn't specify the initial state of static variables. Many developers omit explicit initialization of static variables, incorrectly believing that the C compiler initializes them to zero. Such implementations may behave differently and unpredictably across compiler versions and microcontrollers, leading to bugs that are difficult to identify and reproduce. Such issues may also manifest only in some batches. Hence the developers should carefully refer to the compiler documentation to identify undefined areas in the language.
In conclusion
The project team must be aware of these recommendations and prepare for them before starting safety-critical Embedded software development. While these recommendations also apply to non-safety-critical Embedded software development, teams in such projects often do not consider them due to business constraints. The safety-critical system project teams must not ignore them. These can significantly improve software quality. Most importantly, they can improve the predictability and safety of the software. Connect with our Embedded Systems experts to know more!