Soft audit Checklist・No9: Reliability/Availability Requirements Management Technology

10/02/2021Audit for soft-Develop..

The first two requirements management checklists are reliability and availability

In this article, we will introduce each item of the audit checklist used for software development audit . The audit checklist is  divided into ( 1) development process, ( 2) requirements management, (3) testing, and (4) design and implementation. In this article, (2 ) individual items of configuration management in requirements management are introduced. (The checklist itself can be found in the article on Software Audit Practice / Checklist No. 8: Development Technology / Requirements Management (General Management) , so please refer to that.)

Since the reliability performance the device is not a failure to check at that point of view to

Reliability and availability are easy to confuse because they have similar ideas, but in simple terms reliability is the ability to keep a device running without failure , and availability is the ability to keep using a device . Even if I write it like this, I still don’t understand anything. Let me explain a little more concretely using an example. 

Reliability refers to the performance such as whether the device keeps operating stably 24 hours a day, 365 days a year, or whether it can detect and self-recover even if it hangs due to a software bug. On the other hand, availability means, for example, when operating continuously for 356 days, normal use is stopped for regular maintenance once within 365 days, so 364.5 days can be used within 365 days. That is to say, since there is a regular maintenance service for 10 years after the product is delivered, even if it breaks down, it can be replaced with a replacement within one day. 

The checklist contains items to check whether such performance is clearly stated in the required specifications. First, let’s introduce the check items for reliability whose item numbers start with RE-.

[Item number: RE-01]

The abnormal state of the device or equipment as a whole includes hardware failures such as a blown fuse or smoke coming out of the board, but a soft abnormal state hangs up including the stoppage of functions. There are many things that can be summarized by the word. And the basic policy of how to behave as a device or device when an abnormality caused by software such as hang-up occurs depends on the market and environment in which the device or device is used.

Whether to maintain the abnormal state and wait for the operator’s operation, to try automatic recovery from the abnormal state , and what kind of policy to deal with the abnormal state, the basic policy was first established and then the policy was followed. Specific design and implementation will proceed. Therefore, it is important that the response policy from the recovery from the abnormal state is clearly written in the requirement specifications, and we will check it while paying attention to such points.

[Item number: RE-02]

The most common self-healing function from an abnormal state of a device or device is the watchdog recovery process. And there are several options for the recovery process by this watchdog in two aspects: ( 1) how to detect the abnormal state and (2) how to recover from the abnormal state .

The former is called a heartbeat signal in the watch dock design . Heartbeat, that is, heartbeat, is the heartbeat that is always active in living animals, so when the heartbeat stops, it can be judged that the activity has stopped. The watchdog function is a mechanism that constantly monitors this heartbeat signal , and if the heartbeat is interrupted, it barks and revives again.

A computer system creates a heartbeat signal from various signals and creates a mechanism to monitor it on a regular basis. When it detects that the heartbeat signal has been interrupted, it barks in a preset manner and attempts to restore system functionality .

Therefore, if the watchdog function is installed, it is necessary to specifically describe what to use for the heartbeat signal and what method to use to restore the function of the system in the requirement specifications. there is. We will check such points carefully.

[Item number: RE-03]

RE-02 in the previous section was a confirmation of the specifications of the watchdog function of the device and the device as a whole, but as the scale of the software grows, the idea of installing a watchdog function for each function is also considered. Will come out. If multiple functions are provided and the independence of each function is high, if only one function has stopped functioning, you can recover it by restarting only that function , and the other functions will be restored. You can also minimize the impact on.

From such a way of thinking, there are cases where the soft watch dock function is installed, but in this case as well, what to make a hard beat signal and what method to use for recovery processing are written in the requirement specifications. Must be. We will check such points carefully.

[Item number: RE-04]

In the case of embedded devices and devices, the actual execution program of the software is recorded in the built-in flash memory or hard disk . Then, when the power is turned on, the software is read into the main memory of the device or device and begins to provide the necessary functions.

This flash memory or a hard disk, in a medium which can be rewritten in some kind of software bugs because you would be broken to write you give sometimes. Also, depending on the surrounding environment, the recorded contents may be erased if exposed to high temperatures for a long time. If you do so, even if you turn on the power of the device or device, it will not operate normally. 

In order to avoid such a thing, it is often installed with a function to record the same software in two places and read it from the backup in the other place and continue the operation even if one place fails. .. This is the duplication of firmware, but at this time, it is necessary to specifically describe the range of software to be duplicated and the method of updating the duplicated software as required specifications. I will check how concretely it is written.

Check availability by focusing on the ability to continue using the device

Availability is the performance that can be used so that it can be read from characters . Well, I don’t know what this is, so let’s consider availability using a car as an example . Normally, a car is inspected once every two years, but let’s assume that the car will be inspected at a maintenance shop for five days. Then , within 2 years = 365 days x 2 = 730 days, the car cannot be used for 5 days . This means that you can use this car for 725 days out of 730 days, so the availability is 725/730 = 99.3% .

In other words, availability is the performance of how much the device or device can be used when an abnormal condition does not occur due to a failure or the like . To put it the other way around, it’s the performance of how much time is deducted from the period required to maintain the normal functioning of a device or device. It is this check item that confirms whether such availability is specifically described in the requirement specifications, and the item numbers start with AV-.

[Item number: AV-01]

The basis for considering availability is the time during which the device or device operates continuously . For network station equipment, continuous operation is required 24 hours a day, 365 days a year, but for rice cookers, the maximum is 37 hours for 1 hour of rice cooking and 3 days of heat retention. When considering availability, the time that the device needs to keep running continuously is set to 100 , and what percentage of the time that the device can actually keep running when the time required for maintenance etc. to keep using it normally is subtracted. I will take the view. Therefore, it is important to first confirm the time required to keep moving continuously, which is the denominator. This should be specifically stated in the requirements specification and check if there is such a statement.

[Item number: AV-02]

When considering availability, the next thing you need is the time required for pre- planned maintenance . In terms of automobiles, it is the number of days required for vehicle inspection. Since the time after deducting this maintenance time is the time when the equipment and devices can actually be used , availability is almost determined by this maintenance time.

Maintenance time is zero because it is essential to operate the backbone equipment of the network 24 hours a day, 365 days a year. Such equipment is made of dual systems so that the functions can continue to operate even if they are maintained. If you continue to operate with one dual system, there is no problem even if you maintain the other system.

On the other hand, for example, TV broadcasting equipment, for example, recently has some TV stations that broadcast continuously for 24 hours, but in general, at midnight, the broadcasting ends and there is no broadcasting for several hours (wave stop time). (I say), and the broadcast will start again early in the morning. Assuming that the wave stop time of the broadcasting equipment is 1 hour, if maintenance is required once a day for 1 hour during this time, it will not affect the actual broadcasting, but the availability per day is 23 hours / 24 hours. You can also think about it.

In this way, the maintenance time, which is the original information when considering availability, must also be specified in the requirement specifications, so check the situation.

[Item number: AV-03]

The previous item, AV-02, is a confirmation item for normal maintenance time, but there is also time related to availability calculation. For example, let’s consider a case where the watch dock function is activated and automatic recovery by restarting is executed when something goes wrong with the device or device and it hangs up .

Assuming that this restart takes 4 hours, and assuming that such an automatic recovery process is executed once a year, the time that the device or equipment can be used for 4 hours per year will be reduced. Therefore, the time required to recover from the abnormal state is also required when determining availability, so check the status described in the required specifications.

[Item number: AV-04]

There is another way of thinking when determining availability. It may be possible to continue using it for many years by adding functions to the same device by upgrading the software . If the original function of the product could only be used for about 7 years, but if the software version is upgraded and it can be used for another 3 years, the usable period will be extended.

This is also availability in a sense, but the point of availability is whether or not it is possible to extend the life of the product by expanding its functions and performance after the sale of the device or equipment. This item confirms the revised contents of the requirement specifications.

Next to availability is maintainability

Next to the availability check, the items to check for maintainability will be introduced in the next article.

Next : Soft audit Checklist・No10: Maintainability requirements management technology