Release judgment criteria・The third type of test is the RAS function

13/11/2020Release Judgment

The third type of test is the implementation status of the RAS functional test.

This series of articles introduces the types of tests from the perspective of what kind of test you are doing when determining the quality of the test. Next to the stability and robustness introduced in the previous article , we will check the test status of the RAS function. There are many parts that overlap with stability and robustness, but recently it is often summarized by the term RAS function, so my father Gutara also used it as an item to check the type of test.

Implementation status of tests related to RAS function

By the way, what is the RAS function? Generally, the acronym for Reliability Availability Serviceability  is called the RAS function, but when applied to software, what is the specific function?

Simply put, failure is less me, even if a failure to quickly recover to, so that it can be used long with peace of mind it is maintenance , is a feature called. Father Gutara checked whether the test to check the status of this RAS function was planned and implemented as a test quality check at the time of release judgment. 

If the RAS function is specified in the requirement specification, there is a test item to check whether the specification is satisfied, so you do not need to be careful. But what if the required specifications do not specify the RAS function ? 

Even in such a case, is the RAS function of the common sense level required in the market where the product is provided installed, or is the level of the RAS function that the company or organization defines as necessary realized? Confirmation such as is necessary. This is called quality , which is commonplace in the world of quality control .

Therefore, even if there is no description about the RAS function in the requirement specifications, whether or not there is a test item to confirm whether the RAS function meets the natural quality for the company is from the viewpoint of confirming the RAS function. It is a standard for judging the quality of the test.

Reliability test is replaced by stability confirmation test

R is an abbreviation for Reliability, which means that failures, failures, and defects are unlikely to occur . Since it is often related to hardware and hardware, it is sometimes expressed as the number of failures (MTBF: Mean Time Between Failures) per operating time . 

When it comes to software, the stability that came out earlier is a close idea. Therefore, when determining the release of software, Father Gutara substitutes the status confirmation of the test to confirm the stability to confirm the status of the Relaxability test, and the test status of the remaining A and S for the RAS function. I was checking.

Does Availability confirm maintenance time and restart time?

A stands for Availability and stands for high uptime and short downtime due to failure or maintenance . It may also be expressed as an index such as the ratio of operating hours to the total time (operating rate). 

The time elapsed until restarting when some kind of failure occurs corresponds to the time when it is not operating, so the less time it is not operating , the better the availability. The continuous operation time is also important because the long / short time can be judged only by comparing the continuous operation time of the device with the non-operating time, but in general, it is one month or so. Think of a year.

Again, let’s take a wireless LAN router as an example and calculate the availability when the continuous operation time is about one year. 

Once installed, the wireless LAN router will continue to work, but let’s assume that the firmware is updated once a year. Also, let’s assume that some hangs occur twice a year, but the watchdog function detects the hangs and the recovery function by automatic reset is built in.

The  calculated value of Availability when such an assumption is made is as follows.

  • Continuous operation time of wireless LAN router: 8,760 hours = 365 days
  • Time when the wireless LAN router is not operating: 0.83 hours = 50 minutes

The breakdown is the time for updating the firmware once a year (30 minutes) and the time for restarting from the occurrence of the hang and restarting the normal service twice (10 minutes + 10 minutes).

From this value, the annual availability of the wireless LAN router is 99.99% (100X (8760-0.83) / 8760).

Pay attention to the difference between the terminal in the house and the equipment in the station building

How is it, 99.99% seems to be a pretty good value. This is a terminal installed in the house, and it does not require downtime like regular maintenance, so it is such a good value. 

On the other hand, what will happen to Availability in the collective device installed in the station building as a counter device such as this wireless LAN ? In general, the collective device holds various user setting information, etc., so maintenance and inspection work usually occurs on a regular basis . During maintenance and inspection work, periodic inspections necessary for stable operation of the equipment, such as equipment failure inspection and backup of various data, are performed. 

Since it is a regular maintenance and inspection work, it takes a considerable amount of time. For example, if you need 24 hours of maintenance and inspection work once a year and you need to stop the service during that time, if you calculate the Availability in the same way as before , 99.7% (100X (8760-24) / 8760) Will be. In this way, firmware update time and maintenance inspection time affect Availability.

So, in other words, a test related to Availability can be said to be a test that measures the time required for non-operational work such as firmware update processing and maintenance inspection, and the time required for hang recovery processing . 

Father Gutara confirmed that the test quality at the time of release judgment was that a test was conducted to confirm the availability status . I didn’t go into the amount or content of the test, but it was one of the judgments of the test quality whether or not the test was carried out to confirm the ability of Availability.

Serviceability confirms the investigation function when a problem occurs

The last S is an abbreviation for Serviceability, which indicates the ease of disaster recovery and maintenance work . It is sometimes expressed as an index such as the mean time to repair (MTTR) from the occurrence of a failure to recovery, but from a software perspective, it has the aspect of a function for investigating the cause of a defect that has occurred in the market. It’s strong.

If there is a bug in the market and it is likely due to a software bug, how do you figure out the cause? If the same phenomenon can be reproduced in-house, the cause can be investigated by various means, so there is no need to worry too much. But what if it doesn’t reappear ? I will not reproduce it, but what if I have to clarify the cause of the problem and make a countermeasure version software? 

First of all, gather information on what happened when it occurred

In order to estimate the cause, we have to collect information when a problem occurs in the market . That’s right, it is necessary to take investigation measures such as analysis of error logs that have been set up in the product , or analysis of Core Dump if it is a Unix-like OS .

To do so, it is first necessary to collect the error log and Core Dump data at hand . The error log and Core Dump are stored in the secondary storage area of ​​the product, and if the product is connected to the network , the error log and Core Dump can be obtained via the network, and in some cases, the device can be remotely accessed via the network. If you can make a diagnosis, you can retrieve various information and recommend the cause estimation.

From a software perspective, Serviceability is a feature for investigating defects in these markets . It can be roughly divided into offline functions such as error log and CoreDump collection, and online functions such as remote diagnosis . In addition, remote-controlled firmware updates are often included in Serviceability functions.

Serviceability quality needs attention

This is a function such as error log collection and remote diagnosis, but in terms of software quality, it is actually necessary to pay attention to item 2 . The first is that it requires higher quality than normal functions , and the second is that it is difficult to perform tests to confirm quality .

Why is the quality higher than normal functionality ? Imagine using a Serviceability feature . I’m having a problem right now in the field of actual operation, and I often use the Serviceability function to investigate the cause. Of course, there are many requests from customers to proceed with investigations and countermeasures as soon as possible . In such a situation, if there is a bug in the Serviceability function and it does not work as expected, you will fall into a situation that you do not want to think about. 

The function of Serviceability is actually similar to an ambulance that rushes to the scene of a traffic accident . It is necessary to operate reliably in the field, understand the situation, and take countermeasures . Quality is required at a higher level than the functions of normal service provision .

And it is not the customer who demands this higher level of quality, but the developer himself who investigates and takes measures against defects . So even if the specification doesn’t mention a quality level for Serviceability features, developers need to achieve a high quality level for Serviceability features for themselves .

Serviceability is actually hard to test

And the second caveat comes up when you think about doing enough testing to improve the quality of serviceability . Take, for example, a test of error logging functionality. Since it is an error log, if any error occurs, it is a function to record the content of the error and the time of occurrence in the log. 

So to test the functionality of the error log , you need to get an error . Any error that can be easily raised is fine, but there are many errors in the error log that rarely occur. If it happens easily, you can raise it in the pre-release test and incorporate error countermeasure processing in the first place.

Therefore, the error log required for actual operation is an error that is unlikely to occur in the first place . This means that if you do not devise something, the error will not occur, so you cannot confirm that the error log function is working properly. Like this, the difficulty to perform the test itself is that not, is the second of the notes in order to ensure the quality of Serviceability.

Are you testing anyway? Confirmation of degree

From this point of view, it becomes clear that the quality level required for each product is often uncertain because the quality level required for each product is not written in the specifications, and it is often difficult to carry out the test to confirm the quality. .. Therefore, it is difficult to determine the standard of what level the serviceability test should reach .

In such a case, Father Gutara used only a superficial check to see if the test to confirm the serviceability was carried out without thinking too much , and used it as one of the test quality judgments at the time of release judgment. ..

Type of test, the next version up test is

When judging the quality of the test, we look at the types of tests performed. Next to the RAS function test is the version upgrade function test, which is the last step in software quality assurance. It is introduced in the next article, so please have a look if you are interested.

Next : Release judgment criteria・The fourth type of test is the version upgrade function