Time bomb bug

23/04/2019Bug's nest

What is a time bomb bug?

Don’t you think the time bomb bug is a dangerous name? I wondered what kind of bug it was, and when I searched for a time bomb bug on Google … nothing came up. In fact, the time bomb bug is the name of a bug that is not known to the general public because Father Gutara simply calls a certain kind of bug.

So why does Father Gutara bother to use the name of such a dubious bug ? Actually, some kinds of bugs have a very large range of influence when a bug occurs due to the bug in the market, so be careful. What if the type of bug say or is referred to as a time bomb bug, ① the leak of dynamic resource bugs and ② data to cause roll-over processing of the timer and counter are two types of bugs due to.

The reason why these two types of bugs are called time bomb bugs is that they have two characteristics. Its characteristics are that (a) the device works without any problems for a while after the power is turned on , but (b) problems occur in all devices after a certain period of time .

Due to the feature that it works without problems for a while after the power of the device in (a) is turned on, no problem occurs during the in-house test period and it cannot be found in the test, and it may leak to the market as a potential bug. There are many. It can take months or years for a while, so you pass an in-house test that has a limited amount of time. And if it leaks to the market, almost all devices will have problems after working fine for a while . In the short time after the new release, the number of devices actually in operation in the market is often small, but after several months or more, many devices are in operation in the market . And since problems occur in almost all of these devices, the situation is very scary for the manufacturer.

Father Gutara paid particular attention to these bugs as time bomb bugs, as the situation was like a time bomb that was snugly attached to every product .

Time bomb due to a dynamic resource leak

Let’s take a concrete look at the time bomb bug, the first is the bug that causes a dynamic resource leak . Dynamic resources are resources that are acquired when needed and returned when they are used up . The best known is the dynamic memory provided by the OS . Searching for memory leaks on the net gives you a lot of concrete examples, but let’s take a quick look at memory leaks.

The OS has a function to provide dynamic memory. Dynamic memory borrows memory from the OS when application software is needed, and returns it to the OS when the memory is no longer needed . It is named Dynamic because the resources managed by the OS, called memory, move back and forth between the OS and the application.

In this series of processing, though no longer needed if there is a bug in the application software that want the process exits to return the memory to the OS occurs in that intends, memory leak is. Since the remaining amount of memory resources that the OS has gradually decreases according to the operation of the software, it seems as if water is leaking from the water tank and is decreasing, so the phrase leak is used. I am. Since the leaked thing will not be returned, if the memory leak continues and the remaining memory resource of the OS becomes zero , the OS will not let the application software use the memory, and the application software will process because there is no memory You will not be able to do this, and you will end up with a malfunction.

This memory leak is caused by the application software skipping the process of returning the memory to the OS, but the speed at which the memory leaks depends on the amount of memory borrowed at one time and the frequency with which the memory is borrowed . The leak speed will change, but once the memory is leaked and lost, it will not be recovered, so the remaining memory resources of the OS will continue to decrease and will not be restored . As a result, all operating devices will eventually malfunction due to memory leaks . However, this one day is very troublesome because it may be half a year or two years later depending on the amount of leak.

If this memory leak occurs during normal processing, it may still be found in internal testing. If you create special software dedicated to verification that makes the size of the memory resource of the OS extremely small and accelerate various operations and functions, the remaining amount of memory will be zero. In some cases, the condition can occur in a short period of time. However, if a memory leak occurs in an abnormal process caused by some condition, the leak will not be accelerated no matter how much the normal process is accelerated, so the remaining memory will be zero in a short period of time. Does not occur, and it is still difficult to detect by in-house testing . And, unfortunately, software bugs are quite annoying because they prefer to settle in during exception handling .

Is dynamic memory the only dynamic resource?

I’ve introduced memory leaks as time bomb bugs, but are there other dynamic resources besides memory? In fact, there are two other dynamic resources that can leak. One is a dynamic resource other than the memory provided by the OS , and the other is a dynamic resource provided by the application software .

In addition to memory, dynamic resources provided by the OS include sockets for communication . Here also, though the maximum number is determined by the initial setting of the OS, application software is borrowing the socket from the OS when needed , and return When you are finished using a processing method that might want to adopt you. Of course, if there is a margin in the amount of socket resources, the software design method is to acquire the required number of sockets at startup without dynamic operation, continue to use it as it is, and do not return it to the OS. There is also. In this case, the socket is not used as a dynamic resource, but as a static resource , so no leak will occur . However, embedded software often cannot be used as a static resource due to various hardware restrictions . At such times, the socket is also a dynamic resource and can leak, which may cause a time bomb bug.

And the dynamic resources provided by the remaining one application software are a nasty bug that is easily overlooked. A dynamic resource is a resource that has the process of borrowing it when it is used and returning it when it is finished . If you forget to return it when you finish using it, a leak will occur. It is, that the return to borrow if there is a process that it is not a resource managed by the OS resources which the application is managed in a dynamic resource, even, there are resources risk of leakage becomes a fact that lurking increase.

Let me give you a slightly concrete example. Suppose application A has some management table in the shared area, and that management table has a maximum of 10 entries . Let’s assume that only the application software registered in the management table has a mechanism in which function B is provided by application A. When application C wants to use function B, it gets one entry from the management table and writes in it that it wants to use function B in application C. Then request function B from application A. Then, when function B ends, application C deletes its information from the management table and returns one entry . 

In such a mechanism, the entries in the management table in the shared area are dynamic resources . There are 10 table entries in total, and I use them when needed by various application software, but if any application software forgets to return the table entries, the remaining amount of table entries will gradually decrease. To go. If the remaining amount of table entries becomes zero in the meantime , function B cannot be used in any application software after that. In this way, if application software has dynamic resources such as table entries that are used dynamically in addition to the OS , there is a possibility that a resource leak bug may be hidden there as well.

The dynamic resources provided by the OS are relatively easy to be aware of, but the dynamic resources created by the application software are often overlooked inadvertently, but this can also cause a time bomb bug. So be careful.

Find dynamic resource leaks with in-house testing

Dynamic resource leaks can be prevented to some extent by adding tests to internal tests for the purpose of checking for leaks . There are various methods, but if you incorporate a mechanism to know the remaining amount of dynamic resources in the software in some way, you can test it. The idea of ​​the test is simple: first check the remaining amount of dynamic resources , and then perform various processes . After executing the process, check the remaining amount of resources again, and if it is not different from before the process, it can be considered that no dynamic resource leak occurred during that time.

Various processes Some, in addition to the usual normal processing high load in the state multiplied by the long operation Toka abnormality repeated such prolonged operation in the generating state, the leakage of the dynamic resource latent If so, it is also effective to devise ways to accelerate the processing so that it stands out. Since it is difficult to match the test environment and conditions such as repeated abnormal conditions, it is not a complete test to detect all leaks, but the risk of missing bugs due to leaks can be greatly reduced. increase.

Timer and counter rollover issues are also time bombs

Along with the leak of dynamic resource time bomb bug has become the, bug related to the roll-over processing of the timer or counter is. Well-known bugs related to rollover processing are 49 days or 497 days . There is a problem that something goes wrong 49 or 497 days after the device is started , and various explanations come out when searching on the net. In both cases, when the OS timer is made with an unsigned 32-bit integer , the timer rolls over 49 days (in the case of a timer with a precision of 1 ms) or 497 days (a timer with a precision of 10 ms ) after startup. In the case of), some problems occur at that timing, so it is called the 49-day problem or the 497-day problem.

By the way, what are the bugs related to timer and counter rollover processing ? Let’s review a little. For general timers and counters , 0 is set as the initial value at startup , and it increases by 1 from there . If this timer or counter is made up of unsigned 32-bit integers, its maximum value is 0xFFFFFFFF , which is 4294967295 in decimal . And if you increase it by 1 from the maximum value of 4294967295 , the value returns to 0 instead of 4294967296 . When a 32-bit unsigned integer exceeds the maximum number it has and returns to the initial value of zero, it is called rollover . If you look up the expression method of 32-bit unsigned integers in a computer on the net, there is an explanation as to why it becomes 0, so if you are interested, please look for it.

Now, it’s not a problem that timers and counters made of 32-bit unsigned integers return to 0 when they increase by 1 from the maximum value of 4294967295 . The cause of the problem, the data application software you are using a timer or counter is the side of. The mechanism that causes problems with timers and counters is the same, so let’s consider a timer as an example. Timers are usually used to measure the passage of time. If you calculate the current timer value-the timer value a little earlier, you can measure the time elapsed from a little earlier to the present . It seems to be used when judging whether XX seconds have passed. Also, to know that a specific time has arrived , you can judge that the current timer value> = the timer value that indicates a specific time, which is used when processing something when YY seconds reach MM time. increase.

However, if the timer rolls over, the value of the timer will decrease over time. In simple terms, if the timer value, which was 4294967296 earlier, is rolled over to 0 and then counted up 103 times, the current value is 103 . The value of the timer has decreased with the passage of time. In this case, if you try to measure the elapsed time by calculating the current timer value-the timer value a little earlier, something strange will happen. This is a problem with timer rollover processing.

Don’t forget that the counter also has a down counter

The contents explained here assume that the values ​​of timers and counters increase with time . Normally, timers increase in value over time, but there are both up counters that increase in value over time and down counters that decrease in value over time . In the case of a down counter, it is often used to perform some processing with the trigger when the value becomes zero, so there are few problems in the processing when a rollover occurs. However, depending on how you write the program code, you may still have rollover problems.

Of course, if the application software that uses the timer or counter has a program code that assumes that the timer or counter will roll over, no problem will occur . Most of the time it shouldn’t be a problem, but for trivial reasons such as an inexperienced programmer happening to be in charge of coding, or I was inadvertently writing code to fix a bug in a hurry. (b) of the timer or counter forget incorporate Ruoba processing to the code that will happen.

And even if you forget to incorporate this rollover process , the software will work fine and pass internal tests until the rollover occurs . Until the timer rolls over, nothing goes wrong and it’s hard to find potential bugs . And when the device is actually used in the market for a long time and the timer rolls over, the device suddenly goes wrong. This is also a time bomb bug, just like a resource leak, as it causes problems for all devices over a long period of time after they start working .

Let’s confirm the rollover problem in advance by devising the initial value

Bugs related to timer and counter rollover processing will not be discovered until the counter or timer rolls over . So, in-house test in special contrivance If you do not the, is often hard to find. However, if a timer or counter rolls over , the bug can easily cause a problem, so you can find the bug. Therefore, if you create a state where the timer or counter rolls over with a little ingenuity, it will be easier to find the timer or counter rollover problem in the in-house test.

The idea is to set the initial value of the timer or counter to a value close to the maximum value instead of 0 . Timers and counters often do not have the limitation that they must start counting from zero. Therefore, it does not matter if the initial value is set to a value close to the maximum value. And if the initial value is set to a value close to the maximum value , the timer and counter will roll over even if the software is operated for a short time, so if there is a potential bug related to rollover, it can be found there. increase. 

As a time bomb bug , we have introduced bugs related to dynamic resource leaks and timer and counter rollovers . Both bugs cause problems on all devices after they have been in operation for a long time , so once they occur, they have a very wide range of impact. It’s a very unpleasant thing, so my father Gutara gave it a strange name , a time bomb bug , and was particularly conscious of confirming that there were no bugs. I hope the software you are developing does not have a time bomb bug.

Return to the bug nest