Jump to content
  • entries
    943
  • comments
    5,899
  • views
    924,358

Anatomy of a Bug


Josh

2,536 views

 Share

The model editor animation bug was the second-worst bug to hit Leadwerks Game Engine in all its history. Reported multiple times, this would cause animated models to discard triangles only in the model editor, only on Linux.

http://www.leadwerks.com/werkspace/topic/10856-model-editor-freaks-out/
http://www.leadwerks.com/werkspace/topic/12678-model-animation-vs-flashing-bodyparts/

blogentry-1-0-76602800-1482007637_thumb.png

Since our animation commands have worked solidly for years, I was at my wits' end trying to figure this out. I strongly suspected a driver bug having to do with sharing uniform buffers across multiple contexts, but the fact it happened on both AMD and Nvidia cards did not support that, or indicated the problem was more low-level within the Linux distro. An engineer from Nvidia wasn't able to find the cause. If correct, this would not be the first driver bug I have found and had confirmed, by the Nvidia, AMD, and Intel driver teams.

To make things even more difficult, the error only occurred in the release build. Debug builds could not be debugged because no error would occur!

It never even occurred to me that the actual bone matrix data could be inputted wrong until Leadwerks user Roland reported that the bug was occurring in his game. This was the first time anyone had reported the error was occurring anywhere but the model editor.

I finally determined that the actual bone matrices being sent to the animation shader contained many values of "-nan", meaning the negative form of "not a number". I was shocked. How could this possibly be when our animation commands have been completely reliable for years?

I started printing values out and finally traced the problem back to the Quaternion spherical linear interpolation, or Slerp function. Slerp is a function that smoothly interpolates between two quaternion rotations without the problem of gimbal lock. This is the code for the function:

void Quat::Slerp(const Quat& q, float a, Quat& result)
{
   bool f = false;
   float b = 1.0f - a;
   float d = x*q.x + y*q.y + z*q.z + w*q.w;
   if (d<0.0f) {
       d = -d;
       f = true;
   }
   if (d<1.0f) {
       float om = Math::ACos(d);
       float si = Math::Sin(om);
       a = Math::Sin(a*om) / si;
       b = Math::Sin(b*om) / si;
   }
   if (f == true) a *= -1.0f;
   result.x = x*b + q.x*a;
   result.y = y*b + q.y*a;
   result.z = z*b + q.z*a;
   result.w = w*b + q.w*a;
}
In the function above, "a" is an interpolation value. I found that when a was equal to 0.0 the function would sometimes return the -nan values, but only when compiled in release mode with the GCC compiler! Adding a quick check at the beginning of the function fixed the problem:
if (a==0.0f)
{
   result.x = x;
   result.y = y;
   result.z = z;
   result.w = w;
   return;
}
And with that, it appears this issue can finally be put to rest. I think the lesson I learned here is always go where the bug leads to, even if you are sure there isn't a problem there.
  • Upvote 8
 Share

8 Comments


Recommended Comments

Usually when a bug happens in release but not debug it's a failure to initialize a value since initial values can be set differently between debug and release unless specifically given a value by the programmer.

 

I know you figured it out but in the code you have where are the standalone x,y,z,w variables defined and initial values set?

  • Upvote 1
Link to comment

Those are members of the Quat class, and they are initialized to 0,0,0,1. I think it might be a divide by zero, but it is still perplexing. The numbers go in fine, and out comes a bunch of -nan values.

Link to comment

From some stackeoverflow:

 

If the value is outside of [-1,+1] and passed to asin(), the result will be nan

 

divide by zero would produce a different bug but getting a nan I would have to think is because of sin/acos function calls and something they don't like having passed to them.

Link to comment

Okay, so here was the previous Math::ASin function:

inline static float ASin(const float a)

{

return asin(a)*RADTODEG;

}

 

And it looks like something like this will work instead:

inline static float ASin(const float a)

{

return asin(fmod(a + 1.0f, 2.0f) - 1.0f)*RADTODEG;

}

 

I'll test it and make sure. I can't think right now.

Link to comment

That's quite an interesting bug you have right there. What flags are you using in release mode with gcc? Also, have you tried printing out all the values after each intermediate step to find out, where exactly the values turn to -NaN? From the function you posted, the only two ways for getting NaN, would be a division of 0/0, or calculating acos of a value outside the range ]-1,1[. Both functions would be only possible inside the if(d<1.0f) - branch. However, in your case, in order to get to this branch, the value of d must be in the range [0.0, 1.0[, thus the acos-function can't produce NaN. Furthermore, the value of si = sin(acos(d)) is only zero for d==-1 or d==1, which again can not happen. All the other math-functions can only return NaN, if (at least) one of their inputs is already NaN, so I would be very interested in seeing, where the values actually start turning crazy. Can you provide input values for "q" and "result", and "this", that reliably produce this misbehaviour?

Link to comment

I didn't narrow it down any further than that. The values must be printed out because the bug does not occur in debug mode.

Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...