Debugging


*** glibc detected *** free(): invalid next size (fast): 0x083de008 ***

One of the very frustrating problem I have seen ever. We first observed this problem in an application of ours on one of our test machine which is a RHEL4 using glibc 2.3.x.

From the error message it is pretty much clear that the problem is because of freeing some invalid pointer. Since our application is a multi-threaded application, an obvious fix was to protect the area where the deletion is happening. We did that. But, the fix didn’t solve the issue but delayed it by 8 hours.

The variable that is being deleted is a char array which has been allocated memory using new operator. We copy the contents in to it using memcpy. Since the new operator handles memory allocation properly, we are not  very worried about the contents being copied. If the app is trying to copy some thing out of the boundary, it is expected that it would crash there. So we continued looking ahead and tried to lock the areas where the variable got created and initialized.  This didn’t solve the problem. but the problem got delayed for 16 days :)..

Yes!! Now comes the major problem… How the hell would you be able to solve an issue that happens only after 16 days run?

We went ahead and looked at all the possibilities. Couldn’t find a clue about what might be gone wrong.

For some reason I got a doubt that the memcpy is trying to copy something out of the bounds. I had this nagging feeling and so I went ahead and came up with a small test sample that replicates the code where we are seeing the crash. It gave us surprising results…

Here is the code that i have used….

#include <stdio.h>
#include <string.h>
int main()
{

char *temp;

int len=12;

temp = new char[len]; // Allocate 12 bytes of memory

printf(“Got the new object\nAllocating the memory now\n”);
temp[0]=’A’; // used the first
printf(“Allocated size =1\n”);
memcpy(temp+1,”BBBB”,4); // used 5 bytes
printf(“Allocated size =5\n”);
memcpy(temp+5,”CCCC”,4);//using 9 bytes
printf(“Allocated size =9\n”);
memcpy(temp+9,”DDDD”,4);// using 13 bytes… I expected to crash here.. But it didnt 😦
printf(“Allocated size =13\n”);
memcpy(temp+13,”EEEE”,4);// using 17 bytes… No crash either
printf(“Allocated size =17\n”);
printf(“%s”,temp); // Printing all the 17 bytes…  Yes it does

delete[] temp; // deleting the variable
return 1;
}

I ran this piece of code on SUSE linux with glibc 2.2 and it didnt core. I saw the same results in HPUX, Solaris and AIX machines. No crashes either at memcpy or at delete. That was very surprising for me. I went ahead and ran the test program on RHEL 4 box.

Bingoo!! It cored and it cored with same errror. The machine has glibc2.3

*** glibc detected *** free(): invalid next size (fast): 0x083de008 ***

Initially I thought it depends on might depend on glibc version as the SUSE linux I had was using glibc2.2.

I tried running the test on all the machines where i could lay my hand on and the test results werent conclusive.

Most surprising thing was when i ran the tests on two RHEL5 machines with glibc 2.5.34. One of them cored with same error and other didn’t. I am still not sure the exact reason for this issue. But I made a few changes in my code to ensure that we dont copy out of the bounds.. some thing like this…

if(sizeof(temp1)> k){
memcpy(temp1,temp2,k);}
else{
return fail;}

This solved our issue. I hope this helps

Advertisements

Memory leaks are very dangerous especially when our applications run for an extended period of time. They tend to eat up your memory (depending on the  amount of leak) and might result in a hang of your application or worse your entire machine.

I have worked with different memory debugging tools and one of them I found very interesting was “libumem“.

The problem: One of my JNI application was leaking badly on Solaris OS. Valgrind was always my first choice, but as valgrind doesn’t support Solaris, I had no other choice but to look for other solutions. On googling, I stumbled upon libumem and found it very promising for my problem. Surprisingly, I found it very easy to use than any other tool.

A brief description of the library: libumem is a memory management library provided by Solaris. If you have Solaris 9(update 3)/10. You should be already having  the library.

libumem is a user space slab allocation library which performs object caching so that the memory that is frequently allocated and freed is cached. This reduces the overhead of creating and releasing the memory. Whenever the application requests memory from the system, the system will allocate the memory from this umem cache. And the buffer, which is the building block of the cache contains the redzone section and debug metadata section which holds the that data helps in analyzingthe memory leaks and memory corruption. And one can view all this information using MDB(Modular Debugger)

Enough of theory… Roll up your sleeves now… Letz get in to some action…

Before starting your rouge application, we need to set a few things.

bash-3.00#LD_PRELOAD=libumem.so

bash-3.00#export LD_PRELOAD

bash-3.00#UMEM_DEBUG=default

bash-3.00#export UMEM_DEBUG

bash-3.00#UMEM_LOGGING=transaction

bash-3.00#export UMEM_LOGGING

The above settings ensures that libumem is loaded on to the application and the information like thread-ID, timestamp and stack trace are recorded for each memory transaction initiated by the application.

Now start your application

bash-3.00#./Myapp.bin

Get the  process Id of the application and attach it to the mdb…

bash-3.00# mdb -p 24814
Loading modules: [ ld.so.1 libumem.so.1 libc.so.1 libuutil.so.1 ]
>

If your application got cored, you can even run mdb with core. But we must ensure that libumem is preloaded in to the application. Now moving to the point, letz get the leaks using findleaks

>::findleaks
BYTES             LEAKED VMEM_SEG CALLER
40960                  8 d0fc0000 MMAP
8192                   1 fe308000 MMAP
8192                   1 fe302000 MMAP
8192                   1 fbfb8000 MMAP
8192                   1 fbfb6000 MMAP
8192                   1 f9e72000 MMAP
16384                  1 d15d0000 MMAP
24576                  1 d0fca000 MMAP
————————————————————————
Total       8 oversized leaks, 122880 bytes

CACHE     LEAKED   BUFCTL CALLER
000dae08       1 0012a258 libc.so.1`strdup+0xc
000db188    8902 005674a0 libmysocket_jni.so`Java_com_my_tempsock_mysocket_DSA_1wrap_Mylan_1socketConnect+0x2c
000e3c08       3 005e4708 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000e3888       1 001930e0 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000e3c08       7 007cd158 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000dae08       2 0012d0e0 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db188       2 00791680 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000e3c08       6 0032e690 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db188       3 005661e0 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000e3c08       2 007ccd20 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db188       3 00269680 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000e3c08       3 005e4960 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db188       6 00269f68 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000e3c08       5 005e2ff0 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db188       7 0078a168 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000de388       1 00168ff0 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000de708       1 000efd10 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db188       4 0015be00 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db188       5 009599c8 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000de708       1 000eff68 libjvm.so`__1cCosGmalloc6FI_pv_+0x24
000db888    8943 00243428 libsocket.so.1`get_addr+0x3e0
000db188    8718 00567338 libsocket.so.1`get_addr+0x410
000df508       1 004f9860 pkcs11_kernel.so.1`get_bigint_attr_from_template+0x28
000df508       1 004f9950 pkcs11_kernel.so.1`get_bigint_attr_from_template+0x28
000dae08       1 003e3ba8 pkcs11_kernel.so.1`get_bigint_attr_from_template+0x28
000db508       1 005547f8 pkcs11_kernel.so.1`get_bigint_attr_from_template+0x28
———————————————————————-
Total   26630 buffers, 812808 bytes

It shows many leaks, but most of them are just false alerts. So just ignore them. Look at the leaks that were bolded. They are more serious. To get more information about the leaks, we need to print the stack trace. The command to get the stack is “<CACHE_mem_add>$<bufctl_audit”.

> 00243428$<bufctl_audit
ADDR          BUFADDR        TIMESTAMP           THREAD
CACHE          LASTLOG         CONTENTS
243428           241120 219a283c058e25               27
db888                0                0
libumem.so.1`umem_cache_alloc+0x210
libumem.so.1`umem_alloc+0x60
libumem.so.1`malloc+0x28
libsocket.so.1`get_addr+0x3e0
libsocket.so.1`_getaddrinfo+0x710
libmysocket.so`Mygetaddrinfo+0x78
libmysocket_jni.so`Java_com_my_tempsock_mysocket_DSA_1wrap_Mylan_1socketConnect+0x100
0xfc00cd40
0xfc00ccec
0xfc0059b8
0xfc005850
0xfc005850
0xfc005850
0xfc005970
0xfc005970

The above stack trace clearly tells me that there has been some memory allocated by getaddrinfo, which hasnt been freed up by freeaddrinfo.  I went ahead and fixed up that bug. But not every memory leak / buffer overflowis so simple. We might need some more information about the buffer. The above bolded string is the buffer address of the leak. Printing information will tell us more details like size of the allocation, if there is any corruption etc…

> 241120/10X

will give us the whole information( my buffer is too huge her to copy here). To find if there is any corruption in the memory, get the last two values of the buffer and XOR them. It should return 0xa110c8ed. If not, then our memory is definitely corrupted.

Thats all about how we can use libumem to debug a memory leak in Solaris. So simple right? :). Hope this helps