Monday, August 8, 2011

Linux guest on Vmware problem solved...

So we've had this intermittent issue over the last year+ where sometimes when we add memory to a linux vm in our vmware environment, the box will act really weird with the new memory. It looks like there's a memory leak or something, because as soon as we starting applications and begin using the box, we run out of memory, it becomes slow as it starts swapping, etc. I've dug into it before, but could never identify where the memory was going. I could tell that there was no more available memory, but even once we stopped the applications, we wouldn't have as much free memory as we should have (free meaning including RAM that was currently being used for fs cache/buffer). I think the ultimate solution was usually to rebuild the box =( This also happened sometimes when building a new vm that was cloned from a different one. I always suspected it was something in vmware, maybe a memory leak, maybe something in the memory add and cloning process, but I could never nail it down.

So it happened again recently, we added memory and the box started acting up - additional reboots resulted in the same issue. I jumped on the box right away and actually saw the vmmemctl process in the guest was one of the top processes by cpu usage. I remember last time this happened I got down the path of suspecting it was something with vmmemctl and or memory ballooning, but not knowing vmware I didn't put two and two together and stopped digging before things clicked for me. This time after looking at what was happening on the box, doing some quick googling, I found this forum post- http://communities.vmware.com/thread/133290 and then went into the vsphere client, checked out that guest's limit... sure enough the limit was set to 1gb of ram even though there was 6gb of ram allocated to the guest!! I had the sysadmins change it to unlimited, but the box didn't start responding right away, so we did one more reboot... and problem solved.

Looking back, it all makes sense now. I talked to our sysadmin who was most familiar with these vmware controls (not the one I had worked with on this in the past)_and see how adding memory and cloning definitely could cause this misconfiguration to all of a sudden become a problem. I'm not sure how the limit got set in the first place, since it's not something our sysadmins use... we are wondering if upgrading the virtual hardware or vmware tools sets it sometimes??

The other thing that still bugs me is how vmmemctl "uses" memory when vmware is doing memory ballooning or enforcing memory limits to make the guest think there's no more available memory... Did I miss something as far as detecting that's the process that was using memory? Or is there no way from the guest point of view to be able to tell that ballooning/limit enforcement is kicking in at the time?