This is one of the very difficult issues we resolved which took us through a lot of trouble. The issue was first reported on an application A(name changed) that it wouldn’t start during the reboot of the machine.  Once the machine has started, the customer was able to start the application using the same commands that we used in SCM.

The application A is dependent on our application C. And in the SCM logs they see this… (esecuzione =execution && avvio=start)

06/09/2007       12:32:54 Service Control Manager Information        None     7036     N/A       34100PC023      The C service entered the esecuzione state.

06/09/2007        12:32:53 Service Control Manager Information        None     7036     N/A       34100PC023      The A service entered the esecuzione state.

At this point, C and then A are able to complete their initialization.  A starts before C and presumably is none the worse for wear for doing so.

06/09/2007        12:32:51            Service Control Manager Information        None     7036     N/A       34100PC023      The Servizi terminal service entered the esecuzione state.

06/09/2007        12:32:51            Service Control Manager Information        None     7036     N/A       34100PC023      The Windows Installer service entered the esecuzione state.

…..

06/09/2007        12:32:50            Service Control Manager Information        None     7036     N/A       34100PC023      The NLA (Network Location Awareness) service entered the esecuzione state.

06/09/2007        12:32:50            Service Control Manager Information        None     7035     NT AUTHORITY\SYSTEM            34100PC023      The NLA (Network Location Awareness) service was successfully sent a avvio control.

06/09/2007        12:32:50            Service Control Manager Error     None     7022     N/A       34100PC023      The A service hung on starting.

A hung, It also does something which can cause it to block BEFORE telling SCM that it has started.

06/09/2007        12:32:10            Service Control Manager Error     None     7022     N/A       34100PC023      The C service hung on starting.

Application C hung because comms. isn’t available. We were reasonably sure from the C’s logs that it hangs on a WSAStartup call.

So here in our scenario, application C doesnt fail to start but it times out during the startup. This results in the failure in the start up of application A. Our analysis after this was on that particular machine there was an unknown service which needs to start before a WSAStartup call will complete (when IP is initialized).  Application C calls WSAStartup early in its initialization, before it tells SCM it has started.  This call blocks until the unknown service has asked other optional services to complete IP initialization.  As SCM will only start one service at a time while booting, service startup freezes until SCM times out App C’s startup.

When C’s startup is timed out, whatever is necessary to complete IP initialization can continue but.. As A is dependent on C, SCM fails A. Eventually, IP initializes and the blocks on A and C are removed and everything starts up.

With the above analysis, we decided to fix it by reporting SCM that it has started before calling WSAStartup.  We made the changes and when we ran it over the customer environment, we faced same problem again. :(.

The WSAStartup call didnt hang but it failed with an error code.

Completely fed up… we asked the customer to run bootlogxp. This tool gives us better information about the loading of dll’s during the boot up. It gave the following update…

C:\WINDOWS\SYSTEM32\IMM32.DLL Start: 28.401 sec

C:\Programmi\C\bin\C.dll Start: 28.173 sec

C:\WINDOWS\SYSTEM32\DNSAPI.DLL Start: 173.365 sec

This along with the C’s logs didnt tell us any thing new, but confirmed one thing. The WSAStartup doesnt hang but the subsequent socket calls do hang till the comm’s are up.

The basic problem here was that service control manager (SCM) starts one service at a time and if one blocks, it slows down the whole boot sequence.  A blocked service will eventually be timed out, at which time, SCM will let it continue and start the next it has selected.  To get round this, we changed things so that we report to SCM that we have started at an earlier place in our initialization than we would like.  The downside of this is that we may be trying to use service dependencies as a means of getting things started in the right order, so that the services we need will be available before we start.  Unfortunately, reporting to SCM early means that we re-introduce the race conditions we were trying to avoid and this makes service dependencies less useful.

Another thing we did was to remove the dependency between A and C. After making these changes and running the applications on the customer’s machine, we encountered the same problem again.

And finally everyone lost hopes and started blaming on the system. Everyone suggested that it could be machine specific. The customer re-imaged another machine and it faced the same problem. Dead End….

The solution to this problem came in a quite interesting way… Our L2 tried to get more information about the system’s environment and by chance found the customer’s environment the system variable PATH contains entries for two network shares.

This answered a lot of question… The explanation seems to be this… When A starts up it tries to load E.DLL using LoadLibraryEx. Unfortunately, this DLL is not present on the system and it starts looking into all the directories specified in %PATH%.  This takes time as the network is not yet up on the machine.  Once these variables are removed, everything started working fine.

Before this, I didn’t know we could put network shares on the system path, as network shares tend to be visible to particular users and would normally be invisible to the ‘system’ user which services run under.  I’ve learned quite a lot from this issue… About how SCM works and the network paths in the system path variable. This info was helpful for me in resolving some more issues in the same space. I hope this helps everyone else…

Advertisements