Event log errors: "server down - failed to connect&quot

Event log errors: "server down - failed to connect&quot

After examination, I've noticed that our Equalizer 350si event log has been randomly spitting out "server down-failed to connect" errors for quite some time. What doesn't add up is neither the web servers, switches or even network monitoring software report any downtimes. Anyone else seen this?

Topology:
EQ1 in a 'single network' w/ redundancy, meaning it also has a backup EQ (EQ2), both utilizing only the server (internal) ports.

Error Message: "server down - failed to connect: Operation timed out".

In our testing environment topology, I've also seen this:
EQ3 = testing load balancer in a 'single network' with no backup EQ

Error Message: "server down - failed to connect: Host is down" and "server down - failed to connect: Socket is not connected".

Almost all the servers are dual core 3Ghz Xeons running Win 2003 Server SP2 w/ IIS 6. Primary services being load balanced = TCP/80 and TCP/443.

Replies

Submitted by mark.hoffmann on

Thanks for your question.
Could you please send us the logs as an attachment?
Thank You.

Mark Hoffmann
Coyote Point Engineering

Submitted by cjordan on

Sent

Submitted by mark.hoffmann on

OK -- thanks for the logs and eqcollect.

We have a number of suggestions:

1. There are complaints from ARP about duplicate IP addresses. Please verify that all IP addresses on your subnet are uniquely assigned, all netmasks are correct, and all cabling is correct and functional.

2. There is a spurious file on lb01: /var/eq/licenses/sysinstall.debug. This may or may not be related to your question, but there should only be license related files placed in this directory by Equalizer. Remove this file as follows, and reboot:

a. log in as root
b. mount -w /
c. rm /var/eq/licenses/sysinstall.debug
d. rm /.master/var/eq/licenses/sysinstall.debug
e. shutdown -r now
f. Check to see if this file is also on lb02 and remove it if so, as above.

3. The log entries that indicate loss of contact with a server appear at seemingly random times and for very short durations. It could be that there is a bad cable somewhere, or that the servers are occasionally overloaded and unable to respond to Equalizer's probes before timing out. If it's overloading, then increasing the probe timeout value may make these log entries go away. NOTE that timeouts are in tenths of a second, so a "15" in this field means "1.5 seconds". Try a longer timeout value, say 5 seconds (50) and see if this helps.

If this does not work, then I'd suggest you forward us a network diagram in private email so we can take a closer look at your configuration.

Thank you.

Mark Hoffmann
Coyote Point Engineering

Submitted by cjordan on

Ok, thanks.

1. Resolved already from a previous discussion.

2. Will perform this weekend since it requires a reboot

3. Per web server logs, we don't see any overloads. However I'll increase this timeout and test this tonight. By the way, you say the timeouts are in tenths of a second, but when I mouse over the field it says "SECONDS".

Submitted by cfarley on

We have some IIS 6 servers that we get the same messages about every now and then. I'm sure increasing the timeout would stop the messages but it happens so infrequently and for such a short period of time that I've decided to leave it.

Submitted by cjordan on

I changed the probe timeout from a field value of 15.0 to 50.0 last night. However, I just received another "server down - failed to connect: Operation timed out" message this morning.

Of note is the fact that it's always 5 seconds from the time the Equalizer reports server down to the time it reports it back up again. This applies to both probe timeouts of 15.0 and 50.0

Submitted by mark.hoffmann on

Has the number of log messages decreased at all since you changed the probe timeout?

Are there still just 5 seconds between the 'server down' and 'server up' messages in the log?

In a previous post, I suggested that you forward us a network diagram of your configuration so we can take a look...please private message a simple diagram so we can take a look.

Thank you.

Mark Hoffmann
Coyote Point Engineering

Submitted by cjordan on

Nope. I've also removed the spurious file, which was only on the primary Equalizer...no difference. Still seems to output the same message about once per day. Always 5 seconds long.

Do you need a Visio of the entire network, or just the upstream and downstream connections?

Submitted by mark.hoffmann on

OK. One more suggestion would be to increase the strikeout_threshold which I see on lb01 is set to 4 and on lb03 is set to 3. This is the number of times Equalizer will retry a probe before considering the probe to have failed. Try setting it to 5 or 6 and see if this quiets down these spurious messages.

We're pretty convinced that this behavior is due to some link in your network being renegotiated, just when Equalizer happens to be performing a probe. It's the probe that is writing the failure message, and it sends the "server up" messages on the very next run of the probe daemon through the server probes.

We don't need a full network diagram...what does the route between Equalizer and the servers look like? That is, are they directly connected or does the route go through other devices?

Mark Hoffmann
Coyote Point Engineering

Submitted by cjordan on

"Single network"

Both the Equalizer (1 int. port) and servers connect to a switch (Cisco CAT 2960G).

Submitted by mark.hoffmann on

If you examine the logs on your cisco switch, I would bet that you'll find a correlation between some operation being performed on the switch at the same times that these spurious server down messages are occurring.

The way to prove that this is the case would be to attach a server directly to one of Equalizer's internal ports and see if it returns these messages with no switch between Equalizer and the server.

Mark Hoffmann
Coyote Point Engineering

Submitted by cjordan on

Ok, I've increased all strikeout thresholds from their current value to 5. I'll run like this for the next couple weeks to see if we still get noise.

Also, after a quick review I don't see anything in the configs or logs which would cause random hiccups like this. But I'll download and analyze the tech support dump on the switches connecting to the EQ later on this week just in case.

Submitted by cjordan on

I've increased the threshold from 4 to 5 to 6. Unfortunately both the redundant pair of EQs (A, B) and a separate EQ I have setup for a staging environment (C) all report the same errors regularly. All are 5 secs in length from DOWN to UP time.

Nothing in the web server logs or our network management software reflects this problem at any time. Only the Equalizer seems to think there is a problem.

Btw, I'm not for sure what we could accomplish running your testing scenario, since we run a single network environment. If I tested the server by connecting it to the EQs internal ports, I'd be changing it to a dual network environment.

Submitted by databasics on

Were you able to solve this problem? We are having the exact same issue, down to the 5 secs between coyote marking the server has down and back up.

This is causing a big problem. If a customer is connected to a web app through coyote and that server is marked down, they are kicked out of the web session and sent to the other server.

Submitted by mark.hoffmann on

Hi,

I'm not sure how the original case in this thread ended up, since this customer was working with the support group after posting here. Some recent experimentation with timeouts leads me to believe that this problem might be alleviated by increasing the 'connect timeout' value, either the global value, or the per-cluster value. Try this and see if it helps.

Thanks,

Mark

Mark Hoffmann
Coyote Point Engineering

Submitted by databasics on

Thanks, but I thought the 'connect timeout' value was for L7 only. These are L4 clusters.

Submitted by mark.hoffmann on

I should have provided more background about this...

Yes, that is the way 'connect timeout' has always been documented, and 'probe timeout' has always been documented as the timeout for TCP and ACV probes. Unfortunately, the documentation on timeouts has not kept pace over the years with changes to the system.

Recent experimentation indicates however that the 'connect timeout' is also used for the TCP probe connection timer -- regardless of whether the cluster in question is Layer 4 or Layer 7 -- and that the 'probe timeout' only takes effect if there is an ACV probe defined.

So, it looks to me like increasing 'connect timeout' and possibly 'probe delay' might be the way to stop these 5-second-duration server down/server up msgs. I can't, however, say this definitively, as this behavior has proved difficult to reproduce in our development/testing environment.

Hope this helps,

Mark

Mark Hoffmann
Coyote Point Engineering

Submitted by databasics on

Thanks for the explanation, was wondering why that was in the L4 cluster settings. Already tried increasing the 'probe delay', so will do the 'connect timeout' and see what happens.

Submitted by Alpha Coyote on

Here's another parameter to adjust. You can also try disabling the global "probe_icmp" flag, which will cause the Eq to stop trying to use ICMP to determine server liveness.

-=Alpha

Submitted by cshannahan on

Any updates to this?

We are only using Layer 4 TCP clusters but on one of the clusters we're getting “server down - failed to connect|Host is down".

We are using TCP regular probes and ICMP probes. We have people that are blaming the LB, think it's the cause of the problems for some reason.

See the log, it's always 5 or 10 seconds apart! What would cause this exact msg?

Nov 10 06:05:45| lbd|w| EWAPPS_443| WL2|server down - failed to connect|Host is down|
Nov 10 06:05:45| lbd|w| EWAPPS_7001| WL2|server down - failed to connect|Host is down|
Nov 10 06:05:50| lbd|w| EWAPPS_443| WL2|back up||
Nov 10 06:05:50| lbd|w| EWAPPS_7001| WL2|back up||
Nov 10 06:06:00| lbd|w| EWAPPS_443| WL3|server down - failed to connect|Host is down|
Nov 10 06:06:00| lbd|w| EWAPPS_7001| WL3|server down - failed to connect|Host is down|
Nov 10 06:06:05| lbd|w| EWAPPS_443| WL3|back up||
Nov 10 06:06:05| lbd|w| EWAPPS_7001| WL3|back up||
Nov 10 08:00:03| lbd|w| EWAPPS_443| WL6|server down - failed to connect|Host is down|
Nov 10 08:00:03| lbd|w| EWAPPS_7001| WL6|server down - failed to connect|Host is down|
Nov 10 08:00:08| lbd|w| EWAPPS_443| WL6|back up||
Nov 10 08:00:08| lbd|w| EWAPPS_7001| WL6|back up|

Chris

Submitted by mark.hoffmann on

Hi and thanks for posting.

The appearance of these issues can indicate a variety of issues: network misconfiguration, firewalls blocking probes, a misconfiguration on Equalizer, or other issues.

Basic probe troubleshooting involves taking packet captures to see if the probe queries and responses are actually occurring, and whether they are being routed properly through the network from Equalizer to the servers and back again. This is usually done with packet captures, and is the first step in determining where the actual problem lies.

I've asked the Support group to contact you directly to work through debugging your probing configuration.

Mark

Mark Hoffmann
Coyote Point Engineering

Add Reply

Log in or register to post comments