Cluster Troubleshooting

Troubleshooting a cluster can be complex. You should collect as many information as possible to discover the cause(s).

A usefull way to collect this information is to generate the cluster logs. This is not a trace of what is happening at the moment, this is a log that is generated from a time window somewhere in the past.

A short example of such a log file :

00001df4.00001498::2011/10/06-08:00:20.732 INFO  [GUM] Node 3: Processing RequestLock 3:468
00001df4.000016bc::2011/10/06-08:00:20.763 INFO  [GUM] Node 3: Processing GrantLock to 3 (sent by 2 gumid: 1749)
00001df4.00000af8::2011/10/06-08:00:33.698 INFO  [GUM] Node 3: Processing RequestLock 1:122
00001df4.00000af8::2011/10/06-08:00:33.698 INFO  [GUM] Node 3: Processing GrantLock to 1 (sent by 3 gumid: 1750)
00001df4.000016bc::2011/10/06-08:00:40.282 INFO  [GUM] Node 3: Processing RequestLock 2:303
00001df4.00000af8::2011/10/06-08:00:40.283 INFO  [GUM] Node 3: Processing GrantLock to 2 (sent by 1 gumid: 1751)

Notice that each entry in the cluster log has a particular format. The process ID and thread ID of the thread that issues the log entry are the first part of each line, and are separated by a period. The nextpart of the line is the system time in Coordinated Universal Time. The third part of the line is the level of the event. This can be an error (ERR), a warning (WARN), or an information message (INFO). Followingthe event level is the cluster component that generated the message. Messages that the SQL Serverresource generates show as [RES]. The message that the component is logging follows at the end. Notethat you can obtain the error description from the error number. For system error codes, you can obtainthe error description by using the command with the error code.