| << Previous Tip | > Tips Table < | Next Tip >> |
Diagnosing an ABEND |
|
Upon encountering an ABEND, an administrators first instinct is to
restart the Server. However, spending a few minutes researching and
recording information related to the ABEND and help to prevent its
reoccurrence and further stabilize the Server.
Due to its non-preemptive task switching design, NetWare can often
isolate and identify the specific source of the error. However, it
will occasionally mis-identify the source due to the pseudo
preemption which takes place during certain I/O operations. For
these situations, and just to generally confirm any suspicions, it
can prove useful to initiate NetWare's internal debugger and collect
additional information.
The System Console will display an error message about the event
which NetWare reports to have caused the ABEND. Before entering the
internal debugger, the exact ABEND error information should be
recorded, including the first line of the Stack hex dump information
since it may prove useful to someone providing NetWare kernel
support. Vary rarely will it be necessary "to copy diagnostic
image to diskette" since that process can only be reviewed by
the most knowledgeable of engineers at Novell and is rarely asked
for by Novell's technical support engineers.
(Note: If a diagnostic image is desired, be prepared to insert an
unformatted, high density diskette for each Mbyte of installed RAM.
This process may take some time . . . and patience.)
At this point, it is usually possible to enter the internal debugger
by either entering 386debug at the System Console or by pressing and
holding the Left Alt, both Shift, and the Escape keys. Should the
Server fail to enter the internal debugger, try pressing the Caps
Lock key. If the Caps Lock light fails to toggle (on/off), it
generally indicates a hard conflict which has the Server completely
inoperable. In this case the only option is to power down the Server.
Once in the internal debugger, it is possible to acquire information
about the ABEND which can help technical support personnel ascertain
the exact cause of the problem. To retrieve this information, enter
the data shown in bold after the # prompt (i.e., the debugger
prompt). The Server will respond with the data shown in italics.
Record the Server response in the space provided and follow the
directions outlined within the parens.
Once the information has been collected, forwarding it to the
appropriate technical support personnel can help them identify and
isolate the potential cause of the ABEND.
To retrieve more information about the ABEND, enter the following command at the debugger prompt:
# .a
Since the reply varies, be sure to record all of the information
returned.
Once the details about the ABEND cause has been recorded, it is also important to retrieve information about the Running Process. To do so, enter the following command at the debugger prompt:
# .r
Running process pointer: ________
Process name: __________________________________
Address:________
Stack pointer: ________
Stack limit: ________
Scheduling priority: _
Wait state: __
________ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ________________
________ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ________________
(Record the returned information in the blanks provided.)
There are times where it can also be beneficial to scroll through the active NLM screens to check for any messages or alerts which may have been displayed immediately prior to the ABEND. To view the active NLM screens, enter the following command at the debugger prompt:
# .v
Pressing "CR" will scroll to the next NLM screen until
returning to the debugger screen.
NetWare stores status information about it operational mode in an I/O Port. To retrieve the NetWare status byte from this port, enter the following command at the debugger prompt:
# i61
Port(61) = __
NetWare should respond with a hex value which can be interpreted as follows:
i61 return value Possible conflict
---------------- -----------------
greater than 80h Bus, CPU, or RAM related
40h - 7Fh Interface or adapter card
less than 3Fh Software related
(Record the returned value.)
At this point, more indepth information into the actual code which is executing can be retrieved as follows:
# ?eip
Address in ________.___ at code start +________h
Current: ________ ________h
This debugger command queries the EIP register to identify the offset
in memory where the code execution was suspended. Some ABENDs may
reset the EIP register to zero. In such cases, record the EIP value
but not the Server response to the query.
If the offset in memory pointed to by the EIP register references a
code offset within SERVER.NLM, CLIB.NLM, DSAPI.NLM, or one of the
other Novell provided support modules for NetWare (i.e., not one of
the disk/LAN drivers or other third party NLMs), the source is most
likely another module making an Application Programming Interface
(API) request. To identify which module made the request, it is
necessary to examine the stack in an attempt to recreate the steps
which lead to the request.
Under NetWare v4, examining the stack is very easy since Novell
added a specific debugger command for providing such information.
To examine the stack under NetWare v4, enter the following at the
debugger prompt:
# dds
(If the value cannot be resolved to a valid memory offset.)
xxxxxxxx yyyyyyyy ( "NLM" |(Code Start)+xxxxxxxx)
xxxxxxxx yyyyyyyy ( "NLM" |(Data Start)+xxxxxxxx)
The debugger should list the last 16 possible offset values that
may have been pushed onto the stack. The first line which displays
a valid NLM code offset is the most likely source of the API
request. Record the NLM name and code offset.
Under NetWare v3, examining the stack is more difficult since there is not a specific debugger command for providing such information. To examine the stack under NetWare v3, it will be necessary to repeat the same command multiple times using different stack offsets. The first such command is as follows:
# ?[desp+00]
Address in ________.___ at code start +________h
Current: ________ ________h
This debugger command queries specific offsets on the stack in an
attempt to identify those which point to a valid NLM code offset.
It may be necessary to repeat this command numerous time, adjusting
the stack offset by a value of 4 (in hex - i.e., 00, 04, 08, 0C,
10, 14, 18, 1C, 20, 24, 28, 2C, 30, etc.) until a valid offset can
be located.
After each entry, the Debugger will reply with a memory location.
Some of the entries may result in an 'Unknown' memory location due
to the fact that they are really values passed as parameters rather
than pointers. What is being sought is a code offset within an NLM.
In most cases, it should be the first reference which is not
SERVER.NLM or CLIB.NLM.
If the running process at the time of the ABEND can be confirmed to
be a Server xx Process or a non-kernel process (i.e., not a core
NetWare function) and the EIP has not been reset to zero, it may be
possible to restart the Server ON A SHORT-TERM, TEMPORARY BASIS.
The objective is to give Users enough time to close files and log
out (not to complete work or tasks in process) so that the Server
can be properly downed in order to minimize the potential for data
or file system corruption.
To attempt to restart the Server, enter the following debugger commands:
# eip=CSleepUntilInterrupt
The debugger should respond that the register has been changed.
Note that the command is case-sensitive.
# g
The first debugger command changes the code execution pointer to
an internal NetWare routine which will put the current process to
sleep until awakened by an interrupt and the second resumes
NetWare's execution. In most cases, the offending thread should
not wake up (which also means it also will not complete the task
it was attempting).
The second debugger command will attempt to restart the Server.
If the Server Console screen appears without a new ABEND message
appearing (the previous ABEND message is not erased or removed so
do not be disconcerted by its appearance), chances are that the
Server is running. If you can type on the Console keyboard,
BROADCAST a message that the "Server will be coming down
shortly so LOG OUT NOW!" Wait a reasonable period and then
DOWN the Server, EXIT to DOS, and power off the Server (powering
off is advised to clear any hardware conflicts which may exist).
If the Server is running but the Server Console does not respond
to the keyboard, you probably put the Console Command Process
(or some other process linked into it) to sleep. In such cases,
you can down the Server gracefully via FCONSOLE (which is why you
should keep a copy around even if you are running NetWare v4) or
via other third-party utilities which provide such capability
from a workstation.
If the Server cannot be restarted, you can exit the debugger and return to DOS by entering the following debugger command:
# q
Confirm exit back to DOS (y/n): n
(Enter 'y'. If REMOVE DOS has not been issued, the Server will
return to DOS and can be rebooted. Otherwise, the Server should
perform a warm reboot.)
While this information may seem a bit overwhelming or even cryptic
to someone who does not develop NLMs on a regular basis, it
provides a wealth of critical details which will help the developer
further isolate and correct the anomaly. In those cases where there
might have been question as to the source of the ABEND, this
information will help properly identify the source and eliminate
the potential for finger-pointing or blame shifting.
And qualified developers will welcome the indepth information so
rarely available!
While hardware and software failures are the most common sources
of ABENDs, the following scenarios have also been reported to
cause ABENDs:
|
| << Previous Tip | > Tips Table < | Next Tip >> |