| << Previous Tip | > Tips Table < | Next Tip >> |
Abnormal End (ABEND) and |
NetWare is designed to protect itself from most cases of kernel
corruption. It accomplishes this level of protection by trapping most
hardware related errors, its internal allocation processes, and verifying
the parameters passed to critical internal functions. If NetWare detects
a situation which could present possible risk to the kernel, it will
ABnormally END (ABEND) the currently running process and display a screen
similar to the following:
System halted [date and time]
Abend: [reason (Error code #)]
Due to its non-preemptive task switching design, NetWare can often
isolate and identify the specific source of the error. However, it will
occasionally mis-identify the source due to the pseudo preemption which
takes place during certain I/O operations. For these situations, and
just to generally confirm any suspicions, it can prove useful to initiate
NetWare's internal debugger and collect additional information.
Pressing the Left Shift, Right Shift, Alt, and Escape keys simultaneously
(or entering '386debug' at an ABEND screen) will activate the Debugger.
Before entering the internal debugger, the exact ABEND error information
should be recorded, including the first line of the Stack hex dump
information since it may prove useful to someone providing NetWare
kernel support. Vary rarely will it be necessary "to copy
diagnostic image to diskette" since that process can only be
reviewed by the most knowledgeable of engineers at Novell and is rarely
asked for by Novell's technical support engineers.
(Note: If a diagnostic image is desired, be prepared to insert an
unformatted, high density diskette for each Mbyte of installed RAM.
This process may take some time . . . and patience.)
At this point, it is safe to collect more information. To enter the
internal debugger, a relatively simple, albeit convoluted, combination
of keys must be pressed on the Server keyboard. By pressing down and
holding both the left and right shift keys, plus the left Alt key and
the Escape key, the Server should enter the internal debugger. The user
should see the ABEND error string, a set of registers (EAX, EBX, etc.),
and a '#' prompt.
To narrow the cause of the problem further, enter 'i61'. NetWare should respond with a hex value which can be interpreted as follows:
i61 return value Possible conflict
---------------- -----------------
greater than 80h Bus, CPU, or RAM related
40h - 7Fh Interface or adapter card
less than 3Fh Software related
To identify the location of the aborted instruction, enter '?eip'.
NetWare should reply with the name of the NLM where the instruction is
located and its code offset.
To confirm the currently executing process and to list those processes
in the queue awaiting processing time, enter '.p'. This information may
be useful for the technical personnel in charge of supporting the
offending NLM.
To determine additional information about the currently running process,
enter '.r'. This information may be useful for the technical personnel
in charge of supporting the offending NLM.
It is also possible to scroll through the Server screens by entering 'v'
then pressing enter after each screen until returned to the internal
debugger. This can be useful to determine if an error message appeared
on any of the NLM screens which could further help isolate the source of
the ABEND.
At this point, unless the ABEND was related to a hardware failure,
entering 'q' and responding 'y'es should abort the internal debugger.
If DOS has not been removed from the Server, the system will return to
an MS-DOS prompt. If DOS has been removed from the Server, the system
should perform a warm boot.
While there are numerous possible sources for an ABEND, some of the most
common ABENDs which are likely to occur are:
General Protection Processor Exception (GPPE)
This type of ABEND can occasionally be recovered from but only if the
user is well skilled in assembler and understands 32-bit flat memory
model programming techniques. In most cases, it is best to collect as
much information as possible and then restart the Server anew.
Upon encountering a GPPE ABEND, enter the Debugger (either by typing
386debug on the keyboard or by pressing the Left Shift, Right Shift,
Alt, and Escape keys simultaneously).
At the Debugger prompt (#), type '? eip' and note the module name and
code offset that is reported. In most cases, that will be the culprit.
However, if it reports CLIB as the offender, a module probably called
a CLIB routine and passed it an invalid pointer. In that case, walking
the stack is one option.
Under NetWare v4, examining the stack is very easy since Novell added a specific debugger command for providing such information. To examine the stack under NetWare v4, enter the following at the debugger prompt:
# dds
(If the value cannot be resolved to a valid memory offset.)
xxxxxxxx yyyyyyyy ( "NLM" |(Code Start)+xxxxxxxx)
xxxxxxxx yyyyyyyy ( "NLM" |(Data Start)+xxxxxxxx)
The debugger should list the last 16 possible offset values that may
have been pushed onto the stack. The first line which displays a
valid NLM code offset is the most likely source of the API request.
Record the NLM name and code offset.
Under NetWare v3, examining the stack is more difficult since there is not a specific debugger command for providing such information. To examine the stack under NetWare v3, it will be necessary to repeat the same command multiple times using different stack offsets. The first such command is as follows:
# ?[desp+00]
This debugger command queries specific offsets on the stack in an
attempt to identify those which point to a valid NLM code offset.
It may be necessary to repeat this command numerous time, adjusting the
stack offset by a value of 4 (in hex - i.e., 00, 04, 08, 0C, 10, 14,
18, 1C, 20, 24, 28, 2C, 30, etc.) until a valid offset can be located.
After each entry, the Debugger will reply with a memory location. Some
of the entries may result in an 'Unknown' memory location due to the
fact that they are really values passed as parameters rather than
pointers. What is being sought is a code offset within an NLM. In most
cases, it should be the first reference which is not SERVER.NLM or
CLIB.NLM.
While this technique is not fool-proof and makes assumptions that cannot
be proved without additional Debugger research, in many cases it will
provide insight into the potential source of the problem.
At this point, you can enter 'q' at the Debugger prompt (#) and NetWare
should prompt you if you want to return to DOS (if DOS has not been
removed) or the Server will reboot (if DOS has been removed). Type 'y'
to return to DOS, at which point you can restart the Server.
Invalid OpCode Processor Exception (IOPE)
Upon encountering an IOPE ABEND, enter the Debugger (either by typing
386debug on the keyboard or by pressing the Left Shift, Right Shift,
Alt, and Escape keys simultaneously).
At the Debugger prompt (#), type '? eip' and note the module name and
code offset that is reported. In most cases, that will be the culprit.
If the return indicates a data offset or some other invalid code
location, chances are that the stack has become corrupted for the
active process and it attempted to return execution to an invalid
address after a function call.
Non-Maskable Interrupts (NMI)
The ABEND "NMI parity error generated by System Board"
indicates that the problem is hardware and related to the System Board
(Bus, memory, or CPU SRAM cache). These ABENDs are often the result
of power fluctuations or the failure of a RAM module. A mismatch in
the CPU Cache SRAM can also cause NMI ABENDs. Disabling the CPU's
external Cache (internal if the ABEND reoccurs) can often eliminate
this ABEND.
The ABEND "NMI parity error generated by IO check is most often
memory related. However, there are rare cases where software can
triggered this ABEND. Unfortunately, the catastrophic nature of the
error makes diagnosing such software problems difficult, at best.
While an NMI error may occur for the first time shortly after a new
NLM has been loaded, it is often the result of previously unused
memory being exercised, rather than being a problem with the new NLM.
If this error occurs more than once, the RAM should be removed from
the Server and checked with a hardware based RAM testing device, or
replaced. Testing the Server RAM with software utilities will rarely
isolate the problem, because NetWare typically manipulates the Server
RAM far more thoroughly than such utilities.
With newer, high performance CPUs, heat can also become a factor.
Check that there is adequate cooling in the Server for all components,
especially the CPU and memory.
Page Fault Protection Exceptions (PFPE)
NetWare v4 has memory protection which prevents access outside of allocated or common regions. To minimize the possibility of ABENDs in such situations, enable the following SET Parameters:
SET ALLOW INVALID POINTERS = ON
SET READ FAULT EMULATION = ON
SET READ FAULT NOTIFICATION = ON
SET WRITE FAULT EMULATION = ON
SET WRITE FAULT NOTIFICATION = ON
The emulation parameters will allow the operation to succeed while
the notification parameters will report the occurrence via Console
Alerts. Note the Console Alerts (which will also display the
Module name and the code offset) then report them to the developer.
Another event which can cause the Server to cease normal operation is if the code execution path changes to a memory location that NetWare recognizes as being invalid. This can occur if the stack becomes corrupted causing an invalid return point; or if a function pointer is not properly defined and references an invalid code location. Such events are usually manifested by the following message appearing on the System Console:
Breakpoint at 00000001 because of INT 3 breakpoint
With an entry into the Debugger.
Typically, a "Breakpoint at 00000001" debugger message is
the result of an NLM trying to change execution to a NULL pointer.
NetWare writes hex 0xCC (Intel's internal opcode for INT 3 -
Breakpoint Interrupt) in the first four bytes of memory starting at
location zero. Memory location zero is the beginning of the
Interrupt Vector Table and, as such, is an invalid location for code
execution. Thus, if the stack or a function pointer becomes
corrupted and the NLM tries to change execution to a NULL pointer,
the Server enters the internal debugger to minimize potential damage
caused by invalid code execution.
At the Debugger prompt (#), type '? [desp]' and note the module name
and code offset that is reported. In most cases, that will be the
culprit.
Novell has also produced the following documents on Debugging issues:
Apr. '99 Novell AppNotes (PN# 464-000056-004)
Oct. '97 Novell AppNotes (PN# 464-000052-010)
Mar. '97 Novell AppNotes (PN# 464-000052-003)
Jun. '95 Novell AppNotes (PN# 164-000047-006)
Aug. '91 Novell AppNotes (PN# 164-000030-008)
Either can be ordered by calling 800/377-4136 (303/297-2725).
|
| << Previous Tip | > Tips Table < | Next Tip >> |