<< Previous Tip > Tips Table < Next Tip >>

Avanti Product Banner

Diagnosing an ABEND




Upon encountering an ABEND, an administrators first instinct is to restart the Server. However, spending a few minutes researching and recording information related to the ABEND and help to prevent its reoccurrence and further stabilize the Server.

Due to its non-preemptive task switching design, NetWare can often isolate and identify the specific source of the error. However, it will occasionally mis-identify the source due to the pseudo preemption which takes place during certain I/O operations. For these situations, and just to generally confirm any suspicions, it can prove useful to initiate NetWare's internal debugger and collect additional information.

The System Console will display an error message about the event which NetWare reports to have caused the ABEND. Before entering the internal debugger, the exact ABEND error information should be recorded, including the first line of the Stack hex dump information since it may prove useful to someone providing NetWare kernel support. Vary rarely will it be necessary "to copy diagnostic image to diskette" since that process can only be reviewed by the most knowledgeable of engineers at Novell and is rarely asked for by Novell's technical support engineers.

(Note: If a diagnostic image is desired, be prepared to insert an unformatted, high density diskette for each Mbyte of installed RAM. This process may take some time . . . and patience.)

At this point, it is usually possible to enter the internal debugger by either entering 386debug at the System Console or by pressing and holding the Left Alt, both Shift, and the Escape keys. Should the Server fail to enter the internal debugger, try pressing the Caps Lock key. If the Caps Lock light fails to toggle (on/off), it generally indicates a hard conflict which has the Server completely inoperable. In this case the only option is to power down the Server.

Once in the internal debugger, it is possible to acquire information about the ABEND which can help technical support personnel ascertain the exact cause of the problem. To retrieve this information, enter the data shown in bold after the # prompt (i.e., the debugger prompt). The Server will respond with the data shown in italics. Record the Server response in the space provided and follow the directions outlined within the parens.

Once the information has been collected, forwarding it to the appropriate technical support personnel can help them identify and isolate the potential cause of the ABEND.

To retrieve more information about the ABEND, enter the following command at the debugger prompt:

              # .a
            

Since the reply varies, be sure to record all of the information returned.

Once the details about the ABEND cause has been recorded, it is also important to retrieve information about the Running Process. To do so, enter the following command at the debugger prompt:

              # .r
    
              Running process pointer: ________
              Process name: __________________________________
              Address:________
              Stack pointer: ________
              Stack limit: ________
              Scheduling priority: _
              Wait state: __
              ________ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ________________
              ________ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ________________
            

(Record the returned information in the blanks provided.)

There are times where it can also be beneficial to scroll through the active NLM screens to check for any messages or alerts which may have been displayed immediately prior to the ABEND. To view the active NLM screens, enter the following command at the debugger prompt:

              # .v
            

Pressing "CR" will scroll to the next NLM screen until returning to the debugger screen.

NetWare stores status information about it operational mode in an I/O Port. To retrieve the NetWare status byte from this port, enter the following command at the debugger prompt:

              # i61
              Port(61) = __
            

NetWare should respond with a hex value which can be interpreted as follows:

              i61 return value         Possible conflict
              ----------------         -----------------
              greater than 80h         Bus, CPU, or RAM related
              40h - 7Fh                Interface or adapter card
              less than 3Fh            Software related
            

(Record the returned value.)

At this point, more indepth information into the actual code which is executing can be retrieved as follows:

              # ?eip
              Address in ________.___ at code start +________h
              Current: ________ ________h
            

This debugger command queries the EIP register to identify the offset in memory where the code execution was suspended. Some ABENDs may reset the EIP register to zero. In such cases, record the EIP value but not the Server response to the query.

If the offset in memory pointed to by the EIP register references a code offset within SERVER.NLM, CLIB.NLM, DSAPI.NLM, or one of the other Novell provided support modules for NetWare (i.e., not one of the disk/LAN drivers or other third party NLMs), the source is most likely another module making an Application Programming Interface (API) request. To identify which module made the request, it is necessary to examine the stack in an attempt to recreate the steps which lead to the request.

Under NetWare v4, examining the stack is very easy since Novell added a specific debugger command for providing such information. To examine the stack under NetWare v4, enter the following at the debugger prompt:

            # dds
            
      xxxxxxxx --yyyyyyyy ?
      (If the value cannot be resolved to a valid memory offset.)

      xxxxxxxx yyyyyyyy ( "NLM" |(Code Start)+xxxxxxxx)
      (If the value resolves to a valid code offset within an NLM.)

      xxxxxxxx yyyyyyyy ( "NLM" |(Data Start)+xxxxxxxx)
      (If the value resolves to a valid data offset within an NLM.)


The debugger should list the last 16 possible offset values that may have been pushed onto the stack. The first line which displays a valid NLM code offset is the most likely source of the API request. Record the NLM name and code offset.

Under NetWare v3, examining the stack is more difficult since there is not a specific debugger command for providing such information. To examine the stack under NetWare v3, it will be necessary to repeat the same command multiple times using different stack offsets. The first such command is as follows:

              # ?[desp+00]
              Address in ________.___ at code start +________h
              Current: ________ ________h
            

This debugger command queries specific offsets on the stack in an attempt to identify those which point to a valid NLM code offset. It may be necessary to repeat this command numerous time, adjusting the stack offset by a value of 4 (in hex - i.e., 00, 04, 08, 0C, 10, 14, 18, 1C, 20, 24, 28, 2C, 30, etc.) until a valid offset can be located.

After each entry, the Debugger will reply with a memory location. Some of the entries may result in an 'Unknown' memory location due to the fact that they are really values passed as parameters rather than pointers. What is being sought is a code offset within an NLM. In most cases, it should be the first reference which is not SERVER.NLM or CLIB.NLM.

If the running process at the time of the ABEND can be confirmed to be a Server xx Process or a non-kernel process (i.e., not a core NetWare function) and the EIP has not been reset to zero, it may be possible to restart the Server ON A SHORT-TERM, TEMPORARY BASIS. The objective is to give Users enough time to close files and log out (not to complete work or tasks in process) so that the Server can be properly downed in order to minimize the potential for data or file system corruption.

To attempt to restart the Server, enter the following debugger commands:

              # eip=CSleepUntilInterrupt
            

The debugger should respond that the register has been changed.

Note that the command is case-sensitive.

              # g
            

The first debugger command changes the code execution pointer to an internal NetWare routine which will put the current process to sleep until awakened by an interrupt and the second resumes NetWare's execution. In most cases, the offending thread should not wake up (which also means it also will not complete the task it was attempting).

The second debugger command will attempt to restart the Server.

If the Server Console screen appears without a new ABEND message appearing (the previous ABEND message is not erased or removed so do not be disconcerted by its appearance), chances are that the Server is running. If you can type on the Console keyboard, BROADCAST a message that the "Server will be coming down shortly so LOG OUT NOW!" Wait a reasonable period and then DOWN the Server, EXIT to DOS, and power off the Server (powering off is advised to clear any hardware conflicts which may exist).

If the Server is running but the Server Console does not respond to the keyboard, you probably put the Console Command Process (or some other process linked into it) to sleep. In such cases, you can down the Server gracefully via FCONSOLE (which is why you should keep a copy around even if you are running NetWare v4) or via other third-party utilities which provide such capability from a workstation.

If the Server cannot be restarted, you can exit the debugger and return to DOS by entering the following debugger command:

              # q
              Confirm exit back to DOS (y/n): n
            

(Enter 'y'. If REMOVE DOS has not been issued, the Server will return to DOS and can be rebooted. Otherwise, the Server should perform a warm reboot.)

While this information may seem a bit overwhelming or even cryptic to someone who does not develop NLMs on a regular basis, it provides a wealth of critical details which will help the developer further isolate and correct the anomaly. In those cases where there might have been question as to the source of the ABEND, this information will help properly identify the source and eliminate the potential for finger-pointing or blame shifting.

And qualified developers will welcome the indepth information so rarely available!



Specific ABENDs: Their cause and solution

While hardware and software failures are the most common sources of ABENDs, the following scenarios have also been reported to cause ABENDs:

Scenario:
GPPE ABEND in Backup process on an irregular basis.
Efforts:
Applied all patches, updated all drivers, and tested the RAM without eliminating the occurrence of the ABENDs.
Solution(s): 1. Ran PURGE prior to Backup.
2. Enable the following NetWare SET Parameter: Immediate Purge Of Deleted Files = ON
3. Alternate solution to exclude the DELETED.SAV directory from the Backup process might also have worked.

The applications run on the Server created a high volume of temporary files which were subsequently deleted by the application. As a result, there existed a large quantity of deleted files which did not yet meet the criteria for automatic purging by NetWare. These deleted but unpurged files created additional overhead for the Backup process, both because it was not set up to exclude the DELETED.SAV directory and because it created some temporary files as a part of its operation which forced NetWare to purge some files prior to their reaching the time in which they would be automatically purged.


Scenario:
GPPE ABEND in PSERVER process during heavy printing activity.
Efforts:
Applied all patches, updated all drivers, and tested the RAM without eliminating the occurrence of the ABENDs.
Solution(s): Deleted and recreated all Print Queues.
PSERVER appears to be unable to recover from a corrupted Print Queue.

Scenario:
GPPE ABEND in Server xx Process during the transfer of a large file from the workstation to the Server
Efforts:
Applied all patches, updated all drivers, and tested the RAM without eliminating the occurrence of the ABENDs.
Solution(s): Reconfigured LAN I/O adapter not to use IRQ 15.
NetWare is known to use IRQ 15 on internally. Any Server adapter configured to use IRQ 15 has the potential to conflict with NetWare.

Scenario:
NMI ABEND in various processes during heavy loads
Efforts:
Applied all patches, updated all drivers, and tested the RAM (with software RAM testing tools) without eliminating the occurrence of the ABENDs.
Solution(s): Tested the RAM with a hardware testing product which could better simulate load and create greater heat stress on the RAM then replaced bad SIMMs.
A well exercised version of NetWare is one of the best software RAM testing tools. It will often detect NMI errors which are not found by most software based RAM testers.

Scenario:
NMI ABEND in various processes during heavy loads
Efforts:
Applied all patches, updated all drivers, and tested the RAM (with software RAM testing tools) without eliminating the occurrence of the ABENDs.
Solution(s): Reconfigured the Server disabling the External Cache for the CPU.
NetWare's code efficiency optimization can be compromised by a mismatch in SRAM speeds between the Internal and External Cache.

Scenario:
Periodic PFPE (Page Fault) ABENDs occur under NetWare v4
Solution(s): Enable the following NetWare SET Parameters:
    Allow Invalid Pointers = ON
    Read Fault Emulation = ON
    Read Fault Notification = ON
    Write Fault Emulation = ON
    Write Fault Notification = ON

Enabling the Allow Invalid Pointers and Read/Write Emulation SET Parameters will allow NetWare v4 to operate similar to NetWare v3 allowing NLMs to access most any location in the flat memory model. Enabling the Read/Write Notification SET Parameters will cause NetWare to generate a Server Console Alert message and SYS$LOG.ERR entry every time that memory is accessed outside of the NLMs allocated blocks.

The Server Console Alerts enabled through the Notification SET Parameters should identify the offending NLM, the type of the invalid memory access which occurred, and the memory location for the attempted access. Record this information and then contact the NLM developer. It is their responsibility to properly design their code so as to avoid such system integrity violations.

These SET Parameters should not be enabled except when trying to diagnose and isolate errors. Though they can help towards insuring continued Server operation, they are not designed to correct the design problem within the NLM nor are they intended for long term use.



This document is copyright © 1999 by avanti technology, inc.

<< Previous Tip > Tips Table < Next Tip >>