Changes To Generic NQS v3.50.0
June 1996
Stuart Herbert (S.Herbert@sheffield.ac.uk)Document copyright ©. All rights reserved.
Abstract
This document contains a summary of the changes made to produce
Generic NQS v3.50.0.
Contents
Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).
Introduction
Introduction
This is a summary of changes to Generic NQS 3.50.0.
We are most grateful for the contributions made by other individuals
and organisations.
About This Release
Purpose
Generic NQS 3.50.0 features some major internal changes, which are
aimed at further addressing problems in the following key areas.
- Portability
- Robustness
- Ease of installation
Generic NQS 3.50.0 is the `reference' release for the purposes of
our funded work; the `Official Manual Set' will be based upon this
release.
We recommend that all sites running Monsanto NQS, or Generic NQS
v3.4x, should upgrade to this version. There will be no support for
any previous versions of Monsanto-NQS or Generic NQS available from
us.
Compatibility
This release features a number of changes which affect upgrading
existing installations of Generic NQS. If you are installing
Generic NQS for the first time, you can skip this section.
- All message logging (and debugging output) now goes via the
syslog mechanism. Generic NQS currently uses `local0' as the
facility it logs to (this is configurable at compile-time).
The old NQS logdaemon (and its logfile) are no longer used or
supported.
System administrators will have to ensure that their syslog
daemon does not discard messages which come from the `local0'
facility.
- Pipe queues and queue complexes have been modified so that a pipe
queue can be a member of a queue complex. Sites which use either
pipe queues or queue complexes will have to remove their existing
NQS installations, and re-install from scratch.
This is an unfortunate side-effect of the way NQS stores its own
information. Hopefully, this will be the last file format change
for a very long time.
- Support for resource limits in terms of words, kilo-words,
mega-words, and giga-words, has been removed.
The size of a `word' varies so much from host to host that a
meaningful comparision is not possible.
- The number of inodes used to hold transaction state information
has been doubled.
The side effect of this change is that any requests which have
been queued will be corrupted when you install this version. Our
advise is to use pipe queues to syphon queued requests to another
NQS node; shutdown NQS, upgrade to this release, then use pipe
queues to move the requests back onto this machine (courtesy of
Thomas Eifert).
My apologies about these incompatibilities; they are the result of
important changes to Generic NQS, and could not be avoided.
Project Conclusion
With the release of Generic NQS 3.50.0, the work under JISC grant
NTI/48.2 has come to an end. This does not mean that this will be
the last release of Generic NQS.
The University of Sheffield has agreed to continue to host the
Generic NQS World-Wide Web site, and FTP site, at least until the end
of June, 1997. In addition, they will continue to maintain the
existing Mailbase mailing lists.
Stuart Herbert is leaving the University for a job in industry.
However, thanks to the loan of equipment from the University, he will
continue to act as world-wide maintainer of Generic NQS and its
related technologies. This will be done in his spare time, and it is
plainly obvious that he won't have anywhere near as much time to
spend on Generic NQS as he did when he was employed full-time to
support it, so please bear with us for a month or two until things
settle into a new routine.
Administrators: New Features
Installation Now Done By SETUP
Installation of Generic NQS is now a matter of running the new
`SETUP' script. This shell script will guide the administrator
through configuring, compiling, and installing the Generic NQS
software; the administrator should not need to edit any Makefiles
any more.
SETUP includes a number of automatic tests, to determine a number of
compile-time constants which previously were supplied by Makefiles.
One of these tests is to actually determine what type of computer
you are trying to install Generic NQS on.
This software is undergoing testing.
All platforms are affected.
This code was contributed by Stu.
Cluster-Wide Dynamic Scheduling Added
Generic NQS v3.50.0 can now perform dynamic scheduling on the pipe
queues which are used to perform load-balancing across a cluster.
To use dynamic scheduling, add all of the pipe queues on your
scheduling host to a queue complex, and set the user_limit for that
queue complex to something suitable. Users who submit more jobs
than the complex's user_limit will find their excess jobs being
deferred in preference to jobs from other users, as is the case with
dynamic scheduling across batch queues.
This code has not been tested.
All platforms are affected.
This code was contributed by Stu, and is based on Dave Safford's
dynamic scheduling for batch queues.
Processor Set Support For Digital UNIX Added
If this feature is enabled, Generic NQS will place all processes for
any batch queue `x' into the processor set of the same name. As
Digital UNIX uses integers for processor set names, to make use of
this feature you will have to use numeric names for your batch queues.
These changes have not been tested.
Platforms affected :
> [ ] AIX 3 [ ] AIX 4
> [ ] DYNIX/PTX [ ] FUJITSU
> [ ] HPUX 8 [ ] HPUX 9
> [ ] HPUX 10 [ ] IRIX 5
> [ ] IRIX 6 [ ] LINUX
> [ ] NCR [x] OSF/1
> [ ] SOLARIS 2 [ ] SUNOS 4
> [ ] ULTRIX [ ] UNICOS
Code contributed by Stuart Herbert (S.Herbert@sheffield.ac.uk)
Prologue/Epilogue Scripts Feature Added
If this feature is enabled, Generic NQS will attempt to execute
`NQS_LIBEXE/nqs.prologue' directly before running a batch request,
and `NQS_LIBEXE/nqs.epilogue' directly after running a batch
request. Both programs run as root user, and are passed the name of
the current queue as their only arguement.
One possible use of this facility is support for restricted
processors on IRIX; the prologue script could ensure that, for a
given queue, a given processor has been restricted (using mpadmin),
and the epilogue script could unrestrict the processor once the
request has run to completion.
This code has not been tested.
All platforms affected.
Code contributed by Stu.
Support For Dynix Added
Generic NQS should now compile, and operate, out of the box on
Dynix/Ptx v4.1.3, running on Sequent 5000 hardware.
This change has been tested.
Support For HP-UX v10 Updated
Changes have been made to support Generic NQS on platforms running
the HP-UX 10 operating system.
These changes have not been tested.
Platforms affected :
> [ ] AIX 3 [ ] AIX 4
> [ ] DYNIX/PTX [ ] FUJITSU
> [ ] HPUX 8 [ ] HPUX 9
> [x] HPUX 10 [ ] IRIX 5
> [ ] IRIX 6 [ ] LINUX
> [ ] NCR [ ] OSF/1
> [ ] SOLARIS 2 [ ] SUNOS 4
> [ ] ULTRIX [ ] UNICOS
Code contributed by Holger Busse (busse@chemie.fu-berlin.de)
Administrators: Changes To Existing Features
Debugging/Message Logging Replaced
The old `logdaemon' used for logging messages has been removed.
Generic NQS now logs all messages through the syslog system daemon,
using the `local0' facility.
I recommend that you update your syslog.conf file so that all
messages from `local0' are logged to a single file. I have the
following entry in my syslog.conf file (on Linux) :
> local0.* /var/log/nqs
This code has been tested.
All platforms are affected.
This code was contributed by Stu.
Debugging Levels Rationalised
While I was removing the logdaemon, I sorted out the messages that
Generic NQS produces at each debugging level.
- Level 0: Fatal errors only.
- Level 1: Level 0 + information messages.
- Level 2: Level 1 + high-priority debugging messages.
- Level 3: Level 2 + medium-priority debugging messages.
- Level 4: Level 3 + low-priority debugging messages.
- Level 5: Level 4 + temporary debugging (trace) messages.
Levels 6-10 are reserved for future use. If you set the debugging
level above `5', the behaviour is `unspecified' (ie it'll most
likely crash rather horribly), so please don't do it.
Upon installation, Generic NQS defaults to level 1. You can change
the debugging level using the `qmgr set debug' command.
This code has been tested.
All platforms are affected.
This code was contributed by Stu.
More Machine ID's Supported
Nmapmgr has been adopted to work with up to 1,000 machine IDs in
the machine ID database.
This code has not been tested.
Code contributed by Stu, thanks to Mark Loveridge.
Pipe Queues Can Now Be Part Of Queue Complexes
Pipe queues can now be members of queue complexes. This change
means that any existing installation of Generic NQS which uses pipe
queues will have to be removed, because of file format
incompatibilities.
This code has not been tested.
All platforms are affected.
This code was contributed by Stu.
Support For Word-Size Limits Removed
Resource limits could previous be specified in terms of words,
kilo-words, mega-words and giga-words. Support for these units has
been removed.
This code has not been tested; requests which contain one of these
units may cause NQS to crash. Further work is anticipated before
this change is complete.
All platforms are affected.
This code was contributed by Stu.
Staging Support For Pre-Releases
Generic NQS now understands the difference between a pre-release,
and a final release, and can upgrade from 3.50.0.1 (v3.50.0,
pre-release #1) to 3.50.0 (full release of v3.50.0).
This code has not been tested.
All platforms are affected.
This code was contributed by Stu.
Administrators: Fixes For Problems
Compilation Fix On HP-UX v8
Generic NQS 3.40.2 did not compile on HP-UX v8; compilation failed
in shoqbydesc.c. Fixed.
> [ ] AIX 3 [ ] AIX 4
> [ ] DYNIX/PTX [ ] FUJITSU
> [x] HPUX 8 [ ] HPUX 9
> [ ] HPUX 10 [ ] IRIX 5
> [ ] IRIX 6 [ ] LINUX
> [ ] NCR [ ] OSF/1
> [ ] SOLARIS 2 [ ] SUNOS 4
> [ ] ULTRIX [ ] UNICOS
This code has been tested.
Contributed by Michael Andrews.
Compilation Problems On HP-UX v9
Generic NQS v3.40.2 did not compile on HP-UX v9; compilation failed
during `netdaemon.c'. This has been fixed.
Platforms affected :
> [ ] AIX 3 [ ] AIX 4
> [ ] DYNIX/PTX [ ] FUJITSU
> [x] HPUX 8 [x] HPUX 9
> [ ] HPUX 10 [ ] IRIX 5
> [ ] IRIX 6 [ ] LINUX
> [ ] NCR [ ] OSF/1
> [ ] SOLARIS 2 [ ] SUNOS 4
> [ ] ULTRIX [ ] UNICOS
Code contributed by Stu.
Compilation Problems On IRIX
Generic NQS v3.40.2 did not compile on IRIX 6; compilation failed
during qmgr because of a remaining reference to some SGI-specific
memory debugging code. This has been fixed.
> [ ] AIX 3 [ ] AIX 4
> [ ] DYNIX/PTX [ ] FUJITSU
> [ ] HPUX 8 [ ] HPUX 9
> [ ] HPUX 10 [x] IRIX 5
> [x] IRIX 6 [ ] LINUX
> [ ] NCR [ ] OSF/1
> [ ] SOLARIS 2 [ ] SUNOS 4
> [ ] ULTRIX [ ] UNICOS
Code contributed by Stu.
Jobs Stuck In The Arriving State
I've had a number of reports of cases where a job has been moved
from one machine to another; the job is deleted from the original
machine, and gets stuck in the arriving state on the remote host.
This has happened on a wide range of platforms.
Code in the machine id database library has been replaced in order
to attempt to resolve this problem.
This code has not been tested.
All platforms are affected.
This code has been contributed by Stu, based on a contribution by
Mark Whidby (M.Whidby@mcc.ac.uk).
Incorrect Resource Limits
All versions of Generic NQS from v3.40 onwards did not correctly
support the limits which an administrator can use to prevent their
users from running completely amok on a host. The main symptom was
that file and memory resource limits were reported correctly by
Generic NQS, but were significantly smaller when actual NQS jobs
were executing.
The implementation of resource limits has been completely re-written
from scratch; the new implementation should cure the problem.
UNICOS users will, however, have to wait for the next pre-release
before their resource limits are correctly supported.
This code has been partially tested.
All platforms are affected.
This code was contributed by Stu; the UNICOS support is based on the
previous support written by Dave Safford.
Device Queues And Queue Complexes
Previous versions of Generic NQS (also Monsanto NQS and CERN NQS,
and presumably anything derived from the original COSMIC NQS source
code ...) did not correctly handle situations where device queues
were members of one or more queue complexes, and either a device
queue or such a queue complex were deleted. The result of deleting
such a device queue, or such a queue complex, were `unspecified';
most likely, the NQS software would crash intermittedly.
The implementation of device queues and queue complexes has been
completed, so that all operations on device queues and queue
complexes are safe.
This code has not been tested.
All platforms are affected.
This code has been contributed by Stu.
Memory Leak On HP-UX
Code in libnqs/queues/all-systems/shoqbydesc.c for HP-UX did not
release memory it malloc()ed. This has been fixed.
Platforms affected :
> [ ] AIX 3 [ ] AIX 4
> [ ] DYNIX/PTX [ ] FUJITSU
> [x] HPUX 8 [x] HPUX 9
> [x] HPUX 10 [ ] IRIX 5
> [ ] IRIX 6 [ ] LINUX
> [ ] NCR [ ] OSF/1
> [ ] SOLARIS 2 [ ] SUNOS 4
> [ ] ULTRIX [ ] UNICOS
This code has not been tested.
Code contributed by Stu, based on points raised by Michael Andrews.
Incorrect TMPDIR Support
The code which created the TMPDIR temporary working directory
neglected to set the permissions so that users could actually write
into the directory. Fixed.
All platforms are affected.
Contributed by Stu.
Updated Transaction Handling Code
Previous versions of Generic NQS (and before it Monsanto-NQS and
COSMIC NQS) used inodes in order to store state information in
non-volitile memory. Unfortunately, the original implementation
assumed that a 32-bit value would fit into an inode's mtime field;
this is not the case on a number of platforms.
Our original intention was to remove the use of inodes completely;
however, this approach requires more time than we can afford to spare
at this stage of the project.
Instead, the transaction handling code has been updated to use twice
as many inodes, and to store 16-bit values into each inode. This
allows the existing transaction mechanisms to work unmodified on all
known UNIX platforms, and still leaves open the option of completely
replacing this part of NQS in the future.
This code has been tested.
All platforms are affected.
Contributed by Stu.
List Of Changes For Developers
New Source Tree Layout
All of the source code for Generic NQS 3.50.0 can now be found in
the `Source-Tree' sub-directory. I have broken Generic NQS up into
its individual components, one directory per component. The new
source tree layout allows for platform-specific code to be moved
into separate sub-directories, although this has not yet been done.
A full description of the source tree layout, and its management
through the SETUP software, will be included in the Developer's
Manual I'll shortly be working on; in the meantime, if there is
anything you want to know, drop me an email.
Library Re-organisations
I've made some changes to the libraries built as part of Generic NQS :
There is plenty more to do with the libraries. For example, the
prototypes for libnqs and all of the binaries are stored in libnqs;
they should be broken up so that the prototypes are near the
functions they relate to (see libsal for what I'd like to achieve).
Other Credits
Other Credits
In addition to the changes listed above, the following people
contributed fixes for problems introduced during the pre-release
testing programme for Generic NQS 3.50.0.
- Michael Andrews
- Ulrich Bernhard
- Mark Loveridge
(I'm sure this list should be longer - Stu. My apologies to anyone
I've forgotten to mention.)
|