Progress Report (November 1994)
Academic Computing Services , University of Sheffield
Stuart Herbert (S.Herbert@Sheffield.ac.uk)Document copyright ©. All rights reserved.
Abstract
As part of the JISC New Technologies Initiative, the University of
Sheffield is funded for one year to supply and support batch
processing systems to the UK Higher Educational Community.
Contents
Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).
Introduction
About This Progress Report
This report documents the progress achieved in the months of October
and November, 1994, on work related to batch processing systems.
Excerpt From The Funding Bid
The following subsection, taken from the successful funding bid,
lists the major aims and objectives of the project which were
outstanding at the start of October, 1994.
Aims And Objectives
The main objective is to help sites match their users' demand for
computer power to the available equipment through the use of
distributed batch systems. Currently the systems are available, but
little or no independant information about the suitability of
products is available. Getting to know the products sufficiently
well to understand them without training is a very time consuming
and hence expensive task. Specific aims for the year of funding
would be as follows:
- to implement and evaluate commercial and public domain
distributed batch systems and in particular NQS and DQS;
- to provide a report, comparing the systems' utility;
- to provide support on implementation, configuration and use
through the setting up and monitoring of a list on Mailbase;
- to provide a training course on selected systems at which the
systems will be described and information on implementation and
configuration will be given;
- to provide packaged releases for popular systems and in
particular Sun Solaris 1 and Sun Solaris 2;
- to provide simple end user documentation on selected systems (to
augment the inevitably terse manual pages).
Activities
Preamble
The first two months of the project have concentrated on laying the
foundations on which to build. Much of this is inevitably
intangable, such as building connections with people both inside,
and outside, Sheffield; but much more can be easily identified.
Below you will find a summary of all the activities which have been
undertaken in the last two months. Full versions of the papers
mentioned here are available via our World-Wide Web service.
NOTE
Items marked with a (*) are activities which (perhaps) do not fall
directly under the initial funding bid, but which we believe are
important to the project.
Tom, we would appreciate it if you could indicate whether you
consider these activities to be suitable or not.
Batch Processing At Other UK Sites
A questionaire, looking to identify the current practice, and the
needs, of other UK sites was sent out on the 14th October, 1994.
The replies have been studied, and the following data extracted to
date.
- 36 UK sites have replied to the questionaire. Of these, 28 are
being kept informed of the progress of this project (via email).
- 11 sites currently run NQS-class software. They are relatively
happy with NQS, but would like to see work done to add a proper
scheduler to improve the service NQS provides.
6 of these sites use 4D/NQS, by Silicon Graphics. 4D/NQS has
not been supported, or updated, in some time, and the binaries
available to sites will not work with the latest release of
Silicon Graphics' UNIX (IRIX 6.0).
- 7 sites run some version of DQS. One site is a new installation,
two sites are actively looking for a replacement, and one site is
actually running the direct ancestor of DQS.
- Sillicon Graphics' IRIX is the version of UNIX most widely used
for batch processing, followed by Sun Microsystems Solaris 2.
Together, these two versions of UNIX account for nearly half the
existing installations of a batch processing system.
The other platforms are AIX, Alliant, Convex, Cray, Fujitsu,
HPUX, OSF/1, SunOS 4, Ultrix.
- From a range of features, sites consider the core functionality,
user documentation, and ease of configuration to be the most
important aspects of a batch processing system. Other features,
such as user interfaces for administration purposes, are lowly
rated by comparison.
- Sites which do not perform batch processing cite `NO DEMAND' as
the reason why. Local experience has shown that if a service is
provided, the demand will materialise.
- Sites are unwilling to spend significant amounts of money
purchasing commercial batch processing systems.
In summary, we believe that other UK sites want some form of NQS
software, which is easy to configure, does a good job of scheduling
submitted requests, and includes good user documentation.
Evaluation Of Batch Processing Systems
Much of October 1994 was spent looking at a number of available
batch processing systems, comparing them against 4D/NQS by Silicon
Graphics.
- Condor (commercially available as IBM Loadleveller) was examined
but not evaluated. According to the accompanying documentation,
Condor requires applications to have code included to assist it.
- DQS v3.1.1, from Florida State University, was evaluated. This
evaluation was paper-based due to unresolved problems with the
software (a patch to at least ensure it compiles on Solaris 2.3
was forwarded to the authors by us).
DQS has some very good ideas on managing hetrogenus clusters of
workstations, but unfortunately the implementation has created a
very confusing system. In addition, DQS does not provide support
for essential limits, such as CPU time, memory usage, et. al.
normally available from the underlying operating system.
Documentation is poor to the point of useless - major topics such
as configuration are conspicuous by their absence.
- Manchester Computer Centre have put work into extending NQS, but
in some respects they have been re-inventing the wheel because
their work is based on a (relatively) old version of NQS. What
has been written works well - this was the only software which
compiled `out-of-the-box' - but this was never intended to be
publically released.
- Monsanto-NQS is the most comprehensibly maintained version of NQS
which is freely available. Its feature set is a superset of
other versions (most notably including work initially by CERN and
Boeing), and its documentation is of a higher standard than other
versions. In addition, the code has been well commented.
- Sterling NQS is the main commercial option open to sites urgently
requiring an alternative to 4D/NQS. The evaluation was a
paper-based exercise as this site does not run Sterling NQS.
The main problem with Sterling NQS is that it does not support
the special processing features of IRIX - non-degradable priority
and processor sets - and unlike freely available versions of NQS,
such support cannot be added by ourselves.
As a result of this evaluation, we believe that Monsanto NQS is the
best batch processing system which is currently freely available.
We have still to look at a commercial system, Network Queueing
Environment, which has been developed by Cray.
System Administrator Support
A major part of the project is to provide support for batch
processing systems to other UK sites. The `Project Aims' state that
we intended to do this through the Mailbase service.
Mailbase
Due to problems at Mailbase's end, the email support was not
available until the 9th November 1994. There are currently three
electronic mailing lists :
- NQS-Announce (moderated)
This list is used to keep other UK sites informed of the progress
of this project. It carries all announcements relevant to this
project.
- NQS-Support (unmoderated)
This list is intended to be the first place for people to turn to
for help about installing, configuring, or just using batch
processing systems.
- NQS-Developers
This list carries discussion about changes needed to batch
processing systems. Currently, all those subscribed to the list
have written actual code for batch processing systems, providing
a high-calibre, and highly-experienced, resource for consultation
on technical issues.
We have sought to encourage interested parties to contact us through
the mailing lists (which we monitor daily) rather than to contact us
directly. This will hopefully serve to build a user community which
can help sustain work in NQS once this project is complete.
World Wide Web (*)
We have turned to the World Wide Web as an excellent means of
providing a large set of information about our work to all
interested parties.
All completed reports can be accessed via the Web, which provides a
better way to get at reports than having to download the files using
ftp.
Draft copies of reports are also made available, usually as they are
written, providing the user community with the opportunity to
comment on every step of the project. The feedback from this
process has already been of great help to us.
There is also a searchable database on the Web, which will grow over
time, allowing administrators to search by subject of by keyword on
topics of interest. The main source of data will be conversations
held on the mailing lists.
Finally, the daily Project Diary is also accessable via WWW. An
unusual move, but it is intended to promote a sense of open
development, allowing anyone to read (and comment on) what's going
through my head.
All documentation is updated automatically on a daily basis, and
ASCII versions of all documentation is also available.
FTP Site
We have set aside space on ftp.shef.ac.uk where the very latest
version of NQS is stored - our `NQS Archive' is mirrored by other
sites to ensure that the entire NQS user community benefits from one
single release which includes all the optional extras (*).
The contents of our NQS archive are currently :
- v3.36 of Monsanto-NQS
This is the version of NQS which we believe to be the most
suitable to support.
- Upgrades to v3.36.4 of Monsanto-NQS
We have released four upgrades to NQS so far (more details below).
- ASCII versions of all the documentation available via WWW.
- Unsupported tools (*)
A collection of numerous small tools developed to aid the main
project work. These are freely available, but we don't have the
time to offer help and support for them.
Installation and Configuration via SuperJANET (*)
To help encourage new installations of NQS at sites where resources
are not available for the job, we have decided to offer to install,
and configure NQS at any UK HE site, via the SuperJANET network.
We have publicised this service amongst all interested parties (via
the NQS-Announce mailing list) and also to sites in general via a
future article in Engineering Computing News. Further publicity
about this service in particular, and this project in general, is
required.
Training
To help promote this work, and provide information about the
configuration of NQS, we are attending the December Silicon Graphics
SIG meeting at Leicester. At the time of writing, we have not
determined the content of our presentation material, but we expect
to be able to use this material as the basis of a more rigourous
training package in the future.
Source Code
After consultation with UK Higher Education sites, and our
evaluation of freely available batch processing systems, we believe
that Monsanto NQS forms the best basis from which to proceed.
Monsanto Software, based in the USA, have licenced their code under
the Free Software Foundation's CopyLeft, which allows anyone to
release new versions of the software.
They no longer work on NQS. This project is now responsible for
co-ordinating new releases of NQS. (*)
There have been four such releases during October and (mainly)
November.
- v3.36.1
Fixed support for Solaris 2. Done by Stuart.
- v3.36.2
Included support for Linux, a high-quality, and highly-popular
version of UNIX for IBM PC's (and clones). Done by Stuart, based
on original work done by Dr. Karsten Steffens.
- v3.36.3
Fixed some silly errors in the accounting support. Done by
Michael Hamilton.
- v3.36.4
More fixes to the accounting, plus support for the new IRIX 6.0
release, support for processor sets, and inclusion of ANSI
prototyping. Done by John Roman and Dave Safford.
At the moment, major changes to the code have been supplied from
other institutions around the world, which we have merged into the
main code and tested (where possible).
NQS is over 140,000 lines of code, and time is currently being spent
on reading this source code to learn what it does and how, so that
we can then perform any necessary bug fixes or other modifications
to a high standard in a short period of time.
Time has also been spent looking at how to improve the quality of
the source code. NQS is an old piece of software, dating back to
1985, and major modifications, such as adding a scheduler, may prove
significantly easier if some restructuring of the NQS source code
has been done first (*).
To date, the following improvements are being considered (*):
- Update the code to be compliant to the POSIX 1 standard.
POSIX was not around when NQS was first written, and as a result,
porting NQS to new versions of UNIX is major work. By ensuring
POSIX compliance, future porting (even in years to come) will
prove to be significantly easier than it currently is.
- Sweep the code to improve modularity
We have been informed by other institutions that the NQS source
code is not highly modular (our examination of the source code is
not complete at this time). For example, email support is
scattered through a number of unreleated areas, rather than being
implemented in one place.
Improved modularity will reduce the size of the source code, and
improve consistency of behaviour throughout. It will also reduce
the impact of future changes to the source code, by localising
such changes.
- Runtime debugging
NQS currently has a (very) crude runtime debugging system, which
allows the developer to see what is going on. A complete
replacement, which has a proper stack-based tracing mechanism,
can be quickly used (especially at remote sites) to at least
locate the point of failure - a single-step debugger can then be
applied to the locality to provide an exact diagnosis.
In addition, one of the major problems encountered during the
evaluation of freely available source code was that programs
always failed without an error message. Even where the program
did generate error messages, these messages were never related to
the failure itself. A complete debugging system would do away
with this problem.
- Network protocols
NQS is not secure, although we have yet to determine to what
extent the current protocols can be abused. This area requires
looking at.
The NQS protocols also rely on the fixed size of integers and
longs under C. As the size of these elements is not defined in
any international standard, then future architectures may prove
unable to run NQS. This area requires further examination.
Summary
Aims And Objectives
Goals Achieved
The following goals from the Funding Bid have been satisfied.
- to provide support on implementation, configuration and use
through the setting up and monitoring of a list on Mailbase;
In addition, the following goals which did not feature in the
Funding Bid have been satisfied.
- consultation of other UK Higher Education sites to determine
their needs to help ensure that we meet those needs;
Goals Worked On
The following goals from the Funding Bid have been worked on.
- to implement and evaluate commercial and public domain
distributed batch systems and in particular NQS and DQS;
- to provide a report, comparing the systems' utility;
- to provide a training course on selected systems at which the
systems will be described and information on implementation and
configuration will be given;
- to provide packaged releases for popular systems and in
particular Sun Solaris 1 and Sun Solaris 2;
Goals Not Worked On
- to provide simple end user documentation on selected systems (to
augment the inevitably terse manual pages).
Expected Work - December and January 1994/1995
We believe that the following work should be undertaken during the
next two months of the project (*):
- Complete the examination of the NQS source code.
- Make all necessary structural changes to the source code.
- Bring the source code up to POSIX compliance.
In addition, these existing committments should also continue to be
honoured.
- Monitor traffic on the Mailbase mailing lists, and provide
whatever information/assistance is required.
- Oversee bug fix releases of Monsanto NQS (*).
|