Progress Report - February 1995
Academic Computing Services , University of Sheffield
Stuart Herbert (S.Herbert@Sheffield.ac.uk)Document copyright ©. All rights reserved.
Abstract
JISC, as part of its New Technologies Initiative, has funded the
University of Sheffield to supply and support batch processing
systems to UK Higher Education.
Contents
Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).
Introduction
About This Progress Report
This report documents the progress achieved in the months of
December, 1994, January and February, 1995, on work related to batch
processing systems.
Excerpt From The Funding Bid
The following subsection, taken from the successful funding bid,
lists the major aims and objectives of the project which were
outstanding at the start of December, 1994.
Aims And Objectives
The main objective is to help sites match their users' demand for
computer power to the available equipment through the use of
distributed batch systems. Currently the systems are available, but
little or no independant information about the suitability of
products is available. Getting to know the products sufficiently
well to understand them without training is a very time consuming
and hence expensive task. Specific aims for the year of funding
would be as follows :
Goals Achieved
The following goals from the Funding Bid were satisfied by the
start of December, 1994.
- to provide support on implementation, configuration and use
through the setting up and monitoring of a list on Mailbase;
In addition, the following goals which did not feature in the
Funding Bid have been satisfied.
- Consultation of other UK Higher Education sites to determine
their needs to help ensure that we meet those needs.
Goals Worked On
The following goals from the Funding Bid had been worked on by the
start of December, 1994.
- to implement and evaluate commercial and public domain distributed
batch systems and in particular NQS and DQS
- to provide a report, comparing the systems' utility
- to provide a training course on selected systems at which the
systems will be described and information on implementation and
configuration will be given
- to provide packaged releases for popular systems and in
particular Sun Solaris 1 and Sun Solaris 2
Goals Not Worked On
The following goals from the Funding Bid had not been worked on by
the start of December, 1994.
- to provide simple end user documentation on selected systems (to
augment the inevitably terse manual pages)
Expected Work - December And January 1994/1995
The following work was scheduled for the months of December 1994 and
January 1995.
- Complete examination of the NQS source code
- Make all necessary structural changes
- Bring the source code up to POSIX.1 compliance
In addition, the following existing committments where to be
honoured.
- Monitor traffic on the Mailbase mailing lists, and provide
whatever information/assistance is required.
- Oversee bug fix releases of Monsanto NQS
Activities
Preamble
The last three months of the project, the third, fourth and fifth
months of our NTI funding period, have seen the emphasis of the
project move from a simple, direct support service, to an active
support through development service.
Below you will find a summary of all the activities which have been
undertaken in the last three months. Full versions of the papers
mentioned here are available via our World-Wide Web service.
Monsanto-NQS
Introduction
Monsanto-NQS, previously supported by John Roman of the Monsanto
Company, is the leading freely-available version of NQS. It
incorporates the functionality found in the other freely-available
version of NQS, CERN NQS.
The University of Sheffield, through this NTI project, has accepted
responsibility for world-wide maintenance, development and
distribution of this product. Our policy has been that we will
co-ordinate and incorporate new functionality developed by third
parties, while providing only `bug fixes' ourselves.
We are working to produce a new set of source code, derived from
Monsanto-NQS, to which we will then add new functionality.
Maintenance
Much time has been spent supporting Monsanto-NQS.
Monsanto-NQS is not a bug-ridden product; thanks to the work of John
Roman, Monsanto-NQS is remarkably bug free. However, it is a
complex product with a large amount of source code, which makes the
resolving of the smallest of problems a long and time consuming
process.
This process is made harder when errors occur which are specific to
a flavour of UNIX which is not available here at the University of
Sheffield. In this case, a new release (incorporating our proposed
solution) has to be prepared, and once feedback has occured, a
second release (incorporating the final, fixed solution) has to be
prepared.
The following fixes were made to Monsanto-NQS during the three month
period covered by this report :
- Networking support for Solaris 2 was fixed
The tracing of this one-line error took several weeks of my time,
and turned out to be the usage of an incorrect structure at just
one point in the code (which was unrelated to the network support).
- HP-UX process statistics support was fixed
This prevented compilation of NQS on HP-UX platforms, and took
several weeks to resolve. Once this was corrected, and NQS could
be compiled on HP-UX, it revealed other problems in the HP-UX
support, which thankfully proved easy to solve.
- Solaris 2.1 networking bug was avoided
The networking libraries supplied with Solaris 2.1 do not work as
documented, and a simple fix was installed to avoid incorrect
behaviour when attempting to start the NQS service.
- IRIX 5/6 signal bug was fixed
The behaviour of some signals on IRIX changed between IRIX 4 and
subsequent versions of IRIX, and changes to NQS were necessary to
ensure that kernel-based resources were not ignored by users'
jobs.
- OSF/1 support was fixed
NQS failed to compile on OSF/1 v2, because of previous changes made
to correct other problems (mainly to do with HP-UX support). The
previous changes were corrected, and the Makefile changed so that
NQS would compile correctly.
- HP-UX 8 support was fixed
Many sites worldwide still run the older HP-UX 8, and changes
were made to NQS to ensure that it would work with this older
version of HP-UX.
- AIX compiler problems were solved
The AIX compiler suite is very poor in the way it handles memory;
alternative drivers were developed which alter the way the AIX
compiler suite works to simplify the work of the forthcoming NQS
scheduler.
A significant proportion of these fixes were made in conjunction
with other developers around the world, but even where complete
contributions came from off-site, a significant amount of time is
required to examine, understand, and test each contribution before
it can be considered safe for inclusion in the next release.
Outstanding Problems
Time has been spent (so far unsuccessfuly) resolving the following
problems :
- a number of sites (so far, all based outside UK academia) have
reported problems related to passing an NQS request from one
machine to another.
Analysis of supplied log files shows that this failure is due to
an internal database error. Code has been added to the database
support in NQS in order to generate more information about this
problem. Unfortunately, so far we have been unable to trip this
extra debugging code; my current conclusion is that the bug lies
elsewhere.
- Submission of a request to NQS fails under known conditions on
an IRIX 5 machine in Australia.
Extra code has been added to NQS to generate further information
about this problem, and we are awaiting the arival of this extra
information. However, this appears to be linked to the previous
error, described above.
- There are (reportedly) unknown problems relating to Solaris 2.4.
So far, I have been unsuccessful in learning what the problems
are, and so I must budget time to investigate and resolve these
problems once we install Solaris 2.4 here in Sheffield.
A period of several weeks have been spent on these unresolved
problems to date.
Porting
At the end of February, 1995, Monsanto-NQS is known to be available
on the following distinct platforms :
- IBM AIX
- Hewlett-Packard HPUX 8
- Hewlett-Packard HPUX 9
- Silicon Graphics IRIX 4
- Silicon Graphics IRIX 5
- Silicon Graphics IRIX 6
- Linux
- AT&T (NCR) UNIX
- Digital Equipment Corp. OSF/1 v1.x
- Digital Equipment Corp. OSF/1 v2.x
- Digital Equipment Corp. OSF/1 v3.x
- Sun Microsystems Solaris 2.1
- Sun Microsystems Solaris 2.2
- Sun Microsystems Solaris 2.3
- Sun Microsystems SunOS 4.1.x
In addition, porting to other generic BSD or System 5r3 versions of
UNIX should prove a simple task, although the Monsanto-NQS code is
still not POSIX.1 compliant at this time.
One should note that commercial competitors cannot claim to support
15 versions of UNIX; Cray support 5, Unison support 6, OCS list 4.
Summary
Monsanto-NQS is the product which we currently supply and support to
UK Academia. The on-going maintenance of Monsanto-NQS is considered
to be very expensive in terms of time, and this activity has
seriously affected the time available to work on other activities
(notably development of Sheffield-NQS).
Sheffield-NQS
Introduction
Inside UK Academia, there is a lack of confidence in the overall
quality of the source code for existing implementations of NQS.
Existing implementations of NQS also suffer from an inherant lack of
portability and extensability, and require significant amounts of
time to maintain.
The root cause of this is simply how old NQS is. First written in
1985, and now carrying the baggage of 10 years worth of bug fixes,
NQS is internally a mess.
The solution is to re-engineer the product, and the result of this
work will be known as Sheffield-NQS. Once the initial release is
stable, this project can then begin adding new functionality, as
requested by UK Academia, to the product.
Design
The reason NQS has become such a mess over the last ten years is
because of poor design. Maintainer after maintainer has found it
far quicker to add their own code to perform some task (for example,
support of electronic mail) than to check to see what support is
already there and make use of that.
In order to avoid this, Sheffield-NQS from the outset must promote
code reusability internally; essentially, it must provide a toolkit
which is so good that whoever is maintaining it five years on will
still turn to this toolkit rather than write his or her own little
`fix'.
To achieve this, we have worked on the following :
- Object-orientated technology
C, as a language, is unsuited to the development of large
toolkits, because of its simplistic structuring support. We
quickly turned to C++, as the natural successor to C, as the
language to use.
If one avoids the use of multiple inheritance, a C++ class
hierarchy proves easy to trace through and understand.
- Base functionality
With the adoption of C++, it was decided to produce an
`application framework' C++ class library which would be used to
build all of the NQS programs.
Application frameworks are the ultimate in reusability, as they
are intended to be used as the building blocks in all
applications. The use of an application framework also
simplifies porting between operating systems; one simply
encapsulates the differences between operating systems inside the
framework, and presents a single, coherent interface for the
programmer to use. Porting software then becomes the far smaller
task of porting the application framework.
The success of Microsoft's MFC application framework, which
provides an easy translation for Windows developers from 16-bit
Windows 3.1 to 32-bit Windows 95 lends weight to the belief that
this approach will bring excellent long term benefits.
Current work has concentrated on developing the built-in
debugging support for the class library; function call tracing,
error logging and reporting, memory leakage detection, incorrect
usage of memory detection; plus basic services like standard
generic container classes for use in building complex data
structures. The next stage, which is not expected to take long,
is to encapsulate UNIX networking and file services, providing
transparent server/client support, and portable filing and
directory handling. With that complete, work can begin on
building actual NQS software.
(The class library also supports message catalogues in a
similiar, but superior, style to X/Open's own work in this area.)
- Internal redesign
While it is important that Sheffield-NQS is backwards compatible
with Monsanto-NQS, this only applies to two external interfaces -
the supported command line utilities, and the networking
protocols. The internal behaviour is in the process of being
completely rebuilt, in order to produce a better product.
For example, the current (and highly inefficient) database
mechanisms are to be replaced by a decent distributed database
engine. This provides multi-node fault-tolerance for scheduling
purposes, something no other product can offer, and mechanisms
for easy extension.
For example, we are investigating the STREAMS model to see if a
portable implementation can be added to the class library; this
would allow us to build NQS as a set of stacked modules, making
modification of NQS in the future at runtime a trivial task.
For example, we are discussing on the NQS-Developers mailing list
what generic interfaces we need to provide, so that future
changes to the source code can be as trivial as possible.
- New SETUP system
In order to assist in porting to new architectures, we have
developed a source tree layout, coupled with a SETUP script for
installation purposes.
Architecture-specific code now resides in separate
subdirectories, and it is the role of SETUP to walk the user step
by step through the installation of NQS for the current platform.
Modified versions of the GNU Autoconfig tools are used to assist
in this task.
Eventually, the SETUP script will be joined by a compiled
program, which will feature a good user interface (both text-mode
and X Windows based) to assist the installation. This will allow
us to provide the same ease-of-installation as enjoyed by users
of DOS and Windows software.
Third Party Involvement
Due to the low staffing level of this project, we have been forced
to seek as much outside help as possible in order to produce this
product within a reasonable deadline.
- Staff at the Aachen Technology Institute are developing an
advanced load-balancer and scheduler for use with Sheffield-NQS.
This was originally an in-house project, but they have kindly
agreed to donate their final code to the main NQS distribution.
As I understand matters, work has been underway for several
months now.
- Development of an X Windows user interface will come from code
written in Germany, and will hopefully be supported by the Linux
community.
Linux International are a non-profit organisation working to
promote the use of the Linux operating system; one of their
projects is to develop a capable administration tool for Linux.
The C++ class library and distributed database from NQS will be
used in this project, and as part of this project, they will need
to add X Windows support to the class library.
Once the X Windows support has been added, I am confident of
seeing the C++ application framework ported to 32 bit Windows,
which will give us the opportunity to provide NQS clients running
on PCs.
Summary
Development of Sheffield-NQS has begun. Attention so far has been
concentrating on producing a high quality underlying `toolkit' which
can be used to produce the rest of the NQS software. Development is
proceeding well, although progress has suffered due to time spent on
other committments, notably maintenance of Monsanto-NQS.
Other Activities
Meetings
During December, the project officer made a presentation about this
project and the services and products offered, to the Silicon
Graphics Special Interest Group, held in Leicester.
UK Academic sites using Silicon Graphics are seen as major customers
for this project, because these sites currently use 4D/NQS,
developed (and no longer supported) by Silicon Graphics. 4D/NQS is
known to not work on the latest version of IRIX, IRIX 6.0x.
All sites present, when asked, indicated that they make use of batch
processing, and concerns were also raised over the quality of the
current NQS source code, and over what will happen once funding for
this project ceases. These concerns formed the basis on which the
bid for extra funding, and the schedule mentioned in the bid, were
built.
We expect to be invited to the next meeting of this group, so that
we can present participants with the latest information about our
work.
Publicity
The January edition of Engineering Computing Newsletter carried a
full page article publicising the products of this project. The
emphasis of the article was on the use of NQS to make use of a
number of UNIX workstations (as emphasised in the original Funding
Bid) rather than the more traditional non-dsitributed batch work
commonly found.
The impact of this article is difficult to assess, but certainly one
site, the University of Keele, has since contacted us and indicated
that, because of this article, they will definitely be using our
products in the future.
In January, a letter was sent to the directors of all Computing
Services at UK Universities, informing them about the project and
inviting them to make use of our products.
Again, the impact of this letter is difficult to assess, but our ftp
logs show that no less than 22 UK Academia sites downloaded files
from our NQS area during January alone.
Further publicity is planned, both electronically and paper-based,
but has been held off while development of Sheffield-NQS continues.
Contacts With Vendors
Over the last three months, we have been actively contacting vendors
of distributed batch processing systems.
- We are seeking to compile information about the available products,
so that we can publish a paper detailing the commercial systems
available.
Such a paper compliments our existing paper on public domain
codes, and should prove useful to those sites who feel that they
require a commercial product.
No date has been set for completion of this paper, as we are
waiting for information from several of the NTI Cluster projects
regarding commercial vendors before compiling the paper.
- We are seeking to produce, and ratify, an Internet standard
protocol for distributed systems.
Even if funding is found from an external source once NTI funding
ceases, this project must realistically be considered finite. It
is therefore important, from the viewpoint of UK Academia, to
seek to promote the inter-operability of proprietry commercial
products, so that there will be a migration path from public
domain NQS to supported commercial alternatives.
Success so far has been varied.
- We have established a contact with the NQE development team inside
Cray Research, Inc., who have forwarded to us copies of their NQS
Protocol documentation.
Unfortunately, since then we have been unsuccessful in our
technical queries over extending the NQS Protocol.
We consider Cray, as the NQS market leader, as essential to any
plans to change the current protocols. Because of the role Cray
has played in the past to other NQS developers, any standard
adopted by Cray will eventually filter down to products produced
by other NQS developers. Without the support of Cray, we
consider any attempt to change the standards to be uncertain of
outcome.
- We have established a contact with the sales team in the UK
marketing Express/UX.
Unfortunately, we have been unable to translate this into a
contact inside the development team of Express/UX so far.
- We have established a contact with the sales arm of Unison, who
sell Load Balancer.
We recently forwarded on information about the project, and our
interests in other developers, and are waiting for them to
contact us.
- We are looking to establish contacts in Sterling Software, authors
of Sterling NQS, and IBM, authors of Load Leveller.
Previous attempts to establish a contact inside Sterling were
unsuccessful.
Uptake By UK Academia
Uptake By UK Academia
Our report `Interest In NTI Project : Distributed Batch Systems In A
UNIX Environment' concludes that 27 UK Academic sites are using our
NQS product to date.
In addition, we have undertakings from at least three more sites
(Keele, Birmingham, Reading) that they will use NQS at some point in
the future, making for a total of 30 sites.
We are working to promote NQS beyond these 30 sites, and to provide
a migration path for these sites away from NQS towards commercially
supported products at some point in the future.
Benefits To UK Academia
Introduction
This chapter looks at the benefits available to UK Academia if they
use our version of NQS against any other commercial products.
One should note that the survey of UK Academia showed conclusively
that interest rests very firmly with some form of NQS as the
preferred solution, although only one of the products mentioned
below is part of the NQS family.
One should also note that the emphasis in this project is on the use
of distributed batch processing, across clusters of workstations,
and therefore the cost analysis below is based on clustered
workstations rather than single-server installations.
We have agreed, with suppliers, costings for the following
configuration, which we believe to be about right for one department
running NQS.
One should also consider that, from the attitudes expressed in
response to our original questionaire, many sites would rather do
without a batch processing system rather than purchase a commercial
system, which increases the need for a freely-available (and
supported) UNIX batch processing system.
Finally, this report assumes that it is not necessary to make a case
for the concept of batch processing within these pages.
Commercial Products
There are four commercial products which we have information about.
Cray Research, Inc. NQE
Cray's Network Queueing Environment (NQE) is seen as the market
leader, both in the NQS market, and in the wider, commercial batch
processing system market. Cray view the product as enjoying a high
priority.
Cray have (very) recently negotiated a CHEST deal for NQE to UK
Academia. Their prices, however, are not cheap.
- For our typical installation, an unlimited user license would have
to be purchased, costing 14,375 pounds sterling.
One would then purchase twenty load balancing servers
(according to the documentation we have), at an additional cost
of 5,250 pounds sterling.
- This makes the total price, for each of our installations, to be
19,375 pounds sterling.
For that, one gets a 30-day money back guarentee, product
maintenance by telephone and email, and `minor revision releases'.
- The total cost, to 30 sites, would therefore be 581,250 pounds
sterling.
On top of this, one must consider the (unknown) cost of future
upgrades. We have been unable to obtain any pricing information on
their `Release 2.0' product line.
Contact Val Emerson at Cray Research (UK) on (0344) 722152 for
further enquiries.
Express/UX
Express/UX is developed by OCS, and marketed in Europe by Open Seas.
This product's main strengths are the high quality user interface
(clients available for MS-Windows, and I'm told Motif as well),
while the scheduling support is currently undergoing revision, as I
understand matters.
This product, however, is expensive.
- One would purchase a `first host', providing the main scheduling
capabilities, for 8,000 pounds sterling.
- One would then purchase an agent license for each of the
workstations, at 1,000 pounds sterling each.
- This makes the total cost for a single installation to be 28,000
pounds sterling.
- The total cost, to 30 sites, would therefore be 840,000 pounds
sterling.
Contact Jason Kent, at Open Seas, on (0865) 744656 for further
enquiries.
Sterling NQS
Sterling NQS is the commercial version of NQS most widely known in
the UK to date, although Cray's recent entry into the market is
expected to change this.
As noted in our evaluation of batch processing systems, Sterling NQS
has aquired a reputation for not delivering.
Attempts to contact Sterling have been unsuccessful to date.
Unison Load Balancer
Load Balancer is one of several products developed and marketed by
Unison Software. Load Balancer is seen as best suited for working
in interactive environments, and has very good scalability and job
submission time - the fastest product available.
Prices look very good too :
Contact Janet Aitchison at Unison Software, on (0582) 462424 for
further enquiries.
Conclusion
The total cost for funding this project for two complete years is
32,600 pounds sterling. This provides all 213 UK Academia sites
with :
- An established batch processing system, which has been modified to
meet the direct stated needs of UK Academia.
- A highly portable, high quality implementation providing strong
support for extension and customisation.
- High quality support for installation and problem-solving.
- High quality documentation written specifically for British users.
- Compatible with the de facto UNIX standard for batch processing,
and compliant with major international UNIX standards.
The only equivalent commercial system, from Cray Research, would cost
4,126,875 pounds sterling, for running on just 21 UNIX machines at
213 UK Academic sites.
This project therefore offers UK Academia savings in the region of
12,659 per cent over the commercial equivalent.
While these figures are above the true savings, one should remember
that those UK sites which are using NQS could include 4-5 separate
departments at each site, and each department is likely to have
twenty or so workstations, making these figures closer to the true
saving than it first appears ...
JISC Criteria
The following criteria were published at the outset of the NTI :
- Proposals must demonstrate mechanisms for transferring results
and benefits to other HE institutions.
We continue to satisfy this criteria, through our use of
Mailbase, and our continued publicity, and by making the product
available, 24 hours a day, at the convenience of anyone who
wishes to take a copy.
- The JISC must be satisfied that the projects show vision;
The continued development of public domain NQS to meet the stated
needs of UK Academia, and the emphasis in that development of
producing a product which will require little/no support, surely
qualifies.
- are demonstrably effective;
If batch processing was not effective, UK Universities would not
be using it. And if NQS in particular was not effective, the
same would surely apply.
- and will involve key technologies for the future which would not
be available to students and researchers without the support of
this Initiative.
High performance batch processing is a key technology if one
wishes to get any work done in an otherwise overloaded UNIX
environment. As our questionaire shows, if this project did not
exist, most UK sites would be unwilling to spend money on
commercial alternatives, and so this key technology is only
available to students and researchers through NTI.
World-Wide Usage
As the figures in our report show, this NTI project provides a
product which is used around the world.
This must surely have some influence upon the international standing
of UK Academia.
Summary
Aims And Objectives
Goals Achieved
No new goals have been satisfied during the three months covered by
this report.
Goals Worked On
The following goals from the Funding Bid have been worked on.
- to implement and evaluate commercial and public domain
distributed batch systems and in particular NQS and DQS;
- to provide a report, comparing the systems' utility;
- to provide a training course on selected systems at which the
systems will be described and information on implementation and
configuration will be given;
- to provide packaged releases for popular systems and in
particular Sun Solaris 1 and Sun Solaris 2
Goals Not Worked On
- to provide simple end user documentation on selected systems (to
augment the inevitably terse manual pages).
Expected Work - March - May, 1995
- Continue the construction of Sheffield-NQS
We hope to achieve the first release no later than the end of
April, but it must be pointed out that this may not be possible
due to the amount of time currently spent supporting the existing
source code.
- Prepare publicity for Sheffield-NQS
Given that the current takeup in UK Academia is impressive, we
consider that time should not be devoted to publicity until the
new batch processing system, Sheffield-NQS, is a completed
product. Publicising further the existing product simply takes
time away from development of the new product.
- Prepare and publish a report on commercial batch processing
systems.
One of the major functions of a support service is to provide
information, and currently, there appears to be no information
available to UK Academia on the commercial alternatives.
- Monitor traffic on the Mailbase mailing lists, and provide
whatever information/assistance is required.
Despite the amount of time consumed by this task, we consider
that downgrading the amount of time given to this will only
result in bad publicity, and will therefore be counter-productive.
- Oversea bug fixes of Monsanto NQS.
|