An Operational Comparison Between CODINE And Monsanto-NQS
Academic Computing Services , University of Sheffield
Stuart Herbert (S.Herbert@Sheffield.ac.uk)Document copyright ©. All rights reserved.
Abstract
This paper provides a detailed review of CODINE, a detailed
comparison of CODINE against Monsanto-NQS in light of the stated
needs of UK HE [1], and a look at what each product hopes to deliver
in its next revision.
This paper concludes that currently neither product is superior on
features, and that Monsanto-NQS is currently superior on cost.
This paper is based upon an earlier paper by Professor R Hynds
of Imperial College, London, presented at the JISC NTI Clusters
Workshop, 12th-13th December 1994. We gratefully acknowledge
the assistance of Professor Hynds and his staff during the
preparation of this paper.
Contents
Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).
Introduction
Purpose And Scope
UNIX workstations are often idle (especially overnight), and batch
processing systems are seen as a suitable technology for utilising
such spare capacity.
The Joint Information Services Committee (JISC), through its New
Technologies Initiative (NTI), has looked at the two very different
approaches to this technology - Imperial College has looked at
CODINE, a product based on the DQS approach, while the University of
Sheffield has looked at Monsanto-NQS, a product based on the NQS
approach.
Imperial College presented a paper in December 1994 which looked at
CODINE, and subsequently the NTI committee has decided to fund
support for CODINE.
The purpose of this paper is to provide an alternative review of
CODINE, to provide a detailed comparison of CODINE against
Monsanto-NQS, and to show what both products intend to provide in
their next revisions.
Relevant Literature
There are three other papers in this area :
- A Comparison of Queueing, Cluster and Distributed Computing
Systems, Joseph A. Kaplan, Michael L. Nelson, NASA Langley
Research Center June 1994.
This paper looks at a large number of batch systems, including
CODINE, and several versions of NQS. All systems are compared
against a prepared checklist, and this is followed by discussion
of major points about each product.
Much of the evaluation is based solely on supplied documentation
and discussion with the vendors - only two of the twelve systems
were actually installed and tested.
No conclusions about the superiority of one product over another
are drawn.
- Systems Analysis - Batch Processing Systems, Stuart Herbert,
Academic Computing Services, University of Sheffield, October
1994.
This paper looks at a small number of batch processing systems,
with a view to recommending a product to be further developed to
meet the requirements of UK Higher Educational sites [1]. The
selection of products was determined by knowledge of the
availability of the systems. The features of the different
products are compared, and the concerns about each product are
documented.
Much of the evaluation is based on the attempted installation of
each product, with secondary evaluation based on documentation
where installation attempts were unsuccessful.
This paper concludes that Monsanto-NQS is the most suitable
product to develop further to meet the stated needs of UK Higher
Education.
- CODINE And LoadLeveller - Packages for Running Serial Batch Work
in a Workstation Cluster Environment, Professor R.J.Hynds, Centre
for Computing Services, Imperial College, December 1994.
This paper looks at the suitability of CODINE and LoadLeveller
for batch processing work over a period of several months. The
features of each product are discussed, along with their
implications for practical application, and a limited comparison
with Monsanto-NQS is included.
The evaluation is the result of a JISC NTI project to look at
these packages.
This paper concludes that the use of CODINE and LoadLeveller is
of significant advantage, that both products are superior to
Monsanto-NQS, and that checkpointing is a feature which users are
loathe to use.
Investigations
The author visited Imperial College, and was shown how CODINE is
installed and configured at that site. From there, the author
examined the the CODINE software and its documentation, and
discussed any relevant issues with experienced staff at Imperial
College. These issues were then reviewed by the author and staff at
Imperial College.
These issues were then raised with GENIAS, authors of CODINE, and
their comments were included into a revision of Professor Hynds'
original paper.
The Issues
The following issues were identified :
- CODINE has features which simplify administration when compared
to Monsanto-NQS.
However, batch processing systems are intended to be installed,
configured, and then left unaltered - both CODINE and NQS are
well suited to this task.
- CODINE and NQS cannot interact - sites must realistically choose
one or the other.
- While elements of the Graphical User Interface (GUI) of CODINE
are of value, using the GUI places heavy demands on computing
hardware.
- CODINE does not provide specific support for the extra features
of the Silicon Graphics IRIX operating system. A number of UK
sites require support of these features.
- CODINE cannot be used to share resources between different
institutions (over SuperJANET), because CODINE requires that user
and group id values are the same across all CODINE nodes.
- CODINE's support of interactive use is limited to starting an
`xterm' process on a remote machine, something which Monsanto-NQS
can be configured to do.
CODINE
Purpose of Original Study
In many scientific/engineering departments there are often powerful
workstations standing idle. The purpose of this study is to see
if there exists software that enables this 'spare' capacity to be
used for batch processing purposes without interfering with
the 'normal' use of the workstation.
Such software should :
- protect the owner's normal usage
- permit scheduling and load balancing of jobs
- control the number of jobs
- provide job checkpointing
- provide job migration
About Imperial College
Imperial College Computing Environment - South Kensington Campus
- Main computing resources distributed to departments
- Large research income - many workstations
- Substantial batch work requirement
- Four main vendors - DEC, IBM, SGI, SUN
- Wide variety of types of batch work
- NQS already in use in many departments
The CODINE Software
The commercial package evaluated, CODINE, is developed from two
other products :
- CONDOR, developed at the University of Wisconsin-Madison by
Miron Levy and co-workers, is used to provide the checkpointing
facilities.
- DQS, developed at Florida State University by Tom Green and
co-workers, is used to provide the queueing and scheduling
facilities.
Testing Of CODINE
Part of Centre for COmputing Services support facilities consist of
a client-server workstation pair for each of the main vendors on
Campus - ie DEC, IBM, SGI, SUN.
CODINE was tested on each client-server pair (HOMOGENEOUS
CONFIGURATION) then in combination between client-server pairs
(HETEROGENEOUS CONFIGURATION). The DEC pair were tested under the
OSF operating system.
Generally speaking the package performed according to its
specifications.
General Features Of CODINE And LoadLeveller
- Graphical User Interface
- Support for batch jobs
- Support for job checkpointing
- Support for migration of checkpointed jobs
- Intelligent load balancing
- Definable job limits for specific users
- Definable priority levels for particular users
- Configurable queues with specific resource limits
- Suspension and enabling of queues
- Definable queue owners
- Definable 'managers' and 'users'
- Supports centralised job accounting
- Usable in both homogeneous and heterogeneous clusters
- (Processes NQS scripts)*
- (Support parallel job environments e.g. PVM)
Support for NQS scripts is at extra cost.
Checkpointing And Migration Of Jobs
An essential point of the 'CONDOR' philosophy is that a workstation
being used to run batch work will suspend the job if the keyboard or
mouse of the workstation is activated. Subsequently if the
workstation is used continuously over a specified period the batch
job will be checkpointed, then mirgated to another (idle)
workstation.
Operationally this mode of operation poses problems.
Tests were done on the time needed to checkpoint jobs across an
ethernet to a 'remote' disc. On an unloaded client-server pair jobs
of up to 30 MBytes took up to 1 minute. Jobs of 50 to 60 MBytes
took 2 to 3 minutes. These times need to be doubled to cover
migration to another unloaded workstation. On ethernets with any
activity times are correspondingly longer.
The ability to checkpoint jobs is not widely used at Imperial College.
Operational Evaluation Of CODINE
In April 1994, CODINE was installed on a SGI Indigo workstation
cluster in the Department of Aeronautics, which is used for teaching
and research.
Simplified CODINE administration and user guides were produced and
the cluster manager given training in the use of CODINE.
Four categories of job classes were set up initially :
- Small - 20 Mb Memory Limit
- Medium - 35 Mb Memory Limit
- Large - 70 Mb Memory Limit
- Huge - Unlimited
In August the 'small' job classes were modified to be 'largeoutput'
classes, with a 25 Mb memory limit. These were placed on machines
with large scratch directories. To cope with users who submitted
jobs from outside the cluster (and so were not under CODINE control)
a daemon was written to kill non-CODINE jobs. We have made the
source code for this daemon available for use at other sites.
Results of CODINE Evaluation
- Elements of the graphical user interface work well
- Easy supervision and control of jobs
- Distribution of jobs between machines, and matching of jobs to
machine resources, are better than occurs without the use of
load-balancing software.
- Very useful job accounting
- No one prepared to checkpoint long running jobs
- Need to migrate jobs did not occur
- Need to know job resource requirements otherwise job terminated
at runtime.
- Requires the creation of large numbers of queues in order to run
jobs simultaneously, making serious interactive use more difficult.
- Ability to group queues in different ways simplifies administration.
- No support for processor sets or non-degradable priorities on SGI.
Comparison Of CODINE and Monsanto-NQS
Comparison of CODINE and Monsanto-NQS
An essential question to be answered is whether CODINE provides a
significant improvement over Monsanto-NQS.
The facilities provided by CODINE which are NOT provided by NQS are :
- Graphical user interface
- Ability to restrict number of jobs per user executing at any one
time over the complete cluster
- Ability to restrict users from a particular queue or from using
CODINE at all.
- Ability to checkpoint jobs
- Ability to migrate jobs
The facilities provided by CODINE which can be emulated by NQS are :
- Centralised job accounting.
- Interactive work (CODINE's qsh command).
The facilities provided by NQS which are NOT provided by CODINE are :
- Run more than one job simultaneously per queue.
- Truely distributed - does not require a central node managing a
cluster.
- Mapping of users and groups between machines.
- Interoperate with other commercial and public-domain products
Future Directions
Next Release - CODINE v4
GENIAS reports that the next release of CODINE will see the
following new features :
- A queue can run more than one simultaneous job
- POSIX 1003 Draft 15 compliant
- (E)PS output of the accounting graphical user interface
- Real-time usage monitor
- Hierarchical management
- DCE/DFS Support
- Kerberos/AFS Support
- Scheduler can be replaced/configured
- Access to CODINE can be restricted by group
- Improved GUI
- Parallel Make
- Improved Monitoring & Error Tracing
- Based on DQS 3
Next Release - Sheffield-NQS (formerly Monsanto-NQS) v4 :
The University of Sheffield reports that the next release of NQS
will see the following new features :
- Distributed Database
- Improved Cluster Configuration
- Improved Monitoring & Error Tracing
- Improvied Load Balancing
- Kerberos/AFS Support
- Larmouth Scheduling
- RFC 1413 Support
- Runtime loadable, stackable module support for extensibility
- Tailored to stated UK requirements
The aim of Sheffield-NQS is to provide a product which can be very
easily adapted in the future to meet changing requirements.
Pricing
Pricing - CODINE
Charges To Imperial College
Imperial College were quoted the following prices in December 1993
for a site license :
> --------------------------------
> No. Of Students Cost (DM)
>
> 1 - 20,000 6200
> 20,001 - 35,000 8600
> 35,001+ 14800
> --------------------------------
Prices in the UK are 10% higher. The cost figure is paid yearly,
with an initial one-off payment of the cost figure at the start of
the site license.
Charges To The University of Sheffield
The University of Sheffield reports that the latest prices sent on
request by GENIAS are :
- Pay an annual fee of 13,500 DM (5869 stlg), plus an initial fee
of 13,500 DM (5869 stlg). This includes maintenance for three
months.
or
- Pay a one-off fee of 45,000 DM (19,565 stlg). This includes
maintenance for three months. Upgrades to new releases are extra.
Maintenance is an annual fee of 8,100 DM (2521 stlg). Prices in
brackets assume an exchange rate of 2.3 DM to the pound. All prices
include academic discounts.
Conclusions
Conclusions
The use of batch processing software, such as CODINE, provides
significant advantages in the running on batch jobs of a workstation
cluster. The extra features of CODINE - the graphical user
interface, checkpointing and job migration - which make it an
improvement to NQS, are seldom used.
Imperial College has decided to adopt CODINE as its standard campus
batch control package.
References
References
[1] Batch Processing Systems In The UK HE Community, S.Herbert,
Academic Computing Services, University of Sheffield, October 1994.
|