This is www.gnqs.org, The Home Of Batch Processing


Home | Developers | Documents | Downloads | Mailing Lists | People | Support | Volunteer


An Operational Comparison Between CODINE And Monsanto-NQS

Academic Computing Services , University of Sheffield

Stuart Herbert (S.Herbert@Sheffield.ac.uk)

Document copyright ©. All rights reserved.


Abstract

This paper provides a detailed review of CODINE, a detailed comparison of CODINE against Monsanto-NQS in light of the stated needs of UK HE [1], and a look at what each product hopes to deliver in its next revision.

This paper concludes that currently neither product is superior on features, and that Monsanto-NQS is currently superior on cost.

This paper is based upon an earlier paper by Professor R Hynds of Imperial College, London, presented at the JISC NTI Clusters Workshop, 12th-13th December 1994. We gratefully acknowledge the assistance of Professor Hynds and his staff during the preparation of this paper.


Contents

Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).


Introduction


Purpose And Scope

UNIX workstations are often idle (especially overnight), and batch processing systems are seen as a suitable technology for utilising such spare capacity.

The Joint Information Services Committee (JISC), through its New Technologies Initiative (NTI), has looked at the two very different approaches to this technology - Imperial College has looked at CODINE, a product based on the DQS approach, while the University of Sheffield has looked at Monsanto-NQS, a product based on the NQS approach.

Imperial College presented a paper in December 1994 which looked at CODINE, and subsequently the NTI committee has decided to fund support for CODINE.

The purpose of this paper is to provide an alternative review of CODINE, to provide a detailed comparison of CODINE against Monsanto-NQS, and to show what both products intend to provide in their next revisions.


Relevant Literature

There are three other papers in this area :

  • A Comparison of Queueing, Cluster and Distributed Computing Systems, Joseph A. Kaplan, Michael L. Nelson, NASA Langley Research Center June 1994.

    This paper looks at a large number of batch systems, including CODINE, and several versions of NQS. All systems are compared against a prepared checklist, and this is followed by discussion of major points about each product.

    Much of the evaluation is based solely on supplied documentation and discussion with the vendors - only two of the twelve systems were actually installed and tested.

    No conclusions about the superiority of one product over another are drawn.

  • Systems Analysis - Batch Processing Systems, Stuart Herbert, Academic Computing Services, University of Sheffield, October 1994.

    This paper looks at a small number of batch processing systems, with a view to recommending a product to be further developed to meet the requirements of UK Higher Educational sites [1]. The selection of products was determined by knowledge of the availability of the systems. The features of the different products are compared, and the concerns about each product are documented.

    Much of the evaluation is based on the attempted installation of each product, with secondary evaluation based on documentation where installation attempts were unsuccessful.

    This paper concludes that Monsanto-NQS is the most suitable product to develop further to meet the stated needs of UK Higher Education.

  • CODINE And LoadLeveller - Packages for Running Serial Batch Work in a Workstation Cluster Environment, Professor R.J.Hynds, Centre for Computing Services, Imperial College, December 1994.

    This paper looks at the suitability of CODINE and LoadLeveller for batch processing work over a period of several months. The features of each product are discussed, along with their implications for practical application, and a limited comparison with Monsanto-NQS is included.

    The evaluation is the result of a JISC NTI project to look at these packages.

    This paper concludes that the use of CODINE and LoadLeveller is of significant advantage, that both products are superior to Monsanto-NQS, and that checkpointing is a feature which users are loathe to use.


Investigations

The author visited Imperial College, and was shown how CODINE is installed and configured at that site. From there, the author examined the the CODINE software and its documentation, and discussed any relevant issues with experienced staff at Imperial College. These issues were then reviewed by the author and staff at Imperial College.

These issues were then raised with GENIAS, authors of CODINE, and their comments were included into a revision of Professor Hynds' original paper.


The Issues

The following issues were identified :

  • CODINE has features which simplify administration when compared to Monsanto-NQS.

    However, batch processing systems are intended to be installed, configured, and then left unaltered - both CODINE and NQS are well suited to this task.

  • CODINE and NQS cannot interact - sites must realistically choose one or the other.

  • While elements of the Graphical User Interface (GUI) of CODINE are of value, using the GUI places heavy demands on computing hardware.

  • CODINE does not provide specific support for the extra features of the Silicon Graphics IRIX operating system. A number of UK sites require support of these features.

  • CODINE cannot be used to share resources between different institutions (over SuperJANET), because CODINE requires that user and group id values are the same across all CODINE nodes.

  • CODINE's support of interactive use is limited to starting an `xterm' process on a remote machine, something which Monsanto-NQS can be configured to do.


CODINE


Purpose of Original Study

In many scientific/engineering departments there are often powerful workstations standing idle. The purpose of this study is to see if there exists software that enables this 'spare' capacity to be used for batch processing purposes without interfering with the 'normal' use of the workstation.

Such software should :

  • protect the owner's normal usage

  • permit scheduling and load balancing of jobs

  • control the number of jobs

  • provide job checkpointing

  • provide job migration


About Imperial College

Imperial College Computing Environment - South Kensington Campus

  • Main computing resources distributed to departments

  • Large research income - many workstations

  • Substantial batch work requirement

  • Four main vendors - DEC, IBM, SGI, SUN

  • Wide variety of types of batch work

  • NQS already in use in many departments


The CODINE Software

The commercial package evaluated, CODINE, is developed from two other products :

  • CONDOR, developed at the University of Wisconsin-Madison by Miron Levy and co-workers, is used to provide the checkpointing facilities.

  • DQS, developed at Florida State University by Tom Green and co-workers, is used to provide the queueing and scheduling facilities.


Testing Of CODINE

Part of Centre for COmputing Services support facilities consist of a client-server workstation pair for each of the main vendors on Campus - ie DEC, IBM, SGI, SUN.

CODINE was tested on each client-server pair (HOMOGENEOUS CONFIGURATION) then in combination between client-server pairs (HETEROGENEOUS CONFIGURATION). The DEC pair were tested under the OSF operating system.

Generally speaking the package performed according to its specifications.


General Features Of CODINE And LoadLeveller

  • Graphical User Interface

  • Support for batch jobs

  • Support for job checkpointing

  • Support for migration of checkpointed jobs

  • Intelligent load balancing

  • Definable job limits for specific users

  • Definable priority levels for particular users

  • Configurable queues with specific resource limits

  • Suspension and enabling of queues

  • Definable queue owners

  • Definable 'managers' and 'users'

  • Supports centralised job accounting

  • Usable in both homogeneous and heterogeneous clusters

  • (Processes NQS scripts)*

  • (Support parallel job environments e.g. PVM)

Support for NQS scripts is at extra cost.


Checkpointing And Migration Of Jobs

An essential point of the 'CONDOR' philosophy is that a workstation being used to run batch work will suspend the job if the keyboard or mouse of the workstation is activated. Subsequently if the workstation is used continuously over a specified period the batch job will be checkpointed, then mirgated to another (idle) workstation.

Operationally this mode of operation poses problems.

Tests were done on the time needed to checkpoint jobs across an ethernet to a 'remote' disc. On an unloaded client-server pair jobs of up to 30 MBytes took up to 1 minute. Jobs of 50 to 60 MBytes took 2 to 3 minutes. These times need to be doubled to cover migration to another unloaded workstation. On ethernets with any activity times are correspondingly longer.

The ability to checkpoint jobs is not widely used at Imperial College.


Operational Evaluation Of CODINE

In April 1994, CODINE was installed on a SGI Indigo workstation cluster in the Department of Aeronautics, which is used for teaching and research.

Simplified CODINE administration and user guides were produced and the cluster manager given training in the use of CODINE.

Four categories of job classes were set up initially :

  • Small - 20 Mb Memory Limit

  • Medium - 35 Mb Memory Limit

  • Large - 70 Mb Memory Limit

  • Huge - Unlimited

In August the 'small' job classes were modified to be 'largeoutput' classes, with a 25 Mb memory limit. These were placed on machines with large scratch directories. To cope with users who submitted jobs from outside the cluster (and so were not under CODINE control) a daemon was written to kill non-CODINE jobs. We have made the source code for this daemon available for use at other sites.


Results of CODINE Evaluation

  • Elements of the graphical user interface work well

  • Easy supervision and control of jobs

  • Distribution of jobs between machines, and matching of jobs to machine resources, are better than occurs without the use of load-balancing software.

  • Very useful job accounting

  • No one prepared to checkpoint long running jobs

  • Need to migrate jobs did not occur

  • Need to know job resource requirements otherwise job terminated at runtime.

  • Requires the creation of large numbers of queues in order to run jobs simultaneously, making serious interactive use more difficult.

  • Ability to group queues in different ways simplifies administration.

  • No support for processor sets or non-degradable priorities on SGI.


Comparison Of CODINE and Monsanto-NQS


Comparison of CODINE and Monsanto-NQS

An essential question to be answered is whether CODINE provides a significant improvement over Monsanto-NQS.

The facilities provided by CODINE which are NOT provided by NQS are :

  • Graphical user interface

  • Ability to restrict number of jobs per user executing at any one time over the complete cluster

  • Ability to restrict users from a particular queue or from using CODINE at all.

  • Ability to checkpoint jobs

  • Ability to migrate jobs

The facilities provided by CODINE which can be emulated by NQS are :

  • Centralised job accounting.
  • Interactive work (CODINE's qsh command).

The facilities provided by NQS which are NOT provided by CODINE are :

  • Run more than one job simultaneously per queue.

  • Truely distributed - does not require a central node managing a cluster.

  • Mapping of users and groups between machines.

  • Interoperate with other commercial and public-domain products


Future Directions


Next Release - CODINE v4

GENIAS reports that the next release of CODINE will see the following new features :

  • A queue can run more than one simultaneous job

  • POSIX 1003 Draft 15 compliant

  • (E)PS output of the accounting graphical user interface

  • Real-time usage monitor

  • Hierarchical management

  • DCE/DFS Support

  • Kerberos/AFS Support

  • Scheduler can be replaced/configured

  • Access to CODINE can be restricted by group

  • Improved GUI

  • Parallel Make

  • Improved Monitoring & Error Tracing

  • Based on DQS 3


Next Release - Sheffield-NQS (formerly Monsanto-NQS) v4 :

The University of Sheffield reports that the next release of NQS will see the following new features :

  • Distributed Database

  • Improved Cluster Configuration

  • Improved Monitoring & Error Tracing

  • Improvied Load Balancing

  • Kerberos/AFS Support

  • Larmouth Scheduling

  • RFC 1413 Support

  • Runtime loadable, stackable module support for extensibility

  • Tailored to stated UK requirements

The aim of Sheffield-NQS is to provide a product which can be very easily adapted in the future to meet changing requirements.


Pricing


Pricing - CODINE


Charges To Imperial College

Imperial College were quoted the following prices in December 1993 for a site license :

>  --------------------------------
>  No. Of Students	Cost (DM)
>
>     1	  - 20,000	6200
>  20,001 - 35,000	8600
>  35,001+		14800
>  --------------------------------
Prices in the UK are 10% higher. The cost figure is paid yearly, with an initial one-off payment of the cost figure at the start of the site license.


Charges To The University of Sheffield

The University of Sheffield reports that the latest prices sent on request by GENIAS are :

  • Pay an annual fee of 13,500 DM (5869 stlg), plus an initial fee of 13,500 DM (5869 stlg). This includes maintenance for three months.

    or

  • Pay a one-off fee of 45,000 DM (19,565 stlg). This includes maintenance for three months. Upgrades to new releases are extra.

Maintenance is an annual fee of 8,100 DM (2521 stlg). Prices in brackets assume an exchange rate of 2.3 DM to the pound. All prices include academic discounts.


Conclusions


Conclusions

The use of batch processing software, such as CODINE, provides significant advantages in the running on batch jobs of a workstation cluster. The extra features of CODINE - the graphical user interface, checkpointing and job migration - which make it an improvement to NQS, are seldom used.

Imperial College has decided to adopt CODINE as its standard campus batch control package.


References


References

[1] Batch Processing Systems In The UK HE Community, S.Herbert, Academic Computing Services, University of Sheffield, October 1994.



This site (www.gnqs.org) is copyrighted. You can view the terms & conditions here.
You can contact the webmaster here.