This is www.gnqs.org, The Home Of Batch Processing


Home | Developers | Documents | Downloads | Mailing Lists | People | Support | Volunteer


Progress Report - February 1995

Academic Computing Services , University of Sheffield

Stuart Herbert (S.Herbert@Sheffield.ac.uk)

Document copyright ©. All rights reserved.


Abstract

JISC, as part of its New Technologies Initiative, has funded the University of Sheffield to supply and support batch processing systems to UK Higher Education.


Contents

Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).


Introduction


About This Progress Report

This report documents the progress achieved in the months of December, 1994, January and February, 1995, on work related to batch processing systems.


Excerpt From The Funding Bid

The following subsection, taken from the successful funding bid, lists the major aims and objectives of the project which were outstanding at the start of December, 1994.


Aims And Objectives

The main objective is to help sites match their users' demand for computer power to the available equipment through the use of distributed batch systems. Currently the systems are available, but little or no independant information about the suitability of products is available. Getting to know the products sufficiently well to understand them without training is a very time consuming and hence expensive task. Specific aims for the year of funding would be as follows :


Goals Achieved

The following goals from the Funding Bid were satisfied by the start of December, 1994.

  • to provide support on implementation, configuration and use through the setting up and monitoring of a list on Mailbase;

In addition, the following goals which did not feature in the Funding Bid have been satisfied.

  • Consultation of other UK Higher Education sites to determine their needs to help ensure that we meet those needs.


Goals Worked On

The following goals from the Funding Bid had been worked on by the start of December, 1994.

  • to implement and evaluate commercial and public domain distributed batch systems and in particular NQS and DQS

  • to provide a report, comparing the systems' utility

  • to provide a training course on selected systems at which the systems will be described and information on implementation and configuration will be given

  • to provide packaged releases for popular systems and in particular Sun Solaris 1 and Sun Solaris 2


Goals Not Worked On

The following goals from the Funding Bid had not been worked on by the start of December, 1994.

  • to provide simple end user documentation on selected systems (to augment the inevitably terse manual pages)


Expected Work - December And January 1994/1995

The following work was scheduled for the months of December 1994 and January 1995.

  • Complete examination of the NQS source code

  • Make all necessary structural changes

  • Bring the source code up to POSIX.1 compliance

In addition, the following existing committments where to be honoured.

  • Monitor traffic on the Mailbase mailing lists, and provide whatever information/assistance is required.

  • Oversee bug fix releases of Monsanto NQS


Activities


Preamble

The last three months of the project, the third, fourth and fifth months of our NTI funding period, have seen the emphasis of the project move from a simple, direct support service, to an active support through development service.

Below you will find a summary of all the activities which have been undertaken in the last three months. Full versions of the papers mentioned here are available via our World-Wide Web service.


Monsanto-NQS


Introduction

Monsanto-NQS, previously supported by John Roman of the Monsanto Company, is the leading freely-available version of NQS. It incorporates the functionality found in the other freely-available version of NQS, CERN NQS.

The University of Sheffield, through this NTI project, has accepted responsibility for world-wide maintenance, development and distribution of this product. Our policy has been that we will co-ordinate and incorporate new functionality developed by third parties, while providing only `bug fixes' ourselves.

We are working to produce a new set of source code, derived from Monsanto-NQS, to which we will then add new functionality.


Maintenance

Much time has been spent supporting Monsanto-NQS.

Monsanto-NQS is not a bug-ridden product; thanks to the work of John Roman, Monsanto-NQS is remarkably bug free. However, it is a complex product with a large amount of source code, which makes the resolving of the smallest of problems a long and time consuming process.

This process is made harder when errors occur which are specific to a flavour of UNIX which is not available here at the University of Sheffield. In this case, a new release (incorporating our proposed solution) has to be prepared, and once feedback has occured, a second release (incorporating the final, fixed solution) has to be prepared.

The following fixes were made to Monsanto-NQS during the three month period covered by this report :

  • Networking support for Solaris 2 was fixed

    The tracing of this one-line error took several weeks of my time, and turned out to be the usage of an incorrect structure at just one point in the code (which was unrelated to the network support).

  • HP-UX process statistics support was fixed

    This prevented compilation of NQS on HP-UX platforms, and took several weeks to resolve. Once this was corrected, and NQS could be compiled on HP-UX, it revealed other problems in the HP-UX support, which thankfully proved easy to solve.

  • Solaris 2.1 networking bug was avoided

    The networking libraries supplied with Solaris 2.1 do not work as documented, and a simple fix was installed to avoid incorrect behaviour when attempting to start the NQS service.

  • IRIX 5/6 signal bug was fixed

    The behaviour of some signals on IRIX changed between IRIX 4 and subsequent versions of IRIX, and changes to NQS were necessary to ensure that kernel-based resources were not ignored by users' jobs.

  • OSF/1 support was fixed

    NQS failed to compile on OSF/1 v2, because of previous changes made to correct other problems (mainly to do with HP-UX support). The previous changes were corrected, and the Makefile changed so that NQS would compile correctly.

  • HP-UX 8 support was fixed

    Many sites worldwide still run the older HP-UX 8, and changes were made to NQS to ensure that it would work with this older version of HP-UX.

  • AIX compiler problems were solved

    The AIX compiler suite is very poor in the way it handles memory; alternative drivers were developed which alter the way the AIX compiler suite works to simplify the work of the forthcoming NQS scheduler.

A significant proportion of these fixes were made in conjunction with other developers around the world, but even where complete contributions came from off-site, a significant amount of time is required to examine, understand, and test each contribution before it can be considered safe for inclusion in the next release.


Outstanding Problems

Time has been spent (so far unsuccessfuly) resolving the following problems :

  • a number of sites (so far, all based outside UK academia) have reported problems related to passing an NQS request from one machine to another.

    Analysis of supplied log files shows that this failure is due to an internal database error. Code has been added to the database support in NQS in order to generate more information about this problem. Unfortunately, so far we have been unable to trip this extra debugging code; my current conclusion is that the bug lies elsewhere.

  • Submission of a request to NQS fails under known conditions on an IRIX 5 machine in Australia.

    Extra code has been added to NQS to generate further information about this problem, and we are awaiting the arival of this extra information. However, this appears to be linked to the previous error, described above.

  • There are (reportedly) unknown problems relating to Solaris 2.4.

    So far, I have been unsuccessful in learning what the problems are, and so I must budget time to investigate and resolve these problems once we install Solaris 2.4 here in Sheffield.

A period of several weeks have been spent on these unresolved problems to date.


Porting

At the end of February, 1995, Monsanto-NQS is known to be available on the following distinct platforms :

  • IBM AIX

  • Hewlett-Packard HPUX 8

  • Hewlett-Packard HPUX 9

  • Silicon Graphics IRIX 4

  • Silicon Graphics IRIX 5

  • Silicon Graphics IRIX 6

  • Linux

  • AT&ampT (NCR) UNIX

  • Digital Equipment Corp. OSF/1 v1.x

  • Digital Equipment Corp. OSF/1 v2.x

  • Digital Equipment Corp. OSF/1 v3.x

  • Sun Microsystems Solaris 2.1

  • Sun Microsystems Solaris 2.2

  • Sun Microsystems Solaris 2.3

  • Sun Microsystems SunOS 4.1.x

In addition, porting to other generic BSD or System 5r3 versions of UNIX should prove a simple task, although the Monsanto-NQS code is still not POSIX.1 compliant at this time.

One should note that commercial competitors cannot claim to support 15 versions of UNIX; Cray support 5, Unison support 6, OCS list 4.


Summary

Monsanto-NQS is the product which we currently supply and support to UK Academia. The on-going maintenance of Monsanto-NQS is considered to be very expensive in terms of time, and this activity has seriously affected the time available to work on other activities (notably development of Sheffield-NQS).


Sheffield-NQS


Introduction

Inside UK Academia, there is a lack of confidence in the overall quality of the source code for existing implementations of NQS. Existing implementations of NQS also suffer from an inherant lack of portability and extensability, and require significant amounts of time to maintain.

The root cause of this is simply how old NQS is. First written in 1985, and now carrying the baggage of 10 years worth of bug fixes, NQS is internally a mess.

The solution is to re-engineer the product, and the result of this work will be known as Sheffield-NQS. Once the initial release is stable, this project can then begin adding new functionality, as requested by UK Academia, to the product.


Design

The reason NQS has become such a mess over the last ten years is because of poor design. Maintainer after maintainer has found it far quicker to add their own code to perform some task (for example, support of electronic mail) than to check to see what support is already there and make use of that.

In order to avoid this, Sheffield-NQS from the outset must promote code reusability internally; essentially, it must provide a toolkit which is so good that whoever is maintaining it five years on will still turn to this toolkit rather than write his or her own little `fix'.

To achieve this, we have worked on the following :

  • Object-orientated technology

    C, as a language, is unsuited to the development of large toolkits, because of its simplistic structuring support. We quickly turned to C++, as the natural successor to C, as the language to use.

    If one avoids the use of multiple inheritance, a C++ class hierarchy proves easy to trace through and understand.

  • Base functionality

    With the adoption of C++, it was decided to produce an `application framework' C++ class library which would be used to build all of the NQS programs.

    Application frameworks are the ultimate in reusability, as they are intended to be used as the building blocks in all applications. The use of an application framework also simplifies porting between operating systems; one simply encapsulates the differences between operating systems inside the framework, and presents a single, coherent interface for the programmer to use. Porting software then becomes the far smaller task of porting the application framework.

    The success of Microsoft's MFC application framework, which provides an easy translation for Windows developers from 16-bit Windows 3.1 to 32-bit Windows 95 lends weight to the belief that this approach will bring excellent long term benefits.

    Current work has concentrated on developing the built-in debugging support for the class library; function call tracing, error logging and reporting, memory leakage detection, incorrect usage of memory detection; plus basic services like standard generic container classes for use in building complex data structures. The next stage, which is not expected to take long, is to encapsulate UNIX networking and file services, providing transparent server/client support, and portable filing and directory handling. With that complete, work can begin on building actual NQS software.

    (The class library also supports message catalogues in a similiar, but superior, style to X/Open's own work in this area.)

  • Internal redesign

    While it is important that Sheffield-NQS is backwards compatible with Monsanto-NQS, this only applies to two external interfaces - the supported command line utilities, and the networking protocols. The internal behaviour is in the process of being completely rebuilt, in order to produce a better product.

    For example, the current (and highly inefficient) database mechanisms are to be replaced by a decent distributed database engine. This provides multi-node fault-tolerance for scheduling purposes, something no other product can offer, and mechanisms for easy extension.

    For example, we are investigating the STREAMS model to see if a portable implementation can be added to the class library; this would allow us to build NQS as a set of stacked modules, making modification of NQS in the future at runtime a trivial task.

    For example, we are discussing on the NQS-Developers mailing list what generic interfaces we need to provide, so that future changes to the source code can be as trivial as possible.

  • New SETUP system

    In order to assist in porting to new architectures, we have developed a source tree layout, coupled with a SETUP script for installation purposes.

    Architecture-specific code now resides in separate subdirectories, and it is the role of SETUP to walk the user step by step through the installation of NQS for the current platform. Modified versions of the GNU Autoconfig tools are used to assist in this task.

    Eventually, the SETUP script will be joined by a compiled program, which will feature a good user interface (both text-mode and X Windows based) to assist the installation. This will allow us to provide the same ease-of-installation as enjoyed by users of DOS and Windows software.


Third Party Involvement

Due to the low staffing level of this project, we have been forced to seek as much outside help as possible in order to produce this product within a reasonable deadline.

  • Staff at the Aachen Technology Institute are developing an advanced load-balancer and scheduler for use with Sheffield-NQS.

    This was originally an in-house project, but they have kindly agreed to donate their final code to the main NQS distribution.

    As I understand matters, work has been underway for several months now.

  • Development of an X Windows user interface will come from code written in Germany, and will hopefully be supported by the Linux community.

    Linux International are a non-profit organisation working to promote the use of the Linux operating system; one of their projects is to develop a capable administration tool for Linux. The C++ class library and distributed database from NQS will be used in this project, and as part of this project, they will need to add X Windows support to the class library.

    Once the X Windows support has been added, I am confident of seeing the C++ application framework ported to 32 bit Windows, which will give us the opportunity to provide NQS clients running on PCs.


Summary

Development of Sheffield-NQS has begun. Attention so far has been concentrating on producing a high quality underlying `toolkit' which can be used to produce the rest of the NQS software. Development is proceeding well, although progress has suffered due to time spent on other committments, notably maintenance of Monsanto-NQS.


Other Activities


Meetings

During December, the project officer made a presentation about this project and the services and products offered, to the Silicon Graphics Special Interest Group, held in Leicester.

UK Academic sites using Silicon Graphics are seen as major customers for this project, because these sites currently use 4D/NQS, developed (and no longer supported) by Silicon Graphics. 4D/NQS is known to not work on the latest version of IRIX, IRIX 6.0x.

All sites present, when asked, indicated that they make use of batch processing, and concerns were also raised over the quality of the current NQS source code, and over what will happen once funding for this project ceases. These concerns formed the basis on which the bid for extra funding, and the schedule mentioned in the bid, were built.

We expect to be invited to the next meeting of this group, so that we can present participants with the latest information about our work.


Publicity

The January edition of Engineering Computing Newsletter carried a full page article publicising the products of this project. The emphasis of the article was on the use of NQS to make use of a number of UNIX workstations (as emphasised in the original Funding Bid) rather than the more traditional non-dsitributed batch work commonly found.

The impact of this article is difficult to assess, but certainly one site, the University of Keele, has since contacted us and indicated that, because of this article, they will definitely be using our products in the future.

In January, a letter was sent to the directors of all Computing Services at UK Universities, informing them about the project and inviting them to make use of our products.

Again, the impact of this letter is difficult to assess, but our ftp logs show that no less than 22 UK Academia sites downloaded files from our NQS area during January alone.

Further publicity is planned, both electronically and paper-based, but has been held off while development of Sheffield-NQS continues.


Contacts With Vendors

Over the last three months, we have been actively contacting vendors of distributed batch processing systems.

  • We are seeking to compile information about the available products, so that we can publish a paper detailing the commercial systems available.

    Such a paper compliments our existing paper on public domain codes, and should prove useful to those sites who feel that they require a commercial product.

    No date has been set for completion of this paper, as we are waiting for information from several of the NTI Cluster projects regarding commercial vendors before compiling the paper.

  • We are seeking to produce, and ratify, an Internet standard protocol for distributed systems.

    Even if funding is found from an external source once NTI funding ceases, this project must realistically be considered finite. It is therefore important, from the viewpoint of UK Academia, to seek to promote the inter-operability of proprietry commercial products, so that there will be a migration path from public domain NQS to supported commercial alternatives.

Success so far has been varied.

  • We have established a contact with the NQE development team inside Cray Research, Inc., who have forwarded to us copies of their NQS Protocol documentation.

    Unfortunately, since then we have been unsuccessful in our technical queries over extending the NQS Protocol.

    We consider Cray, as the NQS market leader, as essential to any plans to change the current protocols. Because of the role Cray has played in the past to other NQS developers, any standard adopted by Cray will eventually filter down to products produced by other NQS developers. Without the support of Cray, we consider any attempt to change the standards to be uncertain of outcome.

  • We have established a contact with the sales team in the UK marketing Express/UX.

    Unfortunately, we have been unable to translate this into a contact inside the development team of Express/UX so far.

  • We have established a contact with the sales arm of Unison, who sell Load Balancer.

    We recently forwarded on information about the project, and our interests in other developers, and are waiting for them to contact us.

  • We are looking to establish contacts in Sterling Software, authors of Sterling NQS, and IBM, authors of Load Leveller.

    Previous attempts to establish a contact inside Sterling were unsuccessful.


Uptake By UK Academia


Uptake By UK Academia

Our report `Interest In NTI Project : Distributed Batch Systems In A UNIX Environment' concludes that 27 UK Academic sites are using our NQS product to date.

In addition, we have undertakings from at least three more sites (Keele, Birmingham, Reading) that they will use NQS at some point in the future, making for a total of 30 sites.

We are working to promote NQS beyond these 30 sites, and to provide a migration path for these sites away from NQS towards commercially supported products at some point in the future.


Benefits To UK Academia


Introduction

This chapter looks at the benefits available to UK Academia if they use our version of NQS against any other commercial products.

One should note that the survey of UK Academia showed conclusively that interest rests very firmly with some form of NQS as the preferred solution, although only one of the products mentioned below is part of the NQS family.

One should also note that the emphasis in this project is on the use of distributed batch processing, across clusters of workstations, and therefore the cost analysis below is based on clustered workstations rather than single-server installations.

We have agreed, with suppliers, costings for the following configuration, which we believe to be about right for one department running NQS.

  • One server with a license for fifty users, clustered with 20 workstations, each with a license for ten users.

    This bottom-end installation represents the very minimum installation we expect to see across sites using NQS for clustered arrangements.

One should also consider that, from the attitudes expressed in response to our original questionaire, many sites would rather do without a batch processing system rather than purchase a commercial system, which increases the need for a freely-available (and supported) UNIX batch processing system.

Finally, this report assumes that it is not necessary to make a case for the concept of batch processing within these pages.


Commercial Products

There are four commercial products which we have information about.


Cray Research, Inc. NQE

Cray's Network Queueing Environment (NQE) is seen as the market leader, both in the NQS market, and in the wider, commercial batch processing system market. Cray view the product as enjoying a high priority.

Cray have (very) recently negotiated a CHEST deal for NQE to UK Academia. Their prices, however, are not cheap.

  • For our typical installation, an unlimited user license would have to be purchased, costing 14,375 pounds sterling.

    One would then purchase twenty load balancing servers (according to the documentation we have), at an additional cost of 5,250 pounds sterling.

  • This makes the total price, for each of our installations, to be 19,375 pounds sterling.

    For that, one gets a 30-day money back guarentee, product maintenance by telephone and email, and `minor revision releases'.

  • The total cost, to 30 sites, would therefore be 581,250 pounds sterling.

On top of this, one must consider the (unknown) cost of future upgrades. We have been unable to obtain any pricing information on their `Release 2.0' product line.

Contact Val Emerson at Cray Research (UK) on (0344) 722152 for further enquiries.


Express/UX

Express/UX is developed by OCS, and marketed in Europe by Open Seas. This product's main strengths are the high quality user interface (clients available for MS-Windows, and I'm told Motif as well), while the scheduling support is currently undergoing revision, as I understand matters.

This product, however, is expensive.

  • One would purchase a `first host', providing the main scheduling capabilities, for 8,000 pounds sterling.

  • One would then purchase an agent license for each of the workstations, at 1,000 pounds sterling each.

  • This makes the total cost for a single installation to be 28,000 pounds sterling.

  • The total cost, to 30 sites, would therefore be 840,000 pounds sterling.

Contact Jason Kent, at Open Seas, on (0865) 744656 for further enquiries.


Sterling NQS

Sterling NQS is the commercial version of NQS most widely known in the UK to date, although Cray's recent entry into the market is expected to change this.

As noted in our evaluation of batch processing systems, Sterling NQS has aquired a reputation for not delivering.

Attempts to contact Sterling have been unsuccessful to date.


Unison Load Balancer

Load Balancer is one of several products developed and marketed by Unison Software. Load Balancer is seen as best suited for working in interactive environments, and has very good scalability and job submission time - the fastest product available.

Prices look very good too :

  • A single `Master' license (essentially a scheduling server) costs 780 pounds sterling for an unlimited number of users.

  • A single `Agent' license (essentially client software) costs 370 pounds sterling, again for an unlimited number of users.

  • The cost for each of our installations comes to 8180 pounds sterling, although discounts are available for volume and academic purchases.

    I have no information from Unison as to the availablility of product support, but doubtless it is there.

  • The total cost, to 30 sites, would therefore be 245,400 pounds sterling.

Contact Janet Aitchison at Unison Software, on (0582) 462424 for further enquiries.


Conclusion

The total cost for funding this project for two complete years is 32,600 pounds sterling. This provides all 213 UK Academia sites with :

  • An established batch processing system, which has been modified to meet the direct stated needs of UK Academia.

  • A highly portable, high quality implementation providing strong support for extension and customisation.

  • High quality support for installation and problem-solving.

  • High quality documentation written specifically for British users.

  • Compatible with the de facto UNIX standard for batch processing, and compliant with major international UNIX standards.

The only equivalent commercial system, from Cray Research, would cost 4,126,875 pounds sterling, for running on just 21 UNIX machines at 213 UK Academic sites.

This project therefore offers UK Academia savings in the region of 12,659 per cent over the commercial equivalent.

While these figures are above the true savings, one should remember that those UK sites which are using NQS could include 4-5 separate departments at each site, and each department is likely to have twenty or so workstations, making these figures closer to the true saving than it first appears ...


JISC Criteria

The following criteria were published at the outset of the NTI :

  • Proposals must demonstrate mechanisms for transferring results and benefits to other HE institutions.

    We continue to satisfy this criteria, through our use of Mailbase, and our continued publicity, and by making the product available, 24 hours a day, at the convenience of anyone who wishes to take a copy.

  • The JISC must be satisfied that the projects show vision;

    The continued development of public domain NQS to meet the stated needs of UK Academia, and the emphasis in that development of producing a product which will require little/no support, surely qualifies.

  • are demonstrably effective;

    If batch processing was not effective, UK Universities would not be using it. And if NQS in particular was not effective, the same would surely apply.

  • and will involve key technologies for the future which would not be available to students and researchers without the support of this Initiative.

    High performance batch processing is a key technology if one wishes to get any work done in an otherwise overloaded UNIX environment. As our questionaire shows, if this project did not exist, most UK sites would be unwilling to spend money on commercial alternatives, and so this key technology is only available to students and researchers through NTI.


World-Wide Usage

As the figures in our report show, this NTI project provides a product which is used around the world.

This must surely have some influence upon the international standing of UK Academia.


Summary


Aims And Objectives


Goals Achieved

No new goals have been satisfied during the three months covered by this report.


Goals Worked On

The following goals from the Funding Bid have been worked on.

  • to implement and evaluate commercial and public domain distributed batch systems and in particular NQS and DQS;

  • to provide a report, comparing the systems' utility;

  • to provide a training course on selected systems at which the systems will be described and information on implementation and configuration will be given;

  • to provide packaged releases for popular systems and in particular Sun Solaris 1 and Sun Solaris 2


Goals Not Worked On

  • to provide simple end user documentation on selected systems (to augment the inevitably terse manual pages).


Expected Work - March - May, 1995

  • Continue the construction of Sheffield-NQS

    We hope to achieve the first release no later than the end of April, but it must be pointed out that this may not be possible due to the amount of time currently spent supporting the existing source code.

  • Prepare publicity for Sheffield-NQS

    Given that the current takeup in UK Academia is impressive, we consider that time should not be devoted to publicity until the new batch processing system, Sheffield-NQS, is a completed product. Publicising further the existing product simply takes time away from development of the new product.

  • Prepare and publish a report on commercial batch processing systems.

    One of the major functions of a support service is to provide information, and currently, there appears to be no information available to UK Academia on the commercial alternatives.

  • Monitor traffic on the Mailbase mailing lists, and provide whatever information/assistance is required.

    Despite the amount of time consumed by this task, we consider that downgrading the amount of time given to this will only result in bad publicity, and will therefore be counter-productive.

  • Oversea bug fixes of Monsanto NQS.



This site (www.gnqs.org) is copyrighted. You can view the terms & conditions here.
You can contact the webmaster here.