I wanted to effectively scrap the existing Generic NQS source tree,
and completely re-engineer it from scratch. After a debate on NQS-Developers
[more], I lost that
argument.
So, the idea is to re-engineer Generic NQS in stages.
This is going to require parallel development. We have tried
(and failed) to do parallel development twice in the past.
Here's hoping it is a case of third time lucky.
Maintaining Production Releases
During the re-engineering work, I need to ensure that there is
always a production-quality version of Generic NQS which new
users can safely start using. The new policy I outline here is
designed to achieve that above all else.
Starting with Generic NQS 3.50.6, I will only accept patches
which fix bugs or portability issues against the current production
release. If you find a bug which is in both the production
and pre-production source trees, it will save me a lot
of time if you send me a patch against the production source
tree. I will make sure the fixes also go into the pre-production
source tree.
I will only accept patches for new functionality against the pre-production
source tree. The new functionality will not be patched
into the existing production source tree. This limits the
overhead of parallel development, at the expense of delaying new
functionality becoming available.
We will continue to make new production releases when the
quality of the pre-production source tree has reached a
satisfactory state. The list of milestones below is designed to
make sure that we make more production releases than we
ever have before. This will (hopefully) ensure that we don't
find ourselves once again scrapping a pre-production source
tree because end-users need the new features, but the changes to be
made will take too long.
Milestones
This is my outline of what order things need to be done in. I
expect that this document will become much more detailed as we get
stuck in.
- Generic NQS v3.52 Production
Strip out and replace the existing SETUP software.
New SETUP software must offer a autoconf-like configure script, must
produce usable Makefiles for experienced users & developers, and
must be able to create a local binary package (e.g. RPM, .pkg file)
for operating systems.
My intention is to make sure that Generic NQS is as easy to
install for all my users as soon as possible. Hopefully
it will make it possible for more users to download and play with
the pre-releases, and so help ensure that the testing of Generic NQS
is better than it otherwise would be.
All compile-time features to become runtime features.
Ensure all choices which have to be made at compile-time can
instead be made through editing a new /etc/nqs.conf configuration
file at runtime.
NOTE: This is a temporary solution, and will
require replacing at a later stage.
- Generic NQS v3.53 Production
Isolate all file formats behind a new file-managing API.
Before existing file formats can be replaced, they have to be
hidden behind a suitable API. The new API will be designed so
that eventually Generic NQS can store and retrieve data from other
services, such as LDAP servers or SQL database servers.
NOTE: This is a large piece of work, and will require changes
to the very heart of Generic NQS.
- Generic NQS v3.54 Production
Isolate all internal communications behind a new IPC API.
Before the existing NQS daemons can be replaced with a new
structure, all communications between the existing NQS daemons has
to be placed behind a new API.
As part of this work, a new NQS daemon (the dispatcher) will be
added. Its role is to manage IPC between all other NQS
daemons.
NOTE: This is a large piece of work, and will require changes
to the very heart of Generic NQS.
- Generic NQS v3.55 Production
Move all file-management code out into a separate daemon.
With the NQS dispatcher in place, it will now be possible to
start moving functionality out of the NQSdaemon and into separate
processes. Eventually we will be able to switch off the
NQSdaemon.
The first stage of this is to move all file-management code out
of all of the existing daemons into a separate daemon. The
daemon can then be safely modified later on (without affecting any
other part of NQS) to support LDAP, SQL databases, and any other
data source.
- Generic NQS v3.56 Production
Implement a new 'queue-manager' daemon.
At the moment, the Generic NQS scheduler is tightly coupled with
the basic facilities available to setup and maintain queues. A
new daemon will be introduced which exists purely to manage the
existence of the queues. The existing code inside the
NQSdaemon will then become more of a pure scheduler.
This means that we can then clean up all the administration
utilities, so that items in queues can be manipulated (and finally deleted)
without any hassle whatsoever.
- Generic NQS v3.57 Production
Implement a new 'NQS-Protocol v1' network daemon.
NQS-Protocol v1 is the network protocol currently used by NQS to
talk to other NQS systems, and to any commercial system which
inter-operates with NQS. At the moment, NQSdaemon is
structured around serving this protocol.
All support for servicing this protocol will be moved out into a
separate daemon. (Actually, I plan on totally re-implementing
this protocol, because of historical bugs in our existing
implementation which I've failed to track down and remove over the
years).
At the same time, changes will need to be made to what's left of
NQSdaemon to support the information flow the new network daemon
will require.
I think it goes without saying that this is, most definitely, the
single riskiest piece of work on the plan.
However, once this is done, we can add daemons to support other
network protocols (DQS, PBS, NQS-Protocol v2) without running the
risk of breaking the existing support.
- Generic NQS v3.58 Production
Separate out the batch spawning from the scheduling code.
The creation of a batch process should be triggered by the
scheduler, but actually managed by a separate daemon.
Once the code to manage batch processes has been separated out,
the daemon can then be modified to monitor running processes and
stomp on them if they abuse their resource limits.
We will also introduce a new type of queue, which we are calling
the 'ghost queue'. The basic idea is that each scheduler will
have a ghost queue for every batch or pipe queue on other machines
in the cluster. Each ghost queue will contain all the
information in the real queue. This means that the scheduler for
the very first time will know what is going on cluster-wide,
and will be able to provide the fine-grained control currently
supported by commercial systems.
- Generic NQS v3.59 Production
Implement a 'managed objects' environment.
I want to be able to literately 'plug' new features in, to be
able to write a new 'object', register it with Generic NQS, and then
away we go.
The emphasis of v3.59 will be to take our new architectured code,
and add support for plugging in new features throughout the source
base.
- Generic NQS v3.60 Production
This will be v3.59 with bug fixes. I doubt I'll be able to
enforce a complete feature-freeze, but the priority for v3.60 is to
ensure that, after the restructuring, Generic NQS v3.60 is robust
and stable.
During this work, platform-specific code will be moved out into a
new libsal2 library as and when it is uncovered.
Beyond GNQS v3.60
Generic NQS v3.61 and onwards will seek to add support for new
features not possible under GNQS v3.50. More details will be
agreed later, but these features could include:
- Management of parallel jobs (PVM, MPI)
- Port to Win32 platform
- Inter-operability with PBS
- Inter-operability with DQS
- Scripting language to allow easy prototyping of new features