Document Code : MNQS0017 ------------------------ The first part of this document is the instructions for installing this implementation of NQS for the first time. Following that are the instructions for upgrading to this version from the previous version of NQS from Monsanto. At the end is some information for NQS load balancing and Version staging. ------------------------------------------------------------------------ Steps required to install NQS for the first time: 1. Unload the save set. It creates a directory tree called nqs-3.36. 2. As superuser cd to the proto directory and edit `Makefile' : Review the makefile to ensure that you understand what it is doing. In particular, if you wish to install in "non-standard" locations, you must make some modifications. Several environment variables are used throughout the installation. These are defined as follows: NQS_ROOTDIR Set this to the directory where you want NQS to install. The default is /usr/local. Everything installed into this directory can be shared via NFS between multiple machines of the same architecture, running the same version of UNIX. NQS_ROOTPRIV Set this to the directory where NQS can create its transaction database. The default is /var/spool. Everything installed into this directory CANNOT be shared with any other machine. The following environment variables are defined using the two variables above; new NQS users should not need to change any of them. Existing NQS users may wish to change these variables accordingly. NQS_HOME This directory contains the NQS configuration file (nqs.config) that defines the rest of these environment variables to NQS. By default, this directory is /usr/local/lib/nqs. In a local network this directory can be shared by multiple heterogeneous nodes. This variable is not used in the Makefile, but see the note below. NQS_LIBEXE This directory contains the NQS daemon programs and administrative shell scripts. By default, this directory is /usr/local/lib/nqs. In a local network, this directory can be shared by multiple homogeneous nodes. NQS_NMAP This directory contains the NQS network mapping database. This directory is only required by NQS. No one needs to directly access it. By default, this directory is /usr/local/lib/nqs/nmap. In a local network, this directory can be shared by multiple heterogeneous nodes. NQS_SPOOL This directory contains the NQS spool files and queue database. This directory will contain around 1,000 i-nodes associated with NQS. By default, this directory will be /var/spool/nqs. This directory must be private to each node that executes NQS. NQS_STAGE This directory contains the version of NQS to be installed the next time NQS is quiescent. There is no default location. In a local network, this directory can be shared by multiple homogenous nodes. NQS_USREXE This directory contains the NQS user interface and utility programs. This directory will be used by NQS users and must be in their search path. By default, this directory is /usr/local/bin. In a local network this directory can be shared by multiple homogeneous nodes. These symbols are defined in the makefiles up near the top. Make changes as desired. If NQS is installed in locations other than the standard, then each NQS user will have to have the environment variable NQS_HOME defined and point to the directory which contains the file nqs.config. This file contains pointers to the various NQS directories. Finally, uncomment the `include' line in the Makefile for your particular version of UNIX, and save the Makefile. The commands to make and install NQS are: make to compile and load the NQS binaries make directories to build the NQS directory make install 3. [SGI only] Install the man pages by doing a make maninst Rebuild the whatis database: /usr/lib/makewhatis For other systems, you will need to do the appropriate things manually to install the man pages from the man subdirectory. There is a man page called nqsconfig which is provided for local information on the NQS configuration. 4. Edit /etc/services (or modify your YP database) to add nqs as port 607/tcp: nqs 607/tcp # Network Queueing System 5. Set up the Machine ID database using nmapmgr. Each machine you wish to have in your NQS network must have a unique MID. There are two ways to set this up. One way is to explicitly assign MIDs. The usual way is to start numbering them at 1, in this manner: # # nmapmgr will be installed in the location pointed to by # # the symbol NQS_USREXE in the makefile # # nmapmgr NMAPMGR>: add mid 1 node NMAPMGR>: add name fqdn 1 where node is the name of the node (beaker) and fqdn is the fully qualified domain name of the node (beaker.monsanto.com). Repeat for each node. NMAPMGR>: list This will list all the mids and names. NMAPMGR>: exit The other way is to implicitly assign MIDs based on the IP address of the various nodes. Follow this pattern: # nmapmgr NMAPMGR>: add host node NMAPMGR>: add alias fqdn node where node is the name of the node (beaker) and fqdn is the fully qualified domain name of the node (beaker.monsanto.com). Repeat for each node. NMAPMGR>: list This will list all the mids and names. NMAPMGR>: exit Consult the nmapmgr man pages for more information. Help is available at the NMAPMGR prompt by typing help. 6. Start up NQS by typing: # # The nqsdaemon will be installed in the location pointed # # to by the symbol NQS_LIBEXE in the makefile # # /usr/lib/nqs/nqsdaemon If there is an error in the startup it will be written to the terminal. This command will cause three daemons to run: the main NQSdaemon, the logdaemon, and the netdaemon. If, after NQS is configured, you are satisfied with it, you can shutdown NQS using the "qmgr shutdown" command, and start it up using the command: # /usr/lib/nqs/nqsdaemon > /dev/null & (assuming that NQS_LIBEXE is in the standard place). After you are satisfied with the system you will want to put this line in your startup script. 7. Now you should use the qmgr program to configure your system and add queues. Invoke qmgr as root: # qmgr Mgr: # Direct the log information to a file Mgr: set log_file /tmp/nqs-logfile Mgr: # Indicate the level of information Mgr: set debug 2 Mgr: # Add a manager other than root Mgr: add managers yourself:m Mgr: # Create and enable a batch queue Mgr: create batch batch-queue Mgr: set default batch_request queue batch-queue Mgr: enable queue_batch-queue Mgr: start queue batch-queue Mgr: show all Mgr: exit See the Qmgr man pages (or type help at the Mgr prompt) for more information on these commands. Exit root and test the system by typing "qstat -x". (You will probably have to rehash). You should see information on the queue you just set up and it should indicate that the queue is "[ENABLED, INACTIVE]". Now submit a job by typing qsub then date then a control-d. Qsub will report that a batch request had been submitted. The stdout and stderr files will appear in your directory as STDOUT.o0 and STDERR.e0. Stderr should be empty unless your .profile (Borne shell) or .cshrc + .login (C shell) execute commands which are not appropriate for a batch environment. Stdout will contain the output of the date command. If you want to create a pipe queue, the commands would be: Mgr: create pipe_queue pipe-queue destination=rqueue@there Mgr: enable queue pipe-queue Mgr: start queue pipe-queue These commands create a pipe queue called pipe-queue which will route jobs to the queue called rqueue on the machine called there. Note that the machine "there" will have to be defined using nmapmgr, and the local machine will have to be defined using nmapmgr on the "there" machine. NQS uses the rhosts mechanism for determining if access is permitted. When a remote request (such as a qstat or qsub from a remote system) is received, first the /etc/hosts.equiv file is checked for machine equivalency. If none is found, the .rhosts file in the user's home directory is checked. In this file, both the hostname and the username are expected. It may be necessary to include lines with both the hostname and the fully qualified hostname. Finally, if access is still not granted, NQS checks for a file called /etc/hosts.nqs. At the most simple form, it is similar to the .rhosts file, but it can provide mapping of remote usernames to local usernames. See the source file lib/mapuser.c for more information. Eventually, you can reduce the debug level to 0, but you should not direct the log_file to /dev/null, as failures report useful information to the log files. 8. Create the file called NQS_LIBEXE/nqs-domain giving the names of all the machines in your "nqs domain". The format is a single machine name on each line. Lines beginning with a "#" are considered comments and are ignored. This file serves two purposes. It is the default list of machines that will be checked when a user does a "qstat -d" command. It is also the default list of machines that will receive broadcast messages if requested. This list is usually a subset of all the machines given Machine ids. A sample is in misc/nqs-domain.dist. Note that this can be overridden on a per-user basis by creating a file called .qstat in the user's home directory. The format of the .qstat file is the same as the nqs-domain file. 9. Build the msg programs required for broadcasts. cd to the msgd directory. Read the README there and check the Makefiles in the msg and msgd directories. The makefiles may need modifications for your environment. Then do a "make" and then as root "make install". This will install three new programs in /usr/local/bin. They are mesg, msg, and msgd. msg allows you to write to terminals anywhere on the network, msgd is the daemon which listens for requests to write to users remotely, and mesg is a replacement for the standard mesg program. See their man pages for more information. After the programs are installed, edit /etc/services and /usr/etc/inetd.conf to add the lines for msg. Then cause inetd to re-read its conf file by doing a kill -HUP . 10. If you want to use the NQS scheduler features and are running on RS6000s then you need the monitor package. A copy of that is on wuarchive in the nqs/unix directory. Build and install according to the instructions. That should complete the installation. ------------------------------------------------------------------------ UPGRADE: Steps required to upgrade to this release from the previous release from Monsanto: 1. Unload the save set. It creates a directory tree called nqs-3.36. 2. If you have changed h/nqs.h or the Makefile for your architecture, then you must reconcile them with the previous versions. Since the format of the Makefiles have changed, you will have to check them manually. 3. If you want to install NQS in a different location from the previous, then you should consider this to be a new installation. There must be no queued requests if you want to move the location of the NQS database and spool files whose standard location is /usr/spool/nqs. The best way is to save all information about the current system by getting the qmgr snap file (indicated below) and getting a list of the currently defined Machine IDs using the nmapmgr program. Then follow the instructions above for the new install. 4. Build the release in the proto subdirectory doing a make -f Makefile.xxx where Makefile.xxx is the appropriate Makefile.?? file for your architecture/OS (sgi, ibm, hpux, etc.). 5. Get a copy of the current NQS parameters by doing the following: $ qmgr Mgr: snap file=(file-name) Mgr: exit The qmgr commands are written to the specified file.. 6. Make sure there are no running nqs jobs and then shutdown nqs by doing a shutdown at the qmgr prompt. 7. Install the new version by doing a make -f Makefile.xxx install 8. [SGI only] Install the man pages by doing a make -f Makefile.sgi maninst Rebuild the whatis database: /usr/lib/makewhatis Everybody else will have to do this by hand out of the man subdirectory. 9. [If desired] Rename the NQS accounting file to something else to keep a copy, but have all subsequent records get written to a clean file. The commands would be: # mv /usr/adm/nqs /usr/adm/nqs.old The old file will continue to be valid, but small changes in the accounting file make it convenient to start with a new file after the upgrade. There are no changes between 3.34 and 3.36, but there are from 3.3[123] to 3.34. 10. Start NQS up again: # qmgr start nqs That will complete the upgrade. SETTING UP FOR LOAD BALANCING: NQS supports several levels of load balancing. In the simplest case, there is no load balancing. If a pipe queue has several destinations, and there is no load balancing all requests are sent to one of the destinations, ignoring the others, until the favored destination is disabled. A second level is that a pipe queue can be set up to be load balanced outbound. This means that the destinations will be selected in a sort of round robin algorithm, so that the jobs will be distributed more evenly. The third level is that the destination pipe queues for a local pipe queue can be set load balanced inbound. This means that it will refuse a request unless it can be run immediately. If none of the destination queues can run the request immediately, it waits on the source machine until one of the destination queues can run the request. The final level is implemented using the concept of a NQS scheduler. A scheduler is a machine designated to distribute jobs to a set of queues on several machines. The scheduler has a generic batch queue, which is set to be load balanced outbound and is directed to several pipe queues on the other machines (and perhaps on itself). These queues are all load balanced inbound. The scheduler machine will direct the available jobs to the various compute resources based on available information on the power and current load on the various machines. The power of the various machines is specified by using the qmgr "set server performance" command on the scheduler. The current load on the various machines is provided by their load daemons. The load daemons provide the number of NQS jobs running and the 1, 5 and 15 minute load averages. The scheduler uses this information to order the possible destinations for a requests by their "ability" to run the job. The machines also report completion of jobs to the scheduler. Then the scheduler can act right away to attempt to run the next available job on that machine, thereby minimizing idle cycles. To set up this level of load balancing, do the following: - Designate a machine to be the scheduler. It will have a slightly greater compute load due to the scheduling processing, but it is more important that it be generally available than the most powerful machine. - Create a pipe queue on the scheduler having destinations pipe queues on each of the remote machines which will be the compute engines. - The pipe queue on the scheduler machine is set to be load balanced outbound using the "qmgr set lb_out queuename" command. - The pipe queues on the remote machines are set to be load balanced inbound using the "qmgr set lb_in queuename" command. These queues must have a single destination which is the actual execution batch queue on that machine. - These execution queues can be set to be pipeonly, so that requests have to be submitted through the scheduler. - All of the machines within the "cluster" are configured to know the scheduler using the "qmgr set scheduler nodename" command. This means that the scheduler will be notified when jobs complete on the remote systems and will have some indication about the relative load on the various machines. - The default retry wait time can be increased, to reduce network traffic using the "set default destination_retry wait command". This command controls the interval between which the scheduler will try to deliver a request to a remote node. Since the scheduler will be informed of job completions and is aware of the number of running jobs on remote machines, this information is superfluous. A value of 30 minutes may be appropriate. NQS STAGING: With Release 3.35 of Monsanto NQS, the concept of a "staged release" has been introduced. A common problem with managing NQS is that requests may run for several days, and if there are several on a machine, it may be difficult to find an opportunity to install a new release. It is possible to disable the queues, but that might be inefficient under some circumstances. A "staged release" is a new release of NQS that is placed in a particular location. When NQS completes a request and would go into a quiescent state it checks for a compatable new release in the staging area. If one is found, NQS shuts itself down, installs the new version, and starts itself up again. This facility is set up by doing a "make -f Makefile.xxx stage" which places the next release into the staging area determined by the NQS_STAGE definition in the makefiles. A compatable release is defined as one that has the same major number and a greater minor number or patch level. Due to my taking on other projects, it is unlikely that the Monsanto NQS will be further enhanced. My ability to answer questions will be severely limited, so I am glad to hear from you, but do not expect timely responses to your questions. So be warned: If you want a supported version of NQS, look elsewhere! Also note, the only "official" ftp site for this distribution is wuarchive.wustl.edu. If you did not get it there then I cannot be sure you got an offical version. If this distribution is no longer on wuarchive, then you can be sure that it is no longer supported. ------ John Roman Monsanto Company jrroma@beaker.monsanto.com Chesterfield, MO 63198 (314) 537-7044