Reported Problems : Monsanto-NQS 3.37.0
Academic Computing Services , The University of Sheffield
Stuart Herbert (S.Herbert@Sheffield.ac.uk)Document copyright ©. All rights reserved.
Abstract
JISC, as part of its New Technologies Initiative, has funded the
University of Sheffield to supply and support a freely-available
batch processing system for UNIX to the UK Higher Educational
community.
Contents
Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).
Introduction
Introduction
This is the formal ``bug-list'' for Monsanto-NQS, based on actual
reports from the NQS user community.
Reporting Bugs
If you experience problems with Monsanto-NQS, please send a bug
report to `NQS-Support@mailbase.ac.uk', with the following
information :
> Reported By : (Who you are, and who you work for)
> Contact : (Preferred email address)
> Date : (Today's date)
>
> NQS Version : (Which version of NQS are you using?)
> Platforms : (Which operating systems are experiencing the problem?)
>
> Description : (What is the problem?)
> Solution : (Do you have a solution?)
Our dedicated staf (ie, me) will attempt to get back to you as soon
as possible. Normally, if your mail is received before 5pm GMT on a
weekday, you should received a reply the same day. Otherwise, I do
my best to reply by the end of the following weekday.
Reported Problems - January 1995
SunOS <-> AIX Routing Failure (UNSOLVED)
> Reported by : David Hernaiz, University of Barcelona
> Contact : <sistemes@probeta.qui.ub.es>
> Date : Mon, 9 Jan 95
>
> NQS Version : Monsanto-NQS v3.36.0
> Platforms : AIX, SunOS 4
>
> Description : Requests sent from NQS node on SunOS 4 to NQS
> : node on AIX results in the error message
> : ``Request not to be routed. Request deleted''
> Solution : None as yet
>
> Comments so far :
>
> Having looked at the logs, NQS is complaining that the pipeclient
> process cannot successfully read the nmap database. I have
> traced the error propagation back apparently as far as the
> routine ``nmap_get_nam''.
>
> Investigations continuing.
Linux Compilation Failure (UNSOLVED - CANNOT REPRODUCE)
> Reported by : Dan Rugotzke
> Contact : <rugotzke@nevada.edu>
> Date : Mon, 9 Jan 1995
>
> NQS Version : Monsanto-NQS v3.36.5
> Platforms : Slackware 2.0 distribution of Linux
>
> Description : ./src/lpserver.c failed to compile because the
> : header file <sgtty.h> should be <bsd/sgtty.h>.
> Solution : None as yet
>
> Comments so far :
>
> I have been unable to reproduce this problem. The Linux Makefile
> already tells GCC to look in /usr/include/bsd for BSD header
> files.
>
> No further action recommended. If the problem is reported again,
> I'll take another look at it.
OSF/1 v2.0 Compilation Failure (FIXED)
> Reported by : Andrew Cormack
> Contact : <scoanc@thor.cf.ac.uk>
> Date : Tue, 10 Jan 1995
>
> NQS Version : Monsanto-NQS v3.36.5
> Platforms : OSF/1 v2.0
>
> Description : Incomplete #if statement in ./lib/shoqbydesc.c
> : Massive complaints from the native compile about
> : the ANSI prototypes.
> Solution : Use Monsanto-NQS v3.36.6 or later
>
> Comments so far :
>
> The problem with the #if statement was caused by my HPUX fixes in
> v3.36.5, and has been fixed in v3.36.6.
>
> The prototypes one is more serious. We added ANSI prototypes
> using `protoize', which really left quite a mess, imho. Anyway,
> I understand that using the `-std1' switch with cc(1) works
> around this, and I've added this to v3.36.6's Makefile.
>
> Situation is being monitored - hopefully v3.36.6 fixes these
> problems.
Unable To Apply The Patches (FIXED)
> Reported by : Many users
> Contact : N/A
> Date : First reported Fri, 20 Jan 1995
>
> NQS Version : Irrelevent
> Platforms : OSF/1, IRIX 5 & 6 for sure, probably others as well
>
> Description : When attempting to apply patches, the patch(1) program
> asks for a file to patch, and generally does not
> understand the patch files.
> Solution : use patch-2.1.tar.gz from your local GNU mirror
> (UK, use src.doc.ic.uk:/gnu)
>
> Comments so far :
>
> It appears that a number of vendors, most notably DEC and SGI,
> ship an old version of patch, which does not understand unified
> context diffs (the output of diff -u). The solution is to
> compile and install the latest version of patch.
Unable To Compile On HP/UX (FIXED)
> Reported by : Olivier Pirotte
> Contact : Pirotte@bavax.bartho.ulg.ac.be
> Date : Wed, 25 Jan 1995
>
> NQS Version : Monsanto-NQS 3.36.6
> (Probably affects earlier versions too)
> Platforms : HPUX 9
>
> Description : NQS fails to compile.
> Solution : Use Monsanto-NQS 3.36.7 or later.
>
> Comments so far :
>
> The change to ANSI C (in 3.36.4) broke the Makefile.hpux, as the
> compiler requires the -Ae switch in order to compile ANSI C.
No Account Authorization At Transaction Peer (FIXED)
> Reported by : Olivier Pirotte
> Contact : Pirotte@bavax.bartho.ulg.ac.be
> Date : Wed, 25 Jan 1995
>
> NQS Version : All versions
> Platform : HPUX 9 (but affects all others too)
>
> Description : Attempting to submit a job to a pipe queue results in
> the error message ``No account authorization at
> transation peer.''
> Solution : Create the file /etc/hosts.nqs. Place in this file
> two lines for every machine in the nmapmgr database
> one line for the long name of the machine, and one
> line for the short name.
>
> Comments so far :
>
> This is a common setup mistake, and is easily solved by adding
> a /etc/hosts.nqs file to each machine running NQS. In this text
> file, place an entry for every machine which is permitted to send
> NQS requests via pipe queues to the machine. Each entry consists
> of two lines - one line for the short name, and one line for the
> long name of the machine.
>
> Eg:
>
> stoat
> stoat.shef.ac.uk
Reported Problems - February 1995
Shared Installations Using NFS (FIXED)
> Reported by : Neil Smith
> Contact : neils@csrp.tamu.edu
> Date : Fri 10 Feb 1995
>
> NQS Version : All version 3
> Platform : All
>
> Description : Various NQS processes complain about being unable
> to access files or directories which reside on
> NFS-mounted partitions.
> Solution : Re-export your NFS partitions so that requests
> from processes running as root (uid 0) are NOT
> remapped to another user (typically nobody).
>
> Comments so far :
>
> A number of NQS components run as setuid root, including the
> daemons and qsub. These processes need to be able to access
> a number of files while running as user-id 0. A typical NFS
> setup will force user-id 0 on remote machines to be treated
> as `nobody', requiring world rights to files and directories.
> This breaks the setuid root components of NQS.
>
> NQS v4 will attempt to reduce, if not remove, much of this
> problem.
Environment Variables (FIXED)
> Reported by : Thomas Ziehmer, RHRK, Univeristy of Kaiserslautern
> Contact : ziehmer@rhrk.uni-kl.de
> Date : Tue, 21 Feb 1995
>
> NQS Version : Monsanto-NQS v3.36.6
> Platforms : SGI IRIX6, IRIX5.2, LINUX 1.1.52 and others
>
> Description : In the environment, the LOGNAME, MAIL, TZ and
> QSUB_HOST are concatenated in one line, and in
> another MAIL, TZ, and QSUB_HOST.
> Solution : Use Monsanto-NQS v3.36.7 or later
>
> Comments so far :
>
> The problem is caused by calculations in nqs_reqser.c failing
> to count the NULL terminating a string. Thomas submitted a
> patch which will be incorporated into v3.36.7.
Output Redirection (UNSOLVED - DEBUGGING ADDED)
> Reported by : Michael Shephard
> Contact : michaels@jake.chem.unsw.edu.ac
> Date : Wed, 22 Feb 1995
>
> NQS Version : Not stated
> Platforms : IRIX 5
>
> Description : Using the `-o' switch for output for some users
> results in an error message from qsub about
> being unable to determine the machine-id.
> Solution : None.
>
> Comments so far :
>
> This *appears* to be some form of configuration problem -
> existing users could use the -o switch, but all new users
> could not.
>
> The error code in qsub(1) which handles this problem is
> ambiguious, and a patch has been added to 3.36.7 in order
> to give more information.
>
> This problem has not been reported by anyone else using IRIX.
Pipe Queue Problems On OSF/1 v3 (UNSOLVED)
> Reported by : Andrea Testa, Ecole Polytechnique Federale de Lausanne
> system manager in the Physics Department
> Contact : andrea.testa@sd-p.dp.epfl.ch
> Date : Fri, 24 Feb 1995
>
> NQS Version : Monsanto-NQS 3.36.0 + in-houses fixes for OSF/1
> support
> Platforms : OSF/1 v3
>
> Description : A pipe queue on the DEC is not able to deliver
> correctly jobs to remote queues - they get stuck
> in the arriving state.
> Solution : None.
>
> Comments so far :
>
> This appears to be of the same nature as the first problem reported
> in January 1995.
Reported Problems - March 1995
NQS Daemons Stopping (UNSOLVED - PERHAPS UPGRADE)
> Reported by : Cordula Reineke
> Contact : ratte@iam.uni-bonn.de
> Date : Wed, 29 Mar 1995
>
> NQS Version : Monsanto-NQS 3.35
> Platforms : IRIX 5.2, IRIX 5.3
>
> Description : On machines which only have pipe queues to route
> jobs to other machines, the daemons just lock up,
> and require killing/restarting by hand.
> This happens several times a day.
> Solution : Upgrade to 3.36.6 or later?
>
> Comments so far :
>
> We don't support anything before Monsanto-NQS 3.36.0, so it's
> difficult to investigate and comment. We are unaware of any
> such problem AT THIS TIME with Monsanto-NQS 3.36.x.
Bus Error With `qstat -a' on IRIX 6 (UNSOLVED)
> Reported by : Cordula Reineke
> Contact : ratte@iam.uni-bonn.de
> Date : Wed, 29 Mar 1995
>
> NQS Version : Monsanto-NQS 3.35
> Platforms : IRIX 6.0.1
>
> Description : qstat -a results in a bus error, while qstat -sa
> works fine.
>
> Solution : None.
>
> Comments so far :
>
> This appears to be a problem in the NQS code, but has not been
> actively investigated at this time.
Bus Error With `qmgr show managers' Or `qmgr' (WORKAROUND)
> Reported by : Cordula Reineke
> Contact : ratte@iam.uni-bonn.de
> Date : Wed, 29 Mar 1995
>
> NQS Version : Monsanto-NQS 3.36.6
> Platform : IRIX 5.3
>
> Description : qmgr gives a bus error when attempting to show the
> list of managers.
> Solution : Remove the file NQS_SPOOL/private/root/database/
> managers.
>
> Comments so far :
>
> This is caused by a combination of :
>
> o Assigning manager rights to a non-root user
> o Then changing the machine_id of the machine
>
> It turns out that the mid of the manager is stored in the database,
> and so, when the mid of the machine is changed, the mid of the
> manager is no longer valid.
>
> A patch against this problem will be produced shortly.
>
> Many thanks to Mark Grieshaber at Monsanto for an excellent
> investigation into this problem.
Problems Reported - April 1995
Compilation Failure On IRIX 6 (FIXED)
> Reported by : Cordula Reineke
> Contact : ratte@iam.uni-bonn.de
> Date : Mon, 3 Apr 1995
>
> NQS Version : Monsanto-NQS 3.36.6
> Platform : IRIX 6
>
> Description : NQS fails to compile, complaining about being
> unable to locate the file for -lnqs.
> Solution : Use Monsanto-NQS 3.36.7 or later
>
> Comments so far :
>
> This is just an oversight in the Makefile.sgi6 - simply add
> `-L.' to the front of the LINKLIBS line, and all is well.
Monsanto-NQS 3.36.7 Pre-Release 1 Doesn't Work (FIXED)
> Reported by : Cordula Reineke
> Contact : ratte@iam.uni-bonn.de
> Date : Tue, 11 Apr 1995
>
> NQS Version : Monsanto-NQS 3.36.7 Pre-Release 1
> Platform : All
>
> Description : Submitting a request results in an unanticipated
> transaction failure reported by qsub.
> Solution : Use 3.36.7 pre-release 2 or release version or later
>
> Comments so far :
>
> This was entirely my fault - debugging code was added to the
> request process which always failed, because the test was
> hopelessly wrong. This one fault took weeks to find, and caused
> high levels of inconvenience to all concerned.
Problems Reported - May 1995
Problem Compiling NQS 3.36 On OSF/1 v1.3 (FIXED)
> Reported by : Michael Pope
> Contact : ln1mgp@entoil.co.uk
> Date : Tue, 9 May 1995
>
> NQS Version : Monsanto-NQS 3.36.0
> Platform : OSF/1 v1.3
>
> Description : NQS fails to compile, complaining about BAD SYSTEM
> TYPE.
> Solution : Upgrade to 3.36.6 or later, and upgrade to a much
> later version of OSF/1.
>
> Comments so far :
>
> This problem *appears* to be caused simply by the C pre-processor
> on OSF/1 v1.3 being unable to correctly parse pre-processor
> directives regarding conditional compilation.
>
> I am informed that users should upgrade to at least OSF/1 v3.
Segmentation Fault On OSF/1 v2.1 (FIXED)
> Reported by : Matsushita Takashi
> Contact : matsu@phys.metro-u.ac.jp
> Date : Thu, 18 May 1995
>
> NQS Version : Monsanto-NQS 3.36.6
> Platform : OSF/1 v2.1
>
> Description : NQS core dumps while booting.
> Solution : Use Monsanto-NQS v3.36.7 or later.
>
> Comments so far :
>
> This was caused by printf("%s\n", mid); where mid was an
> unsigned long. How come no other version of UNIX has complained
> is beyond me.
>
> Many thanks to Matsushita Takashi for a post-mortem of the core
> using gdb.
Queue Lockups On OSF/1 (UNSOLVED)
> Reported by : John Peden
> Contact : pdxjfp@evol.gene.nottingham.ac.uk
> Date : Tue, 30 May 95
>
> NQS Version : Monsanto-NQS 3.36.x
> Platform : OSF/1 v3.2
>
> Description : An NQS batch queue will spawn a request, and then
> fail to run any other requests in the queue until the
> request is deleted using qdel.
> Problem is intermittent, and appears to happen under
> load.
> Solution : None.
>
> Comments so far :
>
> John stress-tested NQS, and found a failure rate of just 1%.
> Debugging code will be added to 3.36.7 in order to provide
> further information about the cause of the problem.
Problems Reported - June 1995
Unanticipated Transaction Failure - Intermittent (WORKAROUND)
> Reported by : Philippe A. Bopp
> Contact : pab@hulot.lsmc.u-bordeaux.fr
> Date : Tue, 6 Jun 1995
>
> NQS Version : Monsanto-NQS 3.36
> Platform : AIX
>
> Description : The occaisonal RCM_UNAFAILURE message appears in
> the NQS log files. This appears to happen under
> load.
> Solution : None.
>
> Comments so far :
>
> There have been no similar reports. We are unable to actively
> look into the problem, as it does not affect known UKHE
> installations.
>
> Philippe has since reported that this is caused by having `qsub'
> as the last command of a NQS request. A workaround is to place a
> `sleep 10' after the `qsub' statement in the request.
Accounting Error (UNSOLVED)
> Reported by : Thomas Eifert
> Contact : Eifert@rz.rwth-aachen.de
> Date : Fri, 9 Jun 1995
>
> NQS Version : 3.36.7 pre-release 2 (and earlier versions)
> Platform : IRIX 5.2 (probably IRIX 6 too)
>
> Description : qacct reports that a process uses CPU-time which
> is 100 times what was actually used. The time
> reported by qstat is correct, suggesting a bug in
> qacct.
> Solution : None.
>
> Comments so far :
>
> Not investigated. Will investigate when time allows.
Process Time Inaccuracy - qstat (UNSOLVED)
> Reported by : Thomas Eifert
> Contact : Eifert@rz.rwth-aachen.de
> Date : Mon, 12 Jun 1995
>
> NQS Version : 3.36.7 pre-release 2
> Platform : IRIX 5.2
>
> Description : The time reported by qstat is no longer cumulated
> over several processes that run within one job.
> Solution : None.
>
> Comments so far :
>
> Not investigated. There was a change in signal handling back
> in 3.36.5, which may be relevant to the problem.
Environment Corruption (FIXED)
> Reported by : Chang Keng Seng
> Technical Consultant
> Computervision Services, Singapore
> Contact : chang@pspore.cv.com
> Date : Fri, 16 June 1995
>
> NQS Version : 3.36.6 (and earlier versions)
> Platform : SOLARIS 2
>
> Description : The environment variables set by NQS are concatenated
> together.
> Solution : Use Monsanto-NQS 3.36.7 or later.
>
> Comments so far :
>
> The routines which build the environment didn't count the `NULL'
> terminator at the end of each environment string. Fixed by
> Thomas Ziehmer.
/sbin/pset : unknown set name (SOLVED)
> Reported by : Phil Chambers
> Contact : P.A.Chambers@exeter.ac.uk
> Date : Tue, 27 Jun 1995
>
> NQS Version : 3.36.6 (I think - Stu)
> Platform : IRIX 6
>
> Description : A message appears in the NQS logs of the form
> /sbin/pset: unknown set name.
> Solution : Create processor sets with the same name as your
> batch queues, or recompile NQS without the `-DTAMU'
> option to disable pset support.
>
> Comments so far :
>
> At first sight, this appears to be a bug in pset. Investigations
> are continuing.
Solaris 2.3 stdout Delivery Problem (INSTALLATION ERROR)
> Reported by : Rob Creecy
> Contact : rcreecy@census.gov
> Date : Tue, 27 Jun 1995
>
> NQS Version : Monsanto-NQS v3.36.7 pre-release #2
> Platform : Solaris 2.3
>
> Description : The command `qsub<CR>ls<CR>^D' (as an example)
> results in email detailing that the NQS request
> was aborted by signal 11 (SIGSEGV).
> Solution : None at present.
>
> Comments so far :
>
> I cannot reproduce this bug locally - we've been running NQS on
> Solaris 2.3 for nearly a year now. I've asked for the analysis
> of the core file, which should help track down the problem
> further.
>
> Rob has since reported that he recompiled and reinstalled NQS,
> and NQS appears to be working fine. This appears to have simply
> been a subtle installation error.
>
> There is a good case for arguing that NQS should be imune to
> such problems ...
Defunct Processes On HP-UX 10.0 (UNSUPPORTED MACHINE)
> Reported by : Mouri Yoshihiro
> Contact : y-mouri@jkk.hitachi.co.jp
> Date : Thu, 29 Jun 1995
>
> NQS Version : Monsanto-NQS v3.36.?
> Platform : HP-UX v10.0
>
> Description : <defunct> processes appear, whose parent pid is
> NQS's netdaemon.
> Solution : None as yet.
>
> Comments so far :
>
> At the time of writing, we do not have support for HPUX 10.0
> in the Monsanto-NQS source tree. The problem should be solved
> by changing the way netdaemon.c handles SIGCHLD. The next
> release of Monsanto-NQS will include a preliminary patch to
> attempt a fix.
Reported Problems - July 1995
Netdaemon: Error Getting Local Host's MID (UNSOLVED)
> Reported by : Jim Talley
> Contact : talley@lexicus.mot.com
> Date : Thu 6 July 1995
>
> NQS Version : Monsanto-NQS v3.36.7 pre-release 2
> Platform : SunOS 4.1.x
>
> Description : Netdaemon fails to run on starting up NQS. It
> reports that it is unable to determine the machine-id
> of the local host, and reports the error value 2.
>
> Solution : None at present.
Automagically produced by KTEpaper, part of The Knowledge Tree Engine
|