This is www.gnqs.org, The Home Of Batch Processing


Home | Developers | Documents | Downloads | Mailing Lists | People | Support | Volunteer


Reported Problems : Monsanto-NQS 3.37.0

Academic Computing Services , The University of Sheffield

Stuart Herbert (S.Herbert@Sheffield.ac.uk)

Document copyright ©. All rights reserved.


Abstract

JISC, as part of its New Technologies Initiative, has funded the University of Sheffield to supply and support a freely-available batch processing system for UNIX to the UK Higher Educational community.


Contents

Click here for a plain-text version of this paper. Click here for a copy of this document in Microsoft RTF format, suitable for printing (if available).

Introduction


Introduction

This is the formal ``bug-list'' for Monsanto-NQS, based on actual reports from the NQS user community.


Reporting Bugs

If you experience problems with Monsanto-NQS, please send a bug report to `NQS-Support@mailbase.ac.uk', with the following information :

> Reported By : (Who you are, and who you work for) 
> Contact     : (Preferred email address) 
> Date        : (Today's date) 
> 
> NQS Version : (Which version of NQS are you using?) 
> Platforms   : (Which operating systems are experiencing the problem?) 
> 
> Description : (What is the problem?) 
> Solution    : (Do you have a solution?) 
Our dedicated staf (ie, me) will attempt to get back to you as soon as possible. Normally, if your mail is received before 5pm GMT on a weekday, you should received a reply the same day. Otherwise, I do my best to reply by the end of the following weekday.


Reported Problems - January 1995


SunOS <-> AIX Routing Failure (UNSOLVED)

> Reported by : David Hernaiz, University of Barcelona 
> Contact     : <sistemes@probeta.qui.ub.es> 
> Date        : Mon, 9 Jan 95 
> 
> NQS Version : Monsanto-NQS v3.36.0 
> Platforms   : AIX, SunOS 4 
> 
> Description : Requests sent from NQS node on SunOS 4 to NQS 
>	      : node on AIX results in the error message 
>	      : ``Request not to be routed. Request deleted'' 
> Solution    : None as yet 
> 
> Comments so far : 
> 
>  Having looked at the logs, NQS is complaining that the pipeclient 
>  process cannot successfully read the nmap database.  I have 
>  traced the error propagation back apparently as far as the 
>  routine ``nmap_get_nam''. 
> 
>  Investigations continuing. 


Linux Compilation Failure (UNSOLVED - CANNOT REPRODUCE)

> Reported by : Dan Rugotzke 
> Contact     : <rugotzke@nevada.edu> 
> Date        : Mon, 9 Jan 1995 
> 
> NQS Version : Monsanto-NQS v3.36.5 
> Platforms   : Slackware 2.0 distribution of Linux 
> 
> Description : ./src/lpserver.c failed to compile because the 
>	      : header file <sgtty.h> should be <bsd/sgtty.h>. 
> Solution    : None as yet 
> 
> Comments so far : 
> 
>  I have been unable to reproduce this problem.  The Linux Makefile 
>  already tells GCC to look in /usr/include/bsd for BSD header 
>  files. 
> 
>  No further action recommended.  If the problem is reported again, 
>  I'll take another look at it. 


OSF/1 v2.0 Compilation Failure (FIXED)

> Reported by : Andrew Cormack 
> Contact     : <scoanc@thor.cf.ac.uk> 
> Date	      : Tue, 10 Jan 1995 
> 
> NQS Version : Monsanto-NQS v3.36.5 
> Platforms   : OSF/1 v2.0 
> 
> Description : Incomplete #if statement in ./lib/shoqbydesc.c 
>             : Massive complaints from the native compile about 
>             : the ANSI prototypes. 
> Solution    : Use Monsanto-NQS v3.36.6 or later 
> 
> Comments so far : 
> 
>  The problem with the #if statement was caused by my HPUX fixes in 
>  v3.36.5, and has been fixed in v3.36.6. 
> 
>  The prototypes one is more serious.  We added ANSI prototypes 
>  using `protoize', which really left quite a mess, imho.  Anyway, 
>  I understand that using the `-std1' switch with cc(1) works  
>  around this, and I've added this to v3.36.6's Makefile. 
> 
>  Situation is being monitored - hopefully v3.36.6 fixes these 
>  problems. 


Unable To Apply The Patches (FIXED)

> Reported by : Many users 
> Contact     : N/A 
> Date	      : First reported Fri, 20 Jan 1995 
> 
> NQS Version : Irrelevent 
> Platforms   : OSF/1, IRIX 5 & 6 for sure, probably others as well 
> 
> Description : When attempting to apply patches, the patch(1) program 
>		asks for a file to patch, and generally does not 
>		understand the patch files. 
> Solution    : use patch-2.1.tar.gz from your local GNU mirror 
>		(UK, use src.doc.ic.uk:/gnu) 
> 
> Comments so far : 
> 
>  It appears that a number of vendors, most notably DEC and SGI, 
>  ship an old version of patch, which does not understand unified 
>  context diffs (the output of diff -u).  The solution is to 
>  compile and install the latest version of patch. 


Unable To Compile On HP/UX (FIXED)

> Reported by : Olivier Pirotte 
> Contact     : Pirotte@bavax.bartho.ulg.ac.be 
> Date        : Wed, 25 Jan 1995 
> 
> NQS Version : Monsanto-NQS 3.36.6 
>               (Probably affects earlier versions too) 
> Platforms   : HPUX 9 
> 
> Description : NQS fails to compile. 
> Solution    : Use Monsanto-NQS 3.36.7 or later. 
> 
> Comments so far : 
> 
>  The change to ANSI C (in 3.36.4) broke the Makefile.hpux, as the 
>  compiler requires the -Ae switch in order to compile ANSI C. 


No Account Authorization At Transaction Peer (FIXED)

> Reported by : Olivier Pirotte 
> Contact     : Pirotte@bavax.bartho.ulg.ac.be 
> Date        : Wed, 25 Jan 1995 
> 
> NQS Version : All versions 
> Platform    : HPUX 9 (but affects all others too) 
> 
> Description : Attempting to submit a job to a pipe queue results in 
>               the error message ``No account authorization at 
>		transation peer.'' 
> Solution    : Create the file /etc/hosts.nqs.  Place in this file 
>               two lines for every machine in the nmapmgr database 
>               one line for the long name of the machine, and one 
>               line for the short name. 
> 
> Comments so far : 
> 
>  This is a common setup mistake, and is easily solved by adding 
>  a /etc/hosts.nqs file to each machine running NQS.  In this text 
>  file, place an entry for every machine which is permitted to send 
>  NQS requests via pipe queues to the machine.  Each entry consists 
>  of two lines - one line for the short name, and one line for the 
>  long name of the machine. 
> 
>  Eg: 
> 
>    stoat 
>    stoat.shef.ac.uk 

Reported Problems - February 1995


Shared Installations Using NFS (FIXED)

> Reported by : Neil Smith 
> Contact     : neils@csrp.tamu.edu 
> Date	      : Fri 10 Feb 1995 
> 
> NQS Version : All version 3 
> Platform    : All 
> 
> Description : Various NQS processes complain about being unable 
>		to access files or directories which reside on 
>		NFS-mounted partitions. 
> Solution    : Re-export your NFS partitions so that requests 
>		from processes running as root (uid 0) are NOT 
>		remapped to another user (typically nobody). 
> 
> Comments so far : 
> 
>  A number of NQS components run as setuid root, including the 
>  daemons and qsub.  These processes need to be able to access 
>  a number of files while running as user-id 0.  A typical NFS 
>  setup will force user-id 0 on remote machines to be treated 
>  as `nobody', requiring world rights to files and directories. 
>  This breaks the setuid root components of NQS. 
> 
>  NQS v4 will attempt to reduce, if not remove, much of this 
>  problem. 


Environment Variables (FIXED)

> Reported by : Thomas Ziehmer, RHRK, Univeristy of Kaiserslautern 
> Contact     : ziehmer@rhrk.uni-kl.de 
> Date	      : Tue, 21 Feb 1995 
> 
> NQS Version : Monsanto-NQS v3.36.6 
> Platforms   : SGI IRIX6, IRIX5.2, LINUX 1.1.52 and others 
> 
> Description : In the environment, the LOGNAME, MAIL, TZ and 
>		QSUB_HOST are concatenated in one line, and in 
>		another MAIL, TZ, and QSUB_HOST. 
> Solution    : Use Monsanto-NQS v3.36.7 or later 
> 
> Comments so far : 
> 
>  The problem is caused by calculations in nqs_reqser.c failing 
>  to count the NULL terminating a string.  Thomas submitted a 
>  patch which will be incorporated into v3.36.7. 


Output Redirection (UNSOLVED - DEBUGGING ADDED)

> Reported by : Michael Shephard 
> Contact     : michaels@jake.chem.unsw.edu.ac 
> Date        : Wed, 22 Feb 1995 
> 
> NQS Version : Not stated 
> Platforms   : IRIX 5 
> 
> Description : Using the `-o' switch for output for some users 
>               results in an error message from qsub about 
>		being unable to determine the machine-id. 
> Solution    : None. 
> 
> Comments so far : 
> 
>  This *appears* to be some form of configuration problem - 
>  existing users could use the -o switch, but all new users 
>  could not. 
> 
>  The error code in qsub(1) which handles this problem is 
>  ambiguious, and a patch has been added to 3.36.7 in order 
>  to give more information. 
> 
>  This problem has not been reported by anyone else using IRIX. 


Pipe Queue Problems On OSF/1 v3 (UNSOLVED)

> Reported by : Andrea Testa, Ecole Polytechnique Federale de Lausanne 
>		system manager in the Physics Department 
> Contact     : andrea.testa@sd-p.dp.epfl.ch 
> Date	      : Fri, 24 Feb 1995 
> 
> NQS Version : Monsanto-NQS 3.36.0 + in-houses fixes for OSF/1 
>		support 
> Platforms   : OSF/1 v3 
> 
> Description : A pipe queue on the DEC is not able to deliver 
>               correctly jobs to remote queues - they get stuck 
>		in the arriving state. 
> Solution    : None. 
> 
> Comments so far : 
> 
>  This appears to be of the same nature as the first problem reported 
>  in January 1995. 

Reported Problems - March 1995


NQS Daemons Stopping (UNSOLVED - PERHAPS UPGRADE)

> Reported by : Cordula Reineke 
> Contact     : ratte@iam.uni-bonn.de 
> Date        : Wed, 29 Mar 1995 
> 
> NQS Version : Monsanto-NQS 3.35 
> Platforms   : IRIX 5.2, IRIX 5.3 
> 
> Description : On machines which only have pipe queues to route 
>		jobs to other machines, the daemons just lock up, 
>		and require killing/restarting by hand. 
>		This happens several times a day. 
> Solution    : Upgrade to 3.36.6 or later? 
> 
> Comments so far : 
> 
>  We don't support anything before Monsanto-NQS 3.36.0, so it's 
>  difficult to investigate and comment.  We are unaware of any 
>  such problem AT THIS TIME with Monsanto-NQS 3.36.x. 


Bus Error With `qstat -a' on IRIX 6 (UNSOLVED)

> Reported by : Cordula Reineke 
> Contact     : ratte@iam.uni-bonn.de 
> Date	      : Wed, 29 Mar 1995 
> 
> NQS Version : Monsanto-NQS 3.35 
> Platforms   : IRIX 6.0.1 
> 
> Description : qstat -a results in a bus error, while qstat -sa 
>               works fine. 
> 
> Solution    : None. 
> 
> Comments so far : 
> 
>  This appears to be a problem in the NQS code, but has not been 
>  actively investigated at this time. 


Bus Error With `qmgr show managers' Or `qmgr' (WORKAROUND)

> Reported by : Cordula Reineke 
> Contact     : ratte@iam.uni-bonn.de 
> Date        : Wed, 29 Mar 1995 
> 
> NQS Version : Monsanto-NQS 3.36.6 
> Platform    : IRIX 5.3 
> 
> Description : qmgr gives a bus error when attempting to show the 
>		list of managers. 
> Solution    : Remove the file NQS_SPOOL/private/root/database/ 
>		managers. 
> 
> Comments so far : 
> 
>  This is caused by a combination of : 
> 
>  o  Assigning manager rights to a non-root user 
>  o  Then changing the machine_id of the machine 
> 
>  It turns out that the mid of the manager is stored in the database, 
>  and so, when the mid of the machine is changed, the mid of the 
>  manager is no longer valid. 
> 
>  A patch against this problem will be produced shortly. 
> 
>  Many thanks to Mark Grieshaber at Monsanto for an excellent 
>  investigation into this problem. 

Problems Reported - April 1995


Compilation Failure On IRIX 6 (FIXED)

> Reported by : Cordula Reineke 
> Contact     : ratte@iam.uni-bonn.de 
> Date	      : Mon, 3 Apr 1995 
> 
> NQS Version : Monsanto-NQS 3.36.6 
> Platform    : IRIX 6 
> 
> Description : NQS fails to compile, complaining about being 
>		unable to locate the file for -lnqs. 
> Solution    : Use Monsanto-NQS 3.36.7 or later 
> 
> Comments so far : 
> 
>  This is just an oversight in the Makefile.sgi6 - simply add 
>  `-L.' to the front of the LINKLIBS line, and all is well. 


Monsanto-NQS 3.36.7 Pre-Release 1 Doesn't Work (FIXED)

> Reported by : Cordula Reineke 
> Contact     : ratte@iam.uni-bonn.de 
> Date        : Tue, 11 Apr 1995 
> 
> NQS Version : Monsanto-NQS 3.36.7 Pre-Release 1 
> Platform    : All 
> 
> Description : Submitting a request results in an unanticipated 
>		transaction failure reported by qsub. 
> Solution    : Use 3.36.7 pre-release 2 or release version or later 
> 
> Comments so far : 
> 
>  This was entirely my fault - debugging code was added to the 
>  request process which always failed, because the test was 
>  hopelessly wrong.  This one fault took weeks to find, and caused 
>  high levels of inconvenience to all concerned. 

Problems Reported - May 1995


Problem Compiling NQS 3.36 On OSF/1 v1.3 (FIXED)

> Reported by : Michael Pope 
> Contact     : ln1mgp@entoil.co.uk 
> Date        : Tue, 9 May 1995 
> 
> NQS Version : Monsanto-NQS 3.36.0 
> Platform    : OSF/1 v1.3 
> 
> Description : NQS fails to compile, complaining about BAD SYSTEM 
>		TYPE. 
> Solution    : Upgrade to 3.36.6 or later, and upgrade to a much 
>		later version of OSF/1. 
> 
> Comments so far : 
> 
>  This problem *appears* to be caused simply by the C pre-processor 
>  on OSF/1 v1.3 being unable to correctly parse pre-processor 
>  directives regarding conditional compilation. 
> 
>  I am informed that users should upgrade to at least OSF/1 v3. 


Segmentation Fault On OSF/1 v2.1 (FIXED)

> Reported by : Matsushita Takashi 
> Contact     : matsu@phys.metro-u.ac.jp 
> Date	      : Thu, 18 May 1995 
> 
> NQS Version : Monsanto-NQS 3.36.6 
> Platform    : OSF/1 v2.1 
> 
> Description : NQS core dumps while booting. 
> Solution    : Use Monsanto-NQS v3.36.7 or later. 
> 
> Comments so far : 
> 
>  This was caused by printf("%s\n", mid); where mid was an 
>  unsigned long.  How come no other version of UNIX has complained 
>  is beyond me. 
> 
>  Many thanks to Matsushita Takashi for a post-mortem of the core 
>  using gdb. 


Queue Lockups On OSF/1 (UNSOLVED)

> Reported by : John Peden 
> Contact     : pdxjfp@evol.gene.nottingham.ac.uk 
> Date	      : Tue, 30 May 95 
> 
> NQS Version : Monsanto-NQS 3.36.x 
> Platform    : OSF/1 v3.2 
> 
> Description : An NQS batch queue will spawn a request, and then 
>               fail to run any other requests in the queue until the 
>               request is deleted using qdel. 
>		Problem is intermittent, and appears to happen under 
>		load. 
> Solution    : None. 
> 
> Comments so far : 
> 
>  John stress-tested NQS, and found a failure rate of just 1%. 
>  Debugging code will be added to 3.36.7 in order to provide 
>  further information about the cause of the problem. 

Problems Reported - June 1995


Unanticipated Transaction Failure - Intermittent (WORKAROUND)

> Reported by : Philippe A. Bopp 
> Contact     : pab@hulot.lsmc.u-bordeaux.fr 
> Date	      : Tue, 6 Jun 1995 
>  
> NQS Version : Monsanto-NQS 3.36 
> Platform    : AIX 
> 
> Description : The occaisonal RCM_UNAFAILURE message appears in 
>		the NQS log files.  This appears to happen under 
>		load. 
> Solution    : None. 
> 
> Comments so far : 
> 
>  There have been no similar reports.  We are unable to actively 
>  look into the problem, as it does not affect known UKHE 
>  installations. 
> 
>  Philippe has since reported that this is caused by having `qsub' 
>  as the last command of a NQS request.  A workaround is to place a 
>  `sleep 10' after the `qsub' statement in the request. 


Accounting Error (UNSOLVED)

> Reported by : Thomas Eifert 
> Contact     : Eifert@rz.rwth-aachen.de 
> Date        : Fri, 9 Jun 1995 
> 
> NQS Version : 3.36.7 pre-release 2 (and earlier versions) 
> Platform    : IRIX 5.2 (probably IRIX 6 too) 
> 
> Description : qacct reports that a process uses CPU-time which 
>		is 100 times what was actually used.  The time 
>		reported by qstat is correct, suggesting a bug in 
>		qacct. 
> Solution    : None. 
> 
> Comments so far : 
> 
>   Not investigated.  Will investigate when time allows. 


Process Time Inaccuracy - qstat (UNSOLVED)

> Reported by : Thomas Eifert 
> Contact     : Eifert@rz.rwth-aachen.de 
> Date        : Mon, 12 Jun 1995 
> 
> NQS Version : 3.36.7 pre-release 2 
> Platform    : IRIX 5.2 
> 
> Description : The time reported by qstat is no longer cumulated 
>		over several processes that run within one job. 
> Solution    : None. 
> 
> Comments so far : 
> 
>   Not investigated.  There was a change in signal handling back 
>   in 3.36.5, which may be relevant to the problem. 


Environment Corruption (FIXED)

> Reported by : Chang Keng Seng 
> 		Technical Consultant 
>		Computervision Services, Singapore 
> Contact     : chang@pspore.cv.com 
> Date	      : Fri, 16 June 1995 
> 
> NQS Version : 3.36.6 (and earlier versions) 
> Platform    : SOLARIS 2 
> 
> Description : The environment variables set by NQS are concatenated 
>		together. 
> Solution    : Use Monsanto-NQS 3.36.7 or later. 
> 
> Comments so far : 
> 
>   The routines which build the environment didn't count the `NULL' 
>   terminator at the end of each environment string.  Fixed by 
>   Thomas Ziehmer. 


/sbin/pset : unknown set name (SOLVED)

> Reported by : Phil Chambers 
> Contact     : P.A.Chambers@exeter.ac.uk 
> Date	      : Tue, 27 Jun 1995 
> 
> NQS Version : 3.36.6 (I think - Stu) 
> Platform    : IRIX 6 
> 
> Description : A message appears in the NQS logs of the form 
>		/sbin/pset: unknown set name. 
> Solution    : Create processor sets with the same name as your 
>		batch queues, or recompile NQS without the `-DTAMU' 
>		option to disable pset support. 
> 
> Comments so far : 
> 
>   At first sight, this appears to be a bug in pset.  Investigations 
>   are continuing. 


Solaris 2.3 stdout Delivery Problem (INSTALLATION ERROR)

> Reported by : Rob Creecy 
> Contact     : rcreecy@census.gov 
> Date 	      : Tue, 27 Jun 1995 
> 
> NQS Version : Monsanto-NQS v3.36.7 pre-release #2 
> Platform    : Solaris 2.3 
> 
> Description : The command `qsub<CR>ls<CR>^D' (as an example) 
>		results in email detailing that the NQS request 
>		was aborted by signal 11 (SIGSEGV). 
> Solution    : None at present. 
> 
> Comments so far : 
> 
>   I cannot reproduce this bug locally - we've been running NQS on 
>   Solaris 2.3 for nearly a year now.  I've asked for the analysis 
>   of the core file, which should help track down the problem 
>   further. 
> 
>   Rob has since reported that he recompiled and reinstalled NQS, 
>   and NQS appears to be working fine.  This appears to have simply 
>   been a subtle installation error. 
> 
>   There is a good case for arguing that NQS should be imune to 
>   such problems ... 


Defunct Processes On HP-UX 10.0 (UNSUPPORTED MACHINE)

> Reported by : Mouri Yoshihiro 
> Contact     : y-mouri@jkk.hitachi.co.jp 
> Date	      : Thu, 29 Jun 1995 
> 
> NQS Version : Monsanto-NQS v3.36.? 
> Platform    : HP-UX v10.0 
> 
> Description : <defunct> processes appear, whose parent pid is 
>		NQS's netdaemon. 
> Solution    : None as yet. 
> 
> Comments so far : 
> 
>   At the time of writing, we do not have support for HPUX 10.0 
>   in the Monsanto-NQS source tree.  The problem should be solved 
>   by changing the way netdaemon.c handles SIGCHLD.  The next 
>   release of Monsanto-NQS will include a preliminary patch to 
>   attempt a fix. 

Reported Problems - July 1995


Netdaemon: Error Getting Local Host's MID (UNSOLVED)

> Reported by : Jim Talley 
> Contact     : talley@lexicus.mot.com 
> Date        : Thu 6 July 1995 
>  
> NQS Version : Monsanto-NQS v3.36.7 pre-release 2 
> Platform    : SunOS 4.1.x 
> 
> Description : Netdaemon fails to run on starting up NQS.  It 
>		reports that it is unable to determine the machine-id 
>		of the local host, and reports the error value 2. 
> 
> Solution    : None at present. 

Automagically produced by KTEpaper, part of The Knowledge Tree Engine

This site (www.gnqs.org) is copyrighted. You can view the terms & conditions here.
You can contact the webmaster here.



Please address all correspondance about this page to

This site (www.gnqs.org) is copyrighted. You can view the terms & conditions here.
You can contact the webmaster here.



Please address all correspondance about this page to

This site (www.gnqs.org) is copyrighted. You can view the terms & conditions here.
You can contact the webmaster here.



Please address all correspondance about this page to

This site (www.gnqs.org) is copyrighted. You can view the terms & conditions here.
You can contact the webmaster here.