Thursday 19th November 1998
Being at my wits' end over the Signal 13 problem, I decided the only thing I can do is to produce a debugging patch which reports on every single file we attempt to close in the NQS daemon, and every single file we attempt to write to in the pipeclient.
There is one other possibility, which so far I've totally neglected to investigate. pipeclient writes to two pipes ... the other pipe being messages to syslog. Given the amount of data we generate, could we be exercising problems either in various implementations of syslog itself, or the FIFO support of the underlying operating system?
A simple change to the debugging in pipeclient will soon test that one. And I can get the patch out for everyone to test while I work on the much bigger debugging patch.
I hope one of these patches shows up what the problem is. Once we find it, it's fixable whether it's a problem with GNQS or an operating system bug. After all, we can always go back and use UNIX domain sockets instead of FIFOs if necessary.
I'm repeating myself here, but ... pipeclient is trying to write to either the original NQS daemon (which is listening on file descriptor #0, and which from the logs happily picks up messages from other pipeclients before and after) and to the syslog daemon (probably via UNIX domain sockets, making the patch I sent out tonight a total waste of space ;-) If pipeclient receives SIGPIPE, then that means that the NQS daemon has closed file descriptor #0. As far as I can tell, the NQS daemon simply never does this.
Well, it'll be the weekend before I cut the next patch. I'll have a look tomorrow night, and work out whether it is quicker to switch to UNIX domain sockets or to put in all of the debugging information.
Monday 16th November 1998
KDE has followed Enlightenment into the dark recesses of my hard drive, to be replaced by Window Maker. Not quite sure how I got there ... it all started out as trying to finally get Gnome working on my laptop. Hrm. Anyway, at least X works once more.
So, I now have a tasteful (honest ;-) X11 desktop, running nxterm (and vi), along with Code Crusader as my X-based editor, with Code Medic as the debugger frontend (mainly because DDD as distributed is dynamically linked against libXm.so.2, rather than LessTif). I'm open to suggestions for better tools to replace these, or to complement these. I'm particularly desperate for a decent X-based programmer's editor ...
That was yesterday.
Today is yet more Signal 13 hacking. Still haven't reproduced the problem, so I'm pouring over the various debugging outputs which GNQS users have kindly supplied me with.
The SIGPIPE is received by the pipeclient, which doesn't appear to handle it at all. When this happens, the pipeclient has just failed to find a destination for a pipe job (because all of the remote destinations are too busy), and has decided to reschedule the job to retry later. Unfortunately, when the pipeclient attempts to tell the nqsdaemon about this, pipeclient finds that the nqsdaemon is not listening, and gets a SIGPIPE for its troubles.
According to pipeclient, it believes that it should be writing to file descriptor #4. This is the writing side of the NQS IPC pipe. This is opened in the nqs_boot(), and preserved all of the way through to calling the pipeclient.
Personally, I think the whole IPC for local NQS processes could do with replacing from scratch, if only because it's a total nightmare to maintain. GNQS v4 is going to have to do a lot better than COSMIC NQS did. And I want to look at the whole area of debugging messages again ... in particular, I'd love to have a logging system which logged events in exactly the order they occured, successfully interleaving events from different processes into the one log.
Anyway, my latest attempt to convince pipeclient to SIGPIPE has failed miserably, so I'm off to bed.
(Amusingly, today at work I managed to track down and fix a fatal problem in the custom RPC server I wrote for them, which had been bugging us for months. I take this as a good sign ;-)
Sunday 15th November 1998
(Rant mode on)
Made the mistake of installing the RedHat 5.1 Errata, only to find that my 16-bit depth X11 display no longer worked (sigh). Just to make life more interesting, of the various utilities available to generate XF86Config files, XF86Setup fails to display anything on the screen, xf86config fails to define any working display modes, and RedHat's XConfigurator totally fails to offer configuration for 1024x768 16-bit modes.
And certain Linux advocates are claiming that Linux is ready to take MS-Windows on (sigh). I'm a great fan of Linux, but the current rhetoric is nothing more than hot air from people with egos to maintain.
(Rant mode off)
Having said all that, apparently WINE is getting much closer to being able to run MS-Office 97 [more].
More work on the Signal 13 problem, once I'd got X fixed.
Friday 13th November 1998
Off up to my adopted sister's ... back Saturday (late).
Thursday 12th November 1998
Out, social evening with everyone at work.
Wednesday 11th November 1998
Got the X11 TrueType font server up and running ... the StarOffice word processor module can't detect them (but can use them if you type the font settings in) but the rest of the modules work just fine.
Got nowhere near any code last night. Most of the evening was spent composing TheLetter(tm) to Sun, putting forward the basics of my request for sponsorship.
I have an email backlog ... I'll try and clear it at the weekend.
Tuesday 10th November 1998
Email, sorting out hardware from Sun, more email, StarOffice 5.0 for Linux (not bad at all so far ... wonder how I buy a few more copies?) and TrueType X11 font servers.
With the Signal 13 problem, I'm now thinking that the problem occurs when the pipeclient attempts to write back to the nqs shepherd process to tell it that the request needs to be re-scheduled. I'm still digging through the code (and still haven't reproduced the problem here) and will try and come up with a debugging patch soon.
Sunday 8th November 1998
Busy day with GNQS today.
Knocked up a quick patch for debugging GNQS startup problems. The first patch allows GNQS to pick up the default debugging level from the environment variable DEBUG.
Next up is a look into a security hole reported to me Friday night. I've not checked whether Monsanto-NQS and CERN NQS users are likely to be affected, but I have forwarded on the security report to CERN NQS's maintainer, Christian Boissat [email] just in case.
A couple of hours of examination later, it turns out that the security hole is caused by the new prologue/epilogue patch. I think the best thing I can do is to get the Signal 13 problem delt with, and then roll out a v3.50.5 with a secure version of prologue/epilogue.
Also spend a couple of hours transcribing the main melody for the traditional Irish song "Down In The Sally Gardens"
Saturday 7th November 1998
Out, taking Kristi to a Teddy Bear Fair in Cheltenham.
Monday 2nd November 1998
Hi Derek! My excuse for last weekend is lack of motivation - is that honest enough? ;-)
My copy of Solaris arrived from the States today.
In fact, Sun were nice enough to send me two copies - one for SPARC, and one for Intel - and charge me double for it. I wonder if they'd be kind enough to send me a SPARC to run it on ;-)
Plus I received Solaris 2.6, not Solaris 7. Maybe Solaris 7 isn't out yet, although that isn't the impression one gets from www.sun.com. But after having waited well over a month for the stuff to ship, it would have been a nice surprise to receive the current version of their operating system, and not the previous version.
With competition like this, it's no wonder Linux and the way Linux is distributed is gaining market share ...
Seeing as I need a SPARC to run my nice new (but out-of-date ;-) copy of Solaris on, it seems as good a time as any to approach Sun Microsystems to see if they will donate a single SPARC-based machine for use in Generic NQS maintenance.
I've also asked the Generic NQS user community to help me on this, by approaching their Sun salesman and voicing their support for my request.