4.1. Post Installation Script Debugging

This section will discuss some common mistakes made in post installation scripts and provide some techniques on how to help you debug them.

Below is an example node XML file that has a XML syntax error:

<?xml version="1.0" standalone="no"?>

<kickstart>

<post>
ech "yo" 2&>1 > /tmp/fun
</post>

</kickstart>

When the above node XML file is included in a roll and when a compute node asks for its kickstart file, no file will be returned and the compute node will not install. The compute node will display a screen asking for the user to input a "language" (compute node installations are 100% automated, so a compute node should never ask for user input).

The command rocks list host profile <hostname> executes the same functions as the code which automatically generates kickstart files. So, if we execute rocks list host profile compute-0-0, we'll see what error occurs:

# rocks list host profile compute-0-0 > /tmp/ks.cfg
Traceback (most recent call last):
  File "/opt/rocks/bin/rocks", line 267, in ?
    command.runWrapper(name, args[i:])
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper
    self.run(self._params, self._args)
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/host/profile/__init__.py", line 295, in run
    [
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1667, in command
    o.runWrapper(name, args)
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper
    self.run(self._params, self._args)
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/host/xml/__init__.py", line 199, in run
    xml = self.command('list.node.xml', args)
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1667, in command
    o.runWrapper(name, args)
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper
    self.run(self._params, self._args)
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/node/xml/__init__.py", line 514, in run
    handler.parseNode(node, doEval)
  File "/opt/rocks/lib/python2.4/site-packages/rocks/profile.py", line 409, in parseNode
    parser.feed(line)
  File "/opt/rocks/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed
    self._err_handler.fatalError(exc)
  File "/opt/rocks/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: <unknown>:91:11: not well-formed (invalid token)

The last line of the output above let's us know there is a problem, but it is not clear in which file and which line number. One way to debug the problem is to add ROCKSDEBUG=y to the rocks list host profile command:

# ROCKSDEBUG=y rocks list host profile compute-0-0 > /tmp/ks.cfg
  .
  .
  .
</kickstart>[parse1]
[parse1]<kickstart roll="valgrind">
[parse1]
[parse1]<post>
[parse1]ech "yo" 2&>1 > /tmp/fun
Traceback (most recent call last):
  File "/opt/rocks/bin/rocks", line 267, in ?
    command.runWrapper(name, args[i:])
  File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper
    self.run(self._params, self._args)

The command produces a lot of output, but it does stop on the line with the bug. Now we have something to grep for. On a stock Rocks frontend, all the node XML files for a distribution can be found in /export/rocks/install/rocks-dist/*/build/nodes. So to find the file with the syntax error, execute:

# cd /export/rocks/install/rocks-dist/x86_64/build/nodes
# grep 'ech "yo" 2&>1 > /tmp/fun' *
valgrind-bug.xml:ech "yo" 2&>1 > /tmp/fun

The bug is in valgrind-bug.xml, so we can go to the source code for the Valgrind Roll and fix the node XML file.

But let's say we just fix the XML syntax error:

<?xml version="1.0" standalone="no"?>

<kickstart roll="valgrind">

<post>
ech "yo" > /tmp/fun
</post>

</kickstart>

There is still a bug (the command ech should be echo). After a host installs, there are several log files saved on a host. One that we'll look at is: /var/log/rocks-install.log:

./nodes/valgrind-bug.xml: begin post section
/tmp/ks-script-MP2uTd: line 2: ech: command not found
./nodes/valgrind-bug.xml: end post section

This may be enough information to help determine the root cause of the problem, but we have found that sometimes we need to "stall" an installation at a specific point. We'll modify the post section in valgrind-bug.xml to loop indefinitely:

<post>
ech "yo" > /tmp/fun

touch /tmp/stall
while [ -f /tmp/stall ]
do
	sleep 1
done

</post>

When a compute node installs, you wll see a screen that says "Running post-install scripts" -- the installation will not progress beyond this point due to the loop above. From the frontend, we can access the installing node by executing:

# ssh compute-0-0 -p 2200

We'll see a bash prompt -- we are now in the installation environment for compute-0-0. This environment is running out of a ramdisk, so the partitions on the hard disk are all mounted with the prefix /mnt/sysimage. For example, the root partition is /mnt/sysimage, the var partition is /mnt/sysimage/var, etc. Before a post installation script is run, the installer executes a chroot to /mnt/sysimage so the script has the illusion it is running on the hard disk environment (not the ramdisk environment).

After we debugged the issue, we can instruct the installaation to continue by executing:

# rm -f /mnt/sysimage/tmp/stall