This section will discuss some common mistakes made in post installation scripts and provide some techniques on how to help you debug them.
Below is an example node XML file that has a XML syntax error:
<?xml version="1.0" standalone="no"?> <kickstart> <post> ech "yo" 2&>1 > /tmp/fun </post> </kickstart> |
When the above node XML file is included in a roll and when a compute node asks for its kickstart file, no file will be returned and the compute node will not install. The compute node will display a screen asking for the user to input a "language" (compute node installations are 100% automated, so a compute node should never ask for user input).
The command rocks list host profile <hostname>
executes the same functions as the code which automatically generates
kickstart files.
So, if we execute
rocks list host profile compute-0-0
, we'll see what
error occurs:
# rocks list host profile compute-0-0 > /tmp/ks.cfg Traceback (most recent call last): File "/opt/rocks/bin/rocks", line 267, in ? command.runWrapper(name, args[i:]) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper self.run(self._params, self._args) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/host/profile/__init__.py", line 295, in run [ File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1667, in command o.runWrapper(name, args) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper self.run(self._params, self._args) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/host/xml/__init__.py", line 199, in run xml = self.command('list.node.xml', args) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1667, in command o.runWrapper(name, args) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper self.run(self._params, self._args) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/list/node/xml/__init__.py", line 514, in run handler.parseNode(node, doEval) File "/opt/rocks/lib/python2.4/site-packages/rocks/profile.py", line 409, in parseNode parser.feed(line) File "/opt/rocks/lib/python2.4/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "/opt/rocks/lib/python2.4/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: <unknown>:91:11: not well-formed (invalid token) |
The last line of the output above let's us know there is a problem, but it
is not clear in which file and which line number.
One way to debug the problem is to add ROCKSDEBUG=y
to the rocks list host profile
command:
# ROCKSDEBUG=y rocks list host profile compute-0-0 > /tmp/ks.cfg . . . </kickstart>[parse1] [parse1]<kickstart roll="valgrind"> [parse1] [parse1]<post> [parse1]ech "yo" 2&>1 > /tmp/fun Traceback (most recent call last): File "/opt/rocks/bin/rocks", line 267, in ? command.runWrapper(name, args[i:]) File "/opt/rocks/lib/python2.4/site-packages/rocks/commands/__init__.py", line 1981, in runWrapper self.run(self._params, self._args) |
The command produces a lot of output, but it does stop on the line with the
bug.
Now we have something to grep
for.
On a stock Rocks frontend, all the node XML files for a distribution can be
found in
/export/rocks/install/rocks-dist/*/build/nodes
.
So to find the file with the syntax error, execute:
# cd /export/rocks/install/rocks-dist/x86_64/build/nodes # grep 'ech "yo" 2&>1 > /tmp/fun' * valgrind-bug.xml:ech "yo" 2&>1 > /tmp/fun |
The bug is in valgrind-bug.xml
, so we can go to the
source code for the Valgrind Roll and fix the node XML file.
But let's say we just fix the XML syntax error:
<?xml version="1.0" standalone="no"?> <kickstart roll="valgrind"> <post> ech "yo" > /tmp/fun </post> </kickstart> |
There is still a bug (the command ech
should
be echo
).
After a host installs, there are several log files saved on a host.
One that we'll look at is: /var/log/rocks-install.log
:
./nodes/valgrind-bug.xml: begin post section /tmp/ks-script-MP2uTd: line 2: ech: command not found ./nodes/valgrind-bug.xml: end post section |
This may be enough information to help determine the root cause of the problem,
but we have found that sometimes we need to "stall" an installation at a
specific point.
We'll modify the post section in
valgrind-bug.xml
to loop indefinitely:
<post> ech "yo" > /tmp/fun touch /tmp/stall while [ -f /tmp/stall ] do sleep 1 done </post> |
When a compute node installs, you wll see a screen that says "Running post-install scripts" -- the installation will not progress beyond this point due to the loop above. From the frontend, we can access the installing node by executing:
# ssh compute-0-0 -p 2200 |
We'll see a bash prompt -- we are now in the installation environment for
compute-0-0.
This environment is running out of a ramdisk, so the partitions on the hard
disk are all mounted with the prefix /mnt/sysimage
.
For example, the root partition is /mnt/sysimage
,
the var partition is /mnt/sysimage/var
, etc.
Before a post installation script is run, the installer
executes a chroot
to
/mnt/sysimage
so the script has the
illusion it is running on the hard disk environment (not the ramdisk
environment).
After we debugged the issue, we can instruct the installaation to continue by executing:
# rm -f /mnt/sysimage/tmp/stall |