Configuring the Spare computer
The spare machine is used in the case where one of the primary SOS computers has had some type of failure. Swapping in the spare for a failed machine is a significant undertaking and must be done with care. NOAA should be contacted before initiating this process.
From a hardware point of view and operating system point of view, the spare is identical the "nc" computer. The main difference is the specifics of the configuration. The basic idea is that the spare computer is given the identity of the failed or failing computer. This is accomplished by modifying configuration files on the spare, making them equivalent to those on the computer being replaced. The modifications will vary depending on the type of machine that has failed (nc or a display node). This page is divided into two pieces; replacing a display node and replacing the "nc" system. All of the actions need root privilege. Root privileges can be gained by logging into the system as root, or the user can use "sudo" in front of various commands to temporarily raise the user privileges to root. A reboot is required after the all modifications are made.
Steps for replacing the "nc" computer with the spare system.
All of these edits take place on the spare machine. Essentially, we are swapping identities. Since "nc" is a shared server for all of the other machines, it is probably a good idea to power down each of the display nodes.
- Find out the IP address of the machine being replaced. Use the "/etc/hosts" file as a reference
- Give the spare the identity of the failed machine. On the spare, edit the file "/etc/sysconfig/network". Change "HOSTNAME" to be the name of the failed machine.
- IP address change. On the spare edit the file, "/etc/sysconfig/network-scripts/ifcfg-eth0" and change the "IPADDR" variable to be that of the failed machine. For "nc" specifically, the IPADDR is typically "10.1.1.30". So, the line should read, "IPADDR=10.1.1.30".
-
Edit the file, "/etc/inittab". This file controls what type of window manager comes up when the machine goes into run level 5. There are two modes, one mode (nc is configured this way), the screen will start a window manager so that a user can interact with the system. In the other mode (display node), the system will only display a black screen, with a cursor. In the /etc/inittab file, a "#" (hash mark) is used to designate a comment. When swapping the spare in for "nc", we want the last two lines to be the following:
x:5:respawn:/etc/X11/prefdm -nodaemon
#x:5:respawn:/usr/bin/X11/X -dpms -ac -noreset -s 0 - Edit the "/etc/exports" file. The "nc" computer functions as an NFS server for the other display nodes. Edit this file to enable NFS and export the data disk to the other nodes. Again, a "#" mark is used to designate a comment character. We want to un-comment the line starting with "/shared/sos". It should end up looking like the following:
/shared/sos 10.1.1.0/255.255.255.0(rw,async) - Edit the file, "/etc/fstab". There are a dozen lines in the file, but only two are of concern for this switch over. The lines of concern are already in the file and we simply need to comment/uncomment them. Edit the file to reflect this:
#LABEL=/sos /sos ext3 defaults 1 2
LABEL=/sos /shared/sos ext3 defaults 1 2 - Edit the file, "/home/sos/.ssh/known_hosts" on every machine so that ssh understands that there is a new identity for spare and for the "nc" machine.
- The "nc" computer has two Ethernet interfaces (called NIC's). In most cases, one of the NIC's is setup to communicate with the other machines in the SOS cluster, and the other interface is setup to communicate with the outside Internet. The IP address for the outside Internet needs to be taken care of (this is very site specific).
- Network time keeping. The "nc" computer synchronizes it's internal clock to an Internet time source and all of the other computers in the SOS cluster sync to "nc". To configure "nc" to sync to the Internet, copy the file "/etc/ntp.conf-nc to the destination file, /etc/ntp.conf (e.g. cp /etc/ntp.conf-nc /etc/ntp.conf)
- Power down the "old" nc.
- Reboot the "spare". It should come up as "nc".
- Power up the display nodes
- On "nc", bring up a terminal window and run the command "pdo ls". Because we changed the identity of "nc", all of the machine will prompt for verification that it's OK to ssh. Answer "yes" to all of the questions.
- Test SOS
Steps for replacing a display node with the spare system.
- Find out the IP address of the machine being replaced. Use the "/etc/hosts" file as a reference
- Give the spare the identity of the failed machine. On the spare, edit the file "/etc/sysconfig/network". Change "HOSTNAME" to be the name of the failed machine.
- IP address change. On the spare edit the file, "/etc/sysconfig/network-scripts/ifcfg-eth0" and change the "IPADDR" variable to be that of the failed machine.
- Edit the file, "/home/sos/.ssh/known_hosts" on every machine so that ssh understands that there is a new identity for spare and for the "new" machine.
- Reboot the "spare". It should come up as the new replace one.