Troubleshooting (Ubuntu)

High System Load

The system load is normally represented by the load average over the last 1, 5 and 15 minutes.

For example, the uptime command gives a single line summary of system uptime and recent load

user@server:~$ uptime
 14:28:49 up 9 days, 22:41,  1 user,  load average: 0.34, 0.36, 0.32

So in the above, as of 14:28:49 hrs the server has been up for 9 days 22 hours odd, has 1 user logged in, and the system load averages for the past 1, 5, and 15 minutes are shown.

The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would.

The problem with investigating performance issues is that you need to know what is normal, so you can determine what's wrong once application/service performance deteriorates. But its unlikely that you would have pain much attention to underlying system metrics until things are already bad.

`top`

The top command allows some basic insight into the system's performance, and is akin to the Task Manager in Windows. It probably won't provide the answer as to what the problem is, but it will probably allow you to focus in on the process(es) that are causing grief.

user@server:~$ top
top - 14:32:09 up 9 days, 22:44,  1 user,  load average: 0.70, 0.44, 0.34
Tasks: 137 total,   1 running, 136 sleeping,   0 stopped,   0 zombie
Cpu(s): 93.8%us,  6.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1023360k total,   950520k used,    72840k free,    10836k buffers
Swap:  1757176k total,  1110228k used,   646948k free,   135524k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6608 zimbra    20   0  556m  69m  12m S 69.1  6.9   0:03.26 java
17284 zimbra    20   0  649m 101m 3604 S  4.6 10.1  31:34.74 java
 2610 zimbra    20   0  976m 181m 3700 S  0.7 18.1 133:06.68 java
    1 root      20   0 23580 1088  732 S  0.0  0.1   0:04.70 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
....

Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.

Overview of CPU Metrics, % of CPU time spent on
Code	`us`	`sy`	`ni`	`id`	`wa`	`ha`	`si`	`st`
Name	User CPU	System CPU	Nice CPU	Idle CPU	IO Wait	Hardware Interrupts	Software Interrupts	Steal
Description	user processes (excluding nice)	kernel processes	user nice processes (nice reduces the priority of process)	idling (doing nothing)	waiting for IO (high indicates disk/network bottleneck)	hardware interrupts	software interrupts	servicing virtual machines

Task column heading descriptions (to change what columns are shown press `f`)
Key	Display	Name	Description
`a`	`PID`	Process ID	Task/process identifier
`b`	`PPID`	Parent PID	Task/process identifier of processes parent (ie the process that launched this process)
`c`	`RUSER`	Real User Name	Real username of task's owner
`d`	`UID`	User ID	User ID of task's owner
`e`	`USER`	User Name	Username ID of task's owner
`f`	`GROUP`	Group Name	Group name of task's owner
`g`	`TTY`	Controlling TTY	Device that started the process
`h`	`PR`	Priority	The task's priority
`i`	`NI`	Nice value	Adjusted task priority. From -20 meaning high priority, through 0 meaning unadjusted, to 19 meaning low priority
`j`	`P`	Last Used CPU	ID of the CPU last used by the task
`k`	`%CPU`	CPU Usage	Task's usage of CPU
`l`	`TIME`	CPU Time	Total CPU time used by the task
`m`	`TIME+`	CPU Time, hundredths	Total CPU time used by the task in sub-second accuracy
`n`	`%MEM`	Memory usage (RES)	Task's usage of available physical memory
`o`	`VIRT`	Virtual Image (kb)	Task's allocation of virtual memory
`p`	`SWAP`	Swapped size (kb)	Task's swapped memory (resident in swap-file)
`q`	`RES`	Resident size (kb)	Task's unswapped memory (resident in physical memory)
`r`	`CODE`	Code size (kb)	Task's virtual memory used for executable code
`s`	`DATA`	Data+Stack size (kb)	Task's virtual memory not used for executable code
`t`	`SHR`	Shared Mem size (kb)	Task's shared memory
`u`	`nFLT`	Page Fault count	Major/Hard page faults that have occurred for the task
`v`	`nDRT`	Dirty Pages count	Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory
`w`	`S`	Process Status	D - Uninterruptible sleep R - Running S - Sleeping T - Traced or Stopped Z - Zombie
`x`	`Command`	Command Line	Command used to start task
`y`	`WCHAN`	Sleeping in Function	Name (or address) of function that the task is sleeping in
`z`	`Flags`	Taks Flags	Task's scheduling flags

Identify Process Causing Occasional High System Load

If the high load is constant, just fire up top and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.

If the high load is transient but repetitive, then you'll need to capture the output of top at the right time, the following script will create a log of top output during periods of high load

#!/bin/bash
#
# During high load, write output form top to file.
#
# Simon Strutt - July 2012

LOGFILE="/home/user/load_log.txt"  # Update to a valid folder path
MAXLOAD=100                        # Multiple by 100 as 'if' comparison can only handle integers

LOAD=`cut -d ' ' -f 1 /proc/loadavg`
LOAD=`echo $LOAD '*100' | bc -l | awk -F '.' '{ print $1; exit; }'`     # Convert load to x100 integer

if [ $LOAD -gt $MAXLOAD ]; then
        echo `date '+%Y-%m-%d %H:%M:%S'`>> ${LOGFILE}
        top -b -n 1 >> ${LOGFILE}
fi

Schedule with something like (update with correct path to load_log...

crontab -e
1 * * * * /bin/bash  /home/user/load_log

`vmstat`

vmstat is principally used for reporting on virtual memory statistics, for example vmstat 5 3 creates an output every 5 seconds for 3 iterations,

user@server:~$ vmstat 5 3
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0  42676 479556  34192 106944    5    5    31  3678   84   89  9  9 75  7
 0  0  42676 479548  34208 106948    0    0     0    10   50  105  6  0 88  5
 0  0  42676 479548  34216 106948    0    0     0    18   37   61  0  0 96  4

Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.

Overview of VMSTAT Metrics
Section	Procs		Memory				Swap		IO		System		CPU
Code	`r`	`b`	`swpd`	`free`	`buff`	`cache`	`si`	`so`	`bi`	`bo`	`in`	`cs`	`us`	`sy`	`id`	`wa`
Name	Run	Block	Swap (kB)	Free (kB)	Buffer (kB)	Cache (kB)	Swap In (kB/s)	Swap Out (kB/s)	Blocks In (blocks/s)	Blocks Out (blocks/s)	Interrupts (/s)	Context Switch (/s)	User (% time)	System (% time)	Idle (% time)	Wait (% time)
Description	Processes waiting for run time	Processes in uninterruptible sleep (eg waiting for IO)	Virtual memory used	Unused memory	Memory used as buffers	Memory used as cache	Memory swapped in from disk	Memory swapped out to disk	Blocks in from a storage device	Blocks out from a storage device	Interrupts	Context switches	CPU running user processes	CPU running kernel processes	CPU idle	CPU waiting for IO

`mpstat`

mpstat reports on basic processor stats. It creates a timestamped output which is useful to leave running on a console (or logged to a file) for when you might here or find out about service performance problems after the fact. A number of the metrics are also provided by vmstat, but are reported to a greater accuracy by mpstat.

Its not available by default, and comes as part of the sysstat package (to install, use apt-get install sysstat).

For example mpstat 5 3 creates an output every 5 seconds for 3 iterations,

user@server:~# mpstat 5 3
Linux 2.6.32-41-server (server)   25/07/12        _x86_64_        (1 CPU)

11:50:59     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
11:51:04     all    1.00    0.00    0.80    1.60    0.00    0.00    0.00    0.00   96.60
11:51:09     all    4.60    0.00    0.40    2.60    0.00    0.00    0.00    0.00   92.40
11:51:14     all   43.20    0.00    6.00    3.00    0.00    0.00    0.00    0.00   47.80
Average:     all   16.27    0.00    2.40    2.40    0.00    0.00    0.00    0.00   78.93

Overview of MPSTAT Metrics
Code	`CPU`	`%usr`	`%nice`	`%sys`	`%iowait`	`%irq`	`%soft`	`%steal`	`%guest`	`%idle`
Name	CPU No.	User (% util)	Nice (% util)	System (% util)	IO Wait (% time)	Hard IRQ (% time)	Soft IRQ (% time)	Steal (% time)	Guest In (% time)	Idle (% time)
Description	CPU number (or ALL) Set with `-P <n>` option switch	CPU running user processes	CPU running nice (adjusted priority) user processes	CPU running kernel processes (excludes IRQs)	CPU waiting for (disk) IO	CPU servicing hardware interrupts	CPU servicing software interrupts	Virtual CPU wait due to CPU busy with other vCPU	CPU servicing vCPU(s)	CPU idle

`iostat`

iostat reports on IO (and CPU) stats.

Its not available by default, and comes as part of the sysstat package (to install, use apt-get install sysstat).

IO stats can be displayed either by device (default, and extra metrics with -x switch) or by partition (-p switch). Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.

Device stats output...

root@servername:~# iostat -x 5 3
Linux 2.6.32-41-server (servername)   25/07/12        _x86_64_        (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.56    0.54    2.17    6.67    0.00   79.06

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              18.89     9.63    6.36    8.83   367.22   146.02    33.78     0.34   22.18   5.73   8.70
dm-0              0.00     0.00    3.16   10.68   190.37    86.53    20.01     0.62   44.79   2.18   3.02
dm-1              0.00     0.00   22.11    7.44   176.85    59.48     8.00     0.71   23.92   2.21   6.52

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          45.18    0.00    5.02    0.40    0.00   49.40

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     4.82    0.20    3.82     1.61    65.86    16.80     0.02    4.50   4.00   1.61
dm-0              0.00     0.00    0.20    8.23     1.61    65.86     8.00     0.07    7.86   1.90   1.61
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          95.80    0.00    4.20    0.00    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    11.80    0.00    8.20     0.00   156.80    19.12     0.06    7.07   0.24   0.20
dm-0              0.00     0.00    0.00   19.60     0.00   156.80     8.00     0.18    8.98   0.10   0.20
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Partition stats output...

root@servername:~# iostat -t -p 5 3
Linux 2.6.32-41-server (servername)   30/07/12        _x86_64_        (1 CPU)

30/07/12 12:05:15
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.33    0.32    0.12    0.27    0.00   98.96

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               1.12        13.57        14.71     721218     782038
sda1              0.00         0.02         0.00        804         14
sda2              0.00         0.00         0.00          4          0
sda5              0.91        13.54        14.71     719994     782024
dm-0              2.23        13.49        14.45     716850     768240
dm-1              0.04         0.05         0.26       2632      13784

30/07/12 12:05:20
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00    0.00    0.00  100.00

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               1.20         0.00        14.40          0         72
sda1              0.00         0.00         0.00          0          0
sda2              0.00         0.00         0.00          0          0
sda5              0.80         0.00        14.40          0         72
dm-0              1.80         0.00        14.40          0         72
dm-1              0.00         0.00         0.00          0          0

30/07/12 12:05:25
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.00    0.00    0.00  100.00

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00          0          0
sda1              0.00         0.00         0.00          0          0
sda2              0.00         0.00         0.00          0          0
sda5              0.00         0.00         0.00          0          0
dm-0              0.00         0.00         0.00          0          0
dm-1              0.00         0.00         0.00          0          0

Overview of IOSTAT Device Metrics
Stats	Device IO Stats										Partition IO Stats
Code	`rrqm/s`	`wrqm/s`	`r/s`	`w/s`	`rsec/s`	`wsec/s`	`avgrq-sz`	`avgqu-sz`	`svctm`	`%util`	`tps`	`Blk_read/s`	`Blk_wrtn/s`	`Blk_read`	`Blk_wrtn`
Name	Read Merge (/s)	Write Merge (/s)	Read (/s)	Write (/s))	Read (sectors/s)	Write (sectors/s)	Av. Req. Size (sectors)	Av. Queue Len. (sectors)	Av. Service Time (msec)	Utilisation (% CPU Time)	Transfers (/s)	Read (blocks/s)	Write (blocks/s)	Read (blocks)	Write (blocks)
Description	Read requests merged	Write requests merged	Read requests	Write requests	Sector reads	Sector writes	Average read/write request size	Average request queue length	Average time to service requests	Bandwidth utilisation / device saturation	IO transfer rate (TPS - Transfers Per Second)	Data read	Data write	Data read	Data write

If when using the above tools you're presented with disk/devices names of dm-0, dm-1, etc., which won't mean much. These are LVM logical devices, to understand what they map to use

lvdisplay|awk  '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'

<google uid="C-BottomBanner" position="left"></google>Test

Network

No NIC

Especially after hardware changes, its possible the networking config no longer refers to the right interface.

Use ifconfig to confirm the current network config
Use dmesg | grep -i eth to ascertain what's been detected at boot time
Assuming it states that say eth0 has been changed to eth1 then just update the /etc/network/interfaces file

File System

Unable to Mount CD-ROM

Mounting drive with following command fails

mount /dev/cdrom /media/cdrom/

If /media/cdrom/ doesn't exist

Create the file with mkdir /media/cdrom

If /dev/cdrom special device doesn't exist

Check for existing mappings and devices
- ls -l /dev/ | grep cdrom
If an existing mapping exists but for a different drive number (eg cdrom2 -> sr0)
- Then try mounting with that number
- EG mount /dev/cdrom2 /media/cdrom/
If no existing mapping exists
- Then try creating one for one of the listed devices
- EG ln -sf /dev/sg0 /dev/cdrom

Replacing a Software RAID 1 Disk

This procedure was written from the following starting point...

A machine originally with two disks in RAID1 has failed, one disk has been replaced, and machine started again

...and adapted from this post http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

Backup whatever you can before proceeding, one mistake or system error could destroy your machine
Confirm which disk is new, and which is old (if the new disk is blank this is easy as there will be no partition info!)
- fdisk -l
Partition the new disk the same as the original
- sfdisk -d /dev/sda | sfdisk /dev/sdb
Confirm that the layout of both disks is now that same
- fdisk -l
Add the newly created partitions to the RAID disks
- mdadm --manage /dev/md0 --add /dev/sdb1
- You may have more sd partitions than md partitions, the array size return through mdadm -D /dev/md* should roughly match the number of blocks found from fdisk -l
The arrays should now be being sync'ed, check progress by monitoring /proc/mdstat
- more /proc/mdstat

SSH

Server Hostname Change

If the hostname (or IP) of the server you are SSH'ing to changes, the old entry needs to be removed from your SSH key known hosts file

ssh-keygen -R <name or IP>

Packages

Errors etc received from apt-get

Error 400 Bad Request
- Somewhat misleadingly, the problem is normal caused by being unable to contact the update server. Consider adding proxy server config to your machine
The following packages have been kept back
- Package manager can hold back updates because they will cause conflicts, or sometimes because they're major kernel updates. Running aptitude safe-upgrade normally seems to force kernel updates through.

Reboot Required?

If a package update/installation requires a reboot to complete the following file will exist...

/var/run/reboot-required

To see which packages caused this to be set, inspect the contents of...

/var/run/reboot-required.pkgs

Troubleshooting (Ubuntu)

Contents

High System Load

`top`

Identify Process Causing Occasional High System Load

`vmstat`

`mpstat`

`iostat`

Network

No NIC

File System

Unable to Mount CD-ROM

Replacing a Software RAID 1 Disk

SSH

Server Hostname Change

Packages

Reboot Required?

Navigation menu

Troubleshooting (Ubuntu)

High System Load

top

Identify Process Causing Occasional High System Load

vmstat

mpstat

iostat

Network

No NIC

File System

Unable to Mount CD-ROM

Replacing a Software RAID 1 Disk

SSH

Server Hostname Change

Packages

Reboot Required?

Navigation menu

Search

`top`

`vmstat`

`mpstat`

`iostat`