High System Load (Ubuntu)
The system load of a Linux system is normally represented by the load average over intervals of the last 1, 5 and 15 minutes.
For example, the uptime
command gives a single line summary of system uptime and recent load
user@server:~$ uptime 14:28:49 up 9 days, 22:41, 1 user, load average: 0.34, 0.36, 0.32
So in the above, as of 14:28:49 hrs the server has been up for 9 days 22 hours odd, has 1 user logged in, and the system load averages for the past 1, 5, and 15 minutes are shown.
The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would.
The problem with investigating performance issues is that you need to know what is normal, so you can determine what's wrong once application/service performance deteriorates. But its unlikely that you would have pain much attention to underlying system metrics until things are already bad.
top
The top
command allows some basic insight into the system's performance, and is akin to the Task Manager in Windows. It probably won't provide the answer as to what the problem is, but it will probably allow you to focus in on the process(es) that are causing grief.
user@server:~$ top top - 14:32:09 up 9 days, 22:44, 1 user, load average: 0.70, 0.44, 0.34 Tasks: 137 total, 1 running, 136 sleeping, 0 stopped, 0 zombie Cpu(s): 93.8%us, 6.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1023360k total, 950520k used, 72840k free, 10836k buffers Swap: 1757176k total, 1110228k used, 646948k free, 135524k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6608 zimbra 20 0 556m 69m 12m S 69.1 6.9 0:03.26 java 17284 zimbra 20 0 649m 101m 3604 S 4.6 10.1 31:34.74 java 2610 zimbra 20 0 976m 181m 3700 S 0.7 18.1 133:06.68 java 1 root 20 0 23580 1088 732 S 0.0 0.1 0:04.70 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 ....
Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.
Code | us
|
sy
|
ni
|
id
|
wa
|
ha
|
si
|
st
|
---|---|---|---|---|---|---|---|---|
Name | User CPU | System CPU | Nice CPU | Idle CPU | IO Wait | Hardware Interrupts | Software Interrupts | Steal |
Description | user processes (excluding nice) | kernel processes | user nice processes (nice reduces the priority of process) | idling (doing nothing) | waiting for IO (high indicates disk/network bottleneck) | hardware interrupts | software interrupts | servicing virtual machines |
Key | Display | Name | Description |
---|---|---|---|
a |
PID |
Process ID | Task/process identifier |
b |
PPID |
Parent PID | Task/process identifier of processes parent (ie the process that launched this process) |
c |
RUSER |
Real User Name | Real username of task's owner |
d |
UID |
User ID | User ID of task's owner |
e |
USER |
User Name | Username ID of task's owner |
f |
GROUP |
Group Name | Group name of task's owner |
g |
TTY |
Controlling TTY | Device that started the process |
h |
PR |
Priority | The task's priority |
i |
NI |
Nice value | Adjusted task priority. From -20 meaning high priority, through 0 meaning unadjusted, to 19 meaning low priority |
j |
P |
Last Used CPU | ID of the CPU last used by the task |
k |
%CPU |
CPU Usage | Task's usage of CPU |
l |
TIME |
CPU Time | Total CPU time used by the task |
m |
TIME+ |
CPU Time, hundredths | Total CPU time used by the task in sub-second accuracy |
n |
%MEM |
Memory usage (RES) | Task's usage of available physical memory |
o |
VIRT |
Virtual Image (kb) | Task's allocation of virtual memory |
p |
SWAP |
Swapped size (kb) | Task's swapped memory (resident in swap-file) |
q |
RES |
Resident size (kb) | Task's unswapped memory (resident in physical memory) |
r |
CODE |
Code size (kb) | Task's virtual memory used for executable code |
s |
DATA |
Data+Stack size (kb) | Task's virtual memory not used for executable code |
t |
SHR |
Shared Mem size (kb) | Task's shared memory |
u |
nFLT |
Page Fault count | Major/Hard page faults that have occurred for the task |
v |
nDRT |
Dirty Pages count | Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory |
w |
S |
Process Status |
|
x |
Command |
Command Line | Command used to start task |
y |
WCHAN |
Sleeping in Function | Name (or address) of function that the task is sleeping in |
z |
Flags |
Taks Flags | Task's scheduling flags |
Identify Process Causing Occasional High System Load
If the high load is constant, just fire up top
and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.
If the high load is transient but repetitive, then you'll need to capture the output of top
at the right time, the following script will create a log of top
output during periods of high load
#!/bin/bash
#
# During high load, write output form top to file.
#
# Simon Strutt - July 2012
LOGFILE="/home/user/load_log.txt" # Update to a valid folder path
MAXLOAD=100 # Multiple by 100 as 'if' comparison can only handle integers
LOAD=`cut -d ' ' -f 1 /proc/loadavg`
LOAD=`echo $LOAD '*100' | bc -l | awk -F '.' '{ print $1; exit; }'` # Convert load to x100 integer
if [ $LOAD -gt $MAXLOAD ]; then
echo `date '+%Y-%m-%d %H:%M:%S'`>> ${LOGFILE}
top -b -n 1 >> ${LOGFILE}
fi
Schedule with something like (update with correct path to load_log
...
crontab -e 1 * * * * /bin/bash /home/user/load_log
vmstat
vmstat
is principally used for reporting on virtual memory statistics, for example vmstat 5 3
creates an output every 5 seconds for 3 iterations,
user@server:~$ vmstat 5 3 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 42676 479556 34192 106944 5 5 31 3678 84 89 9 9 75 7 0 0 42676 479548 34208 106948 0 0 0 10 50 105 6 0 88 5 0 0 42676 479548 34216 106948 0 0 0 18 37 61 0 0 96 4
Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.
Section | Procs | Memory | Swap | IO | System | CPU | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | r
|
b
|
swpd
|
free
|
buff
|
cache
|
si
|
so
|
bi
|
bo
|
in
|
cs
|
us
|
sy
|
id
|
wa
|
Name | Run | Block | Swap (kB) |
Free (kB) |
Buffer (kB) |
Cache (kB) |
Swap In (kB/s) |
Swap Out (kB/s) |
Blocks In (blocks/s) |
Blocks Out (blocks/s) |
Interrupts (/s) |
Context Switch (/s) |
User (% time) |
System (% time) |
Idle (% time) |
Wait (% time) |
Description | Processes waiting for run time | Processes in uninterruptible sleep (eg waiting for IO) | Virtual memory used | Unused memory | Memory used as buffers | Memory used as cache | Memory swapped in from disk | Memory swapped out to disk | Blocks in from a storage device | Blocks out from a storage device | Interrupts | Context switches | CPU running user processes | CPU running kernel processes | CPU idle | CPU waiting for IO |
mpstat
mpstat
reports on basic processor stats. It creates a timestamped output which is useful to leave running on a console (or logged to a file) for when you might here or find out about service performance problems after the fact. A number of the metrics are also provided by vmstat
, but are reported to a greater accuracy by mpstat
.
Its not available by default, and comes as part of the sysstat
package (to install, use apt-get install sysstat
).
For example mpstat 5 3
creates an output every 5 seconds for 3 iterations,
user@server:~# mpstat 5 3 Linux 2.6.32-41-server (server) 25/07/12 _x86_64_ (1 CPU) 11:50:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 11:51:04 all 1.00 0.00 0.80 1.60 0.00 0.00 0.00 0.00 96.60 11:51:09 all 4.60 0.00 0.40 2.60 0.00 0.00 0.00 0.00 92.40 11:51:14 all 43.20 0.00 6.00 3.00 0.00 0.00 0.00 0.00 47.80 Average: all 16.27 0.00 2.40 2.40 0.00 0.00 0.00 0.00 78.93
Code | CPU
|
%usr
|
%nice
|
%sys
|
%iowait
|
%irq
|
%soft
|
%steal
|
%guest
|
%idle
|
---|---|---|---|---|---|---|---|---|---|---|
Name | CPU No. | User (% util) |
Nice (% util) |
System (% util) |
IO Wait (% time) |
Hard IRQ (% time) |
Soft IRQ (% time) |
Steal (% time) |
Guest In (% time) |
Idle (% time) |
Description | CPU number (or ALL) Set with -P <n> option switch
|
CPU running user processes | CPU running nice (adjusted priority) user processes | CPU running kernel processes (excludes IRQs) | CPU waiting for (disk) IO | CPU servicing hardware interrupts | CPU servicing software interrupts | Virtual CPU wait due to CPU busy with other vCPU | CPU servicing vCPU(s) | CPU idle |
iostat
iostat
reports on IO (and CPU) stats.
Its not available by default, and comes as part of the sysstat
package (to install, use apt-get install sysstat
).
IO stats can be displayed either by device (default, and extra metrics with -x
switch) or by partition (-p
switch). Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.
Device stats output...
root@servername:~# iostat -x 5 3 Linux 2.6.32-41-server (servername) 25/07/12 _x86_64_ (1 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 11.56 0.54 2.17 6.67 0.00 79.06 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 18.89 9.63 6.36 8.83 367.22 146.02 33.78 0.34 22.18 5.73 8.70 dm-0 0.00 0.00 3.16 10.68 190.37 86.53 20.01 0.62 44.79 2.18 3.02 dm-1 0.00 0.00 22.11 7.44 176.85 59.48 8.00 0.71 23.92 2.21 6.52 avg-cpu: %user %nice %system %iowait %steal %idle 45.18 0.00 5.02 0.40 0.00 49.40 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4.82 0.20 3.82 1.61 65.86 16.80 0.02 4.50 4.00 1.61 dm-0 0.00 0.00 0.20 8.23 1.61 65.86 8.00 0.07 7.86 1.90 1.61 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 95.80 0.00 4.20 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 11.80 0.00 8.20 0.00 156.80 19.12 0.06 7.07 0.24 0.20 dm-0 0.00 0.00 0.00 19.60 0.00 156.80 8.00 0.18 8.98 0.10 0.20 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Partition stats output...
root@servername:~# iostat -t -p 5 3 Linux 2.6.32-41-server (servername) 30/07/12 _x86_64_ (1 CPU) 30/07/12 12:05:15 avg-cpu: %user %nice %system %iowait %steal %idle 0.33 0.32 0.12 0.27 0.00 98.96 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.12 13.57 14.71 721218 782038 sda1 0.00 0.02 0.00 804 14 sda2 0.00 0.00 0.00 4 0 sda5 0.91 13.54 14.71 719994 782024 dm-0 2.23 13.49 14.45 716850 768240 dm-1 0.04 0.05 0.26 2632 13784 30/07/12 12:05:20 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.00 0.00 100.00 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.20 0.00 14.40 0 72 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda5 0.80 0.00 14.40 0 72 dm-0 1.80 0.00 14.40 0 72 dm-1 0.00 0.00 0.00 0 0 30/07/12 12:05:25 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.00 0.00 100.00 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 0.00 0.00 0.00 0 0 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda5 0.00 0.00 0.00 0 0 dm-0 0.00 0.00 0.00 0 0 dm-1 0.00 0.00 0.00 0 0
Stats | Device IO Stats | Partition IO Stats | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Code | rrqm/s
|
wrqm/s
|
r/s
|
w/s
|
rsec/s
|
wsec/s
|
avgrq-sz
|
avgqu-sz
|
svctm
|
%util
|
tps
|
Blk_read/s
|
Blk_wrtn/s
|
Blk_read
|
Blk_wrtn
|
Name | Read Merge (/s) |
Write Merge (/s) |
Read (/s) |
Write (/s)) |
Read (sectors/s) |
Write (sectors/s) |
Av. Req. Size (sectors) |
Av. Queue Len. (sectors) |
Av. Service Time (msec) |
Utilisation (% CPU Time) |
Transfers (/s) |
Read (blocks/s) |
Write (blocks/s) |
Read (blocks) |
Write (blocks) |
Description | Read requests merged | Write requests merged | Read requests | Write requests | Sector reads | Sector writes | Average read/write request size | Average request queue length | Average time to service requests | Bandwidth utilisation / device saturation | IO transfer rate (TPS - Transfers Per Second) | Data read | Data write | Data read | Data write |
If when using the above tools you're presented with disk/devices names of dm-0
, dm-1
, etc., which won't mean much. These are LVM logical devices, to understand what they map to use
lvdisplay|awk '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'