Difference between revisions of "Troubleshooting (Ubuntu)"
Jump to navigation
Jump to search
(→High System Load: Added other tools) |
(→High System Load: Further supporting info for utils) |
||
Line 12: | Line 12: | ||
The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would. | The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would. | ||
The problem with investigating performance issues is that you need to know what is normal, so you can determine what's wrong once application/service performance deteriorates. But its unlikely that you would have pain much attention to underlying system metrics until things are already bad. | |||
=== <code>top</code> === | === <code>top</code> === | ||
Line 36: | Line 38: | ||
Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid. | Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid. | ||
{|class="vwikitable" | {|class="vwikitable-equal" | ||
|+ Overview of CPU Metrics, % | |+ Overview of CPU Metrics, % of CPU time spent on | ||
! Code | ! Code | ||
! <code> us </code> | |||
! <code> sy </code> | |||
! <code> ni </code> | |||
! <code> id </code> | |||
! <code> wa </code> | |||
! <code> ha </code> | |||
! <code> si </code> | |||
! <code> st </code> | |||
|- | |- | ||
| | ! Name | ||
| User CPU | |||
| System CPU | |||
| Nice CPU | |||
| Idle CPU | |||
| IO Wait | |||
| Hardware Interrupts | |||
| Software Interrupts | |||
| Steal | |||
|- | |- | ||
| | ! Description | ||
| | | user processes (excluding nice) | ||
| | | kernel processes | ||
| user nice processes (nice reduces the priority of process) | |||
| idling (doing nothing) | |||
| waiting for IO (high indicates disk/network bottleneck) | |||
| hardware interrupts | |||
| software interrupts | |||
| servicing virtual machines | |||
|} | |} | ||
{|class="vwikitable" | {|class="vwikitable" | ||
Line 77: | Line 93: | ||
| <code>h</code> || <code>PR</code> || Priority || The task's priority | | <code>h</code> || <code>PR</code> || Priority || The task's priority | ||
|- | |- | ||
| <code>i</code> || <code>NI</code> || Nice value || Adjusted task priority. From -20 meaning high | | <code>i</code> || <code>NI</code> || Nice value || Adjusted task priority. From -20 meaning high priority, through 0 meaning unadjusted, to 19 meaning low priority | ||
|- | |- | ||
| <code>j</code> || <code>P</code> || Last Used CPU || ID of the CPU last used by the task | | <code>j</code> || <code>P</code> || Last Used CPU || ID of the CPU last used by the task | ||
Line 101: | Line 117: | ||
| <code>t</code> || <code>SHR</code> || Shared Mem size (kb) || Task's shared memory | | <code>t</code> || <code>SHR</code> || Shared Mem size (kb) || Task's shared memory | ||
|- | |- | ||
| <code>u</code> || <code>nFLT</code> || Page Fault count || Major/Hard page faults that have | | <code>u</code> || <code>nFLT</code> || Page Fault count || Major/Hard page faults that have occurred for the task | ||
|- | |- | ||
| <code>v</code> || <code>nDRT</code> || Dirty Pages count || Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory | | <code>v</code> || <code>nDRT</code> || Dirty Pages count || Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory | ||
Line 120: | Line 136: | ||
==== Identify Process Causing | ==== Identify Process Causing Occasional High System Load ==== | ||
If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO. | If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO. | ||
Line 146: | Line 162: | ||
1 * * * * /bin/bash /home/user/load_log</pre> | 1 * * * * /bin/bash /home/user/load_log</pre> | ||
=== | === <code> vmstat </code> === | ||
[http://www.linuxcommand.org/man_pages/vmstat8.html <code>vmstat</code>] is principally used for reporting on virtual memory statistics, for example <code> vmstat 5 3 </code> creates an output every 5 seconds for 3 iterations, | |||
<pre>user@server:~$ vmstat 5 3 | |||
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- | |||
r b swpd free buff cache si so bi bo in cs us sy id wa | |||
0 0 42676 479556 34192 106944 5 5 31 3678 84 89 9 9 75 7 | |||
0 0 42676 479548 34208 106948 0 0 0 10 50 105 6 0 88 5 | |||
0 0 42676 479548 34216 106948 0 0 0 18 37 61 0 0 96 4 </pre> | |||
Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output. | |||
{|class="vwikitable-equal" | |||
|+ Overview of VMSTAT Metrics | |||
! Section | |||
! colspan="2"| Procs | |||
! colspan="4"| Memory | |||
! colspan="2"| Swap | |||
! colspan="2"| IO | |||
! colspan="2"| System | |||
! colspan="4"| CPU | |||
|- | |||
! Code | |||
! <code> r </code> | |||
! <code> b </code> | |||
! <code> swpd </code> | |||
! <code> free </code> | |||
! <code> buff </code> | |||
! <code> cache </code> | |||
! <code> si </code> | |||
! <code> so </code> | |||
! <code> bi </code> | |||
! <code> bo </code> | |||
! <code> in </code> | |||
! <code> cs </code> | |||
! <code> us </code> | |||
! <code> sy </code> | |||
! <code> id </code> | |||
! <code> wa </code> | |||
|- | |||
! Name | |||
| style="text-align: center;" | Run | |||
| style="text-align: center;" | Block | |||
| style="text-align: center;" | Swap<br>(kB) | |||
| style="text-align: center;" | Free<br>(kB) | |||
| style="text-align: center;" | Buffer<br>(kB) | |||
| style="text-align: center;" | Cache<br>(kB) | |||
| style="text-align: center;" | Swap In<br>(kB/s) | |||
| style="text-align: center;" | Swap Out<br>(kB/s) | |||
| style="text-align: center;" | Blocks In<br>(blocks/s) | |||
| style="text-align: center;" | Blocks Out<br>(blocks/s) | |||
| style="text-align: center;" | Interrupts<br>(/s) | |||
| style="text-align: center;" | Context Switch<br>(/s) | |||
| style="text-align: center;" | User<br>(% time) | |||
| style="text-align: center;" | System<br>(% time) | |||
| style="text-align: center;" | Idle<br>(% time) | |||
| style="text-align: center;" | Wait<br>(% time) | |||
|- | |||
! Description | |||
| Processes waiting for run time | |||
| Processes in uninterruptible sleep (eg waiting for IO) | |||
| Virtual memory used | |||
| Unused memory | |||
| Memory used as buffers | |||
| Memory used as cache | |||
| Memory swapped in from disk | |||
| Memory swapped out to disk | |||
| Blocks in from a storage device | |||
| Blocks out from a storage device | |||
| Interrupts | |||
| Context switches | |||
| CPU running user processes | |||
| CPU running kernel processes | |||
| CPU idle | |||
| CPU waiting for IO | |||
|} | |||
=== <code> mpstat </code> === | |||
[http://www.linuxcommand.org/man_pages/mpstat1.html <code>mpstat</code>] reports on basic processor stats. It creates a timestamped output which is useful to leave running on a console (or logged to a file) for when you might here or find out about service performance problems after the fact. A number of the metrics are also provided by <code>[[#vmstat|vmstat]]</code>, but are reported to a greater accuracy by <code>mpstat</code>. | |||
Its not available by default, and comes as part of the [http://sebastien.godard.pagesperso-orange.fr/ <code>sysstat</code>] package (to install, use <code>apt-get install sysstat</code>). | |||
For example <code> mpstat 5 3 </code> creates an output every 5 seconds for 3 iterations, | |||
<pre>user@server:~# mpstat 5 3 | |||
Linux 2.6.32-41-server (server) 25/07/12 _x86_64_ (1 CPU) | |||
11:50:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle | |||
11:51:04 all 1.00 0.00 0.80 1.60 0.00 0.00 0.00 0.00 96.60 | |||
11:51:09 all 4.60 0.00 0.40 2.60 0.00 0.00 0.00 0.00 92.40 | |||
11:51:14 all 43.20 0.00 6.00 3.00 0.00 0.00 0.00 0.00 47.80 | |||
Average: all 16.27 0.00 2.40 2.40 0.00 0.00 0.00 0.00 78.93 </pre> | |||
{|class="vwikitable-equal" | |||
|+ Overview of MPSTAT Metrics | |||
! Code | |||
! <code> CPU </code> | |||
! <code> %usr </code> | |||
! <code> %nice </code> | |||
! <code> %sys </code> | |||
! <code> %iowait </code> | |||
! <code> %irq </code> | |||
! <code> %soft </code> | |||
! <code> %steal </code> | |||
! <code> %guest </code> | |||
! <code> %idle </code> | |||
|- | |||
! Name | |||
| style="text-align: center;" | CPU No. | |||
| style="text-align: center;" | User<br>(% util) | |||
| style="text-align: center;" | Nice<br>(% util) | |||
| style="text-align: center;" | System<br>(% util) | |||
| style="text-align: center;" | IO Wait<br>(% time) | |||
| style="text-align: center;" | Hard IRQ<br>(% time) | |||
| style="text-align: center;" | Soft IRQ<br>(% time) | |||
| style="text-align: center;" | Steal<br>(% time) | |||
| style="text-align: center;" | Guest In<br>(% time) | |||
| style="text-align: center;" | Idle<br>(% time) | |||
|- | |||
! Description | |||
| CPU number (or ''ALL'')<br>Set with <code>-P <n></code> option switch | |||
| CPU running user processes | |||
| CPU running nice (adjusted priority) user processes | |||
| CPU running kernel processes (excludes [[Acronyms#I|IRQ]]s) | |||
| CPU waiting for (disk) IO | |||
| CPU servicing hardware interrupts | |||
| CPU servicing software interrupts | |||
| Virtual CPU wait due to CPU busy with other [[Acronyms#V|vCPU]] | |||
| CPU servicing vCPU(s) | |||
| CPU idle | |||
|} | |||
=== <code> iostat </code> === | |||
[http://www.linuxcommand.org/man_pages/iostat1.html <code>iostat</code>] reports on IO (and CPU) stats. | |||
Its not available by default, and comes as part of the [http://sebastien.godard.pagesperso-orange.fr/ <code>sysstat</code>] package (to install, use <code>apt-get install sysstat</code>). | |||
IO stats can be displayed either by device (default, and extra metrics with <code>-x</code> switch) or by partition (<code>-p</code> switch). Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output. | |||
Device stats output... | |||
<pre>root@servername:~# iostat -x 5 3 | |||
Linux 2.6.32-41-server (servername) 25/07/12 _x86_64_ (1 CPU) | |||
avg-cpu: %user %nice %system %iowait %steal %idle | |||
11.56 0.54 2.17 6.67 0.00 79.06 | |||
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util | |||
sda 18.89 9.63 6.36 8.83 367.22 146.02 33.78 0.34 22.18 5.73 8.70 | |||
dm-0 0.00 0.00 3.16 10.68 190.37 86.53 20.01 0.62 44.79 2.18 3.02 | |||
dm-1 0.00 0.00 22.11 7.44 176.85 59.48 8.00 0.71 23.92 2.21 6.52 | |||
avg-cpu: %user %nice %system %iowait %steal %idle | |||
45.18 0.00 5.02 0.40 0.00 49.40 | |||
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util | |||
sda 0.00 4.82 0.20 3.82 1.61 65.86 16.80 0.02 4.50 4.00 1.61 | |||
dm-0 0.00 0.00 0.20 8.23 1.61 65.86 8.00 0.07 7.86 1.90 1.61 | |||
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 | |||
avg-cpu: %user %nice %system %iowait %steal %idle | |||
95.80 0.00 4.20 0.00 0.00 0.00 | |||
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util | |||
sda 0.00 11.80 0.00 8.20 0.00 156.80 19.12 0.06 7.07 0.24 0.20 | |||
dm-0 0.00 0.00 0.00 19.60 0.00 156.80 8.00 0.18 8.98 0.10 0.20 | |||
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 </pre> | |||
Partition stats output... | |||
<pre>root@servername:~# iostat -t -p 5 3 | |||
Linux 2.6.32-41-server (servername) 30/07/12 _x86_64_ (1 CPU) | |||
30/07/12 12:05:15 | |||
avg-cpu: %user %nice %system %iowait %steal %idle | |||
0.33 0.32 0.12 0.27 0.00 98.96 | |||
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn | |||
sda 1.12 13.57 14.71 721218 782038 | |||
sda1 0.00 0.02 0.00 804 14 | |||
sda2 0.00 0.00 0.00 4 0 | |||
sda5 0.91 13.54 14.71 719994 782024 | |||
dm-0 2.23 13.49 14.45 716850 768240 | |||
dm-1 0.04 0.05 0.26 2632 13784 | |||
30/07/12 12:05:20 | |||
avg-cpu: %user %nice %system %iowait %steal %idle | |||
0.00 0.00 0.00 0.00 0.00 100.00 | |||
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn | |||
sda 1.20 0.00 14.40 0 72 | |||
sda1 0.00 0.00 0.00 0 0 | |||
sda2 0.00 0.00 0.00 0 0 | |||
sda5 0.80 0.00 14.40 0 72 | |||
dm-0 1.80 0.00 14.40 0 72 | |||
dm-1 0.00 0.00 0.00 0 0 | |||
30/07/12 12:05:25 | |||
avg-cpu: %user %nice %system %iowait %steal %idle | |||
0.00 0.00 0.00 0.00 0.00 100.00 | |||
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn | |||
sda 0.00 0.00 0.00 0 0 | |||
sda1 0.00 0.00 0.00 0 0 | |||
sda2 0.00 0.00 0.00 0 0 | |||
sda5 0.00 0.00 0.00 0 0 | |||
dm-0 0.00 0.00 0.00 0 0 | |||
dm-1 0.00 0.00 0.00 0 0</pre> | |||
{|class="vwikitable-equal" | |||
|+ Overview of IOSTAT Device Metrics | |||
! Stats | |||
! colspan="10"| Device IO Stats | |||
! colspan="5"| Partition IO Stats | |||
|- | |||
! Code | |||
! <code> rrqm/s </code> | |||
! <code> wrqm/s </code> | |||
! <code> r/s </code> | |||
! <code> w/s </code> | |||
! <code> rsec/s </code> | |||
! <code> wsec/s </code> | |||
! <code> avgrq-sz </code> | |||
! <code> avgqu-sz </code> | |||
! <code> svctm </code> | |||
! <code> %util </code> | |||
! <code> tps </code> | |||
! <code> Blk_read/s </code> | |||
! <code> Blk_wrtn/s </code> | |||
! <code> Blk_read </code> | |||
! <code> Blk_wrtn </code> | |||
|- | |||
! Name | |||
| style="text-align: center;" | Read Merge<br>(/s) | |||
| style="text-align: center;" | Write Merge<br>(/s) | |||
| style="text-align: center;" | Read<br>(/s) | |||
| style="text-align: center;" | Write<br>(/s)) | |||
| style="text-align: center;" | Read<br>(sectors/s) | |||
| style="text-align: center;" | Write<br>(sectors/s) | |||
| style="text-align: center;" | Av. Req. Size<br>(sectors) | |||
| style="text-align: center;" | Av. Queue Len.<br>(sectors) | |||
| style="text-align: center;" | Av. Service Time<br>(msec) | |||
| style="text-align: center;" | Utilisation<br>(% CPU Time) | |||
| style="text-align: center;" | Transfers<br>(/s) | |||
| style="text-align: center;" | Read<br>(blocks/s) | |||
| style="text-align: center;" | Write<br>(blocks/s) | |||
| style="text-align: center;" | Read<br>(blocks) | |||
| style="text-align: center;" | Write<br>(blocks) | |||
|- | |||
! Description | |||
| Read requests merged | |||
| Write requests merged | |||
| Read requests | |||
| Write requests | |||
| Sector reads | |||
| Sector writes | |||
| Average read/write request size | |||
| Average request queue length | |||
| Average time to service requests | |||
| Bandwidth utilisation / device saturation | |||
| IO transfer rate (TPS - Transfers Per Second) | |||
| Data read | |||
| Data write | |||
| Data read | |||
| Data write | |||
|} | |||
If when using the above tools you're presented with disk/devices names of <code>dm-0</code>, <code>dm-1</code>, etc., which won't mean much. These are LVM logical devices, to understand what they map to use | If when using the above tools you're presented with disk/devices names of <code>dm-0</code>, <code>dm-1</code>, etc., which won't mean much. These are LVM logical devices, to understand what they map to use | ||
<pre>lvdisplay|awk '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'</pre> | <pre>lvdisplay|awk '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'</pre> | ||
{{GoogleAdBanner}} | |||
== Network == | == Network == |