Difference between revisions of "Troubleshooting (Ubuntu)"

Jump to navigation Jump to search
→‎High System Load: Further supporting info for utils
(→‎High System Load: Added other tools)
(→‎High System Load: Further supporting info for utils)
Line 12: Line 12:


The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state.  What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them.  Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would.
The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state.  What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them.  Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would.
The problem with investigating performance issues is that you need to know what is normal, so you can determine what's wrong once application/service performance deteriorates.  But its unlikely that you would have pain much attention to underlying system metrics until things are already bad.


=== <code>top</code> ===
=== <code>top</code> ===
Line 36: Line 38:
Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.
Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.


{|class="vwikitable"
{|class="vwikitable-equal"  
|+ Overview of CPU Metrics, % over time
|+ Overview of CPU Metrics, % of CPU time spent on
! Code !! Name !! Description
! Code
|-
! <code> us </code>
| <code>us</code> || User CPU || % of CPU time spent servicing user processes (excluding nice)
! <code> sy </code>
|-
! <code> ni </code>
| <code>sy</code> || System CPU || % of CPU time spent servicing kernel processes
! <code> id </code>
|-
! <code> wa </code>
| <code>ni</code> || Nice CPU || % of CPU time spent servicing user nice processes (nice reduces the priority of process)
! <code> ha </code>
|-
! <code> si </code>
| <code>id</code> || Idle CPU || % of CPU time spent idling (doing nothing)
! <code> st </code>
|-
| <code>wa</code> || IO Wait || % of CPU time spent waiting for IO (high indicates disk/network bottleneck)
|-
|-
| <code>ha</code> || Hardware Interrupts || % of CPU time spent servicing hardware interrupts
! Name
| User CPU
| System CPU
| Nice CPU
| Idle CPU
| IO Wait
| Hardware Interrupts
| Software Interrupts
| Steal
|-
|-
| <code>si</code> || Software Interrupts || % of CPU time spent servicing hardware interrupts
! Description
|-
| user processes (excluding nice)
| <code>st</code> || Steal || % of CPU time stolen to service virtual machines
| kernel processes
| user nice processes (nice reduces the priority of process)
| idling (doing nothing)
| waiting for IO (high indicates disk/network bottleneck)
| hardware interrupts
| software interrupts
| servicing virtual machines
|}
|}


{|class="vwikitable"
{|class="vwikitable"
Line 77: Line 93:
| <code>h</code> || <code>PR</code> || Priority || The task's priority
| <code>h</code> || <code>PR</code> || Priority || The task's priority
|-
|-
| <code>i</code> || <code>NI</code> || Nice value || Adjusted task priority. From -20 meaning high priorty, through 0 meaning unadjusted, to 19 meaning low priority  
| <code>i</code> || <code>NI</code> || Nice value || Adjusted task priority. From -20 meaning high priority, through 0 meaning unadjusted, to 19 meaning low priority  
|-
|-
| <code>j</code> || <code>P</code> || Last Used CPU || ID of the CPU last used by the task
| <code>j</code> || <code>P</code> || Last Used CPU || ID of the CPU last used by the task
Line 101: Line 117:
| <code>t</code> || <code>SHR</code> || Shared Mem size (kb) || Task's shared memory
| <code>t</code> || <code>SHR</code> || Shared Mem size (kb) || Task's shared memory
|-
|-
| <code>u</code> || <code>nFLT</code> || Page Fault count || Major/Hard page faults that have occured for the task   
| <code>u</code> || <code>nFLT</code> || Page Fault count || Major/Hard page faults that have occurred for the task   
|-
|-
| <code>v</code> || <code>nDRT</code> || Dirty Pages count || Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory  
| <code>v</code> || <code>nDRT</code> || Dirty Pages count || Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory  
Line 120: Line 136:




==== Identify Process Causing Occassional High System Load ====
==== Identify Process Causing Occasional High System Load ====
If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.
If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.


Line 146: Line 162:
1 * * * * /bin/bash  /home/user/load_log</pre>
1 * * * * /bin/bash  /home/user/load_log</pre>


=== Other Tools ===
=== <code> vmstat </code> ===
[http://www.linuxcommand.org/man_pages/vmstat8.html <code>vmstat</code>] is principally used for reporting on virtual memory statistics, for example <code> vmstat 5 3 </code> creates an output every 5 seconds for 3 iterations,
<pre>user@server:~$ vmstat 5 3
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa
0  0  42676 479556  34192 106944    5    5    31  3678  84  89  9  9 75  7
0  0  42676 479548  34208 106948    0    0    0    10  50  105  6  0 88  5
0  0  42676 479548  34216 106948    0    0    0    18  37  61  0  0 96  4 </pre>
 
Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.
 
{|class="vwikitable-equal"
|+ Overview of VMSTAT Metrics
! Section
! colspan="2"| Procs
! colspan="4"| Memory
! colspan="2"| Swap
! colspan="2"| IO
! colspan="2"| System
! colspan="4"| CPU
|-
! Code
! <code> r </code>
! <code> b </code>
! <code> swpd </code>
! <code> free </code>
! <code> buff </code>
! <code> cache </code>
! <code> si </code>
! <code> so </code>
! <code> bi </code>
! <code> bo </code>
! <code> in </code>
! <code> cs </code>
! <code> us </code>
! <code> sy </code>
! <code> id </code>
! <code> wa </code>
|-
! Name
| style="text-align: center;" | Run
| style="text-align: center;" | Block
| style="text-align: center;" | Swap<br>(kB)
| style="text-align: center;" | Free<br>(kB)
| style="text-align: center;" | Buffer<br>(kB)
| style="text-align: center;" | Cache<br>(kB)
| style="text-align: center;" | Swap In<br>(kB/s)
| style="text-align: center;" | Swap Out<br>(kB/s)
| style="text-align: center;" | Blocks In<br>(blocks/s)
| style="text-align: center;" | Blocks Out<br>(blocks/s)
| style="text-align: center;" | Interrupts<br>(/s)
| style="text-align: center;" | Context Switch<br>(/s)
| style="text-align: center;" | User<br>(% time)
| style="text-align: center;" | System<br>(% time)
| style="text-align: center;" | Idle<br>(% time)
| style="text-align: center;" | Wait<br>(% time)
|-
! Description
| Processes waiting for run time
| Processes in uninterruptible sleep (eg waiting for IO)
| Virtual memory used
| Unused memory
| Memory used as buffers
| Memory used as cache
| Memory swapped in from disk
| Memory swapped out to disk
| Blocks in from a storage device
| Blocks out from a storage device
| Interrupts
| Context switches
| CPU running user processes
| CPU running kernel processes
| CPU idle
| CPU waiting for IO
|}
 
=== <code> mpstat </code> ===
[http://www.linuxcommand.org/man_pages/mpstat1.html <code>mpstat</code>] reports on basic processor stats.  It creates a timestamped output which is useful to leave running on a console (or logged to a file) for when you might here or find out about service performance problems after the fact. A number of the metrics are also provided by <code>[[#vmstat|vmstat]]</code>, but are reported to a greater accuracy by <code>mpstat</code>.
 
Its not available by default, and comes as part of the [http://sebastien.godard.pagesperso-orange.fr/ <code>sysstat</code>] package (to install, use <code>apt-get install sysstat</code>).
 
For example <code> mpstat 5 3 </code> creates an output every 5 seconds for 3 iterations,
<pre>user@server:~# mpstat 5 3
Linux 2.6.32-41-server (server)  25/07/12        _x86_64_        (1 CPU)
 
11:50:59    CPU    %usr  %nice    %sys %iowait    %irq  %soft  %steal  %guest  %idle
11:51:04    all    1.00    0.00    0.80    1.60    0.00    0.00    0.00    0.00  96.60
11:51:09    all    4.60    0.00    0.40    2.60    0.00    0.00    0.00    0.00  92.40
11:51:14    all  43.20    0.00    6.00    3.00    0.00    0.00    0.00    0.00  47.80
Average:    all  16.27    0.00    2.40    2.40    0.00    0.00    0.00    0.00  78.93 </pre>
 
{|class="vwikitable-equal"
|+ Overview of MPSTAT Metrics
! Code
! <code> CPU </code>
! <code> %usr </code>
! <code> %nice </code>
! <code> %sys </code>
! <code> %iowait </code>
! <code> %irq </code>
! <code> %soft </code>
! <code> %steal </code>
! <code> %guest </code>
! <code> %idle </code>
|-
! Name
| style="text-align: center;" | CPU No.
| style="text-align: center;" | User<br>(% util)
| style="text-align: center;" | Nice<br>(% util)
| style="text-align: center;" | System<br>(% util)
| style="text-align: center;" | IO Wait<br>(% time)
| style="text-align: center;" | Hard IRQ<br>(% time)
| style="text-align: center;" | Soft IRQ<br>(% time)
| style="text-align: center;" | Steal<br>(% time)
| style="text-align: center;" | Guest In<br>(% time)
| style="text-align: center;" | Idle<br>(% time)
|-
! Description
| CPU number (or ''ALL'')<br>Set with <code>-P <n></code> option switch
| CPU running user processes
| CPU running nice (adjusted priority) user processes
| CPU running kernel processes (excludes [[Acronyms#I|IRQ]]s)
| CPU waiting for (disk) IO
| CPU servicing hardware interrupts
| CPU servicing software interrupts
| Virtual CPU wait due to CPU busy with other [[Acronyms#V|vCPU]]
| CPU servicing vCPU(s)
| CPU idle
|}
 
=== <code> iostat </code> ===
[http://www.linuxcommand.org/man_pages/iostat1.html <code>iostat</code>] reports on IO (and CPU) stats.
 
Its not available by default, and comes as part of the [http://sebastien.godard.pagesperso-orange.fr/ <code>sysstat</code>] package (to install, use <code>apt-get install sysstat</code>).
 
IO stats can be displayed either by device (default, and extra metrics with <code>-x</code> switch) or by partition (<code>-p</code> switch).  Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.
 
Device stats output...
<pre>root@servername:~# iostat -x 5 3
Linux 2.6.32-41-server (servername)  25/07/12        _x86_64_        (1 CPU)
 
avg-cpu:  %user  %nice %system %iowait  %steal  %idle
          11.56    0.54    2.17    6.67    0.00  79.06
 
Device:        rrqm/s  wrqm/s    r/s    w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  await  svctm  %util
sda              18.89    9.63    6.36    8.83  367.22  146.02    33.78    0.34  22.18  5.73  8.70
dm-0              0.00    0.00    3.16  10.68  190.37    86.53    20.01    0.62  44.79  2.18  3.02
dm-1              0.00    0.00  22.11    7.44  176.85    59.48    8.00    0.71  23.92  2.21  6.52
 
avg-cpu:  %user  %nice %system %iowait  %steal  %idle
          45.18    0.00    5.02    0.40    0.00  49.40
 
Device:        rrqm/s  wrqm/s    r/s    w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  await  svctm  %util
sda              0.00    4.82    0.20    3.82    1.61    65.86    16.80    0.02    4.50  4.00  1.61
dm-0              0.00    0.00    0.20    8.23    1.61    65.86    8.00    0.07    7.86  1.90  1.61
dm-1              0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  0.00  0.00
 
avg-cpu:  %user  %nice %system %iowait  %steal  %idle
          95.80    0.00    4.20    0.00    0.00    0.00


* <code> vmstat </code>
Device:        rrqm/s  wrqm/s    r/s    w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  await  svctm  %util
** http://www.linuxcommand.org/man_pages/vmstat8.html
sda              0.00    11.80    0.00    8.20    0.00  156.80    19.12    0.06    7.07  0.24  0.20
** Principally used for reporting on virtual memory statistics
dm-0              0.00    0.00    0.00  19.60    0.00  156.80    8.00    0.18    8.98  0.10  0.20
* <code> mpstat </code>
dm-1              0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  0.00  0.00 </pre>
** http://www.linuxcommand.org/man_pages/mpstat1.html
 
** Reports basic processor stats
Partition stats output...
* <code> iostat </code>
<pre>root@servername:~# iostat -t -p 5 3
** http://sebastien.godard.pagesperso-orange.fr/man_iostat.html
Linux 2.6.32-41-server (servername)  30/07/12        _x86_64_        (1 CPU)
** Provides disk IO statistics - part of <code>sysstat</code> package
 
30/07/12 12:05:15
avg-cpu:  %user  %nice %system %iowait  %steal  %idle
          0.33    0.32    0.12    0.27    0.00  98.96
 
Device:            tps  Blk_read/s  Blk_wrtn/s  Blk_read  Blk_wrtn
sda              1.12        13.57        14.71    721218    782038
sda1              0.00        0.02        0.00        804        14
sda2              0.00        0.00        0.00          4          0
sda5              0.91        13.54        14.71    719994    782024
dm-0              2.23        13.49        14.45    716850    768240
dm-1              0.04        0.05        0.26      2632      13784
 
30/07/12 12:05:20
avg-cpu:  %user  %nice %system %iowait  %steal  %idle
          0.00    0.00    0.00    0.00    0.00  100.00
 
Device:           tps  Blk_read/s  Blk_wrtn/s  Blk_read  Blk_wrtn
sda              1.20        0.00        14.40          0        72
sda1              0.00        0.00        0.00          0          0
sda2              0.00        0.00        0.00          0          0
sda5              0.80        0.00        14.40          0        72
dm-0              1.80        0.00        14.40          0        72
dm-1              0.00        0.00        0.00          0          0
 
30/07/12 12:05:25
avg-cpu:  %user  %nice %system %iowait  %steal  %idle
          0.00    0.00    0.00    0.00    0.00  100.00
 
Device:            tps  Blk_read/s  Blk_wrtn/s  Blk_read  Blk_wrtn
sda              0.00        0.00        0.00          0          0
sda1              0.00        0.00        0.00          0          0
sda2              0.00        0.00        0.00          0          0
sda5              0.00        0.00        0.00          0          0
dm-0              0.00        0.00        0.00          0          0
dm-1              0.00        0.00        0.00          0          0</pre>
 
{|class="vwikitable-equal"
|+ Overview of IOSTAT Device Metrics
! Stats
! colspan="10"| Device IO Stats
! colspan="5"| Partition IO Stats
|-
! Code
! <code> rrqm/s </code>
! <code> wrqm/s </code>
! <code> r/s </code>
! <code> w/s </code>
! <code> rsec/s </code>
! <code> wsec/s </code>
! <code> avgrq-sz </code>
! <code> avgqu-sz </code>
! <code> svctm </code>
! <code> %util </code>
! <code> tps </code>
! <code> Blk_read/s </code>
! <code> Blk_wrtn/s </code>
! <code> Blk_read </code>
! <code> Blk_wrtn </code>
|-
! Name
| style="text-align: center;" | Read Merge<br>(/s)
| style="text-align: center;" | Write Merge<br>(/s)
| style="text-align: center;" | Read<br>(/s)
| style="text-align: center;" | Write<br>(/s))
| style="text-align: center;" | Read<br>(sectors/s)
| style="text-align: center;" | Write<br>(sectors/s)
| style="text-align: center;" | Av. Req. Size<br>(sectors)
| style="text-align: center;" | Av. Queue Len.<br>(sectors)
| style="text-align: center;" | Av. Service Time<br>(msec)
| style="text-align: center;" | Utilisation<br>(% CPU Time)
| style="text-align: center;" | Transfers<br>(/s)
| style="text-align: center;" | Read<br>(blocks/s)
| style="text-align: center;" | Write<br>(blocks/s)
| style="text-align: center;" | Read<br>(blocks)
| style="text-align: center;" | Write<br>(blocks)
|-
! Description
| Read requests merged
| Write requests merged
| Read requests
| Write requests
| Sector reads
| Sector writes
| Average read/write request size
| Average request queue length
| Average time to service requests
| Bandwidth utilisation / device saturation
| IO transfer rate (TPS - Transfers Per Second)
| Data read
| Data write
| Data read
| Data write
|}


If when using the above tools you're presented with disk/devices names of <code>dm-0</code>, <code>dm-1</code>, etc., which won't mean much.  These are LVM logical devices, to understand what they map to use
If when using the above tools you're presented with disk/devices names of <code>dm-0</code>, <code>dm-1</code>, etc., which won't mean much.  These are LVM logical devices, to understand what they map to use
<pre>lvdisplay|awk  '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'</pre>
<pre>lvdisplay|awk  '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'</pre>
{{GoogleAdBanner}}


== Network ==
== Network ==

Navigation menu