Troubleshooting (Ubuntu): Difference between revisions

Line 12:

The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would.

The problem with investigating performance issues is that you need to know what is normal, so you can determine what's wrong once application/service performance deteriorates. But its unlikely that you would have pain much attention to underlying system metrics until things are already bad.

=== <code>top</code> ===

Line 36:

Line 38:

Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.

{|class="vwikitable"

{|class="vwikitable-equal"

|+ Overview of CPU Metrics, % ~~over~~ time

|+ Overview of CPU Metrics, % of CPU time spent on

! Code ~~!! Name !~~! ~~Description~~

! Code

|-

! <code> us </code>

| <code>us</code> ~~|| User CPU || % of CPU time spent servicing user processes (excluding nice)~~

! <code> sy </code>

|-

! <code> ni </code>

| <code>sy</code> ~~|| System CPU || % of CPU time spent servicing kernel processes~~

! <code> id </code>

|-

! <code> wa </code>

| <code>ni</code> ~~|| Nice CPU || % of CPU time spent servicing user nice processes (nice reduces the priority of process)~~

! <code> ha </code>

|-

! <code> si </code>

| <code>id</code> ~~|| Idle CPU || % of CPU time spent idling (doing nothing)~~

! <code> st </code>

|-

| <code>wa</code> ~~|| IO Wait || % of CPU time spent waiting for IO (high indicates disk~~/~~network bottleneck)~~

|-

| ~~<code>ha</code>~~ || Hardware Interrupts || ~~% of CPU time spent servicing hardware interrupts~~

! Name

| User CPU

| System CPU

| Nice CPU

| Idle CPU

| IO Wait

| Hardware Interrupts

| Software Interrupts

| Steal

|-

| ~~<code>si</code>~~ || ~~Software Interrupts || %~~ of ~~CPU time spent servicing hardware interrupts~~

! Description

|-

| user processes (excluding nice)

| ~~<code>st<~~/~~code>~~ || ~~Steal~~ |~~| % of CPU time stolen to service~~ virtual machines

| kernel processes

| user nice processes (nice reduces the priority of process)

| idling (doing nothing)

| waiting for IO (high indicates disk/network bottleneck)

| hardware interrupts

| software interrupts

| servicing virtual machines

|}

{|class="vwikitable"

Line 77:

Line 93:

| <code>h</code> || <code>PR</code> || Priority || The task's priority

|-

| <code>i</code> || <code>NI</code> || Nice value || Adjusted task priority. From -20 meaning high ~~priorty~~, through 0 meaning unadjusted, to 19 meaning low priority

| <code>i</code> || <code>NI</code> || Nice value || Adjusted task priority. From -20 meaning high priority, through 0 meaning unadjusted, to 19 meaning low priority

|-

| <code>j</code> || <code>P</code> || Last Used CPU || ID of the CPU last used by the task

Line 101:

Line 117:

| <code>t</code> || <code>SHR</code> || Shared Mem size (kb) || Task's shared memory

|-

| <code>u</code> || <code>nFLT</code> || Page Fault count || Major/Hard page faults that have ~~occured~~ for the task

| <code>u</code> || <code>nFLT</code> || Page Fault count || Major/Hard page faults that have occurred for the task

|-

| <code>v</code> || <code>nDRT</code> || Dirty Pages count || Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory

Line 120:

Line 136:

==== Identify Process Causing ~~Occassional~~ High System Load ====

==== Identify Process Causing Occasional High System Load ====

If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.

Line 146:

Line 162:

1 * * * * /bin/bash /home/user/load_log</pre>

=== ~~Other Tools~~ ===

=== <code> vmstat </code> ===

[http://www.linuxcommand.org/man_pages/vmstat8.html <code>vmstat</code>] is principally used for reporting on virtual memory statistics, for example <code> vmstat 5 3 </code> creates an output every 5 seconds for 3 iterations,

<pre>user@server:~$ vmstat 5 3

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 0 42676 479556 34192 106944 5 5 31 3678 84 89 9 9 75 7

0 0 42676 479548 34208 106948 0 0 0 10 50 105 6 0 88 5

0 0 42676 479548 34216 106948 0 0 0 18 37 61 0 0 96 4 </pre>

Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.

{|class="vwikitable-equal"

|+ Overview of VMSTAT Metrics

! Section

! colspan="2"| Procs

! colspan="4"| Memory

! colspan="2"| Swap

! colspan="2"| IO

! colspan="2"| System

! colspan="4"| CPU

|-

! Code

! <code> r </code>

! <code> b </code>

! <code> swpd </code>

! <code> free </code>

! <code> buff </code>

! <code> cache </code>

! <code> si </code>

! <code> so </code>

! <code> bi </code>

! <code> bo </code>

! <code> in </code>

! <code> cs </code>

! <code> us </code>

! <code> sy </code>

! <code> id </code>

! <code> wa </code>

|-

! Name

| style="text-align: center;" | Run

| style="text-align: center;" | Block

| style="text-align: center;" | Swap (kB)

| style="text-align: center;" | Free (kB)

| style="text-align: center;" | Buffer (kB)

| style="text-align: center;" | Cache (kB)

| style="text-align: center;" | Swap In (kB/s)

| style="text-align: center;" | Swap Out (kB/s)

| style="text-align: center;" | Blocks In (blocks/s)

| style="text-align: center;" | Blocks Out (blocks/s)

| style="text-align: center;" | Interrupts (/s)

| style="text-align: center;" | Context Switch (/s)

| style="text-align: center;" | User (% time)

| style="text-align: center;" | System (% time)

| style="text-align: center;" | Idle (% time)

| style="text-align: center;" | Wait (% time)

|-

! Description

| Processes waiting for run time

| Processes in uninterruptible sleep (eg waiting for IO)

| Virtual memory used

| Unused memory

| Memory used as buffers

| Memory used as cache

| Memory swapped in from disk

| Memory swapped out to disk

| Blocks in from a storage device

| Blocks out from a storage device

| Interrupts

| Context switches

| CPU running user processes

| CPU running kernel processes

| CPU idle

| CPU waiting for IO

|}

=== <code> mpstat </code> ===

[http://www.linuxcommand.org/man_pages/mpstat1.html <code>mpstat</code>] reports on basic processor stats. It creates a timestamped output which is useful to leave running on a console (or logged to a file) for when you might here or find out about service performance problems after the fact. A number of the metrics are also provided by <code>[[#vmstat|vmstat]]</code>, but are reported to a greater accuracy by <code>mpstat</code>.

Its not available by default, and comes as part of the [http://sebastien.godard.pagesperso-orange.fr/ <code>sysstat</code>] package (to install, use <code>apt-get install sysstat</code>).

For example <code> mpstat 5 3 </code> creates an output every 5 seconds for 3 iterations,

<pre>user@server:~# mpstat 5 3

Linux 2.6.32-41-server (server) 25/07/12 _x86_64_ (1 CPU)

11:50:59 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle

11:51:04 all 1.00 0.00 0.80 1.60 0.00 0.00 0.00 0.00 96.60

11:51:09 all 4.60 0.00 0.40 2.60 0.00 0.00 0.00 0.00 92.40

11:51:14 all 43.20 0.00 6.00 3.00 0.00 0.00 0.00 0.00 47.80

Average: all 16.27 0.00 2.40 2.40 0.00 0.00 0.00 0.00 78.93 </pre>

{|class="vwikitable-equal"

|+ Overview of MPSTAT Metrics

! Code

! <code> CPU </code>

! <code> %usr </code>

! <code> %nice </code>

! <code> %sys </code>

! <code> %iowait </code>

! <code> %irq </code>

! <code> %soft </code>

! <code> %steal </code>

! <code> %guest </code>

! <code> %idle </code>

|-

! Name

| style="text-align: center;" | CPU No.

| style="text-align: center;" | User (% util)

| style="text-align: center;" | Nice (% util)

| style="text-align: center;" | System (% util)

| style="text-align: center;" | IO Wait (% time)

| style="text-align: center;" | Hard IRQ (% time)

| style="text-align: center;" | Soft IRQ (% time)

| style="text-align: center;" | Steal (% time)

| style="text-align: center;" | Guest In (% time)

| style="text-align: center;" | Idle (% time)

|-

! Description

| CPU number (or ''ALL'') Set with <code>-P <n></code> option switch

| CPU running user processes

| CPU running nice (adjusted priority) user processes

| CPU running kernel processes (excludes [[Acronyms#I|IRQ]]s)

| CPU waiting for (disk) IO

| CPU servicing hardware interrupts

| CPU servicing software interrupts

| Virtual CPU wait due to CPU busy with other [[Acronyms#V|vCPU]]

| CPU servicing vCPU(s)

| CPU idle

|}

=== <code> iostat </code> ===

[http://www.linuxcommand.org/man_pages/iostat1.html <code>iostat</code>] reports on IO (and CPU) stats.

Its not available by default, and comes as part of the [http://sebastien.godard.pagesperso-orange.fr/ <code>sysstat</code>] package (to install, use <code>apt-get install sysstat</code>).

IO stats can be displayed either by device (default, and extra metrics with <code>-x</code> switch) or by partition (<code>-p</code> switch). Note that the first line of output contains average/total counts since system start, with subsequent output being for the period since the last line of output.

Device stats output...

<pre>root@servername:~# iostat -x 5 3

Linux 2.6.32-41-server (servername) 25/07/12 _x86_64_ (1 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle

11.56 0.54 2.17 6.67 0.00 79.06

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 18.89 9.63 6.36 8.83 367.22 146.02 33.78 0.34 22.18 5.73 8.70

dm-0 0.00 0.00 3.16 10.68 190.37 86.53 20.01 0.62 44.79 2.18 3.02

dm-1 0.00 0.00 22.11 7.44 176.85 59.48 8.00 0.71 23.92 2.21 6.52

avg-cpu: %user %nice %system %iowait %steal %idle

45.18 0.00 5.02 0.40 0.00 49.40

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 0.00 4.82 0.20 3.82 1.61 65.86 16.80 0.02 4.50 4.00 1.61

dm-0 0.00 0.00 0.20 8.23 1.61 65.86 8.00 0.07 7.86 1.90 1.61

dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle

95.80 0.00 4.20 0.00 0.00 0.00

* <~~code~~> ~~vmstat~~ </~~code>~~

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

** http://~~www~~.~~linuxcommand~~.~~org~~/~~man_pages~~/~~vmstat8~~.~~html~~

sda 0.00 11.80 0.00 8.20 0.00 156.80 19.12 0.06 7.07 0.24 0.20

** Principally used for reporting on virtual memory statistics

dm-0 0.00 0.00 0.00 19.60 0.00 156.80 8.00 0.18 8.98 0.10 0.20

* <code> ~~mpstat~~ </code>

dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 </pre>

** http://~~www.linuxcommand.org~~/~~man_pages~~/~~mpstat1.html~~

** Reports basic processor stats

Partition stats output...

* <code> ~~iostat~~ </code>

<pre>root@servername:~# iostat -t -p 5 3

** http://~~sebastien~~.~~godard~~.~~pagesperso~~-~~orange~~.fr/~~man_iostat.html~~

Linux 2.6.32-41-server (servername) 30/07/12 _x86_64_ (1 CPU)

** Provides disk IO statistics - ~~part of~~ <~~code~~>~~sysstat~~</~~code~~> ~~package~~

30/07/12 12:05:15

avg-cpu: %user %nice %system %iowait %steal %idle

0.33 0.32 0.12 0.27 0.00 98.96

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 1.12 13.57 14.71 721218 782038

sda1 0.00 0.02 0.00 804 14

sda2 0.00 0.00 0.00 4 0

sda5 0.91 13.54 14.71 719994 782024

dm-0 2.23 13.49 14.45 716850 768240

dm-1 0.04 0.05 0.26 2632 13784

30/07/12 12:05:20

avg-cpu: %user %nice %system %iowait %steal %idle

0.00 0.00 0.00 0.00 0.00 100.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 1.20 0.00 14.40 0 72

sda1 0.00 0.00 0.00 0 0

sda2 0.00 0.00 0.00 0 0

sda5 0.80 0.00 14.40 0 72

dm-0 1.80 0.00 14.40 0 72

dm-1 0.00 0.00 0.00 0 0

30/07/12 12:05:25

avg-cpu: %user %nice %system %iowait %steal %idle

0.00 0.00 0.00 0.00 0.00 100.00

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 0.00 0.00 0.00 0 0

sda1 0.00 0.00 0.00 0 0

sda2 0.00 0.00 0.00 0 0

sda5 0.00 0.00 0.00 0 0

dm-0 0.00 0.00 0.00 0 0

dm-1 0.00 0.00 0.00 0 0</pre>

{|class="vwikitable-equal"

|+ Overview of IOSTAT Device Metrics

! Stats

! colspan="10"| Device IO Stats

! colspan="5"| Partition IO Stats

|-

! Code

! <code> rrqm/s </code>

! <code> wrqm/s </code>

! <code> r/s </code>

! <code> w/s </code>

! <code> rsec/s </code>

! <code> wsec/s </code>

! <code> avgrq-sz </code>

! <code> avgqu-sz </code>

! <code> svctm </code>

! <code> %util </code>

! <code> tps </code>

! <code> Blk_read/s </code>

! <code> Blk_wrtn/s </code>

! <code> Blk_read </code>

! <code> Blk_wrtn </code>

|-

! Name

| style="text-align: center;" | Read Merge (/s)

| style="text-align: center;" | Write Merge (/s)

| style="text-align: center;" | Read (/s)

| style="text-align: center;" | Write (/s))

| style="text-align: center;" | Read (sectors/s)

| style="text-align: center;" | Write (sectors/s)

| style="text-align: center;" | Av. Req. Size (sectors)

| style="text-align: center;" | Av. Queue Len. (sectors)

| style="text-align: center;" | Av. Service Time (msec)

| style="text-align: center;" | Utilisation (% CPU Time)

| style="text-align: center;" | Transfers (/s)

| style="text-align: center;" | Read (blocks/s)

| style="text-align: center;" | Write (blocks/s)

| style="text-align: center;" | Read (blocks)

| style="text-align: center;" | Write (blocks)

|-

! Description

| Read requests merged

| Write requests merged

| Read requests

| Write requests

| Sector reads

| Sector writes

| Average read/write request size

| Average request queue length

| Average time to service requests

| Bandwidth utilisation / device saturation

| IO transfer rate (TPS - Transfers Per Second)

| Data read

| Data write

| Data read

| Data write

|}

If when using the above tools you're presented with disk/devices names of <code>dm-0</code>, <code>dm-1</code>, etc., which won't mean much. These are LVM logical devices, to understand what they map to use

<pre>lvdisplay|awk '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'</pre>

== Network ==

Troubleshooting (Ubuntu): Difference between revisions

Navigation menu

Search