Difference between revisions of "Troubleshooting (Ubuntu)"

Troubleshooting (Ubuntu) (view source)

Revision as of 15:44, 11 July 2012

6,855 bytes added , 15:44, 11 July 2012

Added "High System Load"

Sstrutt

Administrators

2,187

edits

@@ Line 1: / Line 1: @@
+== High System Load ==
+The system load is normally represented by the load average over the last 1, 5 and 15 minutes.
+For example, the <code>uptime</code> command gives a single line summary of system uptime and recent load
+<pre>
+user@server:~$ uptime
+:28:49 up 9 days, 22:41,  1 user,  load average: 0.34, 0.36, 0.32
+</pre>
+So in the above, as of 14:28:49 hrs the server has been up for 9 days 22 hours odd, has 1 user logged in, and the system load averages for the past 1, 5, and 15 minutes are shown.
+The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state.  What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them.
+=== <code>top</code> ===
+The <code>top</code> command allows some basic insight into the system's performance, and is akin to the Task Manager in Windows.
+<pre>
+user@server:~$ top
+top - 14:32:09 up 9 days, 22:44,  1 user,  load average: 0.70, 0.44, 0.34
+Tasks: 137 total,   1 running, 136 sleeping,   0 stopped,   0 zombie
+Cpu(s): 93.8%us,  6.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
+Mem:   1023360k total,   950520k used,    72840k free,    10836k buffers
+Swap:  1757176k total,  1110228k used,   646948k free,   135524k cached
+  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
+zimbra    20   0  556m  69m  12m S 69.1  6.9   0:03.26 java
+zimbra    20   0  649m 101m 3604 S  4.6 10.1  31:34.74 java
+zimbra    20   0  976m 181m 3700 S  0.7 18.1 133:06.68 java
+root      20   0 23580 1088  732 S  0.0  0.1   0:04.70 init
+root      20   0     0    0    0 S  0.0  0.0   0:00.01 kthreadd
+root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
+....
+</pre>
+Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.
+{|class="vwikitable"
+|+ Overview of CPU Metrics, % over time
+! Code  		!! Name		!! Description
+|-
+| <code>us</code>	|| User CPU	|| % of CPU time spent servicing user processes (excluding nice)
+|-
+| <code>sy</code>	|| System CPU	|| % of CPU time spent servicing kernel processes
+|-
+| <code>ni</code>	|| Nice CPU	|| % of CPU time spent servicing user nice processes (nice reduces the priority of process)
+|-
+| <code>id</code>	|| Idle CPU	|| % of CPU time spent idling (doing nothing)
+|-
+| <code>wa</code>	|| IO Wait	|| % of CPU time spent waiting for IO (high indicates disk/network bottleneck)
+|-
+| <code>ha</code>	|| Hardware Interrupts	|| % of CPU time spent servicing hardware interrupts
+|-
+| <code>si</code>	|| Software Interrupts	|| % of CPU time spent servicing hardware interrupts
+|-
+| <code>st</code>	|| Steal	|| % of CPU time stolen to service virtual machines
+|}
+{|class="vwikitable"
+|+ Task column heading descriptions (to change what columns are shown press <code>f</code>)
+! Key			!! Display  		!! Name		!! Description
+|-
+| <code>a</code>	|| <code>PID</code>	|| Process ID	|| Task/process identifier
+|-
+| <code>b</code>	|| <code>PPID</code>	|| Parent PID	|| Task/process identifier of processes parent (ie the process that launched this process)
+|-
+| <code>c</code>	|| <code>RUSER</code>	|| Real User Name || Real username of task's owner
+|-
+| <code>d</code>	|| <code>UID</code>	|| User ID	|| User ID of task's owner
+|-
+| <code>e</code>	|| <code>USER</code>	|| User Name	|| Username ID of task's owner
+|-
+| <code>f</code>	|| <code>GROUP</code>	|| Group Name	|| Group name of task's owner
+|-
+| <code>g</code>	|| <code>TTY</code>	|| Controlling TTY	|| Device that started the process
+|-
+| <code>h</code>	|| <code>PR</code>	|| Priority	|| The task's priority
+|-
+| <code>i</code>	|| <code>NI</code>	|| Nice value	|| Adjusted task priority. From -20 meaning high priorty, through 0 meaning unadjusted, to 19 meaning low priority
+|-
+| <code>j</code>	|| <code>P</code>	|| Last Used CPU	|| ID of the CPU last used by the task
+|-
+| <code>k</code>	|| <code>%CPU</code>	|| CPU Usage	|| Task's usage of CPU
+|-
+| <code>l</code>	|| <code>TIME</code>	|| CPU Time	|| Total CPU time used by the task
+|-
+| <code>m</code>	|| <code>TIME+</code>	|| CPU Time, hundredths	|| Total CPU time used by the task in sub-second accuracy
+|-
+| <code>n</code>	|| <code>%MEM</code>	|| Memory usage (RES)	|| Task's usage of available physical memory
+|-
+| <code>o</code>	|| <code>VIRT</code>	|| Virtual Image (kb)	|| Task's allocation of virtual memory
+|-
+| <code>p</code>	|| <code>SWAP</code>	|| Swapped size (kb)	|| Task's swapped memory (resident in swap-file)
+|-
+| <code>q</code>	|| <code>RES</code>	|| Resident size (kb)	|| Task's unswapped memory (resident in physical memory)
+|-
+| <code>r</code>	|| <code>CODE</code>	|| Code size (kb)	|| Task's virtual memory used for executable code
+|-
+| <code>s</code>	|| <code>DATA</code>	|| Data+Stack size (kb)	|| Task's virtual memory not used for executable code
+|-
+| <code>t</code>	|| <code>SHR</code>	|| Shared Mem size (kb)	|| Task's shared memory
+|-
+| <code>u</code>	|| <code>nFLT</code>	|| Page Fault count || Major/Hard page faults that have occured for the task
+|-
+| <code>v</code>	|| <code>nDRT</code>	|| Dirty Pages count || Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory
+|-
+| <code>w</code>	|| <code>S</code>	|| Process Status ||
+* D - Uninterruptible sleep
+* R - Running
+* S - Sleeping
+* T - Traced or Stopped
+* Z - Zombie
+|-
+| <code>x</code>	|| <code>Command</code>	|| Command Line || Command used to start task
+|-
+| <code>y</code>	|| <code>WCHAN</code>	|| Sleeping in Function || Name (or address) of function that the task is sleeping in
+|-
+| <code>z</code>	|| <code>Flags</code>	|| Taks Flags || Task's scheduling flags
+|}
+=== Identify Process Causing High System Load ===
+If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.
+If the high load is transient but repetitive, then you'll need to capture the output of <code>top</code> at the right time, the following script will create a log of <code>top</code> output during periods of high load
+<source lang="bash">#!/bin/bash
+#
+# During high load, write output form top to file.
+#
+# Simon Strutt - July 2012
+LOGFILE="load_log.txt"
+MAXLOAD=100                     # Multiple by 100 as if comparison can only handle integers
+LOAD=`cut -d ' ' -f 1 /proc/loadavg`
+LOAD=`echo $LOAD '*100' | bc -l | awk -F '.' '{ print $1; exit; }'`     # Convert load to x100 integer
+if [ $LOAD -gt $MAXLOAD ]; then
+        echo `date '+%Y-%m-%d %H:%M:%S'`>> ${LOGFILE}
+        top -b -n 1 >> ${LOGFILE}
+fi</source>
+Schedule with something like...
+<pre>crontab -e
+* * * * /bin/bash  /home/simons/load_log</pre>
 == Network ==
 === No NIC ===
@@ Line 63: / Line 210: @@
 [[Category:Ubuntu]]
 [[Category:Troubleshooting]]
+[[Category:Bash]]

Difference between revisions of "Troubleshooting (Ubuntu)"

Troubleshooting (Ubuntu) (view source)

Revision as of 15:44, 11 July 2012

Navigation menu

Search