Troubleshooting (Ubuntu): Difference between revisions
m (→Identify Process Causing High System Load: Arghh, another typo fix) |
(→High System Load: Added other tools) |
||
Line 11: | Line 11: | ||
So in the above, as of 14:28:49 hrs the server has been up for 9 days 22 hours odd, has 1 user logged in, and the system load averages for the past 1, 5, and 15 minutes are shown. | So in the above, as of 14:28:49 hrs the server has been up for 9 days 22 hours odd, has 1 user logged in, and the system load averages for the past 1, 5, and 15 minutes are shown. | ||
The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. | The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would. | ||
=== <code>top</code> === | === <code>top</code> === | ||
The <code>top</code> command allows some basic insight into the system's performance, and is akin to the Task Manager in Windows. | The <code>top</code> command allows some basic insight into the system's performance, and is akin to the Task Manager in Windows. It probably won't provide the answer as to what the problem is, but it will probably allow you to focus in on the process(es) that are causing grief. | ||
<pre> | <pre> | ||
Line 120: | Line 120: | ||
=== Identify Process Causing High System Load === | ==== Identify Process Causing Occassional High System Load ==== | ||
If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO. | If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO. | ||
Line 145: | Line 145: | ||
<pre>crontab -e | <pre>crontab -e | ||
1 * * * * /bin/bash /home/user/load_log</pre> | 1 * * * * /bin/bash /home/user/load_log</pre> | ||
=== Other Tools === | |||
* <code> vmstat </code> | |||
** http://www.linuxcommand.org/man_pages/vmstat8.html | |||
** Principally used for reporting on virtual memory statistics | |||
* <code> mpstat </code> | |||
** http://www.linuxcommand.org/man_pages/mpstat1.html | |||
** Reports basic processor stats | |||
* <code> iostat </code> | |||
** http://sebastien.godard.pagesperso-orange.fr/man_iostat.html | |||
** Provides disk IO statistics - part of <code>sysstat</code> package | |||
If when using the above tools you're presented with disk/devices names of <code>dm-0</code>, <code>dm-1</code>, etc., which won't mean much. These are LVM logical devices, to understand what they map to use | |||
<pre>lvdisplay|awk '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'</pre> | |||
== Network == | == Network == |
Revision as of 15:28, 16 July 2012
High System Load
The system load is normally represented by the load average over the last 1, 5 and 15 minutes.
For example, the uptime
command gives a single line summary of system uptime and recent load
user@server:~$ uptime 14:28:49 up 9 days, 22:41, 1 user, load average: 0.34, 0.36, 0.32
So in the above, as of 14:28:49 hrs the server has been up for 9 days 22 hours odd, has 1 user logged in, and the system load averages for the past 1, 5, and 15 minutes are shown.
The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state. What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them. Though if you expect peaks in load, then a high load over the last minute might not concern, whereas over 15mins it would.
top
The top
command allows some basic insight into the system's performance, and is akin to the Task Manager in Windows. It probably won't provide the answer as to what the problem is, but it will probably allow you to focus in on the process(es) that are causing grief.
user@server:~$ top top - 14:32:09 up 9 days, 22:44, 1 user, load average: 0.70, 0.44, 0.34 Tasks: 137 total, 1 running, 136 sleeping, 0 stopped, 0 zombie Cpu(s): 93.8%us, 6.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1023360k total, 950520k used, 72840k free, 10836k buffers Swap: 1757176k total, 1110228k used, 646948k free, 135524k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6608 zimbra 20 0 556m 69m 12m S 69.1 6.9 0:03.26 java 17284 zimbra 20 0 649m 101m 3604 S 4.6 10.1 31:34.74 java 2610 zimbra 20 0 976m 181m 3700 S 0.7 18.1 133:06.68 java 1 root 20 0 23580 1088 732 S 0.0 0.1 0:04.70 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 ....
Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.
Code | Name | Description |
---|---|---|
us |
User CPU | % of CPU time spent servicing user processes (excluding nice) |
sy |
System CPU | % of CPU time spent servicing kernel processes |
ni |
Nice CPU | % of CPU time spent servicing user nice processes (nice reduces the priority of process) |
id |
Idle CPU | % of CPU time spent idling (doing nothing) |
wa |
IO Wait | % of CPU time spent waiting for IO (high indicates disk/network bottleneck) |
ha |
Hardware Interrupts | % of CPU time spent servicing hardware interrupts |
si |
Software Interrupts | % of CPU time spent servicing hardware interrupts |
st |
Steal | % of CPU time stolen to service virtual machines |
Key | Display | Name | Description |
---|---|---|---|
a |
PID |
Process ID | Task/process identifier |
b |
PPID |
Parent PID | Task/process identifier of processes parent (ie the process that launched this process) |
c |
RUSER |
Real User Name | Real username of task's owner |
d |
UID |
User ID | User ID of task's owner |
e |
USER |
User Name | Username ID of task's owner |
f |
GROUP |
Group Name | Group name of task's owner |
g |
TTY |
Controlling TTY | Device that started the process |
h |
PR |
Priority | The task's priority |
i |
NI |
Nice value | Adjusted task priority. From -20 meaning high priorty, through 0 meaning unadjusted, to 19 meaning low priority |
j |
P |
Last Used CPU | ID of the CPU last used by the task |
k |
%CPU |
CPU Usage | Task's usage of CPU |
l |
TIME |
CPU Time | Total CPU time used by the task |
m |
TIME+ |
CPU Time, hundredths | Total CPU time used by the task in sub-second accuracy |
n |
%MEM |
Memory usage (RES) | Task's usage of available physical memory |
o |
VIRT |
Virtual Image (kb) | Task's allocation of virtual memory |
p |
SWAP |
Swapped size (kb) | Task's swapped memory (resident in swap-file) |
q |
RES |
Resident size (kb) | Task's unswapped memory (resident in physical memory) |
r |
CODE |
Code size (kb) | Task's virtual memory used for executable code |
s |
DATA |
Data+Stack size (kb) | Task's virtual memory not used for executable code |
t |
SHR |
Shared Mem size (kb) | Task's shared memory |
u |
nFLT |
Page Fault count | Major/Hard page faults that have occured for the task |
v |
nDRT |
Dirty Pages count | Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory |
w |
S |
Process Status |
|
x |
Command |
Command Line | Command used to start task |
y |
WCHAN |
Sleeping in Function | Name (or address) of function that the task is sleeping in |
z |
Flags |
Taks Flags | Task's scheduling flags |
Identify Process Causing Occassional High System Load
If the high load is constant, just fire up top
and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.
If the high load is transient but repetitive, then you'll need to capture the output of top
at the right time, the following script will create a log of top
output during periods of high load
#!/bin/bash
#
# During high load, write output form top to file.
#
# Simon Strutt - July 2012
LOGFILE="/home/user/load_log.txt" # Update to a valid folder path
MAXLOAD=100 # Multiple by 100 as 'if' comparison can only handle integers
LOAD=`cut -d ' ' -f 1 /proc/loadavg`
LOAD=`echo $LOAD '*100' | bc -l | awk -F '.' '{ print $1; exit; }'` # Convert load to x100 integer
if [ $LOAD -gt $MAXLOAD ]; then
echo `date '+%Y-%m-%d %H:%M:%S'`>> ${LOGFILE}
top -b -n 1 >> ${LOGFILE}
fi
Schedule with something like (update with correct path to load_log
...
crontab -e 1 * * * * /bin/bash /home/user/load_log
Other Tools
vmstat
- http://www.linuxcommand.org/man_pages/vmstat8.html
- Principally used for reporting on virtual memory statistics
mpstat
- http://www.linuxcommand.org/man_pages/mpstat1.html
- Reports basic processor stats
iostat
- http://sebastien.godard.pagesperso-orange.fr/man_iostat.html
- Provides disk IO statistics - part of
sysstat
package
If when using the above tools you're presented with disk/devices names of dm-0
, dm-1
, etc., which won't mean much. These are LVM logical devices, to understand what they map to use
lvdisplay|awk '/LV Name/{n=$3} /Block device/{d=$3; sub(".*:","dm-",d); print d,n;}'
Network
No NIC
Especially after hardware changes, its possible the networking config no longer refers to the right interface.
- Use
ifconfig
to confirm the current network config - Use
dmesg | grep -i eth
to ascertain what's been detected at boot time - Assuming it states that say
eth0
has been changed toeth1
then just update the/etc/network/interfaces
file
File System
Unable to Mount CD-ROM
Mounting drive with following command fails
mount /dev/cdrom /media/cdrom/
If /media/cdrom/
doesn't exist
- Create the file with
mkdir /media/cdrom
If /dev/cdrom
special device doesn't exist
- Check for existing mappings and devices
ls -l /dev/ | grep cdrom
- If an existing mapping exists but for a different drive number (eg
cdrom2 -> sr0
)- Then try mounting with that number
- EG
mount /dev/cdrom2 /media/cdrom/
- If no existing mapping exists
- Then try creating one for one of the listed devices
- EG
ln -sf /dev/sg0 /dev/cdrom
Replacing a Software RAID 1 Disk
This procedure was written from the following starting point...
- A machine originally with two disks in RAID1 has failed, one disk has been replaced, and machine started again
...and adapted from this post http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array
- Backup whatever you can before proceeding, one mistake or system error could destroy your machine
- Confirm which disk is new, and which is old (if the new disk is blank this is easy as there will be no partition info!)
fdisk -l
- Partition the new disk the same as the original
sfdisk -d /dev/sda | sfdisk /dev/sdb
- Confirm that the layout of both disks is now that same
fdisk -l
- Add the newly created partitions to the RAID disks
mdadm --manage /dev/md0 --add /dev/sdb1
- You may have more
sd
partitions thanmd
partitions, the array size return throughmdadm -D /dev/md*
should roughly match the number of blocks found fromfdisk -l
- The arrays should now be being sync'ed, check progress by monitoring
/proc/mdstat
more /proc/mdstat
SSH
Server Hostname Change
If the hostname (or IP) of the server you are SSH'ing to changes, the old entry needs to be removed from your SSH key known hosts file
ssh-keygen -R <name or IP>
Packages
Errors etc received from apt-get
- Error 400 Bad Request
- Somewhat misleadingly, the problem is normal caused by being unable to contact the update server. Consider adding proxy server config to your machine
- The following packages have been kept back
- Package manager can hold back updates because they will cause conflicts, or sometimes because they're major kernel updates. Running
aptitude safe-upgrade
normally seems to force kernel updates through.
- Package manager can hold back updates because they will cause conflicts, or sometimes because they're major kernel updates. Running
Reboot Required?
If a package update/installation requires a reboot to complete the following file will exist...
/var/run/reboot-required
To see which packages caused this to be set, inspect the contents of...
/var/run/reboot-required.pkgs