Troubleshooting (Ubuntu): Difference between revisions

From vwiki
Jump to navigation Jump to search
(Added "High System Load")
(Removed GoogleAdLinkUnitBanner)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
== High System Load ==
'''For performance problems related load, see [[High_System_Load_(Ubuntu)|High System Load]]'''
The system load is normally represented by the load average over the last 1, 5 and 15 minutes.
 
For example, the <code>uptime</code> command gives a single line summary of system uptime and recent load
 
<pre>
user@server:~$ uptime
14:28:49 up 9 days, 22:41,  1 user,  load average: 0.34, 0.36, 0.32
</pre>
 
So in the above, as of 14:28:49 hrs the server has been up for 9 days 22 hours odd, has 1 user logged in, and the system load averages for the past 1, 5, and 15 minutes are shown.
 
The load average for a given period indicates how many processes were running or in a uninterruptable (waiting for IO) state.  What's bad depends on your system, for a single CPU system a load average greater than 1 could be considered bad as there are more processes running than CPU's to service them.
 
=== <code>top</code> ===
The <code>top</code> command allows some basic insight into the system's performance, and is akin to the Task Manager in Windows.
 
<pre>
user@server:~$ top
top - 14:32:09 up 9 days, 22:44,  1 user,  load average: 0.70, 0.44, 0.34
Tasks: 137 total,  1 running, 136 sleeping,  0 stopped,  0 zombie
Cpu(s): 93.8%us,  6.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  1023360k total,  950520k used,    72840k free,    10836k buffers
Swap:  1757176k total,  1110228k used,  646948k free,  135524k cached
 
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
6608 zimbra    20  0  556m  69m  12m S 69.1  6.9  0:03.26 java
17284 zimbra    20  0  649m 101m 3604 S  4.6 10.1  31:34.74 java
2610 zimbra    20  0  976m 181m 3700 S  0.7 18.1 133:06.68 java
    1 root      20  0 23580 1088  732 S  0.0  0.1  0:04.70 init
    2 root      20  0    0    0    0 S  0.0  0.0  0:00.01 kthreadd
    3 root      RT  0    0    0    0 S  0.0  0.0  0:00.00 migration/0
....
</pre>
 
Note that CPU metrics are with respect to 1 CPU, so on a multiple CPU system, seeing values > 100% is valid.
 
{|class="vwikitable"
|+ Overview of CPU Metrics, % over time
! Code  !! Name !! Description
|-
| <code>us</code> || User CPU || % of CPU time spent servicing user processes (excluding nice)
|-
| <code>sy</code> || System CPU || % of CPU time spent servicing kernel processes
|-
| <code>ni</code> || Nice CPU || % of CPU time spent servicing user nice processes (nice reduces the priority of process)
|-
| <code>id</code> || Idle CPU || % of CPU time spent idling (doing nothing)
|-
| <code>wa</code> || IO Wait || % of CPU time spent waiting for IO (high indicates disk/network bottleneck)
|-
| <code>ha</code> || Hardware Interrupts || % of CPU time spent servicing hardware interrupts
|-
| <code>si</code> || Software Interrupts || % of CPU time spent servicing hardware interrupts
|-
| <code>st</code> || Steal || % of CPU time stolen to service virtual machines
|}
 
{|class="vwikitable"
|+ Task column heading descriptions (to change what columns are shown press <code>f</code>)
! Key !! Display  !! Name !! Description
|-
| <code>a</code> || <code>PID</code> || Process ID || Task/process identifier
|-
| <code>b</code> || <code>PPID</code> || Parent PID || Task/process identifier of processes parent (ie the process that launched this process)
|-
| <code>c</code> || <code>RUSER</code> || Real User Name || Real username of task's owner
|-
| <code>d</code> || <code>UID</code> || User ID || User ID of task's owner
|-
| <code>e</code> || <code>USER</code> || User Name || Username ID of task's owner
|-
| <code>f</code> || <code>GROUP</code> || Group Name || Group name of task's owner
|-
| <code>g</code> || <code>TTY</code> || Controlling TTY || Device that started the process
|-
| <code>h</code> || <code>PR</code> || Priority || The task's priority
|-
| <code>i</code> || <code>NI</code> || Nice value || Adjusted task priority. From -20 meaning high priorty, through 0 meaning unadjusted, to 19 meaning low priority
|-
| <code>j</code> || <code>P</code> || Last Used CPU || ID of the CPU last used by the task
|-
| <code>k</code> || <code>%CPU</code> || CPU Usage || Task's usage of CPU
|-
| <code>l</code> || <code>TIME</code> || CPU Time || Total CPU time used by the task
|-
| <code>m</code> || <code>TIME+</code> || CPU Time, hundredths || Total CPU time used by the task in sub-second accuracy
|-
| <code>n</code> || <code>%MEM</code> || Memory usage (RES) || Task's usage of available physical memory
|-
| <code>o</code> || <code>VIRT</code> || Virtual Image (kb) || Task's allocation of virtual memory
|-
| <code>p</code> || <code>SWAP</code> || Swapped size (kb) || Task's swapped memory (resident in swap-file)
|-
| <code>q</code> || <code>RES</code> || Resident size (kb) || Task's unswapped memory (resident in physical memory)
|-
| <code>r</code> || <code>CODE</code> || Code size (kb) || Task's virtual memory used for executable code
|-
| <code>s</code> || <code>DATA</code> || Data+Stack size (kb) || Task's virtual memory not used for executable code
|-
| <code>t</code> || <code>SHR</code> || Shared Mem size (kb) || Task's shared memory
|-
| <code>u</code> || <code>nFLT</code> || Page Fault count || Major/Hard page faults that have occured for the task 
|-
| <code>v</code> || <code>nDRT</code> || Dirty Pages count || Tasks memory pages that have been modified since last write to disk, and so can be readily freed from physical memory
|-
| <code>w</code> || <code>S</code> || Process Status ||
* D - Uninterruptible sleep
* R - Running
* S - Sleeping
* T - Traced or Stopped
* Z - Zombie
|-
| <code>x</code> || <code>Command</code> || Command Line || Command used to start task
|-
| <code>y</code> || <code>WCHAN</code> || Sleeping in Function || Name (or address) of function that the task is sleeping in
|-
| <code>z</code> || <code>Flags</code> || Taks Flags || Task's scheduling flags
|}
 
 
=== Identify Process Causing High System Load ===
If the high load is constant, just fire up <code>top</code> and see if there is a specific process to blame, or if your stuck waiting for disk or network IO.
 
If the high load is transient but repetitive, then you'll need to capture the output of <code>top</code> at the right time, the following script will create a log of <code>top</code> output during periods of high load
 
<source lang="bash">#!/bin/bash
#
# During high load, write output form top to file.
#
# Simon Strutt - July 2012
 
LOGFILE="load_log.txt"
MAXLOAD=100                    # Multiple by 100 as if comparison can only handle integers
 
LOAD=`cut -d ' ' -f 1 /proc/loadavg`
LOAD=`echo $LOAD '*100' | bc -l | awk -F '.' '{ print $1; exit; }'`    # Convert load to x100 integer
 
if [ $LOAD -gt $MAXLOAD ]; then
        echo `date '+%Y-%m-%d %H:%M:%S'`>> ${LOGFILE}
        top -b -n 1 >> ${LOGFILE}
fi</source>
 
Schedule with something like...
<pre>crontab -e
1 * * * * /bin/bash  /home/simons/load_log</pre>


== Network ==
== Network ==
Line 153: Line 8:
# Use <code> dmesg | grep -i eth </code> to ascertain what's been detected at boot time
# Use <code> dmesg | grep -i eth </code> to ascertain what's been detected at boot time
# Assuming it states that say <code>eth0</code> has been changed to <code>eth1</code> then just update the <code>/etc/network/interfaces</code> file
# Assuming it states that say <code>eth0</code> has been changed to <code>eth1</code> then just update the <code>/etc/network/interfaces</code> file
# Alternatively, force the ''new'' NIC to be <code>eth0</code> by editing the <code>/etc/udev/rules.d/70-persistent-net.rules</code> file
#* You'll need to reboot the server for changes to take effect


== File System ==
== File System ==
Line 188: Line 45:
# The arrays should now be being sync'ed, check progress by monitoring <code>/proc/mdstat</code>
# The arrays should now be being sync'ed, check progress by monitoring <code>/proc/mdstat</code>
#* <code> more /proc/mdstat </code>
#* <code> more /proc/mdstat </code>
=== Recover Deleted Files ===
Ideally you should recover files to a seperate disk partition to the one you are attempting to recover from.  This procedure should help to recover lost or corrupted files from a filesystem using [http://manpages.ubuntu.com/manpages/lucid/man1/scalpel.1.html Scalpel], a data recovery utility built on the foundation of [http://foremost.sourceforge.net/ Foremost]
# Install Scalpel
#* <code> apt-get install scalpel </code>
# Update the config file to search for the lost files (uncomment/add as neccessary)
#* <code> /etc/scalpel/scalpel.conf </code>
#* For PHP files (not embedded in HTML) use <code> php n  50000  <?php          ?> </code>
# Create a folder for the recovered files to go to
#* <code> mkdir /tmp/recov </code>
# Launch Scalpel to trawl the disk image (will takes ages, and source disk will be under high load)
#* <code> scalpel /dev/mapper/svr-root -o /tmp/recov/ </code>
# Search through recovered files to find the data of interest
#* <code> grep -R "string you want to find" /tmp/recov/* </code>


== SSH ==
== SSH ==
Line 200: Line 72:
* '''The following packages have been kept back'''
* '''The following packages have been kept back'''
** Package manager can hold back updates because they will cause conflicts, or sometimes because they're major kernel updates.  Running <code>aptitude safe-upgrade</code> normally seems to force kernel updates through.
** Package manager can hold back updates because they will cause conflicts, or sometimes because they're major kernel updates.  Running <code>aptitude safe-upgrade</code> normally seems to force kernel updates through.
=== Add EOL Repository ===
Once a version of Ubuntu has gone End Of Line (EOL), you can't install software packages using the normal repository.  On trying you'll get an error similar to
* <code>Failed to fetch http://gb.archive.ubuntu.com/ubuntu/pool/main/s/<package>  404 Not Found</code>
The repository is still available, but via a different URL -  http://old-releases.ubuntu.com
Edit <code>/etc/apt/sources.list</code> and add the following (replace hardy with your flavour of Ubuntu).  Remove the existing ubuntu repositories (they'll just cause errors as they're inaccessible)
<pre>
# Hardy EOL
# Required
deb http://old-releases.ubuntu.com/ubuntu/ hardy main restricted universe multiverse
deb http://old-releases.ubuntu.com/ubuntu/ hardy-updates main restricted universe multiverse
deb http://old-releases.ubuntu.com/ubuntu/ hardy-security main restricted universe multiverse
# Optional
#deb http://old-releases.ubuntu.com/ubuntu/ hardy-backports main restricted universe multiverse
</pre>


== Reboot Required? ==
== Reboot Required? ==
Line 207: Line 98:
To see which packages caused this to be set, inspect the contents of...
To see which packages caused this to be set, inspect the contents of...
  /var/run/reboot-required.pkgs
  /var/run/reboot-required.pkgs
== Firewall ==
=== ERROR: problem running ufw-init ===
If on starting or reloading <code>ufw</code> you receive this error, its likely that you have a configuration problem.  This is especially likely if you've needed to edit <code>ufw</code>'s config files directly.
# Ensure that <code>ufw</code> is running
#* <code> ufw enable </code>
# Force the config to be reloaded
#* <code> /lib/ufw/ufw-init force-reload </code>
# Or if <code>ufw</code> failed to start use
#* <code> /lib/ufw/ufw-init start </code>
Doing the above should trigger the error, and present a better description of what the problem is
See http://ubuntuforums.org/showthread.php?t=1660916 for further info


[[Category:Ubuntu]]
[[Category:Ubuntu]]
[[Category:Troubleshooting]]
[[Category:Troubleshooting]]
[[Category:Bash]]
[[Category:Bash]]

Latest revision as of 13:34, 26 September 2016

For performance problems related load, see High System Load

Network

No NIC

Especially after hardware changes, its possible the networking config no longer refers to the right interface.

  1. Use ifconfig to confirm the current network config
  2. Use dmesg | grep -i eth to ascertain what's been detected at boot time
  3. Assuming it states that say eth0 has been changed to eth1 then just update the /etc/network/interfaces file
  4. Alternatively, force the new NIC to be eth0 by editing the /etc/udev/rules.d/70-persistent-net.rules file
    • You'll need to reboot the server for changes to take effect

File System

Unable to Mount CD-ROM

Mounting drive with following command fails

  • mount /dev/cdrom /media/cdrom/

If /media/cdrom/ doesn't exist

  1. Create the file with mkdir /media/cdrom

If /dev/cdrom special device doesn't exist

  1. Check for existing mappings and devices
    • ls -l /dev/ | grep cdrom
  2. If an existing mapping exists but for a different drive number (eg cdrom2 -> sr0)
    • Then try mounting with that number
    • EG mount /dev/cdrom2 /media/cdrom/
  3. If no existing mapping exists
    • Then try creating one for one of the listed devices
    • EG ln -sf /dev/sg0 /dev/cdrom

Replacing a Software RAID 1 Disk

This procedure was written from the following starting point...

  • A machine originally with two disks in RAID1 has failed, one disk has been replaced, and machine started again

...and adapted from this post http://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

  1. Backup whatever you can before proceeding, one mistake or system error could destroy your machine
  2. Confirm which disk is new, and which is old (if the new disk is blank this is easy as there will be no partition info!)
    • fdisk -l
  3. Partition the new disk the same as the original
    • sfdisk -d /dev/sda | sfdisk /dev/sdb
  4. Confirm that the layout of both disks is now that same
    • fdisk -l
  5. Add the newly created partitions to the RAID disks
    • mdadm --manage /dev/md0 --add /dev/sdb1
    • You may have more sd partitions than md partitions, the array size return through mdadm -D /dev/md* should roughly match the number of blocks found from fdisk -l
  6. The arrays should now be being sync'ed, check progress by monitoring /proc/mdstat
    • more /proc/mdstat

Recover Deleted Files

Ideally you should recover files to a seperate disk partition to the one you are attempting to recover from. This procedure should help to recover lost or corrupted files from a filesystem using Scalpel, a data recovery utility built on the foundation of Foremost

  1. Install Scalpel
    • apt-get install scalpel
  2. Update the config file to search for the lost files (uncomment/add as neccessary)
    • /etc/scalpel/scalpel.conf
    • For PHP files (not embedded in HTML) use php n 50000 <?php  ?>
  3. Create a folder for the recovered files to go to
    • mkdir /tmp/recov
  4. Launch Scalpel to trawl the disk image (will takes ages, and source disk will be under high load)
    • scalpel /dev/mapper/svr-root -o /tmp/recov/
  5. Search through recovered files to find the data of interest
    • grep -R "string you want to find" /tmp/recov/*

SSH

Server Hostname Change

If the hostname (or IP) of the server you are SSH'ing to changes, the old entry needs to be removed from your SSH key known hosts file

  • ssh-keygen -R <name or IP>

Packages

Errors etc received from apt-get

  • Error 400 Bad Request
    • Somewhat misleadingly, the problem is normal caused by being unable to contact the update server. Consider adding proxy server config to your machine
  • The following packages have been kept back
    • Package manager can hold back updates because they will cause conflicts, or sometimes because they're major kernel updates. Running aptitude safe-upgrade normally seems to force kernel updates through.

Add EOL Repository

Once a version of Ubuntu has gone End Of Line (EOL), you can't install software packages using the normal repository. On trying you'll get an error similar to

The repository is still available, but via a different URL - http://old-releases.ubuntu.com

Edit /etc/apt/sources.list and add the following (replace hardy with your flavour of Ubuntu). Remove the existing ubuntu repositories (they'll just cause errors as they're inaccessible)

# Hardy EOL
# Required
deb http://old-releases.ubuntu.com/ubuntu/ hardy main restricted universe multiverse
deb http://old-releases.ubuntu.com/ubuntu/ hardy-updates main restricted universe multiverse
deb http://old-releases.ubuntu.com/ubuntu/ hardy-security main restricted universe multiverse

# Optional
#deb http://old-releases.ubuntu.com/ubuntu/ hardy-backports main restricted universe multiverse

Reboot Required?

If a package update/installation requires a reboot to complete the following file will exist...

/var/run/reboot-required 

To see which packages caused this to be set, inspect the contents of...

/var/run/reboot-required.pkgs

Firewall

ERROR: problem running ufw-init

If on starting or reloading ufw you receive this error, its likely that you have a configuration problem. This is especially likely if you've needed to edit ufw's config files directly.

  1. Ensure that ufw is running
    • ufw enable
  2. Force the config to be reloaded
    • /lib/ufw/ufw-init force-reload
  3. Or if ufw failed to start use
    • /lib/ufw/ufw-init start

Doing the above should trigger the error, and present a better description of what the problem is

See http://ubuntuforums.org/showthread.php?t=1660916 for further info