[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-commits
Subject:    =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22UnknownHost=22_by_SteveLoughran?=
From:       Apache Wiki <wikidiffs () apache ! org>
Date:       2011-06-27 12:01:35
Message-ID: 20110627120135.49517.49828 () eos ! apache ! org
[Download RAW message or body]

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change \
notification.

The "UnknownHost" page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/UnknownHost

Comment:
how to troubleshoot unknown host exceptions

New page:
= Unknown Host =

You get an Unknown Host Error -often wrapped in a Java {{{IOException}}}, when one \
machine on the network cannot determine the IP address of a host that it is trying to \
connect to by way of its hostname. This can happen during file upload (in which case \
the client machine is has the hostname problem), or inside the Hadoop cluster.


Some possible causes (not an exclusive list):
 * The site's DNS server does not have an entry for the node. Test: do an {{{nslookup \
                <hostname>}}} from the client machine.
 * The calling machine's host table {{{//etc/hosts}}} lacks an entry for the host, \
                and DNS isn't helping out
 * There's some error in the configuration files and the hostname is actually wrong.
 * A worker node thinks it has a given name -which it reports to the NameNode and \
JobTracker, but that isn't the name that the network team expect, so it isn't \
                resolvable.
 * The calling machine is on a different subnet from the target machine, and short \
                names are being used instead of fully qualified domain names (FQDNs).
 * The client's network card is playing up (network timeouts, etc), the network is \
                overloaded, or even the switch is dropping DNS packets.
 * The host's IP address has changed but a long-lived JVM is caching the old value. \
This is a known problem with JVMs (search for "java negative DNS caching" for the \
                details and solutions). The quick solution: restart the JVMs
 * The site's DNS server is overloaded. This can happen in large clusters. Either \
                move to host table entries or use caching DNS servers in every worker \
                node.
 * Your ARP cache is corrupt, either accidentally or maliciously. If you don't know \
what that means, you won't be in a position to verify this is the problem -or fix it.

These are all network configuration/router issues. As it is your network, only you \
can find out and track down the problem. That said, any tooling to help Hadoop track \
down such problems in cluster would be welcome, as would extra diagnostics. If you \
have to extend Hadoop to track down these issues -submit your patches!

Some tactics to help solve the problem:
 1. Look for configuration problems first (Hadoop XML files, hostnames, host tables), \
as these are easiest to fix and quite common.  1. Try and identify which client \
machine is playing up. If it is out-of-cluster, try the FQDN instead, and consider \
that it may not have access to the worker node.  1. If the client that does not work \
is one of the machines in the cluster, SSH to that machine and make sure it can \
resolve the hostname.  1. As well as {{{nslookup}}}, the {{{dig}}} command is \
invaluable for tracking down DNS problems, though it does assume you understand DNS \
records. Now is a good time to learn.  1. Restart the JVMs to see if that makes it go \
away.  1. Restart the servers to see if that makes it go away.

Remember, unless the route cause has been identified, the problem may return.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic