[prev in list] [next in list] [prev in thread] [next in thread]
List: hadoop-commits
Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22UnknownHost=22_by_SteveLoughran?=
From: Apache Wiki <wikidiffs () apache ! org>
Date: 2011-06-27 12:01:35
Message-ID: 20110627120135.49517.49828 () eos ! apache ! org
[Download RAW message or body]
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change \
notification.
The "UnknownHost" page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/UnknownHost
Comment:
how to troubleshoot unknown host exceptions
New page:
= Unknown Host =
You get an Unknown Host Error -often wrapped in a Java {{{IOException}}}, when one \
machine on the network cannot determine the IP address of a host that it is trying to \
connect to by way of its hostname. This can happen during file upload (in which case \
the client machine is has the hostname problem), or inside the Hadoop cluster.
Some possible causes (not an exclusive list):
* The site's DNS server does not have an entry for the node. Test: do an {{{nslookup \
<hostname>}}} from the client machine.
* The calling machine's host table {{{//etc/hosts}}} lacks an entry for the host, \
and DNS isn't helping out
* There's some error in the configuration files and the hostname is actually wrong.
* A worker node thinks it has a given name -which it reports to the NameNode and \
JobTracker, but that isn't the name that the network team expect, so it isn't \
resolvable.
* The calling machine is on a different subnet from the target machine, and short \
names are being used instead of fully qualified domain names (FQDNs).
* The client's network card is playing up (network timeouts, etc), the network is \
overloaded, or even the switch is dropping DNS packets.
* The host's IP address has changed but a long-lived JVM is caching the old value. \
This is a known problem with JVMs (search for "java negative DNS caching" for the \
details and solutions). The quick solution: restart the JVMs
* The site's DNS server is overloaded. This can happen in large clusters. Either \
move to host table entries or use caching DNS servers in every worker \
node.
* Your ARP cache is corrupt, either accidentally or maliciously. If you don't know \
what that means, you won't be in a position to verify this is the problem -or fix it.
These are all network configuration/router issues. As it is your network, only you \
can find out and track down the problem. That said, any tooling to help Hadoop track \
down such problems in cluster would be welcome, as would extra diagnostics. If you \
have to extend Hadoop to track down these issues -submit your patches!
Some tactics to help solve the problem:
1. Look for configuration problems first (Hadoop XML files, hostnames, host tables), \
as these are easiest to fix and quite common. 1. Try and identify which client \
machine is playing up. If it is out-of-cluster, try the FQDN instead, and consider \
that it may not have access to the worker node. 1. If the client that does not work \
is one of the machines in the cluster, SSH to that machine and make sure it can \
resolve the hostname. 1. As well as {{{nslookup}}}, the {{{dig}}} command is \
invaluable for tracking down DNS problems, though it does assume you understand DNS \
records. Now is a good time to learn. 1. Restart the JVMs to see if that makes it go \
away. 1. Restart the servers to see if that makes it go away.
Remember, unless the route cause has been identified, the problem may return.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic