'=?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22ServerNotAvailable=22_by_SteveLo?='

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-commits
Subject:    =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22ServerNotAvailable=22_by_SteveLo?=
From:       Apache Wiki <wikidiffs () apache ! org>
Date:       2011-06-30 10:50:40
Message-ID: 20110630105040.86796.69755 () eos ! apache ! org
[Download RAW message or body]

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change \
notification.

The "ServerNotAvailable" page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/ServerNotAvailable

Comment:
new page on understanding the server not available error

New page:
= Server Not Available Yet =

This can appear in the logs of a DataNode

{{{
2011-06-30 11:30:40,403 INFO org.apache.hadoop.ipc.Client: Retrying connect to \
server: namenode/10.8.1.2:54310. Already tried 0 time(s). 2011-06-30 11:30:41,404 \
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: \
namenode/10.8.1.2:54310. Already tried 1 time(s). 2011-06-30 11:30:42,404 INFO \
org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/10.8.1.2:54310. \
Already tried 2 time(s). 2011-06-30 11:30:43,405 INFO org.apache.hadoop.ipc.Client: \
Retrying connect to server: namenode/10.8.1.2:54310. Already tried 3 time(s). \
2011-06-30 11:30:44,405 INFO org.apache.hadoop.ipc.Client: Retrying connect to \
server: namenode/10.8.1.2:54310. Already tried 4 time(s). 2011-06-30 11:30:45,406 \
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: \
namenode/10.8.1.2:54310. Already tried 5 time(s). 2011-06-30 11:30:46,407 INFO \
org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/10.8.1.2:54310. \
Already tried 6 time(s). 2011-06-30 11:30:47,407 INFO org.apache.hadoop.ipc.Client: \
Retrying connect to server: namenode/10.8.1.2:54310. Already tried 7 time(s). \
2011-06-30 11:30:48,408 INFO org.apache.hadoop.ipc.Client: Retrying connect to \
server: namenode/10.8.1.2:54310. Already tried 8 time(s). 2011-06-30 11:30:49,409 \
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: \
namenode/10.8.1.2:54310. Already tried 9 time(s). 2011-06-30 11:30:49,410 INFO \
org.apache.hadoop.ipc.RPC: Server at namenode/10.8.1.2:54310 not available yet, \
Zzzzz... 2011-06-30 11:30:51,411 INFO org.apache.hadoop.ipc.Client: Retrying connect \
to server: namenode/10.8.1.2:54310. Already tried 0 time(s). 2011-06-30 11:30:52,412 \
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: \
namenode/10.8.1.2:54310. Already tried 1 time(s). 2011-06-30 11:30:53,412 INFO \
org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/10.8.1.2:54310. \
Already tried 2 time(s). 2011-06-30 11:30:54,413 INFO org.apache.hadoop.ipc.Client: \
Retrying connect to server: namenode/10.8.1.2:54310. Already tried 3 time(s). \
2011-06-30 11:30:55,414 INFO org.apache.hadoop.ipc.Client: Retrying connect to \
server: namenode/10.8.1.2:54310. Already tried 4 time(s). 2011-06-30 11:30:56,414 \
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: \
namenode/10.8.1.2:54310. Already tried 5 time(s). 2011-06-30 11:30:57,415 INFO \
org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/10.8.1.2:54310. \
Already tried 6 time(s). 2011-06-30 11:30:58,416 INFO org.apache.hadoop.ipc.Client: \
Retrying connect to server: namenode/10.8.1.2:54310. Already tried 7 time(s). \
2011-06-30 11:30:59,416 INFO org.apache.hadoop.ipc.Client: Retrying connect to \
server: namenode/10.8.1.2:54310. Already tried 8 time(s). 2011-06-30 11:31:00,417 \
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: \
namenode/10.8.1.2:54310. Already tried 9 time(s). 2011-06-30 11:31:00,418 INFO \
org.apache.hadoop.ipc.RPC: Server at namenode/10.8.1.2:54310 not available yet, \
Zzzzz... }}}

What's happening here is that the DataNode cannot connect to the NameNode. Rather \
than fail, it assumes that the NameNode is temporarily offline -it hasn't started or \
is being restarted. The DataNodes will happily wait for the NameNode to come back up, \
and as soon as it does, report in. After trying repeatedly every seconds the client \
will back off for couple of seconds, then try again.


This process of retrying and backing off is a key part of how an HDFS cluster handles \
the temporary outage of a NameNode. It works well provided the network is set up and \
running correctly. It can be triggered by other cluster setup problems, which anyone \
setting up a Hadoop cluster is likely to encounter.

 1. The namenode hasn't been started yet. Fix: start the NameNode.
 2. The `fs.default.name` property in `core-site.xml` doesn't point to the correct \
hostname for the NameNode, and the DataNodes are trying to connect to the wrong \
server. Look at the server name in the log and verify it is valid.  3. The port in \
the `fs.default.name` property is wrong. Verify the NameNode is listening at that \
port; if not correct the site settings.  4. The client can't resolve the hostname, or \
it is resolving to the wrong address. Verify that IP address in the logs.  5. \
Connection problems. Look at the network connectivity options in the TroubleShooting \
page.


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic