[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: Many Checksum Errors
From:       bigjules <julian.neil () internode ! on ! net>
Date:       2007-05-24 11:38:32
Message-ID: 10782028.post () talk ! nabble ! com
[Download RAW message or body]


I am also getting intermittent checksum errors during map reduce jobs. 
Mostly they go away when the map or reduce is retried.  One error seems to
have made its way into the output of a job.

I cannot get this file from the hdfs because of the checksum error.  As you
suggest, a faulty memory stick may have caused a corruption of my input file
(or the checksum file).  

Is this problem rare enough to put it down to faulty memory?  You mention
that you have seen it reported before, but I'm wondering if there have been
reports of Checksum errors that weren't due to faulty memory (couldn't find
any with a forum search).

I suppose that Dennis Kubes problems did go away when he replaced his entire
cluster's memory with ECC sticks. (not all of us have that luxury)

Running hadoop-0.12.3 on a single windows server 2003 machine (using cygwin)
without ECC memory.

Jules




Doug Cutting wrote:
> 
> Do you have ECC memory on your nodes?  Nodes without ECC have been known 
> to trigger high rates of checksum errors.
> 
> Doug
> 
> Dennis Kubes wrote:
>> All,
>> 
>> We are continually experiencing checksum errors when running some jobs 
>> under heavy load (specifically merging segments or crawldbs).  I am lost 
>> as to whether this is a hardware or software problem.  Two questions, 
>> one is anyone else experiencing a large number of checksum type errors 
>> on big clusters?  Two, does anyone know if this is hardware or software 
>> related?  Here are some examples.
>> 
>> Dennis Kubes
>> 
>> 
>> org.apache.hadoop.fs.ChecksumException: Checksum error: 
>> /d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
>>     at 
>> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258) 
> 
> 

-- 
View this message in context: http://www.nabble.com/Many-Checksum-Errors-tf3678663.html#a10782028
Sent from the Hadoop Users mailing list archive at Nabble.com.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic