[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: How to flush SequenceFile.Writer?
From:       Brian Long <brian () dotspots ! com>
Date:       2009-01-30 7:31:33
Message-ID: 75caa0950901292331o24363e6ah40d54a8fe84e6897 () mail ! gmail ! com
[Download RAW message or body]


More information --
I'm running 17.2. I create a sequence file with code that looks like the
following:

writer = SequenceFile.createWriter(fs, configuration, new Path(root + "/" +
lastDateString + "/" + env.getHostname() + "#" + UUID.randomUUID()),
keyClass, valueClass);



I also immediately call writer.sync();, but to no avail. The sequence file
I've created is 0 bytes in HDFS per fs -ls, and what's more disturbing is
that an fsck fails , with an error message "MISSING 1 blocks of total size 0
B" for this file.

It seem weird that if I create a sequence file this way, it will immediately
create a failure state in HDFS (fsck), and I have to either close the file
to get it to flush, or wait until it flushes on its own. As I've discovered
this also breaks my map/reduce jobs because when they encounter these files
on the input path, they die with EOFException.

Seems like I must be doing something very fundamental wrong... any ideas?

Thanks,
Brian


On Thu, Jan 29, 2009 at 4:17 PM, Brian Long <brian@dotspots.com> wrote:

> I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter
> and write to using append(key, value). Because the writer volume is low,
> it's not uncommon for it to take over a day for my appends to finally be
> flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
> Because I am running map/reduce tasks on this data multiple times a day, I
> want to "flush" the sequence file so the mapred jobs can pick it up when
> they run.
> What's the right way to do this? I'm assuming it's a fairly common use
> case. Also -- are writes to the sequence files atomic? (e.g. if I am
> actively appending to a sequence file, is it always safe to read from that
> same file in a mapred job?)
>
> To be clear, I want the flushing to be time based (controlled explicitly by
> the app), not size based. Will this create waste in HDFS somehow?
>
> Thanks,
> Brian
>
>


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic