'hadoop file format query'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    hadoop file format query
From:       Mapred Learn <mapred.learn () gmail ! com>
Date:       2011-02-24 22:51:07
Message-ID: AANLkTinUt069uB10R1dnwKZWdKZakON3A7xz3aLftQOA () mail ! gmail ! com
[Download RAW message or body]


hi,
I have a use case to upload gzipped text files of sizes ranging from 10-30
GB on hdfs.
We have decided on sequence file format as format on hdfs.
I have some doubts/questions regarding it:

i) what should be the optimal size for a sequence file considering the input
text files range from 10-30 GB in size ? Can we have a sequence file as same
size as text file ?

ii) is there some tool that could be used to convert a gzipped text file to
sequence file ?

ii) what should be a good metadata management for the files. Currently, we
have about 30-40 different types of schema for these text files. We thought
of 2 options:
    -  uploading metadata as a text file on hdfs along with data. So users
can view using hadoop fs -cat <file>.
    -  adding metadata in seq file header. In this case, we could not find
how to fetch the metadata from sequence file as we need to provide our
downstream users a way to see what is the metadata of the
       data they are reading.

thanks a lot !
-JJ


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic