[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-commits
Subject:    [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman
From:       Apache Wiki <wikidiffs () apache ! org>
Date:       2007-01-31 23:22:26
Message-ID: 20070131232226.14019.8628 () eos ! apache ! osuosl ! org
[Download RAW message or body]

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for \
change notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

The comment on the change is:
initial revision, more to come...

------------------------------------------------------------------------------
+ = Master Node =
- ## page was renamed from HbaseArchitecture
- Describe HbaseArchitecture here. To be supplied.
  
+  * Handles table creation and deletion
+  * Responsible for assigning tablets to tablet servers
+  * Detects the addition and expiration of tablet servers
+  * Balances tablet server load
+  * Garbage collects files (SSTables) in GFS by mark-and-sweep
+  * Handles schema changes, such as the addition of Column families
+  * Keeps track of the set of live tablet servers
+  * Keeps current assignment of tablets to tablet servers, including those that are \
unassigned +  * Assigns unassigned tablets to tablet servers with sufficient room
+  * Polls the /servers directory to discover new tablet servers
+  * Regularly pings tablet servers for the status of their lock
+  * If it can't contact a tablet server or it reports that it lost it's lock, the \
Master will acquire the tablet server's lock on it's /servers Chubby file, and if \
successful, will delete it. +  * Initiates tablet merges ''(when/how does it know to \
do this?)'' +  * Startup
+   * Acquires unique "master" Chubby lock
+   * Scans /servers directory in Chubby to find live tablet servers
+   * Communicates with all tablet servers to discover tablet assignment
+   * Scans METADATA table to find all tablets and adds those that have not been \
assigned to the set of unassigned tablets + 
+ 
+ = Chubby =
+ 
+  * Distributed Lock Service
+  * Provides a namespace of directories and small files
+  * Reads and writes to a file are atomic
+  * Locking
+   * Both directories and files can be used as locks
+  * Chubby client maintains a "session" with a Chubby service, which expires if it \
is unable to renew it's session lease with ''SessionTimeout'' seconds. +  * Chubby \
client provides consistent caching of Chubby files +  * Uses
+   1. To ensure that there is at most one active master at any time
+   1. To store bootstrap location of Bigtable data
+   1. To discover tablet servers and finalize tablet server death
+   1. To store Bigtable schema information
+   1. To store access control lists
+ 
+ 
+ = Tablet Server =
+ 
+  * Manages and serves a set of Tablets (between 10 and 1000). A tablet is a row \
range of the table sorted in lexographical order. Tablets comprise two types of data \
structures: One or more on-disk structure called an SSTable, and one or more \
in-memory data structures called memtable +  * Tablet size is 100-200MB by default
+  * Memtable rows are marked Copy-on-write during reads to allow writes to happen in \
parallel +  * Handles read/write requests to tablets
+  * Splits tablets that have grown too large
+   * Initiates the split by recording information for the new tablet in the METADATA \
table +   * Notifies the master of the split
+ 
+   ''So if a METADATA tablet splits, that would imply that the root tablet needs to \
be updated.'' + 
+  * Can be dynamically added or removed
+  * Clients communicate directly with tablet servers
+  * Announces it's existence by creating a uniquely named file in the /servers \
Chubby directory +  * Stops serving it's tablets and kills itself if it cannot renew \
the lease on it's /servers file +  * Commit Log
+   * Stores redo records
+   * Contains redo records for all tablets managed by tablet server
+   * Key consists of <table, row name, log sequence number>
+   * To speed recovery when a tablet server dies, the log is sorted by key. This \
sort is done by breaking the log into 64MB chunks and is done in parallel on \
different tablet servers. The sort is managed by the Master. +   * Two logs are kept, \
one active and one inactive. When writing to one log becomes slow, a log sequence \
number is incremented, and the other log is switched to. During recovery, both logs \
are sorted together and the sequence number is used to elided duplicated entries. +  \
* Moving tablets. When a tablet is moved from one server to another, the tablet \
server does a compaction prior to the move to speed up tablet recovery. +  * SSTables \
for a tablet are registered in the METADATA table +  * Tablet recovery
+   * Reads the METADATA table to find SSTable locations and the redo points
+   * Reads SSTable indices into memory
+   * Reconstructs the memtable by applying all of the updates that have committed \
since the redo points +  * Writes
+   * Checks that the request is well-formed
+   * Checks that the sender is authorized (by reading authorization info from Chubby \
file, usually a cache hit) +   * Writes mutation to commit log (has group commit \
feature to improve performance) +   * Updates memtable
+  * Reads
+   * Checks that the request is well-formed
+   * Checks that the sender is authorized (by reading authorization info from Chubby \
file, usually a cache hit) +   * Executes read on merged view of SSTables and \
memtable +  * Compactions
+   * minor: Writes memtable to SSTable when it reaches a certain size. Writes new \
redo point into METADATA table. +   * merging: Periodically merges a few SSTables and \
the memtable into one larger SSTable. This newly generated table may contain deletion \
entries that suppress deleted data in older tables. +   * major: merging compaction \
that rewrites all SSTables into one SSTable. Contains no deletion entries +  * \
Caching +   * Scan Cache: caches the key/value pairs returned by the SSTable \
interface +    * Block Cache: caches blocks read from the SSTables
+  * Bloom Filter
+   * Optional in-memory structure that reduces disk access by
+  * API
+ {{{
+    LoadTablet()
+ }}}
+ 
+ = SSTable =
+ 
+  * Immutable (write-once)
+  * Sorted list of key/value pairs.
+  * Sequence of 64KB blocks
+  * Block index (presumably consists of the start keys for each block)
+  * Compression
+   * Per block
+   * ''Column family compression?''
+  * Can be Memory-mapped
+  * Columns are organized into Locality Groups. Separate SSTable(s) are generated \
for each locality group in each tablet. +  * Can be shared by two tablets immediately \
after a split +  * API
+   * Lookup a given key ''(with or without timestamp?)''
+   * Iterate over key/value pairs
+ 
+ 
+ = METADATA Table =
+ 
+  * Three-level heirarchy
+   * Chubby file contains the location of the "root tablet"
+   * Root tablet stores the location of all tablets in a METADATA table
+   * The "root tablet" is just the first tablet in the METADATA table, it is never \
split + 
+   ''I really don't like this definition (and I know it is from the Bigtable paper). \
Aside from the fact that it is never split, the root tablet is special in another \
way: it is the metadata table for the METADATA table.'' + 
+   ''Suppose you had a table with one column and 4x10^9^ rows. Each row contains \
about 5KB of data resulting in a total table size of 200TB.'' + 
+   ''If each tablet holds about 100MB of data, this will require 2x10^6^ tablets and \
the same number of rows in the METADATA table.'' + 
+   ''If each metadata row is about 1KB, the METADATA table size required to map all \
the tablets is 2x10^9^ bytes. If each METADATA tablet is 100MB that requires 20 \
METADATA tablets to map the entire table, and consequently 20 root tablet rows to map \
the METADATA tablets.'' + 
+  * Stores the location of a tablet under a row key that is an encoding of the \
tablet's table ID and it's end row +  * The "location" column family is in it's own \
locality group and has the ''InMemory'' tuning parameter set +  * Each row stores \
approximately 1KB of data in memory +  * All events pertaining to each tablet are \
logged here (such as when a tablet server starts serving a tablet) +  * ["Schema"]
+ 
+ 
+ = Client Library =
+ 
+  * Caches tablet locations
+  * If can't find locaiton, recurses up the heirarchy
+  * Contacts Chubby directly to find root tablet
+  * Client library pre-fetches tablet locations by reading metadata for more than \
one tablet whenever it reads the METADATA table. + 
+ 
+ = Configuration / Schema Definition =
+ 
+  * Tablet Size
+  * Column Families
+   * Access Control
+   * Garbage Collection (last n, newer than time t)
+   * IntegerCounter
+  * Locality Groups
+   * In Memory tuning parameter
+   * Use Bloom Filter
+ 
+ = API =
+ 
+ {{{
+ CreateTable()
+ ChangeColumnFamilyMetadata(name=ACL, value=foo)
+ Scanner
+   FetchColumnFamily
+   Lookup
+   RowName
+ 
+ ScanStream
+   SetReturnAllVersions
+   Next
+   Done
+   ColumnName
+   MicroTimestamp
+   Value
+ 
+ WriteStream?
+   Batch interface
+   Increment IntegerCounter?
+ }}}
+ 
+ = Other =
+ 
+  * Map/Reduce connector
+  * Client Sawzall script execution in Tablet server space
+ 


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic