[prev in list] [next in list] [prev in thread] [next in thread]
List: hadoop-commits
Subject: [Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman
From: Apache Wiki <wikidiffs () apache ! org>
Date: 2007-01-31 23:22:26
Message-ID: 20070131232226.14019.8628 () eos ! apache ! osuosl ! org
[Download RAW message or body]
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for \
change notification.
The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture
The comment on the change is:
initial revision, more to come...
------------------------------------------------------------------------------
+ = Master Node =
- ## page was renamed from HbaseArchitecture
- Describe HbaseArchitecture here. To be supplied.
+ * Handles table creation and deletion
+ * Responsible for assigning tablets to tablet servers
+ * Detects the addition and expiration of tablet servers
+ * Balances tablet server load
+ * Garbage collects files (SSTables) in GFS by mark-and-sweep
+ * Handles schema changes, such as the addition of Column families
+ * Keeps track of the set of live tablet servers
+ * Keeps current assignment of tablets to tablet servers, including those that are \
unassigned + * Assigns unassigned tablets to tablet servers with sufficient room
+ * Polls the /servers directory to discover new tablet servers
+ * Regularly pings tablet servers for the status of their lock
+ * If it can't contact a tablet server or it reports that it lost it's lock, the \
Master will acquire the tablet server's lock on it's /servers Chubby file, and if \
successful, will delete it. + * Initiates tablet merges ''(when/how does it know to \
do this?)'' + * Startup
+ * Acquires unique "master" Chubby lock
+ * Scans /servers directory in Chubby to find live tablet servers
+ * Communicates with all tablet servers to discover tablet assignment
+ * Scans METADATA table to find all tablets and adds those that have not been \
assigned to the set of unassigned tablets +
+
+ = Chubby =
+
+ * Distributed Lock Service
+ * Provides a namespace of directories and small files
+ * Reads and writes to a file are atomic
+ * Locking
+ * Both directories and files can be used as locks
+ * Chubby client maintains a "session" with a Chubby service, which expires if it \
is unable to renew it's session lease with ''SessionTimeout'' seconds. + * Chubby \
client provides consistent caching of Chubby files + * Uses
+ 1. To ensure that there is at most one active master at any time
+ 1. To store bootstrap location of Bigtable data
+ 1. To discover tablet servers and finalize tablet server death
+ 1. To store Bigtable schema information
+ 1. To store access control lists
+
+
+ = Tablet Server =
+
+ * Manages and serves a set of Tablets (between 10 and 1000). A tablet is a row \
range of the table sorted in lexographical order. Tablets comprise two types of data \
structures: One or more on-disk structure called an SSTable, and one or more \
in-memory data structures called memtable + * Tablet size is 100-200MB by default
+ * Memtable rows are marked Copy-on-write during reads to allow writes to happen in \
parallel + * Handles read/write requests to tablets
+ * Splits tablets that have grown too large
+ * Initiates the split by recording information for the new tablet in the METADATA \
table + * Notifies the master of the split
+
+ ''So if a METADATA tablet splits, that would imply that the root tablet needs to \
be updated.'' +
+ * Can be dynamically added or removed
+ * Clients communicate directly with tablet servers
+ * Announces it's existence by creating a uniquely named file in the /servers \
Chubby directory + * Stops serving it's tablets and kills itself if it cannot renew \
the lease on it's /servers file + * Commit Log
+ * Stores redo records
+ * Contains redo records for all tablets managed by tablet server
+ * Key consists of <table, row name, log sequence number>
+ * To speed recovery when a tablet server dies, the log is sorted by key. This \
sort is done by breaking the log into 64MB chunks and is done in parallel on \
different tablet servers. The sort is managed by the Master. + * Two logs are kept, \
one active and one inactive. When writing to one log becomes slow, a log sequence \
number is incremented, and the other log is switched to. During recovery, both logs \
are sorted together and the sequence number is used to elided duplicated entries. + \
* Moving tablets. When a tablet is moved from one server to another, the tablet \
server does a compaction prior to the move to speed up tablet recovery. + * SSTables \
for a tablet are registered in the METADATA table + * Tablet recovery
+ * Reads the METADATA table to find SSTable locations and the redo points
+ * Reads SSTable indices into memory
+ * Reconstructs the memtable by applying all of the updates that have committed \
since the redo points + * Writes
+ * Checks that the request is well-formed
+ * Checks that the sender is authorized (by reading authorization info from Chubby \
file, usually a cache hit) + * Writes mutation to commit log (has group commit \
feature to improve performance) + * Updates memtable
+ * Reads
+ * Checks that the request is well-formed
+ * Checks that the sender is authorized (by reading authorization info from Chubby \
file, usually a cache hit) + * Executes read on merged view of SSTables and \
memtable + * Compactions
+ * minor: Writes memtable to SSTable when it reaches a certain size. Writes new \
redo point into METADATA table. + * merging: Periodically merges a few SSTables and \
the memtable into one larger SSTable. This newly generated table may contain deletion \
entries that suppress deleted data in older tables. + * major: merging compaction \
that rewrites all SSTables into one SSTable. Contains no deletion entries + * \
Caching + * Scan Cache: caches the key/value pairs returned by the SSTable \
interface + * Block Cache: caches blocks read from the SSTables
+ * Bloom Filter
+ * Optional in-memory structure that reduces disk access by
+ * API
+ {{{
+ LoadTablet()
+ }}}
+
+ = SSTable =
+
+ * Immutable (write-once)
+ * Sorted list of key/value pairs.
+ * Sequence of 64KB blocks
+ * Block index (presumably consists of the start keys for each block)
+ * Compression
+ * Per block
+ * ''Column family compression?''
+ * Can be Memory-mapped
+ * Columns are organized into Locality Groups. Separate SSTable(s) are generated \
for each locality group in each tablet. + * Can be shared by two tablets immediately \
after a split + * API
+ * Lookup a given key ''(with or without timestamp?)''
+ * Iterate over key/value pairs
+
+
+ = METADATA Table =
+
+ * Three-level heirarchy
+ * Chubby file contains the location of the "root tablet"
+ * Root tablet stores the location of all tablets in a METADATA table
+ * The "root tablet" is just the first tablet in the METADATA table, it is never \
split +
+ ''I really don't like this definition (and I know it is from the Bigtable paper). \
Aside from the fact that it is never split, the root tablet is special in another \
way: it is the metadata table for the METADATA table.'' +
+ ''Suppose you had a table with one column and 4x10^9^ rows. Each row contains \
about 5KB of data resulting in a total table size of 200TB.'' +
+ ''If each tablet holds about 100MB of data, this will require 2x10^6^ tablets and \
the same number of rows in the METADATA table.'' +
+ ''If each metadata row is about 1KB, the METADATA table size required to map all \
the tablets is 2x10^9^ bytes. If each METADATA tablet is 100MB that requires 20 \
METADATA tablets to map the entire table, and consequently 20 root tablet rows to map \
the METADATA tablets.'' +
+ * Stores the location of a tablet under a row key that is an encoding of the \
tablet's table ID and it's end row + * The "location" column family is in it's own \
locality group and has the ''InMemory'' tuning parameter set + * Each row stores \
approximately 1KB of data in memory + * All events pertaining to each tablet are \
logged here (such as when a tablet server starts serving a tablet) + * ["Schema"]
+
+
+ = Client Library =
+
+ * Caches tablet locations
+ * If can't find locaiton, recurses up the heirarchy
+ * Contacts Chubby directly to find root tablet
+ * Client library pre-fetches tablet locations by reading metadata for more than \
one tablet whenever it reads the METADATA table. +
+
+ = Configuration / Schema Definition =
+
+ * Tablet Size
+ * Column Families
+ * Access Control
+ * Garbage Collection (last n, newer than time t)
+ * IntegerCounter
+ * Locality Groups
+ * In Memory tuning parameter
+ * Use Bloom Filter
+
+ = API =
+
+ {{{
+ CreateTable()
+ ChangeColumnFamilyMetadata(name=ACL, value=foo)
+ Scanner
+ FetchColumnFamily
+ Lookup
+ RowName
+
+ ScanStream
+ SetReturnAllVersions
+ Next
+ Done
+ ColumnName
+ MicroTimestamp
+ Value
+
+ WriteStream?
+ Batch interface
+ Increment IntegerCounter?
+ }}}
+
+ = Other =
+
+ * Map/Reduce connector
+ * Client Sawzall script execution in Tablet server space
+
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic