@xtccc 2015-11-19T01:15:04.000000Z 字数 5158 阅读 3463

HBase Architecture

此处输入图片的描述

HBase

Storage

HBase handles basically two kinds of file types: one is used for the write-ahead log and the other for the actual data storage. 这两类文件都由 HRegionServer 负责处理。

此处输入图片的描述

client怎样访问HBase的一条数据（rowkey）？

首先client会与ZK集群联系，从ZK那里询问哪一个节点持有-ROOT-这个region？然后，client向该RS询问，.META.表中包含target rowkey的region处于哪一个节点上？（以上这些查询的结果都将被client缓存起来）。最后，client访问target server，与RS通讯，进行真正的数据查询。

由于client会缓存以上查询的元数据，随着时间的推移，client将会知道越来越多的元数据，也就愈发不用去查询.META.表了。

实际数据的存储

HRegionServer会打开region，并相应地创建一个HRegion实例。

当HRegion打开后，HRegionServer会为每一个table的每一个family相应地创建一个Store实例。

每一个Store实例都拥有若干个StoreFile实例 —— StoreFile实例其实是HFile的wrapper。

数据怎样被写入HBase?

当client向HBase写入一条数据时，首先会将数据写入到WAL（write-ahead log）中。（WAL的作用是在RS崩溃后，还能恢复尚未被持久化的数据。）

当数据被写入到WAL后，就会被放入到MemStore中。同时，会检查MemStore是否已满（缓冲区大小由hbase.hregion.memstore.flush.size配置，默认值是64MB），如果已满则将将其flush到磁盘上。该flush请求由RS中一个单独的线程负责，该线程会将数据写入到HDFS中的一个新的HFile文件中。

HDFS Files

write-ahead log files

在HDFS的/hbase/WALs下，有若干个目录，每个RS对应着一个这样的目录，目录里有若干HLog文件。一个RS中的全部regions共用同一套HLog files。

当log file中的数据被持久化到store file后，它将被移至/hbase/oldWALs目录下，里面的文件每隔10分钟会被master删除（时间间隔由hbase.master.logcleaner.ttl配置）。

/hbase/hbase.id及/hbase/hbase.version这两个文件则分别是HBase集群的ID以及文件格式的版本。

table自己的文件

每一张表在都有自己单独的目录，例如testtable对应的HDFS目录为/hbase/data/default/testtable（testtable是默认的namespace中）。每个目录下有子目录.tabledesc，里面的文件包含了关于table及family的元数据。

此外，在table的目录下，还有若干子目录，每个子目录对应着该表的一个region。每个子目录的名称是该region的名称的md5哈希值部分。例如，从HBase Web UI中可以看出，testtable有5个region，它们的名字分别是：

testtable,,1444656645598.f655a770d069d70bce5a3c85826c550a.
testtable,row-300,1444656645598.9a55b2955f0e98a79fceadef74331ebb.
testtable,row-500,1444656645598.c75ed551d1b7895505fbea08d82e137d.
testtable,row-700,1444656645598.3c6450f6bf407275a623ba9faa08fa5f.
testtable,row-900,1444656645598.af90f4069bc0a763bc424cdfee4dd2bc.

Region Name的构成： table name + start key + time

而我们从HDFS中查询testable表的几个目录：

# hdfs dfs -ls /hbase/data/default/testtable
/hbase/data/default/testtable/.tabledesc
/hbase/data/default/testtable/.tmp
/hbase/data/default/testtable/3c6450f6bf407275a623ba9faa08fa5f
/hbase/data/default/testtable/9a55b2955f0e98a79fceadef74331ebb
/hbase/data/default/testtable/af90f4069bc0a763bc424cdfee4dd2bc
/hbase/data/default/testtable/c75ed551d1b7895505fbea08d82e137d
/hbase/data/default/testtable/f655a770d069d70bce5a3c85826c550a

# hdfs dfs -ls /hbase/data/default/testtable/3c6450f6bf407275a623ba9faa08fa5f
/hbase/data/default/testtable/3c6450f6bf407275a623ba9faa08fa5f/.regioninfo
/hbase/data/default/testtable/3c6450f6bf407275a623ba9faa08fa5f/.tmp
/hbase/data/default/testtable/3c6450f6bf407275a623ba9faa08fa5f/colfam1

其中，

Catalog tables

HBase中有两个catalog tables，分别为 -root- 与 .META.

The -ROOT- table is used to refer to all regions in the .META. table. The design considers only one root region, that is, the root region is never split to guarantee a three-level, B+ tree-like lookup scheme: the first level is a node stored in ZooKeeper that contains the location of the root table's region—in other words, the name of the region server hosting that specific region. The second level is the lookup of a matching meta region from the -ROOT- table, and the third is the retrieval of the user table region from the .META. table.

Lookup Path

实际上，在HBase 0.98中，表-ROOT-不存在了；表.META.也不存在了，它变为了表hbase:meta。

在表hbase:meta中，每条数据的rowkey是region name，如下：
此处输入图片的描述

ZooKeeper

为什么HBase要使用ZK？

由于ZK的分布式特性，其中会有非常频繁的状态转换，例如每个Region都可能会经历 Offline -> Pending Open -> Opening -> Open -> Pending Close -> Closing -> Closed -> Splitting -> Split 这些状态的转变，所以HBase需要通过ZK的znode来追踪这些状态的变化。

以下摘自 What are HBase znodes?：

In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase's ZooKeeper data is removed, only the transient operations are affected – data can continue to be written and read to/from HBase.

实际上，如果把ZK中的/hbase这个znode删掉，HBase重启后依然可以正常运行。

HBase使用ZK来做以下事情：

tracking region servers
where the root region is hosted

HBase会在ZK的根节点（默认是/hbase，通过zookeeper.znode.parent来配置）下创建一系列的znode：

meta-region-server： .META. region所在的server name
backup-masters
table： Used by the master to track the table state during assignments (disabling/enabling states, for example).
draining： Used to decommission more than one RegionServer at a time by creating sub-znodes with the form serverName,port,startCode (for example, /hbase/draining/m1.host,60020,1338936306752). This lets you decommission multiple RegionServers without having the risk of regions temporarily moved to a RegionServer that will be decommissioned later. Read this to learn more about /hbase/draining.

region-in-transition
table-lock
running
balancer
master： master所在的server name
namespace
hbaseid： cluster ID，与HDFS文件/hbase/hbase.id的内容相同
online-snapshot
replication
splitWAL
recovering-regions
rs

如果开启了安全机制（例如Kerberos），还可能有其他的znodes。
The Access Control List (ACL) and the Token Provider coprocessors add two more znodes: one to synchronize access to table ACLs and the other to synchronize the token encryption keys across the cluster nodes.

/hbase/acl： The acl znode is used for synchronizing the changes made to the acl table by the grant/revoke commands. Each table will have a sub-znode (/hbase/acl/tableName) containing the ACLs of the table. (Read this for more information about the access controller and the ZooKeeper interaction.)

/hbase/tokenauth： The token provider is usually used to allow a MapReduce job to access the HBase cluster. When a user asks for a new token the information will be stored in a sub-znode created for the key (/hbase/tokenauth/keys/key-id).