@xmruibi 2015-03-03T15:01:17.000000Z 字数 6282 阅读 814

Map Reduce & Data Storage and Processing

cloud_computing

Map Reduce

1 Procedure of MapReduce:

1.The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.

One of the copies of the program is special the master. The rest are workers that are assigned work
by the master. There areM map tasks and R reduce tasks to assign. The master picks idle workers and
assigns each one a map task or a reduce task.
A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning
function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate
data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function.The output of the Reduce function is appended to a final output file for this reduce partition.
When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

2 Summary of Idea

Mapper generates some kind of index for the original data
Reducer apply group/aggregate based on that index
Flexibility
- Developers are free to generate all kinds of different indices based on the original data
- Thus, many different types jobs can be done based on this simple framework

3 Combiner

Apply reduce function to the intermediate results locally after the map generates the result.

4 Partitioner (Not belong to Map)

If map’s output will generate N keys (N>R, R:# of reduces)
1. By default, N keys are randomly distributed to R reduces
2. You can use partitioner to define how the keys are distributed to the reduces.

Shuffle & Sort

Shuffle和Sort阶段有两个任务：
1. 决定Reducer应该处理哪些key/value对（被称为partitioning）
2. 确定给Reducer的数据是被排过序的从下图中的例子可以看到，Mapper 1的输出cat, doc1 以及Mapper 2的输出cat, doc2，在经过Shuffle之后被汇总成cat，list(doc1, doc2)。同时，经过Sort之后，传到Reducer里的数据是按照Key排过序的（cat, chipmunk, dog, haster）

http://langyu.iteye.com/blog/992916

Data Storage and Processing

1. Big Table

A sparse, distributed, persistent multi-dimensional sorted map.

1.1 Physical Organization

Built on top of GFS
A tablet contains multiple SSTables
SSTable: a block file with sparse index
- Data (key, value) are stored in blocks
- Block index : Sorted keys, Keys are partitioned to ranges, First Key of the range ->block(s)
- Data Possibly organized in a nested way
Use Chubby distributed lock service

1.2 Data Model

Objects are arrays of bytes
Objects are indexed by 3 dimensions
Row
- Row key: strings
- Row range (groups of rows) – tablet (100-200M)
Column
- Grouped into “column families”
  - Column key “Family: qualifier”
- Data physically organized by column-family – a embedded substructure of a record
- Access controls are on column-family
Timestamps
- Representing versions of the same object

1.3 Get Tablet

Chubby -> Root Tablet(Metadata) -> User Table

1.4 Work Procedure

Chubby service handles tablet locking
Master maintains consistency
- Start up by reading Metadata from Chubby
- Check lock status
Insert/delete/split/merge tablets
Tablet operations
- Each tablet contains multiple SSTables (info stored in Metadata)
- Use log to maintain consistency

2 Dynamo

Developed by Amazon
Fast R/W speed, satisfying strong service level agreement
Using a fully distributed architecture(decentralization)
Store only key-value pairs

2.1 Service Level Agreement

Application can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds

2.2 Design Consideration

Sacrifice strong consistency for availability
Conflict resolution is executed during read instead of write, i.e. “always writeable”.
Other principles:
- Incremental scalability.
- Symmetry.
- Decentralization.
- Heterogeneity.

2.3 Partition

Consistent Hashing + Virtual Nodes

3 Cassandra

Use BigTable’s data model
- “Rows” and “column family”
Use Dynamo’s architecture

3.1 Partition

Nodes are logically structured in Ring Topology (DHT).
Hashed value of key associated with data partition is used to assign it to a node in the ring.
Hashing rounds off after certain value to support ring structure.
Lightly loaded nodes moves position to alleviate highly loaded nodes.

3.2 Replication

Each data item is replicated at N (replication factor) nodes.
Different Replication Policies
- Rack Unaware – replicate data at N-1 successive nodes after its coordinator
- Rack Aware – uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for
- Datacenter Aware – similar to Rack Aware but leader is chosen at Datacenter level instead of Rack level.

3.3 Gossip Protocol

Network Communication protocols inspired for real life rumour spreading.
Periodic, Pairwise, inter-node communication.
Low frequency communication ensures low cost.
Random selection of peers.
Example – Node A wish to search for pattern in data
- Round 1 – Node A searches locally and then gossips with node B.
- Round 2 – Node A,B gossips with C and D.
- Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……
Round by round doubling makes protocol very robust.

3.4 Cluster management

Uses Scuttleback (a Gossip protocol) to manage nodes.
Uses gossip for node membership and to transmit system control state.
Node Fail state is given by variable ‘phi’ which tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).
This type of system is known as Accrual Failure Detector.

Infrastructure as a service

1. EC2

A typical example of utility computing, VM instance running platform.
functionality:

1. launch instances with a variety of operating systems (windows/linux)
2. load them with your custom application environment (customized AMI (Amazon Machine image))
3. Full root access to a blank Linux machine
4. Manage your network’s access permissions
5. run your image using as many or few systems as you desire (scaling up/down)

1.1 Amazon Machine Images

Public AMIs: Use pre-configured, template AMIs to get up and running immediately. Choose from Fedora, Movable Type, Ubuntu configurations, and more

Private AMIs: Create an Amazon Machine Image (AMI) containing your applications, libraries, data and associated configuration settings

Paid AMIs: Set a price for your AMI and let others purchase and use it (Single payment and/or per hour)
AMIs with commercial DBMS