[关闭]
@xmruibi 2015-03-03T15:01:17.000000Z 字数 6282 阅读 814

Map Reduce & Data Storage and Processing

cloud_computing


Map Reduce

1 Procedure of MapReduce:

1.The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.

  1. One of the copies of the program is special –the master. The rest are workers that are assigned work
    by the master. There areM map tasks and R reduce tasks to assign. The master picks idle workers and
    assigns each one a map task or a reduce task.

  2. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.

  3. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning
    function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

  4. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate
    data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.

  5. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function.The output of the Reduce function is appended to a final output file for this reduce partition.

  6. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

2 Summary of Idea

3 Combiner

Apply reduce function to the intermediate results locally after the map generates the result.

4 Partitioner (Not belong to Map)

If map’s output will generate N keys (N>R, R:# of reduces)
1. By default, N keys are randomly distributed to R reduces
2. You can use partitioner to define how the keys are distributed to the reduces.

Shuffle & Sort

Shuffle和Sort阶段有两个任务:
1. 决定Reducer应该处理哪些key/value对(被称为partitioning)
2. 确定给Reducer的数据是被排过序的 从下图中的例子可以看到,Mapper 1的输出cat, doc1 以及Mapper 2的输出cat, doc2,在经过Shuffle之后被汇总成cat,list(doc1, doc2)。同时,经过Sort之后,传到Reducer里的数据是按照Key排过序的(cat, chipmunk, dog, haster)

http://langyu.iteye.com/blog/992916

Data Storage and Processing

1. Big Table

A sparse, distributed, persistent multi-dimensional sorted map.

1.1 Physical Organization

1.2 Data Model

1.3 Get Tablet

Chubby -> Root Tablet(Metadata) -> User Table

1.4 Work Procedure

2 Dynamo

2.1 Service Level Agreement

Application can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds

2.2 Design Consideration

2.3 Partition

Consistent Hashing + Virtual Nodes

3 Cassandra

3.1 Partition

3.2 Replication

3.3 Gossip Protocol

3.4 Cluster management

Uses Scuttleback (a Gossip protocol) to manage nodes.
Uses gossip for node membership and to transmit system control state.
Node Fail state is given by variable ‘phi’ which tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).
This type of system is known as Accrual Failure Detector.

Infrastructure as a service

1. EC2

A typical example of utility computing, VM instance running platform.
functionality:

1. launch instances with a variety of operating systems (windows/linux)
2. load them with your custom application environment (customized AMI (Amazon Machine image))
3. Full root access to a blank Linux machine
4. Manage your network’s access permissions
5. run your image using as many or few systems as you desire (scaling up/down)

1.1 Amazon Machine Images

Public AMIs: Use pre-configured, template AMIs to get up and running immediately. Choose from Fedora, Movable Type, Ubuntu configurations, and more

Private AMIs: Create an Amazon Machine Image (AMI) containing your applications, libraries, data and associated configuration settings

Paid AMIs: Set a price for your AMI and let others purchase and use it (Single payment and/or per hour)
AMIs with commercial DBMS

2 S3 (Simple Storage Service)

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注