@xmruibi
2015-03-03T15:01:17.000000Z
字数 6282
阅读 814
cloud_computing
1.The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
One of the copies of the program is special the master. The rest are workers that are assigned work
by the master. There areM map tasks and R reduce tasks to assign. The master picks idle workers and
assigns each one a map task or a reduce task.
A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning
function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate
data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function.The output of the Reduce function is appended to a final output file for this reduce partition.
When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
Apply reduce function to the intermediate results locally after the map generates the result.
If map’s output will generate N keys (N>R, R:# of reduces)
1. By default, N keys are randomly distributed to R reduces
2. You can use partitioner to define how the keys are distributed to the reduces.
Shuffle和Sort阶段有两个任务:
1. 决定Reducer应该处理哪些key/value对(被称为partitioning)
2. 确定给Reducer的数据是被排过序的 从下图中的例子可以看到,Mapper 1的输出cat, doc1 以及Mapper 2的输出cat, doc2,在经过Shuffle之后被汇总成cat,list(doc1, doc2)。同时,经过Sort之后,传到Reducer里的数据是按照Key排过序的(cat, chipmunk, dog, haster)
http://langyu.iteye.com/blog/992916
A sparse, distributed, persistent multi-dimensional sorted map.
Objects are indexed by 3 dimensions
Row
Column
Timestamps
Chubby -> Root Tablet(Metadata) -> User Table
Application can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds
Consistent Hashing + Virtual Nodes
Each data item is replicated at N (replication factor) nodes.
Different Replication Policies
Uses Scuttleback (a Gossip protocol) to manage nodes.
Uses gossip for node membership and to transmit system control state.
Node Fail state is given by variable ‘phi’ which tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).
This type of system is known as Accrual Failure Detector.
A typical example of utility computing, VM instance running platform.
functionality:
1. launch instances with a variety of operating systems (windows/linux)
2. load them with your custom application environment (customized AMI (Amazon Machine image))
3. Full root access to a blank Linux machine
4. Manage your network’s access permissions
5. run your image using as many or few systems as you desire (scaling up/down)
Public AMIs: Use pre-configured, template AMIs to get up and running immediately. Choose from Fedora, Movable Type, Ubuntu configurations, and more
Private AMIs: Create an Amazon Machine Image (AMI) containing your applications, libraries, data and associated configuration settings
Paid AMIs: Set a price for your AMI and let others purchase and use it (Single payment and/or per hour)
AMIs with commercial DBMS