@zqbinggong 2018-06-12T12:16:42.000000Z 字数 7073 阅读 1804

MapReduce应用开发

hadoop 《权威指南》

用于配置的API

Hadoop中的组件是通过Hadoop自己配置的API来配置的；Configuration通过从资源（即使用简单结构定义名值对的ＸＭＬ文件）中读取其属性值

访问属性，主要是get方法

Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));//定义xml中没有的属性，并指定了默认值为wide

资源合并（后来居上，但是final标记的属性其值不会被override）

Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.addResource("configuration-2.xml");//override1中的相同属

变量扩展，即配置属性可以被其他属性或系统属性进行定义，该特性特别适用于在命令行方式下用JVM参数-Dproperty=value来覆盖属性

<property>
<name>size-weight</name>
<value>${size},${weight}</value>
<description>Size and weight</description>
</property>

配置开发环境

管理配置

问题：开发Hadoop应用时，需要经常在本地运行和集群运行之间切换
解决方法：是Hadoop配置文件包含每个集群的连接设置，并且在运行Hadoop应用或工具时指定使用哪一种连接
最好的做法：将这些文件放置在Hadoop安装目录树之外，方便切换同时避免重复和丢失设置信息

hadoop fs -conf conf/hadoop-localhost.xml -ls

辅助类GenericOptionsParser，Tool，和ToolRuner

API conf
api中给出了实现Tool接口的范例，和书中的范例格式一致
ToolRunner.run(new ConfigurationPrinter(), args)
- 等价于run(tool.getConf(), tool, args)
  
  public static int run(Configuration conf,Tool tool,String[] args)
  Runs the given Tool by Tool.run(String[]), after parsing with the given generic arguments. Uses the given Configuration, or builds one if null. Sets the Tool's configuration with the possibly modified version of the conf.
- 注意args参数被GenericOptionsParser解析
使用示例：需要注意的是示例2中-D之后有空格，这与变量扩展中所说的JVM参数不同，后者没有空格；两者的作用也不同，前者是为GOP设置某个别属性，而后者是为JVM来改变系统属性的

1. hadoop ConfigurationPrinter -conf conf/hadoop-localhost.xml
2. hadoop ConfigurationPrinter -D color=yellow

public class ConfigurationPrinter extends Configured implements Tool {
    static {
    // 注意configuration默认载入的是core-default.xml和core-site.xml
        Configuration.addDefaultResource("hdfs-default.xml");
    }
    @Override
    public int run(String[] args) throws Exception {
       Configuration conf = getConf();//configurable接口中的方法，configured是该接口的一个最简单实现
       for (Entry<String, String> entry: conf) {
              System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
       }
     return 0;
}
    public static void main(String[] args) throws Exception {
      int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);
        System.exit(exitCode);
    }
}

使用MRUnit来写单元测试

MapDriver
ReduceDriver

本地运行测试数据

目的：写一个作业驱动程序(job driver),然后在开发机器上使用测试数据运行它

在本地作业运行器上运行作业

Hadoop有一个本地作业运行器（job runner），它是在MapReduce执行引擎上运行单个JVM的MR作业的简化版本，为测试而设计。
如果mapreduce.framework.name被设置成local（默认情况），则运行本地作业运行器

public class MaxTemperatureDriver extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Job job = new Job(getConf(), "Max temperature");
        job.setJarByClass(getClass());
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(MaxTemperatureMapper.class);
        job.setCombinerClass(MaxTemperatureReducer.class);
        job.setReducerClass(MaxTemperatureReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        return job.waitForCompletion(true) ? 0 : 1;
    }
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
        System.exit(exitCode);
    }
}

测试驱动程序

除了灵活的配置选项可以使应用程序实现Tool还可以插入任意Configuration来增加可测试性

使用本地作业运行器

@Test
public void test() throws Exception {
    Configuration conf = new Configuration();
    conf.set("fs.defaultFS", "file:///");
    conf.set("mapreduce.framework.name", "local");
    conf.setInt("mapreduce.task.io.sort.mb", 1);
    Path input = new Path("input/ncdc/micro");
    Path output = new Path("output");
    FileSystem fs = FileSystem.getLocal(conf);
    fs.delete(output, true); // delete old output
    MaxTemperatureDriver driver = new MaxTemperatureDriver();
    driver.setConf(conf);
    int exitCode = driver.run(new String[] {
        input.toString(), output.toString() });
    assertThat(exitCode, is(0));
    checkOutput(conf, output);//该方法会对实际输出和预计输出进行比较（预计输出从何而来呢？）
}

使用一个mini集群来运行

在集群上运行

打包作业

本地作业运行器使用单JVM运行一个作业，只需要所有的类（比如编写的额map类）都在类路径（classpath）上，那么作业就可以正常执行
对于分布式环境：
- 首先将作业的类打包成一个作业jar文件发送给集群
- Hadoop will find the job JAR automatically by searching for the JAR on the driver’s classpath that contains the class set in the setJarByClass() method (on JobConf or Job ). Alternatively, if you want to set an explicit JAR file by its file path, you can use the setJar() method. (The JAR file path may be local or an HDFS file path.)
if you have single job per JAR, you can specify the main class to run in the JAR file's manifest
任何有以来关系的JAR文件应该打包到作业的JAR文件的lib子目录中

客户端的类路径

由Hadoop jar <jar>设置的用户客户端类路径包括：

作业的JAR文件
作业JAR文件的lib目录中的所有JAR文件以及classes目录（if present）
HADOOP_CLASSPATH定义的类路径（if set）
Incidentally, this explains why you have to set HADOOP_CLASSPATH to point to dependent classes and libraries if you are running using the local job runner without a job JAR(hadoop CLASSNAME).

任务的类路径

On a cluster (and this includes pseudodistributed mode), map and reduce tasks run in separate JVMs, and their classpaths are not controlled by HADOOP_CLASSPATH . HADOOP_CLASSPATH is a client-side setting and only sets the classpath for the driver JVM, which submits the job.

The job JAR file
Any JAR files contained in the lib directory of the job JAR file, and the classes directory (if present)
Any files added to the distributed cache using the -libjars option (see Table 6-1), or the addFileToClassPath() method on DistributedCache (old API), or Job(new API)

打包依赖（Packaging dependencies）

用以处理作业的库依赖的操作（corresponding options for including library dependencies for a
job）：

Unpack the libraries and repackage them in the job JAR.
Package the libraries in the lib directory of the job JAR.
Keep the libraries separate from the job JAR, and add them to the client classpath via HADOOP_CLASSPATH and to the task classpath via -libjars

任务类路径的优先权

用户的JAR文件被添加到客户端和任务类路径的后面，这意味着，如果Hadoop使用的库版本和你的代码使用的不同或不相容，则可能会发生冲突，此时需要调整任务类路径的次序以让你的类被先提取出来

在客户端： set the HADOOP_USER_CLASSPATH_FIRST environment variable to true
对于任务类路径： set mapreduce.job.user.classpath.first to true

启动作业

We unset the HADOOP_CLASSPATH environment variable because we don’t have any third-party dependencies for this job. If it were left set to target/classes/ (from earlier in the chapter), Hadoop wouldn’t be able to find the job JAR; it would load the MaxTempera tureDriver class from target/classes rather than the JAR, and the job would fail.

unset HADOOP_CLASSPATH
hadoop jar hadoop-examples.jar v2.MaxTemperatureDriver \
    -conf conf/hadoop-cluster.xml input/ncdc/all max-temp

MR的web界面

获取结果

作业调试

Hadoop日志

远程调试

作业调优

此处输入图片的描述

MR的工作流

问题：如何将数据处理问题转化为MR模型

将问题分解成MR作业

ChainReducer

Hadoop自带的ChainReducer可以将很多mapper连接成一个mapper

JobControl

问题：当一个工作流中的作业不止一个时，如何管理这些作业按顺序进行
主要解决途径是考虑是否存在一个线性的作业链或一个更复杂的作业有向无环图
1. 对于线性链表，最简单的方法就是一个接一个地运行作业：

JobClient.runJob(conf1)
JobClient.runJob(conf2)

对于更复杂的结构，需要使用JpbControl类（同样适用于线性链表），JobControl的实例表示一个作业的运行图，可以加入作业配置，然后告知JobControl实例作业之间的依赖关系；如果一个作业失败，那么后面的作业将不会执行

Apache Oozie

Apache Oozie是一个运行由相互依赖的作业组成的工作流。由两部分组成：
- 工作流引擎，负责存储和运行不同Hadoop作业（MR，pig，hive等）组成的工作流
- coordinate引擎，负责基于预定义的调度策略以及数据可用性运行工作流作业
不同于在客户端运行并提交作业的JobControl，Oozie是作为集群中的服务器运行的，客户端提交有个立即或稍后执行的工作流定义到服务器
在Oozie中，工作流是一个由动作（action）节点和控制流节点组成的DAG：
- 动作流节点执行工作流任务，例如在Hadoop中移动文件，运行MR、streaming，pig或hive作业等
- 控制流节点通过构建条件逻辑（so different execution branches may be followed depending on the result of an earlier action node)或并行执行来管理actions之间的工作流执行情况

MapReduce应用开发

用于配置的API

配置开发环境

管理配置

辅助类GenericOptionsParser，Tool，和ToolRuner

使用MRUnit来写单元测试

本地运行测试数据

在本地作业运行器上运行作业

测试驱动程序

在集群上运行

打包作业

客户端的类路径

任务的类路径

打包依赖（Packaging dependencies）

任务类路径的优先权

启动作业

MR的web界面

获取结果

作业调试

Hadoop日志

远程调试

作业调优

MR的工作流

将问题分解成MR作业

ChainReducer

JobControl

Apache Oozie

定义Oozie工作流

打包和配置工作流应用

运行工作流作业

MapReduce应用开发

用于配置的API

配置开发环境

管理配置

辅助类GenericOptionsParser，Tool，和ToolRuner

使用MRUnit来写单元测试

本地运行测试数据

在本地作业运行器上运行作业

测试驱动程序

在集群上运行

打包作业

客户端的类路径

任务的类路径

打包依赖（Packaging dependencies）

任务类路径的优先权

启动作业

MR的web界面

获取结果

作业调试

Hadoop日志

远程调试

作业调优

MR的工作流

将问题分解成MR作业

ChainReducer

JobControl

Apache Oozie

定义Oozie工作流

打包和配置工作流应用

运行工作流作业

内容目录