@zh350229319
2018-09-20T06:52:51.000000Z
字数 5035
阅读 2652
hive
hive on spark要求spark编译时不集成hive,编辑命令如下,需要安装maven,命令中hadoop版本根据实际情况调整
#Spark 2.0.0以后./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.6,parquet-provided"
在/etc/profile中加入以下环境变量
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
在hive-site.xml中加入以下配置,同时将spark的jars目录中的jar文件传到hdfs对应目录下
#创建hdfs目录hadoop fs -mkdir /spark#上传/application/spark/jars文件夹到hdfs的/spark目录hadoop fs -put /application/spark/jars/ /spark/
<!-- hive 2.2.0以后,这项配置必须放在hive-site.xml中 --><property><name>spark.yarn.jars</name><value>hdfs://xxxx:9000/spark/jars/*.jar</value></property>
在spark-env.sh中加入以下配置,注意不要有spark集群的配置,会导致hive on spark出现异常
#注意$(hadoop classpath)需要支持hadoop命令可执行,可以修改成根目录形式$(/application/hadoop-2.6.4/bin/hadoop classpath)export SPARK_DIST_CLASSPATH=$(hadoop classpath)export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoopexport YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop```> 在spark-defaults.conf中加入以下配置,需要将该文件放入$HIVE_HOME/conf目录下<div class="md-section-divider"></div>
spark.master yarn
spark.submit.deployMode client
spark.eventLog.enabled true
spark.eventLog.dir hdfs://dashuju174:9000/spark/logs
spark.driver.memory 512m
spark.driver.cores 1
spark.executor.memory 512m
spark.executor.cores 1
spark.executor.instances 2
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.yarn.jars hdfs://dashuju174:9000/spark/jars/*.jar
<div class="md-section-divider"></div>#### [yarn调度模式配置][4]> 官方建议修改yarn的调度模式为FAIR,其他调度模式其实也可以使用> 修改yarn-site.xml<div class="md-section-divider"></div>
yarn.resourcemanager.scheduler.class
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
> 当启动spark任务时,会根据spark-defaule.conf中的配置向yarn请求内存和CPU,计算公式参考[Spark 2.2 内存占用计算公式][5]> 以上面的配置为例:> yarn.scheduler.minimum-allocation-mb=512M,spark.yarn.driver.memoryOverhead=384m(最小值),spark.yarn.executor.memoryOverhead=384m(最小值)> 实际driver占用内存为spark.driver.memory+(max(spark.yarn.driver.memoryOverhead, 0.1*spark.driver.memory))=512m+(max(384m: 0.1*512m))=规整化后1G,然后再乘以spark.driver.cores*spark.executor.instances> 实际executor占用内存计算方法也一样> 实际的内存占用可以通过ApplicationMaster查看,开启ApplicationMaster命令:yarn-daemon.sh start proxyserver<div class="md-section-divider"></div>####yarn.scheduler.minimum-allocation-mb设置较小时解决办法> [spark ERROR YarnClientSchedulerBackend: Yarn application has already exited][6]<div class="md-section-divider"></div>###异常> [官方文档常见异常][7]<div class="md-section-divider"></div>####1、java.lang.NoClassDefFoundError: scala/collection/Iterable> hive未加载spark的jar造成的问题> 按一下方式修改$HIVE_HOME/bin/hive文件<div class="md-section-divider"></div>
for f in {CLASSPATH}:$f;
done
for f in {CLASSPATH}:$f;
done
done<div class="md-section-divider"></div>
在hive-site.xml中加入hbase的jar配置,spark.driver.extraClassPath和spark.executor.extraClassPath中内容相同
<property><name>spark.driver.extraClassPath</name><value>$SPARK_HOME/lib/mysql-connector-java-5.1.34.jar:$SPARK_HOME/lib/hbase-annotations-1.1.4.jar:$SPARK_HOME/lib/hbase-client-1.1.4.jar:$SPARK_HOME/lib/hbase-common-1.1.4.jar:$SPARK_HOME/lib/hbase-hadoop2-compat-1.1.4.jar:$SPARK_HOME/lib/hbase-hadoop-compat-1.1.4.jar:$SPARK_HOME/lib/hbase-protocol-1.1.4.jar:$SPARK_HOME/lib/hbase-server-1.1.4.jar:$SPARK_HOME/lib/hive-hbase-handler-2.3.2.jar:$SPARK_HOME/lib/htrace-core-3.1.0-incubating.jar</value></property><property><name>spark.executor.extraClassPath</name><value>$SPARK_HOME/lib/mysql-connector-java-5.1.34.jar:$SPARK_HOME/lib/hbase-annotations-1.1.4.jar:$SPARK_HOME/lib/hbase-client-1.1.4.jar:$SPARK_HOME/lib/hbase-common-1.1.4.jar:$SPARK_HOME/lib/hbase-hadoop2-compat-1.1.4.jar:$SPARK_HOME/lib/hbase-hadoop-compat-1.1.4.jar:$SPARK_HOME/lib/hbase-protocol-1.1.4.jar:$SPARK_HOME/lib/hbase-server-1.1.4.jar:$SPARK_HOME/lib/hive-hbase-handler-2.3.2.jar:$SPARK_HOME/lib/htrace-core-3.1.0-incubating.jar</value></property><div class="md-section-divider"></div>
实际上在集成之前在集成hbase时,hive是不缺少jar包的,在spark.executor.extraClassPath和spark.driver.extraClassPath加入了hive-hbase-handler-2.3.2.jar之后,就出现了这个异常可以在hive.aux.jars.path中加入
<property><name>hive.aux.jars.path</name><value>/home/hadoop/application/hive/lib/hive-hbase-handler-2.3.2.jar</value></property><div class="md-section-divider"></div>
或在hive命令行中执行以下命令解决
add jar /home/hadoop/application/hive/lib/hive-hbase-handler-2.3.2.jar<div class="md-section-divider"></div>
当namedode存在HA时,hdfs的访问地址需要改为hadoop/etc/hadoop/hdfs-site.xml 中dfs.nameservices配置的值,不需要指定端口例如hdfs://hadoop-cluster/xxx
检查hive-site.xml,spark-default.xml中spark.yarn.jars是否配置正确
检查hdfs://xxx/spark/jars中jar包是否是HBASE_HOME/lib/*:classpath
修改配置或修改hdfs中jar之后都重启hive服务再测试
权限问题造成,修改hive-site.xml
<property><name>hive.server2.enable.doAs</name><value>false</value></property>