[关闭]
@Arslan6and6 2016-10-10T06:58:09.000000Z 字数 9243 阅读 770

Spark 功能、编译及安装部署测试 CORE [殷杰]

Spark

---Spark 功能、编译及安装部署测试

作业描述
基于内存的大数据计算框架Spark,尤其特有的处理数据的功能和数据分析栈,受到很多互联网和各行各业的喜爱,对于初学者,以下几点大家要注重,打好基础为后续深入使用做好铺垫:
1)Spark 功能、优势,尤其与MapReduce相比较
2)Spark 源码编译,针对不同版本的Hadoop 版本
3)Spark 提供交互性工具spark-shell基本使用,以及初步理解RDD的功能
4)Spark Cluster的理解,如何提交Spark Application及运行
5)Spark Standalone 安装部署启动及测试、编写实现WordCount程序

1)Spark 功能、优势,尤其与MapReduce相比较

2)Spark 源码编译,针对不同版本的Hadoop 版本

1,在官网下载spark
http://spark.apache.org/downloads.html
在此我我们选择1.6.1版本
image_1anuif5sa12ghbnep2018391fli9.png-100.7kB
2,根据官方文档进行编译
image_1anuiid8ndfd65lakr1vej56gm.png-89.3kB
image_1anuijvnls9kiv8b6l15j878513.png-103.7kB
3,根据官网要求配置对应maven及jdk版本
image_1anuis40k1q5k14q0ivjf7d45r1g.png-82.7kB

image_1anuitdvoroh1fgm1k028978fs1t.png-18.6kB

4,打包编译make-distribution.sh
image_1anujhur3fl44ievo155nhsf2a.png-29kB
./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn
逐条参数设置:
--name custom 一般不指定
--tgz 把编译好的文件打包成tar包
-Phadoop-2.4 指定Hadoop版本
-Phive-thriftserver 支持hive
-Pyarn 支持yarn
上面语句参数不全
对于CDH版本需要加如下参数
image_1anups1guibs16vj1joefap1qpk9.png-61.8kB
对于本例,使用2.5.0
参数为 -Dhadoop.version=2.5.0-mr1-cdh5.3.6
还需组合如下参数
image_1anuqbvjjg6k1p801fi2kea15939.png-30.4kB

针对 aparche hadoop 编译:
./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0 -Pyarn -Phive -Phive-thriftserver mvn clean package -DskipTests-Phadoop-2.4 -Dhadoop.version=2.5.0 -Pyarn -Phive -Phive-thriftserver

针对 CDH aparche 编译
./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Pyarn -Phive -Phive-thriftserver

修改 make-distribution.sh 文件

  1. VERSION=1.6.1
  2. SCALA_VERSION=2.10.4
  3. SPARK_HADOOP_VERSION=2.5.0-cdh5.3.6
  4. SPARK_HIVE=1

如果针对CDH版本Hadoop
删除 $MAVEN_HOME/conf/settings 文件中的如下标签

  1. <mirror>
  2. <id>nexus-osc</id>
  3. <mirrorOf>*</mirrorOf>
  4. <name>Nexus osc</name>
  5. <url>http://maven.oschina.net/content/groups/public/</url>
  6. </mirror>

配置域名解析服务器
# vi /etc/resolv.conf
内容:
nameserver 8.8.8.8
nameserver 8.8.4.4

进入$SPARK_HOME,执行组合了参数的 make-distribution.sh 文件

  1. ./make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Pyarn -Phive -Phive-thriftserver
  2. [INFO] ------------------------------------------------------------------------
  3. [INFO] Reactor Summary:
  4. [INFO]
  5. [INFO] Spark Project Parent POM ........................... SUCCESS [ 6.923 s]
  6. [INFO] Spark Project Test Tags ............................ SUCCESS [ 4.033 s]
  7. [INFO] Spark Project Launcher ............................. SUCCESS [ 15.463 s]
  8. [INFO] Spark Project Networking ........................... SUCCESS [ 11.982 s]
  9. [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 6.390 s]
  10. [INFO] Spark Project Unsafe ............................... SUCCESS [ 8.195 s]
  11. [INFO] Spark Project Core ................................. SUCCESS [01:56 min]
  12. [INFO] Spark Project Bagel ................................ SUCCESS [ 5.070 s]
  13. [INFO] Spark Project GraphX ............................... SUCCESS [ 13.118 s]
  14. [INFO] Spark Project Streaming ............................ SUCCESS [ 36.047 s]
  15. [INFO] Spark Project Catalyst ............................. SUCCESS [ 49.638 s]
  16. [INFO] Spark Project SQL .................................. SUCCESS [01:08 min]
  17. [INFO] Spark Project ML Library ........................... SUCCESS [01:07 min]
  18. [INFO] Spark Project Tools ................................ SUCCESS [ 2.873 s]
  19. [INFO] Spark Project Hive ................................. SUCCESS [ 44.103 s]
  20. [INFO] Spark Project Docker Integration Tests ............. SUCCESS [ 2.803 s]
  21. [INFO] Spark Project REPL ................................. SUCCESS [ 8.310 s]
  22. [INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 8.097 s]
  23. [INFO] Spark Project YARN ................................. SUCCESS [ 12.728 s]
  24. [INFO] Spark Project Hive Thrift Server ................... SUCCESS [ 8.788 s]
  25. [INFO] Spark Project Assembly ............................. SUCCESS [02:02 min]
  26. [INFO] Spark Project External Twitter ..................... SUCCESS [ 9.430 s]
  27. [INFO] Spark Project External Flume Sink .................. SUCCESS [ 11.433 s]
  28. [INFO] Spark Project External Flume ....................... SUCCESS [ 18.304 s]
  29. [INFO] Spark Project External Flume Assembly .............. SUCCESS [ 4.526 s]
  30. [INFO] Spark Project External MQTT ........................ SUCCESS [ 28.436 s]
  31. [INFO] Spark Project External MQTT Assembly ............... SUCCESS [ 4.794 s]
  32. [INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 8.104 s]
  33. [INFO] Spark Project External Kafka ....................... SUCCESS [ 11.387 s]
  34. [INFO] Spark Project Examples ............................. SUCCESS [06:02 min]
  35. [INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 6.001 s]
  36. [INFO] ------------------------------------------------------------------------
  37. [INFO] BUILD SUCCESS
  38. [INFO] ------------------------------------------------------------------------
  39. [INFO] Total time: 18:06 min
  40. [INFO] Finished at: 2016-07-19T18:25:49+08:00
  41. [INFO] Final Memory: 468M/1649M
  42. [INFO] ------------------------------------------------------------------------
  43. + rm -rf /opt/modules/spark-1.6.1/dist
  44. + mkdir -p /opt/modules/spark-1.6.1/dist/lib
  45. + echo 'Spark 1.6.1 built for Hadoop 2.5.0-cdh5.3.6'
  46. + echo 'Build flags: -Phadoop-2.4' -Dhadoop.version=2.5.0-cdh5.3.6 -Pyarn -Phive -Phive-thriftserver
  47. + cp /opt/modules/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.1-hadoop2.5.0-cdh5.3.6.jar /ob/
  48. + cp /opt/modules/spark-1.6.1/examples/target/scala-2.10/spark-examples-1.6.1-hadoop2.5.0-cdh5.3.6.jar /ob/
  49. + cp /opt/modules/spark-1.6.1/network/yarn/target/scala-2.10/spark-1.6.1-yarn-shuffle.jar /opt/modules/sp
  50. + mkdir -p /opt/modules/spark-1.6.1/dist/examples/src/main
  51. + cp -r /opt/modules/spark-1.6.1/examples/src/main /opt/modules/spark-1.6.1/dist/examples/src/
  52. + '[' 1 == 1 ']'
  53. + cp /opt/modules/spark-1.6.1/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar /opt/modules/spark-1.6.1/lib-3.2.10.jar /opt/modules/spark-1.6.1/lib_managed/jars/datanucleus-rdbms-3.2.9.jar /opt/modules/spark-1.6.
  54. + cp /opt/modules/spark-1.6.1/LICENSE /opt/modules/spark-1.6.1/dist
  55. + cp -r /opt/modules/spark-1.6.1/licenses /opt/modules/spark-1.6.1/dist
  56. + cp /opt/modules/spark-1.6.1/NOTICE /opt/modules/spark-1.6.1/dist
  57. + '[' -e /opt/modules/spark-1.6.1/CHANGES.txt ']'
  58. + cp /opt/modules/spark-1.6.1/CHANGES.txt /opt/modules/spark-1.6.1/dist
  59. + cp -r /opt/modules/spark-1.6.1/data /opt/modules/spark-1.6.1/dist
  60. + mkdir /opt/modules/spark-1.6.1/dist/conf
  61. + cp /opt/modules/spark-1.6.1/conf/docker.properties.template /opt/modules/spark-1.6.1/conf/fairschedulerrk-1.6.1/conf/log4j.properties.template /opt/modules/spark-1.6.1/conf/metrics.properties.template /opt/motemplate /opt/modules/spark-1.6.1/conf/spark-defaults.conf.template /opt/modules/spark-1.6.1/conf/spark-eark-1.6.1/dist/conf
  62. + cp /opt/modules/spark-1.6.1/README.md /opt/modules/spark-1.6.1/dist
  63. + cp -r /opt/modules/spark-1.6.1/bin /opt/modules/spark-1.6.1/dist
  64. + cp -r /opt/modules/spark-1.6.1/python /opt/modules/spark-1.6.1/dist
  65. + cp -r /opt/modules/spark-1.6.1/sbin /opt/modules/spark-1.6.1/dist
  66. + cp -r /opt/modules/spark-1.6.1/ec2 /opt/modules/spark-1.6.1/dist
  67. + '[' -d /opt/modules/spark-1.6.1/R/lib/SparkR ']'
  68. + '[' false == true ']'
  69. + '[' true == true ']'
  70. + TARDIR_NAME=spark-1.6.1-bin-2.5.0-cdh5.3.6
  71. + TARDIR=/opt/modules/spark-1.6.1/spark-1.6.1-bin-2.5.0-cdh5.3.6
  72. + rm -rf /opt/modules/spark-1.6.1/spark-1.6.1-bin-2.5.0-cdh5.3.6
  73. + cp -r /opt/modules/spark-1.6.1/dist /opt/modules/spark-1.6.1/spark-1.6.1-bin-2.5.0-cdh5.3.6
  74. + tar czf spark-1.6.1-bin-2.5.0-cdh5.3.6.tgz -C /opt/modules/spark-1.6.1 spark-1.6.1-bin-2.5.0-cdh5.3.6
  75. + rm -rf /opt/modules/spark-1.6.1/spark-1.6.1-bin-2.5.0-cdh5.3.6

4)Spark Cluster的理解,如何提交Spark Application及运行

Standalone类似于YARN这样的框架
分布式
主节点:
Master - ResourceManager
从节点:
Works - NodeManagers
只要资源足够,一台服务器可以分配多个Works
start-slaves.sh
启动所有的从节点,也就是Work
注意:使用此命令时,运行此命令的机器,必须要配置与其他机器的SSH无密钥登录,否则启动的时候会出现一些问题,比如说输入密码之类的。
配置 Standalon
$SPARK_HOME/conf/spark-env.sh 添加以下信息

  1. JAVA_HOME=/usr/java/jdk1.7.0_67
  2. SCALA_HOME=/opt/modules/scala-2.10.4
  3. HADOOP_CONF_DIR=/opt/modules/hadoop-2.5.0-cdh5.3.6/etc/hadoop
  4. SPARK_MASTER_IP=hadoop-senior.ibeifeng.com
  5. SPARK_MASTER_PORT=7077
  6. SPARK_MASTER_WEBUI_PORT=8080
  7. SPARK_WORKER_CORES=2
  8. SPARK_WORKER_MEMORY=2g
  9. SPARK_WORKER_PORT=7078
  10. SPARK_WORKER_WEBUI_PORT=8081
  11. SPARK_WORKER_INSTANCES=1

$SPARK_HOME/conf/slaves localhost 改为 hadoop-senior.ibeifeng.com

  1. $HADOOP_HOME $ sbin/hadoop-daemon.sh start namenode
  2. $HADOOP_HOME $sbin/hadoop-daemon.sh start datanode
  3. $SPARK_HOME $ sbin/start-master.sh
  4. starting org.apache.spark.deploy.master.Master, logging to /opt/modules/spark-1.6.1/logs/spark-beifeng-org.apache.spark.deploy.master.Master-1-hadoop-senior.ibeifeng.com.out
  5. $SPARK_HOME $ sbin/start-slaves.sh
  6. localhost: starting org.apache.spark.deploy.worker.Worker, logging to /opt/modules/spark-1.6.1/logs/spark-beifeng-org.apache.spark.deploy.worker.Worker-1-hadoop-senior.ibeifeng.com.out
  7. $ jps
  8. 4115 NameNode
  9. 4527 Jps
  10. 4308 Master
  11. 4203 DataNode
  12. 4456 Worker

查看master网页 http://hadoop-senior.ibeifeng.com:8080/
SPARK_MASTER_WEBUI_PORT=8080
image_1aobtus3v133b1fnn19g1lcom5s9.png-98.4kB

  1. [beifeng@hadoop-senior spark-1.6.1]$ bin/spark-shell --help
  2. Usage: ./bin/spark-shell [options]
  3. Options:
  4. --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
  5. SPARK_MASTER_PORT=7077
  6. spark://hadoop-senior.ibeifeng.com:7077
  7. $ bin/spark-shell --master spark://hadoop-senior.ibeifeng.com:7077
  8. scala> val rdd = sc.textFile("hdfs://hadoop-senior.ibeifeng.com:8020/input")
  9. rdd: org.apache.spark.rdd.RDD[String] = hdfs://hadoop-senior.ibeifeng.com:8020/input MapPartitionsRDD[1] at textFile at <console>:27
  10. scala> rdd.count
  11. res0: Long = 100000
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注