[关闭]
@Arslan6and6 2016-05-25T16:29:02.000000Z 字数 9202 阅读 617

【作业十八】殷杰

第八章、大数据协作框架之Sqoop

---Hive日志分析案例

作业描述:
 需求说明
 依据日志文件,参考字段,统计分析每日各时段的 pv 和 uv
 实现要点
 建立 Hive 表,表列分隔符需要与原文件保持一致
 Load 加载数据到 Hive 表
 写 Hive sql 统计,结果落地到 Hive 表 2
 从 Hive 表 2 导出结果到 mysql 表

步骤一:建立Hive数据库和分区表

  1. hive> drop database if exists db_track ;
  2. OK
  3. Time taken: 0.115 seconds
  4. hive> create database db_track ;
  5. OK
  6. Time taken: 0.049 seconds
  7. hive> create table track_log(
  8. > id string,
  9. > url string,
  10. > referer string,
  11. > keyword string,
  12. > type string,
  13. > guid string,
  14. > pageId string,
  15. > moduleId string,
  16. > linkId string,
  17. > attachedInfo string,
  18. > sessionId string,
  19. > trackerU string,
  20. > trackerType string,
  21. > ip string,
  22. > trackerSrc string,
  23. > cookie string,
  24. > orderCode string,
  25. > trackTime string,
  26. > endUserId string,
  27. > firstLink string,
  28. > sessionViewNo string,
  29. > productId string,
  30. > curMerchantId string,
  31. > provinceId string,
  32. > cityId string,
  33. > fee string,
  34. > edmActivity string,
  35. > edmEmail string,
  36. > edmJobId string,
  37. > ieVersion string,
  38. > platform string,
  39. > internalKeyword string,
  40. > resultSum string,
  41. > currentPage string,
  42. > linkPosition string,
  43. > buttonPosition string
  44. > )
  45. > partitioned by (date string,hour string)
  46. > row format delimited fields terminated by '\t' ;
  47. OK
  48. Time taken: 0.09 seconds

步骤二:从本地加载分区表数据

  1. load data local inpath '/home/beifeng/2015082818' into table track_log partition(date='20150828',hour='18') ;
  2. load data local inpath '/home/beifeng/Desktop/source/2015082819' into table track_log partition(date='20150828',hour='19') ;

步骤三:统计每日各时段的 pv 和 uv

  1. select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour ;
  2. Total jobs = 1
  3. Launching Job 1 out of 1
  4. Number of reduce tasks not specified. Estimated from input data size: 1
  5. In order to change the average load for a reducer (in bytes):
  6. set hive.exec.reducers.bytes.per.reducer=<number>
  7. In order to limit the maximum number of reducers:
  8. set hive.exec.reducers.max=<number>
  9. In order to set a constant number of reducers:
  10. set mapreduce.job.reduces=<number>
  11. Starting Job = job_1464189390335_0001, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0001/
  12. Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464189390335_0001
  13. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  14. 2016-05-25 23:21:48,458 Stage-1 map = 0%, reduce = 0%
  15. 2016-05-25 23:22:10,869 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 2.26 sec
  16. 2016-05-25 23:22:13,329 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.37 sec
  17. 2016-05-25 23:22:32,297 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.03 sec
  18. MapReduce Total cumulative CPU time: 6 seconds 30 msec
  19. Ended Job = job_1464189390335_0001
  20. MapReduce Jobs Launched:
  21. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.03 sec HDFS Read: 76517763 HDFS Write: 48 SUCCESS
  22. Total MapReduce CPU Time Spent: 6 seconds 30 msec
  23. OK
  24. date hour _c2 _c3
  25. 20150828 18 64972 23938
  26. 20150828 19 61162 22330
  27. Time taken: 69.163 seconds, Fetched: 2 row(s)

步骤四:把统计结果导入Hive临时表

  1. hive (db_track)> create table track_log_tmp(
  2. > date string,
  3. > hour string,
  4. > pv string,
  5. > uv string
  6. > )
  7. > row format delimited fields terminated by '\t';
  8. OK
  9. Time taken: 1.293 seconds
  10. hive (db_track)> insert into table track_log_tmp select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour ;
  11. Total jobs = 1
  12. Launching Job 1 out of 1
  13. Number of reduce tasks not specified. Estimated from input data size: 1
  14. In order to change the average load for a reducer (in bytes):
  15. set hive.exec.reducers.bytes.per.reducer=<number>
  16. In order to limit the maximum number of reducers:
  17. set hive.exec.reducers.max=<number>
  18. In order to set a constant number of reducers:
  19. set mapreduce.job.reduces=<number>
  20. Starting Job = job_1464189390335_0002, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0002/
  21. Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464189390335_0002
  22. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  23. 2016-05-25 23:38:40,279 Stage-1 map = 0%, reduce = 0%
  24. 2016-05-25 23:39:05,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.08 sec
  25. 2016-05-25 23:39:27,353 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.43 sec
  26. MapReduce Total cumulative CPU time: 6 seconds 430 msec
  27. Ended Job = job_1464189390335_0002
  28. Loading data to table db_track.track_log_tmp
  29. Table db_track.track_log_tmp stats: [numFiles=1, numRows=2, totalSize=48, rawDataSize=46]
  30. MapReduce Jobs Launched:
  31. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 76517763 HDFS Write: 126 SUCCESS
  32. Total MapReduce CPU Time Spent: 6 seconds 430 msec
  33. OK
  34. date hour _c2 _c3
  35. Time taken: 60.805 seconds

步骤五:把Hive临时表数据导出到MySQL永久表

  1. mysql> use db_0521;
  2. Database changed
  3. mysql> create table track_log(
  4. -> date varchar(20),
  5. -> hour varchar(10),
  6. -> pv varchar(50),
  7. -> uv varchar(50),
  8. -> primary key (date,hour)
  9. -> ) ;
  10. Query OK, 0 rows affected (0.10 sec)
  11. bin/sqoop export \
  12. > --connect jdbc:mysql://hadoop-senior.ibeifeng.com:3306/db_0521 \
  13. > --username root \
  14. > --password 123 \
  15. > --table track_log \
  16. > --export-dir /user/hive/warehouse/db_track.db/track_log_tmp \
  17. > --input-fields-terminated-by '\t'
  18. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hbase does not exist! HBase imports will fail.
  19. Please set $HBASE_HOME to the root of your HBase installation.
  20. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hcatalog does not exist! HCatalog jobs will fail.
  21. Please set $HCAT_HOME to the root of your HCatalog installation.
  22. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../accumulo does not exist! Accumulo imports will fail.
  23. Please set $ACCUMULO_HOME to the root of your Accumulo installation.
  24. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../zookeeper does not exist! Accumulo imports will fail.
  25. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
  26. 16/05/25 23:57:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
  27. 16/05/25 23:57:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
  28. 16/05/25 23:57:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
  29. 16/05/25 23:57:38 INFO tool.CodeGenTool: Beginning code generation
  30. 16/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 1
  31. 16/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 1
  32. 16/05/25 23:57:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6
  33. Note: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.java uses or overrides a deprecated API.
  34. Note: Recompile with -Xlint:deprecation for details.
  35. 16/05/25 23:57:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.jar
  36. 16/05/25 23:57:42 INFO mapreduce.ExportJobBase: Beginning export of track_log
  37. 16/05/25 23:57:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  38. 16/05/25 23:57:42 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
  39. 16/05/25 23:57:44 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
  40. 16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
  41. 16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  42. 16/05/25 23:57:44 INFO client.RMProxy: Connecting to ResourceManager at hadoop-senior.ibeifeng.com/192.168.5.130:8032
  43. 16/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 1
  44. 16/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 1
  45. 16/05/25 23:57:47 INFO mapreduce.JobSubmitter: number of splits:4
  46. 16/05/25 23:57:47 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
  47. 16/05/25 23:57:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1464189390335_0003
  48. 16/05/25 23:57:48 INFO impl.YarnClientImpl: Submitted application application_1464189390335_0003
  49. 16/05/25 23:57:48 INFO mapreduce.Job: The url to track the job: http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0003/
  50. 16/05/25 23:57:48 INFO mapreduce.Job: Running job: job_1464189390335_0003
  51. 16/05/25 23:58:05 INFO mapreduce.Job: Job job_1464189390335_0003 running in uber mode : false
  52. 16/05/25 23:58:05 INFO mapreduce.Job: map 0% reduce 0%
  53. 16/05/25 23:58:43 INFO mapreduce.Job: map 100% reduce 0%
  54. 16/05/25 23:58:49 INFO mapreduce.Job: Job job_1464189390335_0003 completed successfully
  55. 16/05/25 23:58:51 INFO mapreduce.Job: Counters: 30
  56. File System Counters
  57. FILE: Number of bytes read=0
  58. FILE: Number of bytes written=525380
  59. FILE: Number of read operations=0
  60. FILE: Number of large read operations=0
  61. FILE: Number of write operations=0
  62. HDFS: Number of bytes read=828
  63. HDFS: Number of bytes written=0
  64. HDFS: Number of read operations=16
  65. HDFS: Number of large read operations=0
  66. HDFS: Number of write operations=0
  67. Job Counters
  68. Launched map tasks=4
  69. Data-local map tasks=4
  70. Total time spent by all maps in occupied slots (ms)=147642
  71. Total time spent by all reduces in occupied slots (ms)=0
  72. Total time spent by all map tasks (ms)=147642
  73. Total vcore-seconds taken by all map tasks=147642
  74. Total megabyte-seconds taken by all map tasks=151185408
  75. Map-Reduce Framework
  76. Map input records=2
  77. Map output records=2
  78. Input split bytes=696
  79. Spilled Records=0
  80. Failed Shuffles=0
  81. Merged Map outputs=0
  82. GC time elapsed (ms)=315
  83. CPU time spent (ms)=2930
  84. Physical memory (bytes) snapshot=352866304
  85. Virtual memory (bytes) snapshot=3355598848
  86. Total committed heap usage (bytes)=62914560
  87. File Input Format Counters
  88. Bytes Read=0
  89. File Output Format Counters
  90. Bytes Written=0
  91. 16/05/25 23:58:52 INFO mapreduce.ExportJobBase: Transferred 828 bytes in 67.7294 seconds (12.2251 bytes/sec)
  92. 16/05/25 23:58:52 INFO mapreduce.ExportJobBase: Exported 2 records.
  93. mysql> show tables;
  94. +-------------------+
  95. | Tables_in_db_0521 |
  96. +-------------------+
  97. | hdfs2mysql |
  98. | my_user |
  99. | track_log |
  100. +-------------------+
  101. 3 rows in set (0.05 sec)
  102. mysql> select * from track_log;
  103. +----------+------+-------+-------+
  104. | date | hour | pv | uv |
  105. +----------+------+-------+-------+
  106. | 20150828 | 18 | 64972 | 23938 |
  107. | 20150828 | 19 | 61162 | 22330 |
  108. +----------+------+-------+-------+
  109. 2 rows in set (0.02 sec)
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注