[关闭]
@Arslan6and6 2016-05-30T16:26:40.000000Z 字数 15371 阅读 627

【作业十九】殷杰

第九章、大数据仓库Hive高级案例

---Hive日志分析案例二

作业描述:

在【日志分析案例一】基础之上,加强业务需求进行数据分析,具体需求见课程说明。
需求表如下:
02.jpg-11.2kB

track_log表中字段描述如下
image_1ak16jic7u0t19n21q4k1mve6le1k.png-121.6kB

需求:将track_log表中各字段按照需求表字段数据提取并统计

思路:
1.需求表字段对应track_log表字段编写会话信息表描述,会话信息表体现需求键名与track_log表对应关系
举例如下
01.jpg-55.5kB
2.按会话信息表描述在Hive中建立会话信息表空表
3.按需求字段从track_log表抽取数据到会话信息表
4.抽取会话信息表数据到新建需求表中进行统计
实现步骤:
1.建立会话信息表session

  1. hive (db_track)> create table session_info(
  2. > session_id string,
  3. > guid string,
  4. > trackerU string,
  5. > landing_url string,
  6. > landing_url_ref string,
  7. > user_id string,
  8. > pv string,
  9. > stay_time string,
  10. > min_trackTime string,
  11. > ip string,
  12. > provinceId string
  13. > )
  14. > partitioned by (date string)
  15. > row format delimited fields terminated by '\t' ;
  16. OK
  17. Time taken: 0.859 seconds

2.创建临时表 session_info_tmp1 ,把所需字段信息从 track_log 表导入

  1. hive (db_track)> create table session_info_tmp1 as select
  2. > sessionId session_id,
  3. > max(guid) guid,
  4. > max(endUserId) user_id,
  5. > count(url) pv,
  6. > (max(unix_timestamp(trackTime)) - min(unix_timestamp(trackTime))) stay_time,
  7. > min(trackTime) min_trackTime ,
  8. > max(ip) ip,
  9. > max(provinceId) provinceId
  10. > from track_log where date='20150828'
  11. > group by sessionId ;
  12. Total jobs = 1
  13. Launching Job 1 out of 1
  14. Number of reduce tasks not specified. Estimated from input data size: 1
  15. In order to change the average load for a reducer (in bytes):
  16. set hive.exec.reducers.bytes.per.reducer=<number>
  17. In order to limit the maximum number of reducers:
  18. set hive.exec.reducers.max=<number>
  19. In order to set a constant number of reducers:
  20. set mapreduce.job.reduces=<number>
  21. Starting Job = job_1464600726756_0001, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0001/
  22. Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464600726756_0001
  23. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  24. 2016-05-30 17:34:36,735 Stage-1 map = 0%, reduce = 0%
  25. 2016-05-30 17:34:56,812 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.07 sec
  26. 2016-05-30 17:35:16,548 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.69 sec
  27. MapReduce Total cumulative CPU time: 2 seconds 690 msec
  28. Ended Job = job_1464600726756_0001
  29. Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/session_info_tmp1
  30. Table db_track.session_info_tmp1 stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
  31. MapReduce Jobs Launched:
  32. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.69 sec HDFS Read: 429 HDFS Write: 52 SUCCESS
  33. Total MapReduce CPU Time Spent: 2 seconds 690 msec
  34. OK
  35. session_id guid user_id pv stay_time min_tracktime ip provinceid
  36. Time taken: 69.668 seconds

3.创建临时表session_info_tmp2

  1. hive (db_track)> create table session_info_tmp2 as select
  2. > sessionId session_id,
  3. > trackTime trackTime,
  4. > trackeru trackerU,
  5. > url landing_url,
  6. > referer landing_url_ref
  7. > from track_log where date='20150828' ;
  8. Total jobs = 3
  9. Launching Job 1 out of 3
  10. Number of reduce tasks is set to 0 since there's no reduce operator
  11. Starting Job = job_1464600726756_0002, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0002/
  12. Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464600726756_0002
  13. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
  14. 2016-05-30 18:16:10,956 Stage-1 map = 0%, reduce = 0%
  15. 2016-05-30 18:16:33,069 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.91 sec
  16. MapReduce Total cumulative CPU time: 910 msec
  17. Ended Job = job_1464600726756_0002
  18. Stage-4 is selected by condition resolver.
  19. Stage-3 is filtered out by condition resolver.
  20. Stage-5 is filtered out by condition resolver.
  21. Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/.hive-staging_hive_2016-05-30_18-15-55_162_8334205732777417774-1/-ext-10001
  22. Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/session_info_tmp2
  23. Table db_track.session_info_tmp2 stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
  24. MapReduce Jobs Launched:
  25. Stage-Stage-1: Map: 1 Cumulative CPU: 0.91 sec HDFS Read: 429 HDFS Write: 52 SUCCESS
  26. Total MapReduce CPU Time Spent: 910 msec
  27. OK
  28. session_id tracktime trackeru landing_url landing_url_ref
  29. Time taken: 41.865 seconds

4.导入会话表数据

  1. hive (db_track)> insert overwrite table session_info partition (date='20150828')
  2. > select
  3. > a.session_id,
  4. > a.guid,
  5. > b.trackerU,
  6. > b.landing_url,
  7. > b.landing_url_ref,
  8. > a.user_id,
  9. > a.pv,
  10. > a.stay_time,
  11. > a.min_trackTime,
  12. > a.ip,
  13. > a.provinceId
  14. > from session_info_tmp1 a join session_info_tmp2 b
  15. > on a.session_id=b.session_id
  16. > and a.min_trackTime=b.trackTime ;
  17. Total jobs = 1
  18. 16/05/30 18:26:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  19. 16/05/30 18:26:09 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead
  20. Execution log at: /tmp/beifeng/beifeng_20160530182626_42ba565a-123f-477d-b626-0c1dfe3c4b88.log
  21. 2016-05-30 06:26:11 Starting to launch local task to process map join; maximum memory = 518979584
  22. 2016-05-30 06:26:13 Dump the side-table into file: file:/tmp/beifeng/hive_2016-05-30_18-26-00_624_1665288294433606603-1/-local-10002/HashTable-Stage-4/MapJoin-mapfile00--.hashtable
  23. 2016-05-30 06:26:13 Uploaded 1 File to: file:/tmp/beifeng/hive_2016-05-30_18-26-00_624_1665288294433606603-1/-local-10002/HashTable-Stage-4/MapJoin-mapfile00--.hashtable (260 bytes)
  24. 2016-05-30 06:26:13 End of local task; Time Taken: 1.67 sec.
  25. Execution completed successfully
  26. MapredLocal task succeeded
  27. Launching Job 1 out of 1
  28. Number of reduce tasks is set to 0 since there's no reduce operator
  29. Starting Job = job_1464600726756_0003, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0003/
  30. Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464600726756_0003
  31. Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
  32. 2016-05-30 18:26:28,466 Stage-4 map = 0%, reduce = 0%
  33. 2016-05-30 18:26:44,721 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 0.96 sec
  34. MapReduce Total cumulative CPU time: 960 msec
  35. Ended Job = job_1464600726756_0003
  36. Loading data to table db_track.session_info partition (date=20150828)
  37. Partition db_track.session_info{date=20150828} stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
  38. MapReduce Jobs Launched:
  39. Stage-Stage-4: Map: 1 Cumulative CPU: 0.96 sec HDFS Read: 251 HDFS Write: 61 SUCCESS
  40. Total MapReduce CPU Time Spent: 960 msec
  41. OK
  42. a.session_id a.guid b.trackeru b.landing_url b.landing_url_ref a.user_id a.pv a.stay_time a.min_tracktima.ip a.provinceid
  43. Time taken: 48.044 seconds

5.统计 session_info 表如下字段数据,并导入到 vistor_users_info 需求表
date UV PV 登录人数 游客人数 平均访问时长 二跳率 独立IP

  1. hive (db_track)> create table vistor_users_info as
  2. > select
  3. > date,
  4. > count(distinct guid) UV,
  5. > sum(pv) PV,
  6. > count(case when user_id != '' then user_id else null end) login_users,
  7. > count(case when user_id = '' then user_id else null end) vistor_users,
  8. > avg(stay_time) avg_stay_time,
  9. > count(case when pv>=2 then session_id else null end)/count(session_id) sec_ratio,
  10. > count(distinct ip) ip
  11. > from session_info where date='20150828'
  12. > group by date ;
  13. Total jobs = 1
  14. Launching Job 1 out of 1
  15. Number of reduce tasks not specified. Estimated from input data size: 1
  16. In order to change the average load for a reducer (in bytes):
  17. set hive.exec.reducers.bytes.per.reducer=<number>
  18. In order to limit the maximum number of reducers:
  19. set hive.exec.reducers.max=<number>
  20. In order to set a constant number of reducers:
  21. set mapreduce.job.reduces=<number>
  22. Starting Job = job_1464600726756_0004, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0004/
  23. Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464600726756_0004
  24. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  25. 2016-05-30 18:44:05,804 Stage-1 map = 0%, reduce = 0%
  26. 2016-05-30 18:44:26,410 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.04 sec
  27. 2016-05-30 18:44:41,263 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.24 sec
  28. MapReduce Total cumulative CPU time: 3 seconds 240 msec
  29. Ended Job = job_1464600726756_0004
  30. Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/vistor_users_info
  31. Table db_track.vistor_users_info stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
  32. MapReduce Jobs Launched:
  33. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.24 sec HDFS Read: 260 HDFS Write: 52 SUCCESS
  34. Total MapReduce CPU Time Spent: 3 seconds 240 msec
  35. OK
  36. date uv pv login_users vistor_users avg_stay_time sec_ratio ip
  37. Time taken: 52.767 seconds

6.查看该需求表得到分析信息

  1. hive (db_track)> select * from vistor_users_info;
  2. OK
  3. vistor_users_info.date vistor_users_info.uv vistor_users_info.pv vistor_users_info.login_users vistor_users_info.vistor_usersvistor_users_info.avg_stay_time vistor_users_info.sec_ratio vistor_users_info.ip
  4. 20150828 38985 131668.0 18548 21902 750.7895179233622 0.5238813349814586 29668
  5. Time taken: 0.523 seconds, Fetched: 1 row(s)

注意:使用 create table 表名 as select 字段属性 from 会话信息表; ,创建的统计结果表vistor_users_info,就算在创建会话信息表session时指定分隔符,统计结果表vistor_users_info在text查看时仍然显示方框分割,导致sqoop导入MySQL时不能解析。
03.jpg-44.1kB
解决办法,删除as select建立的分析结果表vistor_users_info,单独创建并指定分割符,在用insert into table 导入数据。即可正常使用sqoop导入MySQL。

  1. hive (db_track)> create table vistor_users_info(
  2. > date string,
  3. > UV string,
  4. > PV string,
  5. > login_users string,
  6. > vistor_users string,
  7. > avg_stay_time string,
  8. > sec_ratio string,
  9. > ip string
  10. > )
  11. > row format delimited fields terminated by '\t' ;
  12. OK
  13. Time taken: 0.147 seconds

将会话信息表数据加入 vistor_users_info 进行统计

  1. hive (db_track)> insert into table vistor_users_info select
  2. > date,
  3. > count(distinct guid) UV,
  4. > sum(pv) PV,
  5. > count(case when user_id != '' then user_id else null end) login_users,
  6. > count(case when user_id = '' then user_id else null end) vistor_users,
  7. > avg(stay_time) avg_stay_time,
  8. > count(case when pv>=2 then session_id else null end)/count(session_id) sec_ratio,
  9. > count(distinct ip) ip
  10. > from session_info where date='20150828'
  11. > group by date ;
  12. Total jobs = 1
  13. Launching Job 1 out of 1
  14. Number of reduce tasks not specified. Estimated from input data size: 1
  15. In order to change the average load for a reducer (in bytes):
  16. set hive.exec.reducers.bytes.per.reducer=<number>
  17. In order to limit the maximum number of reducers:
  18. set hive.exec.reducers.max=<number>
  19. In order to set a constant number of reducers:
  20. set mapreduce.job.reduces=<number>
  21. Starting Job = job_1464600726756_0030, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0030/
  22. Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464600726756_0030
  23. Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  24. 2016-05-30 23:21:18,203 Stage-1 map = 0%, reduce = 0%
  25. 2016-05-30 23:21:46,752 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.6 sec
  26. 2016-05-30 23:22:17,900 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.68 sec
  27. MapReduce Total cumulative CPU time: 7 seconds 680 msec
  28. Ended Job = job_1464600726756_0030
  29. Loading data to table db_track.vistor_users_info
  30. Table db_track.vistor_users_info stats: [numFiles=1, numRows=1, totalSize=79, rawDataSize=78]
  31. MapReduce Jobs Launched:
  32. Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.68 sec HDFS Read: 9121608 HDFS Write: 161 SUCCESS
  33. Total MapReduce CPU Time Spent: 7 seconds 680 msec
  34. OK
  35. date uv pv login_users vistor_users avg_stay_time sec_ratio ip
  36. Time taken: 73.532 seconds

在Hive中查看统计结果

  1. hive (db_track)> select * from vistor_users_info;
  2. OK
  3. vistor_users_info.date vistor_users_info.uv vistor_users_info.pv vistor_users_info.login_users vistor_users_info.vistor_usersvistor_users_info.avg_stay_time vistor_users_info.sec_ratio vistor_users_info.ip
  4. 20150828 38985 131668.0 18548 21902 750.7895179233622 0.5238813349814586 29668
  5. Time taken: 0.183 seconds, Fetched: 1 row(s)

在MySQL中创建接收结果表 vistor_users_info

  1. create table vistor_users_info(
  2. date varchar(20),
  3. uv varchar(50),
  4. pv varchar(50),
  5. login_users varchar(50),
  6. vistor_users varchar(50),
  7. avg_stay_time varchar(50),
  8. sec_ratio varchar(50),
  9. ip varchar(50)
  10. );

使用sqoop将导入MySQL

  1. bin/sqoop export \
  2. > --connect jdbc:mysql://hadoop-senior.ibeifeng.com:3306/db_0521 \
  3. > --username root \
  4. > --password 123 \
  5. > --table vistor_users_info \
  6. > --export-dir /user/hive/warehouse/db_track.db/vistor_users_info \
  7. > --input-fields-terminated-by '\t'
  8. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hbase does not exist! HBase imports will fail.
  9. Please set $HBASE_HOME to the root of your HBase installation.
  10. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hcatalog does not exist! HCatalog jobs will fail.
  11. Please set $HCAT_HOME to the root of your HCatalog installation.
  12. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../accumulo does not exist! Accumulo imports will fail.
  13. Please set $ACCUMULO_HOME to the root of your Accumulo installation.
  14. Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../zookeeper does not exist! Accumulo imports will fail.
  15. Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
  16. 16/05/30 23:24:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
  17. 16/05/30 23:24:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
  18. 16/05/30 23:24:51 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
  19. 16/05/30 23:24:51 INFO tool.CodeGenTool: Beginning code generation
  20. 16/05/30 23:24:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `vistor_users_info` AS t LIMIT 1
  21. 16/05/30 23:24:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `vistor_users_info` AS t LIMIT 1
  22. 16/05/30 23:24:51 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6
  23. Note: /tmp/sqoop-beifeng/compile/d59d0a6ed5ece911a523a321474bffbb/vistor_users_info.java uses or overrides a deprecated API.
  24. Note: Recompile with -Xlint:deprecation for details.
  25. 16/05/30 23:24:55 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-beifeng/compile/d59d0a6ed5ece911a523a321474bffbb/vistor_users_info.jar
  26. 16/05/30 23:24:55 INFO mapreduce.ExportJobBase: Beginning export of vistor_users_info
  27. 16/05/30 23:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  28. 16/05/30 23:24:55 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
  29. 16/05/30 23:24:57 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
  30. 16/05/30 23:24:57 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
  31. 16/05/30 23:24:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  32. 16/05/30 23:24:57 INFO client.RMProxy: Connecting to ResourceManager at hadoop-senior.ibeifeng.com/192.168.5.130:8032
  33. 16/05/30 23:25:02 INFO input.FileInputFormat: Total input paths to process : 1
  34. 16/05/30 23:25:02 INFO input.FileInputFormat: Total input paths to process : 1
  35. 16/05/30 23:25:03 INFO mapreduce.JobSubmitter: number of splits:4
  36. 16/05/30 23:25:03 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
  37. 16/05/30 23:25:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1464600726756_0031
  38. 16/05/30 23:25:04 INFO impl.YarnClientImpl: Submitted application application_1464600726756_0031
  39. 16/05/30 23:25:04 INFO mapreduce.Job: The url to track the job: http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0031/
  40. 16/05/30 23:25:04 INFO mapreduce.Job: Running job: job_1464600726756_0031
  41. 16/05/30 23:25:19 INFO mapreduce.Job: Job job_1464600726756_0031 running in uber mode : false
  42. 16/05/30 23:25:19 INFO mapreduce.Job: map 0% reduce 0%
  43. 16/05/30 23:26:04 INFO mapreduce.Job: map 100% reduce 0%
  44. 16/05/30 23:26:12 INFO mapreduce.Job: Job job_1464600726756_0031 completed successfully
  45. 16/05/30 23:26:15 INFO mapreduce.Job: Counters: 30
  46. File System Counters
  47. FILE: Number of bytes read=0
  48. FILE: Number of bytes written=525728
  49. FILE: Number of read operations=0
  50. FILE: Number of large read operations=0
  51. FILE: Number of write operations=0
  52. HDFS: Number of bytes read=1054
  53. HDFS: Number of bytes written=0
  54. HDFS: Number of read operations=19
  55. HDFS: Number of large read operations=0
  56. HDFS: Number of write operations=0
  57. Job Counters
  58. Launched map tasks=4
  59. Data-local map tasks=4
  60. Total time spent by all maps in occupied slots (ms)=175923
  61. Total time spent by all reduces in occupied slots (ms)=0
  62. Total time spent by all map tasks (ms)=175923
  63. Total vcore-seconds taken by all map tasks=175923
  64. Total megabyte-seconds taken by all map tasks=180145152
  65. Map-Reduce Framework
  66. Map input records=1
  67. Map output records=1
  68. Input split bytes=826
  69. Spilled Records=0
  70. Failed Shuffles=0
  71. Merged Map outputs=0
  72. GC time elapsed (ms)=272
  73. CPU time spent (ms)=2810
  74. Physical memory (bytes) snapshot=358006784
  75. Virtual memory (bytes) snapshot=3356164096
  76. Total committed heap usage (bytes)=62914560
  77. File Input Format Counters
  78. Bytes Read=0
  79. File Output Format Counters
  80. Bytes Written=0
  81. 16/05/30 23:26:15 INFO mapreduce.ExportJobBase: Transferred 1.0293 KB in 78.1854 seconds (13.4808 bytes/sec)
  82. 16/05/30 23:26:15 INFO mapreduce.ExportJobBase: Exported 1 records.

在 MySQL 查看 vistor_users_info 表接收的统计结果
05.jpg-43.4kB

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注