@Arslan6and6 2016-05-30T16:26:40.000000Z 字数 15371 阅读 627

【作业十九】殷杰

第九章、大数据仓库Hive高级案例

---Hive日志分析案例二

作业描述：

在【日志分析案例一】基础之上，加强业务需求进行数据分析，具体需求见课程说明。
需求表如下：
02.jpg-11.2kB

track_log表中字段描述如下
image_1ak16jic7u0t19n21q4k1mve6le1k.png-121.6kB

需求：将track_log表中各字段按照需求表字段数据提取并统计

思路：
1.需求表字段对应track_log表字段编写会话信息表描述，会话信息表体现需求键名与track_log表对应关系
举例如下
01.jpg-55.5kB
2.按会话信息表描述在Hive中建立会话信息表空表
3.按需求字段从track_log表抽取数据到会话信息表
4.抽取会话信息表数据到新建需求表中进行统计
实现步骤：
1.建立会话信息表session

hive (db_track)> create table session_info(
               > session_id string,
               > guid string,
               > trackerU string,
               > landing_url string,
               > landing_url_ref string,
               > user_id string,
               > pv string,
               > stay_time string,
               > min_trackTime string,
               > ip string,
               > provinceId string
               > )
               > partitioned by (date string)
               > row format delimited fields terminated by '\t' ;
OK
Time taken: 0.859 seconds

2.创建临时表 session_info_tmp1 ,把所需字段信息从 track_log 表导入

hive (db_track)> create table session_info_tmp1 as select 
               > sessionId session_id,
               > max(guid) guid,
               > max(endUserId) user_id,
               > count(url) pv,
               > (max(unix_timestamp(trackTime)) - min(unix_timestamp(trackTime)))  stay_time,
               > min(trackTime) min_trackTime ,
               > max(ip) ip,
               > max(provinceId) provinceId
               > from track_log where date='20150828'
               > group by sessionId ;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1464600726756_0001, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0001/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1464600726756_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-30 17:34:36,735 Stage-1 map = 0%,  reduce = 0%
2016-05-30 17:34:56,812 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.07 sec
2016-05-30 17:35:16,548 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1464600726756_0001
Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/session_info_tmp1
Table db_track.session_info_tmp1 stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.69 sec   HDFS Read: 429 HDFS Write: 52 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 690 msec
OK
session_id  guid    user_id pv  stay_time   min_tracktime   ip  provinceid
Time taken: 69.668 seconds

3.创建临时表session_info_tmp2

hive (db_track)> create table session_info_tmp2 as select
               > sessionId session_id,
               > trackTime trackTime,
               > trackeru trackerU,
               > url landing_url,
               > referer landing_url_ref
               > from track_log where date='20150828' ;
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1464600726756_0002, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0002/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1464600726756_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-05-30 18:16:10,956 Stage-1 map = 0%,  reduce = 0%
2016-05-30 18:16:33,069 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.91 sec
MapReduce Total cumulative CPU time: 910 msec
Ended Job = job_1464600726756_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/.hive-staging_hive_2016-05-30_18-15-55_162_8334205732777417774-1/-ext-10001
Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/session_info_tmp2
Table db_track.session_info_tmp2 stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 0.91 sec   HDFS Read: 429 HDFS Write: 52 SUCCESS
Total MapReduce CPU Time Spent: 910 msec
OK
session_id  tracktime   trackeru    landing_url landing_url_ref
Time taken: 41.865 seconds

4.导入会话表数据

hive (db_track)> insert overwrite table session_info partition (date='20150828')
               > select
               > a.session_id,
               > a.guid,
               > b.trackerU,
               > b.landing_url,
               > b.landing_url_ref,
               > a.user_id,
               > a.pv,
               > a.stay_time,
               > a.min_trackTime,
               > a.ip,
               > a.provinceId
               > from session_info_tmp1 a  join session_info_tmp2 b
               > on a.session_id=b.session_id 
               > and a.min_trackTime=b.trackTime ;
Total jobs = 1
16/05/30 18:26:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/30 18:26:09 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead
Execution log at: /tmp/beifeng/beifeng_20160530182626_42ba565a-123f-477d-b626-0c1dfe3c4b88.log
2016-05-30 06:26:11 Starting to launch local task to process map join;  maximum memory = 518979584
2016-05-30 06:26:13 Dump the side-table into file: file:/tmp/beifeng/hive_2016-05-30_18-26-00_624_1665288294433606603-1/-local-10002/HashTable-Stage-4/MapJoin-mapfile00--.hashtable
2016-05-30 06:26:13 Uploaded 1 File to: file:/tmp/beifeng/hive_2016-05-30_18-26-00_624_1665288294433606603-1/-local-10002/HashTable-Stage-4/MapJoin-mapfile00--.hashtable (260 bytes)
2016-05-30 06:26:13 End of local task; Time Taken: 1.67 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1464600726756_0003, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0003/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1464600726756_0003
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2016-05-30 18:26:28,466 Stage-4 map = 0%,  reduce = 0%
2016-05-30 18:26:44,721 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 0.96 sec
MapReduce Total cumulative CPU time: 960 msec
Ended Job = job_1464600726756_0003
Loading data to table db_track.session_info partition (date=20150828)
Partition db_track.session_info{date=20150828} stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched: 
Stage-Stage-4: Map: 1   Cumulative CPU: 0.96 sec   HDFS Read: 251 HDFS Write: 61 SUCCESS
Total MapReduce CPU Time Spent: 960 msec
OK
a.session_id    a.guid  b.trackeru  b.landing_url   b.landing_url_ref   a.user_id   a.pv    a.stay_time a.min_tracktima.ip  a.provinceid
Time taken: 48.044 seconds

5.统计 session_info 表如下字段数据，并导入到 vistor_users_info 需求表
date UV PV 登录人数游客人数平均访问时长二跳率独立IP

hive (db_track)> create table vistor_users_info as
               > select
               > date,
               > count(distinct guid) UV,
               > sum(pv) PV,
               > count(case when user_id != '' then user_id else null end) login_users,
               > count(case when user_id = '' then user_id else null end) vistor_users,
               > avg(stay_time) avg_stay_time,
               > count(case when pv>=2 then session_id else null end)/count(session_id) sec_ratio,
               > count(distinct ip) ip
               > from session_info where date='20150828'
               > group by date ;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1464600726756_0004, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0004/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1464600726756_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-30 18:44:05,804 Stage-1 map = 0%,  reduce = 0%
2016-05-30 18:44:26,410 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.04 sec
2016-05-30 18:44:41,263 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.24 sec
MapReduce Total cumulative CPU time: 3 seconds 240 msec
Ended Job = job_1464600726756_0004
Moving data to: hdfs://hadoop-senior.ibeifeng.com:8020/user/hive/warehouse/db_track.db/vistor_users_info
Table db_track.vistor_users_info stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.24 sec   HDFS Read: 260 HDFS Write: 52 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 240 msec
OK
date    uv  pv  login_users vistor_users    avg_stay_time   sec_ratio   ip
Time taken: 52.767 seconds

6.查看该需求表得到分析信息

hive (db_track)> select * from vistor_users_info;                                    
OK
vistor_users_info.date  vistor_users_info.uv    vistor_users_info.pv    vistor_users_info.login_users   vistor_users_info.vistor_usersvistor_users_info.avg_stay_time   vistor_users_info.sec_ratio vistor_users_info.ip
20150828    38985   131668.0    18548   21902   750.7895179233622   0.5238813349814586  29668
Time taken: 0.523 seconds, Fetched: 1 row(s)

注意：使用 create table 表名 as select 字段属性 from 会话信息表；，创建的统计结果表vistor_users_info，就算在创建会话信息表session时指定分隔符，统计结果表vistor_users_info在text查看时仍然显示方框分割，导致sqoop导入MySQL时不能解析。
03.jpg-44.1kB
解决办法，删除as select建立的分析结果表vistor_users_info，单独创建并指定分割符，在用insert into table 导入数据。即可正常使用sqoop导入MySQL。

hive (db_track)> create table vistor_users_info(
               > date string,
               > UV string,
               > PV string,
               > login_users string,
               > vistor_users string,
               > avg_stay_time string,
               > sec_ratio string,
               > ip string
               > )
               > row format delimited fields terminated by '\t' ;
OK
Time taken: 0.147 seconds

将会话信息表数据加入 vistor_users_info 进行统计

hive (db_track)> insert into table vistor_users_info select 
               > date,
               > count(distinct guid) UV,
               > sum(pv) PV,
               > count(case when user_id != '' then user_id else null end) login_users,
               > count(case when user_id = '' then user_id else null end) vistor_users,
               > avg(stay_time) avg_stay_time,
               > count(case when pv>=2 then session_id else null end)/count(session_id) sec_ratio,
               > count(distinct ip) ip
               > from session_info where date='20150828'
               > group by date ;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1464600726756_0030, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0030/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1464600726756_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-30 23:21:18,203 Stage-1 map = 0%,  reduce = 0%
2016-05-30 23:21:46,752 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.6 sec
2016-05-30 23:22:17,900 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.68 sec
MapReduce Total cumulative CPU time: 7 seconds 680 msec
Ended Job = job_1464600726756_0030
Loading data to table db_track.vistor_users_info
Table db_track.vistor_users_info stats: [numFiles=1, numRows=1, totalSize=79, rawDataSize=78]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.68 sec   HDFS Read: 9121608 HDFS Write: 161 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 680 msec
OK
date    uv  pv  login_users vistor_users    avg_stay_time   sec_ratio   ip
Time taken: 73.532 seconds

在Hive中查看统计结果

hive (db_track)> select * from vistor_users_info;
OK
vistor_users_info.date  vistor_users_info.uv    vistor_users_info.pv    vistor_users_info.login_users   vistor_users_info.vistor_usersvistor_users_info.avg_stay_time   vistor_users_info.sec_ratio vistor_users_info.ip
20150828    38985   131668.0    18548   21902   750.7895179233622   0.5238813349814586  29668
Time taken: 0.183 seconds, Fetched: 1 row(s)

在MySQL中创建接收结果表 vistor_users_info

create table vistor_users_info(
date varchar(20),
uv varchar(50),
pv varchar(50),
login_users varchar(50),
vistor_users varchar(50),
avg_stay_time varchar(50),
sec_ratio varchar(50),
ip varchar(50)
);

使用sqoop将导入MySQL

bin/sqoop export \
> --connect jdbc:mysql://hadoop-senior.ibeifeng.com:3306/db_0521 \
> --username root \
> --password 123 \
> --table vistor_users_info \
> --export-dir /user/hive/warehouse/db_track.db/vistor_users_info \
> --input-fields-terminated-by '\t'
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
16/05/30 23:24:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
16/05/30 23:24:50 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/05/30 23:24:51 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/05/30 23:24:51 INFO tool.CodeGenTool: Beginning code generation
16/05/30 23:24:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `vistor_users_info` AS t LIMIT 1
16/05/30 23:24:51 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `vistor_users_info` AS t LIMIT 1
16/05/30 23:24:51 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6
Note: /tmp/sqoop-beifeng/compile/d59d0a6ed5ece911a523a321474bffbb/vistor_users_info.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/05/30 23:24:55 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-beifeng/compile/d59d0a6ed5ece911a523a321474bffbb/vistor_users_info.jar
16/05/30 23:24:55 INFO mapreduce.ExportJobBase: Beginning export of vistor_users_info
16/05/30 23:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/30 23:24:55 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/05/30 23:24:57 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/05/30 23:24:57 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/05/30 23:24:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/05/30 23:24:57 INFO client.RMProxy: Connecting to ResourceManager at hadoop-senior.ibeifeng.com/192.168.5.130:8032
16/05/30 23:25:02 INFO input.FileInputFormat: Total input paths to process : 1
16/05/30 23:25:02 INFO input.FileInputFormat: Total input paths to process : 1
16/05/30 23:25:03 INFO mapreduce.JobSubmitter: number of splits:4
16/05/30 23:25:03 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/05/30 23:25:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1464600726756_0031
16/05/30 23:25:04 INFO impl.YarnClientImpl: Submitted application application_1464600726756_0031
16/05/30 23:25:04 INFO mapreduce.Job: The url to track the job: http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464600726756_0031/
16/05/30 23:25:04 INFO mapreduce.Job: Running job: job_1464600726756_0031
16/05/30 23:25:19 INFO mapreduce.Job: Job job_1464600726756_0031 running in uber mode : false
16/05/30 23:25:19 INFO mapreduce.Job:  map 0% reduce 0%
16/05/30 23:26:04 INFO mapreduce.Job:  map 100% reduce 0%
16/05/30 23:26:12 INFO mapreduce.Job: Job job_1464600726756_0031 completed successfully
16/05/30 23:26:15 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=525728
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1054
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=19
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=4
        Data-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=175923
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=175923
        Total vcore-seconds taken by all map tasks=175923
        Total megabyte-seconds taken by all map tasks=180145152
    Map-Reduce Framework
        Map input records=1
        Map output records=1
        Input split bytes=826
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=272
        CPU time spent (ms)=2810
        Physical memory (bytes) snapshot=358006784
        Virtual memory (bytes) snapshot=3356164096
        Total committed heap usage (bytes)=62914560
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
16/05/30 23:26:15 INFO mapreduce.ExportJobBase: Transferred 1.0293 KB in 78.1854 seconds (13.4808 bytes/sec)
16/05/30 23:26:15 INFO mapreduce.ExportJobBase: Exported 1 records.

在 MySQL 查看 vistor_users_info 表接收的统计结果
05.jpg-43.4kB

【作业十九】殷杰

作业描述：

内容目录