@Arslan6and6
2016-05-25T16:29:02.000000Z
字数 9202
阅读 617
第八章、大数据协作框架之Sqoop
---Hive日志分析案例
作业描述:
需求说明
依据日志文件,参考字段,统计分析每日各时段的 pv 和 uv
实现要点
建立 Hive 表,表列分隔符需要与原文件保持一致
Load 加载数据到 Hive 表
写 Hive sql 统计,结果落地到 Hive 表 2
从 Hive 表 2 导出结果到 mysql 表
hive> drop database if exists db_track ;
OK
Time taken: 0.115 seconds
hive> create database db_track ;
OK
Time taken: 0.049 seconds
hive> create table track_log(
> id string,
> url string,
> referer string,
> keyword string,
> type string,
> guid string,
> pageId string,
> moduleId string,
> linkId string,
> attachedInfo string,
> sessionId string,
> trackerU string,
> trackerType string,
> ip string,
> trackerSrc string,
> cookie string,
> orderCode string,
> trackTime string,
> endUserId string,
> firstLink string,
> sessionViewNo string,
> productId string,
> curMerchantId string,
> provinceId string,
> cityId string,
> fee string,
> edmActivity string,
> edmEmail string,
> edmJobId string,
> ieVersion string,
> platform string,
> internalKeyword string,
> resultSum string,
> currentPage string,
> linkPosition string,
> buttonPosition string
> )
> partitioned by (date string,hour string)
> row format delimited fields terminated by '\t' ;
OK
Time taken: 0.09 seconds
步骤二:从本地加载分区表数据
load data local inpath '/home/beifeng/2015082818' into table track_log partition(date='20150828',hour='18') ;
load data local inpath '/home/beifeng/Desktop/source/2015082819' into table track_log partition(date='20150828',hour='19') ;
步骤三:统计每日各时段的 pv 和 uv
select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour ;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1464189390335_0001, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0001/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464189390335_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-25 23:21:48,458 Stage-1 map = 0%, reduce = 0%
2016-05-25 23:22:10,869 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 2.26 sec
2016-05-25 23:22:13,329 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.37 sec
2016-05-25 23:22:32,297 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.03 sec
MapReduce Total cumulative CPU time: 6 seconds 30 msec
Ended Job = job_1464189390335_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.03 sec HDFS Read: 76517763 HDFS Write: 48 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 30 msec
OK
date hour _c2 _c3
20150828 18 64972 23938
20150828 19 61162 22330
Time taken: 69.163 seconds, Fetched: 2 row(s)
步骤四:把统计结果导入Hive临时表
hive (db_track)> create table track_log_tmp(
> date string,
> hour string,
> pv string,
> uv string
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 1.293 seconds
hive (db_track)> insert into table track_log_tmp select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour ;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1464189390335_0002, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0002/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464189390335_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-25 23:38:40,279 Stage-1 map = 0%, reduce = 0%
2016-05-25 23:39:05,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.08 sec
2016-05-25 23:39:27,353 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.43 sec
MapReduce Total cumulative CPU time: 6 seconds 430 msec
Ended Job = job_1464189390335_0002
Loading data to table db_track.track_log_tmp
Table db_track.track_log_tmp stats: [numFiles=1, numRows=2, totalSize=48, rawDataSize=46]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 76517763 HDFS Write: 126 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
date hour _c2 _c3
Time taken: 60.805 seconds
步骤五:把Hive临时表数据导出到MySQL永久表
mysql> use db_0521;
Database changed
mysql> create table track_log(
-> date varchar(20),
-> hour varchar(10),
-> pv varchar(50),
-> uv varchar(50),
-> primary key (date,hour)
-> ) ;
Query OK, 0 rows affected (0.10 sec)
bin/sqoop export \
> --connect jdbc:mysql://hadoop-senior.ibeifeng.com:3306/db_0521 \
> --username root \
> --password 123 \
> --table track_log \
> --export-dir /user/hive/warehouse/db_track.db/track_log_tmp \
> --input-fields-terminated-by '\t'
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
16/05/25 23:57:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
16/05/25 23:57:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/05/25 23:57:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/05/25 23:57:38 INFO tool.CodeGenTool: Beginning code generation
16/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 1
16/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 1
16/05/25 23:57:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6
Note: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/05/25 23:57:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.jar
16/05/25 23:57:42 INFO mapreduce.ExportJobBase: Beginning export of track_log
16/05/25 23:57:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/25 23:57:42 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/05/25 23:57:44 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/05/25 23:57:44 INFO client.RMProxy: Connecting to ResourceManager at hadoop-senior.ibeifeng.com/192.168.5.130:8032
16/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 1
16/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 1
16/05/25 23:57:47 INFO mapreduce.JobSubmitter: number of splits:4
16/05/25 23:57:47 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/05/25 23:57:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1464189390335_0003
16/05/25 23:57:48 INFO impl.YarnClientImpl: Submitted application application_1464189390335_0003
16/05/25 23:57:48 INFO mapreduce.Job: The url to track the job: http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0003/
16/05/25 23:57:48 INFO mapreduce.Job: Running job: job_1464189390335_0003
16/05/25 23:58:05 INFO mapreduce.Job: Job job_1464189390335_0003 running in uber mode : false
16/05/25 23:58:05 INFO mapreduce.Job: map 0% reduce 0%
16/05/25 23:58:43 INFO mapreduce.Job: map 100% reduce 0%
16/05/25 23:58:49 INFO mapreduce.Job: Job job_1464189390335_0003 completed successfully
16/05/25 23:58:51 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=525380
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=828
HDFS: Number of bytes written=0
HDFS: Number of read operations=16
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=4
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=147642
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=147642
Total vcore-seconds taken by all map tasks=147642
Total megabyte-seconds taken by all map tasks=151185408
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=696
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=315
CPU time spent (ms)=2930
Physical memory (bytes) snapshot=352866304
Virtual memory (bytes) snapshot=3355598848
Total committed heap usage (bytes)=62914560
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/05/25 23:58:52 INFO mapreduce.ExportJobBase: Transferred 828 bytes in 67.7294 seconds (12.2251 bytes/sec)
16/05/25 23:58:52 INFO mapreduce.ExportJobBase: Exported 2 records.
mysql> show tables;
+-------------------+
| Tables_in_db_0521 |
+-------------------+
| hdfs2mysql |
| my_user |
| track_log |
+-------------------+
3 rows in set (0.05 sec)
mysql> select * from track_log;
+----------+------+-------+-------+
| date | hour | pv | uv |
+----------+------+-------+-------+
| 20150828 | 18 | 64972 | 23938 |
| 20150828 | 19 | 61162 | 22330 |
+----------+------+-------+-------+
2 rows in set (0.02 sec)