@Arslan6and6
2016-05-25T16:29:02.000000Z
字数 9202
阅读 688
第八章、大数据协作框架之Sqoop
---Hive日志分析案例
作业描述:
需求说明
依据日志文件,参考字段,统计分析每日各时段的 pv 和 uv
实现要点
建立 Hive 表,表列分隔符需要与原文件保持一致
Load 加载数据到 Hive 表
写 Hive sql 统计,结果落地到 Hive 表 2
从 Hive 表 2 导出结果到 mysql 表
hive> drop database if exists db_track ;OKTime taken: 0.115 secondshive> create database db_track ;OKTime taken: 0.049 secondshive> create table track_log(> id string,> url string,> referer string,> keyword string,> type string,> guid string,> pageId string,> moduleId string,> linkId string,> attachedInfo string,> sessionId string,> trackerU string,> trackerType string,> ip string,> trackerSrc string,> cookie string,> orderCode string,> trackTime string,> endUserId string,> firstLink string,> sessionViewNo string,> productId string,> curMerchantId string,> provinceId string,> cityId string,> fee string,> edmActivity string,> edmEmail string,> edmJobId string,> ieVersion string,> platform string,> internalKeyword string,> resultSum string,> currentPage string,> linkPosition string,> buttonPosition string> )> partitioned by (date string,hour string)> row format delimited fields terminated by '\t' ;OKTime taken: 0.09 seconds
步骤二:从本地加载分区表数据
load data local inpath '/home/beifeng/2015082818' into table track_log partition(date='20150828',hour='18') ;load data local inpath '/home/beifeng/Desktop/source/2015082819' into table track_log partition(date='20150828',hour='19') ;
步骤三:统计每日各时段的 pv 和 uv
select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour ;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>In order to set a constant number of reducers:set mapreduce.job.reduces=<number>Starting Job = job_1464189390335_0001, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0001/Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464189390335_0001Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-05-25 23:21:48,458 Stage-1 map = 0%, reduce = 0%2016-05-25 23:22:10,869 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 2.26 sec2016-05-25 23:22:13,329 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.37 sec2016-05-25 23:22:32,297 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.03 secMapReduce Total cumulative CPU time: 6 seconds 30 msecEnded Job = job_1464189390335_0001MapReduce Jobs Launched:Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.03 sec HDFS Read: 76517763 HDFS Write: 48 SUCCESSTotal MapReduce CPU Time Spent: 6 seconds 30 msecOKdate hour _c2 _c320150828 18 64972 2393820150828 19 61162 22330Time taken: 69.163 seconds, Fetched: 2 row(s)
步骤四:把统计结果导入Hive临时表
hive (db_track)> create table track_log_tmp(> date string,> hour string,> pv string,> uv string> )> row format delimited fields terminated by '\t';OKTime taken: 1.293 secondshive (db_track)> insert into table track_log_tmp select date,hour,count(url),count(distinct guid) from track_log where date='20150828' group by date,hour ;Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>In order to set a constant number of reducers:set mapreduce.job.reduces=<number>Starting Job = job_1464189390335_0002, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0002/Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job -kill job_1464189390335_0002Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12016-05-25 23:38:40,279 Stage-1 map = 0%, reduce = 0%2016-05-25 23:39:05,766 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.08 sec2016-05-25 23:39:27,353 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.43 secMapReduce Total cumulative CPU time: 6 seconds 430 msecEnded Job = job_1464189390335_0002Loading data to table db_track.track_log_tmpTable db_track.track_log_tmp stats: [numFiles=1, numRows=2, totalSize=48, rawDataSize=46]MapReduce Jobs Launched:Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 76517763 HDFS Write: 126 SUCCESSTotal MapReduce CPU Time Spent: 6 seconds 430 msecOKdate hour _c2 _c3Time taken: 60.805 seconds
步骤五:把Hive临时表数据导出到MySQL永久表
mysql> use db_0521;Database changedmysql> create table track_log(-> date varchar(20),-> hour varchar(10),-> pv varchar(50),-> uv varchar(50),-> primary key (date,hour)-> ) ;Query OK, 0 rows affected (0.10 sec)bin/sqoop export \> --connect jdbc:mysql://hadoop-senior.ibeifeng.com:3306/db_0521 \> --username root \> --password 123 \> --table track_log \> --export-dir /user/hive/warehouse/db_track.db/track_log_tmp \> --input-fields-terminated-by '\t'Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hbase does not exist! HBase imports will fail.Please set $HBASE_HOME to the root of your HBase installation.Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hcatalog does not exist! HCatalog jobs will fail.Please set $HCAT_HOME to the root of your HCatalog installation.Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../accumulo does not exist! Accumulo imports will fail.Please set $ACCUMULO_HOME to the root of your Accumulo installation.Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../zookeeper does not exist! Accumulo imports will fail.Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.16/05/25 23:57:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.616/05/25 23:57:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.16/05/25 23:57:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.16/05/25 23:57:38 INFO tool.CodeGenTool: Beginning code generation16/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 116/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 116/05/25 23:57:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6Note: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.java uses or overrides a deprecated API.Note: Recompile with -Xlint:deprecation for details.16/05/25 23:57:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.jar16/05/25 23:57:42 INFO mapreduce.ExportJobBase: Beginning export of track_log16/05/25 23:57:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable16/05/25 23:57:42 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar16/05/25 23:57:44 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps16/05/25 23:57:44 INFO client.RMProxy: Connecting to ResourceManager at hadoop-senior.ibeifeng.com/192.168.5.130:803216/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 116/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 116/05/25 23:57:47 INFO mapreduce.JobSubmitter: number of splits:416/05/25 23:57:47 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative16/05/25 23:57:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1464189390335_000316/05/25 23:57:48 INFO impl.YarnClientImpl: Submitted application application_1464189390335_000316/05/25 23:57:48 INFO mapreduce.Job: The url to track the job: http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0003/16/05/25 23:57:48 INFO mapreduce.Job: Running job: job_1464189390335_000316/05/25 23:58:05 INFO mapreduce.Job: Job job_1464189390335_0003 running in uber mode : false16/05/25 23:58:05 INFO mapreduce.Job: map 0% reduce 0%16/05/25 23:58:43 INFO mapreduce.Job: map 100% reduce 0%16/05/25 23:58:49 INFO mapreduce.Job: Job job_1464189390335_0003 completed successfully16/05/25 23:58:51 INFO mapreduce.Job: Counters: 30File System CountersFILE: Number of bytes read=0FILE: Number of bytes written=525380FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=828HDFS: Number of bytes written=0HDFS: Number of read operations=16HDFS: Number of large read operations=0HDFS: Number of write operations=0Job CountersLaunched map tasks=4Data-local map tasks=4Total time spent by all maps in occupied slots (ms)=147642Total time spent by all reduces in occupied slots (ms)=0Total time spent by all map tasks (ms)=147642Total vcore-seconds taken by all map tasks=147642Total megabyte-seconds taken by all map tasks=151185408Map-Reduce FrameworkMap input records=2Map output records=2Input split bytes=696Spilled Records=0Failed Shuffles=0Merged Map outputs=0GC time elapsed (ms)=315CPU time spent (ms)=2930Physical memory (bytes) snapshot=352866304Virtual memory (bytes) snapshot=3355598848Total committed heap usage (bytes)=62914560File Input Format CountersBytes Read=0File Output Format CountersBytes Written=016/05/25 23:58:52 INFO mapreduce.ExportJobBase: Transferred 828 bytes in 67.7294 seconds (12.2251 bytes/sec)16/05/25 23:58:52 INFO mapreduce.ExportJobBase: Exported 2 records.mysql> show tables;+-------------------+| Tables_in_db_0521 |+-------------------+| hdfs2mysql || my_user || track_log |+-------------------+3 rows in set (0.05 sec)mysql> select * from track_log;+----------+------+-------+-------+| date | hour | pv | uv |+----------+------+-------+-------+| 20150828 | 18 | 64972 | 23938 || 20150828 | 19 | 61162 | 22330 |+----------+------+-------+-------+2 rows in set (0.02 sec)