@Arslan6and6 2016-05-25T16:29:02.000000Z 字数 9202 阅读 617

【作业十八】殷杰

第八章、大数据协作框架之Sqoop

---Hive日志分析案例

作业描述：
 需求说明
 依据日志文件，参考字段，统计分析每日各时段的 pv 和 uv
 实现要点
 建立 Hive 表，表列分隔符需要与原文件保持一致
 Load 加载数据到 Hive 表
 写 Hive sql 统计，结果落地到 Hive 表 2
 从 Hive 表 2 导出结果到 mysql 表

步骤一：建立Hive数据库和分区表

hive> drop database if exists db_track ;
OK
Time taken: 0.115 seconds
hive> create database db_track ;
OK
Time taken: 0.049 seconds
hive> create table track_log(
    > id         string,
    > url             string,
    > referer         string,
    > keyword         string,
    > type            string,
    > guid            string,
    > pageId          string,
    > moduleId        string,
    > linkId          string,
    > attachedInfo    string,
    > sessionId       string,
    > trackerU        string,
    > trackerType     string,
    > ip              string,
    > trackerSrc      string,
    > cookie          string,
    > orderCode       string,
    > trackTime       string,
    > endUserId       string,
    > firstLink       string,
    > sessionViewNo   string,
    > productId       string,
    > curMerchantId   string,
    > provinceId      string,
    > cityId          string,
    > fee             string,
    > edmActivity     string,
    > edmEmail        string,
    > edmJobId        string,
    > ieVersion       string,
    > platform        string,
    > internalKeyword string,
    > resultSum       string,
    > currentPage     string,
    > linkPosition    string,
    > buttonPosition  string
    > )
    > partitioned by (date string,hour string)
    > row format delimited fields terminated by '\t' ;
OK
Time taken: 0.09 seconds

步骤二：从本地加载分区表数据

load data local inpath '/home/beifeng/2015082818' into table track_log partition(date='20150828',hour='18') ;
load data local inpath '/home/beifeng/Desktop/source/2015082819' into table track_log partition(date='20150828',hour='19') ;

步骤三：统计每日各时段的 pv 和 uv

select date,hour,count(url),count(distinct guid) from track_log where date='20150828'  group by date,hour ;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1464189390335_0001, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0001/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1464189390335_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-25 23:21:48,458 Stage-1 map = 0%,  reduce = 0%
2016-05-25 23:22:10,869 Stage-1 map = 34%,  reduce = 0%, Cumulative CPU 2.26 sec
2016-05-25 23:22:13,329 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.37 sec
2016-05-25 23:22:32,297 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.03 sec
MapReduce Total cumulative CPU time: 6 seconds 30 msec
Ended Job = job_1464189390335_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.03 sec   HDFS Read: 76517763 HDFS Write: 48 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 30 msec
OK
date    hour    _c2 _c3
20150828    18  64972   23938
20150828    19  61162   22330
Time taken: 69.163 seconds, Fetched: 2 row(s)

步骤四：把统计结果导入Hive临时表

hive (db_track)> create table track_log_tmp(
               > date string,
               > hour string,
               > pv string,
               > uv string
               > )
               > row format delimited fields terminated by '\t';
OK
Time taken: 1.293 seconds
hive (db_track)> insert into table track_log_tmp select date,hour,count(url),count(distinct guid) from track_log where date='20150828'  group by date,hour ;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1464189390335_0002, Tracking URL = http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0002/
Kill Command = /opt/modules/hadoop-2.5.0-cdh5.3.6/bin/hadoop job  -kill job_1464189390335_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-05-25 23:38:40,279 Stage-1 map = 0%,  reduce = 0%
2016-05-25 23:39:05,766 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.08 sec
2016-05-25 23:39:27,353 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.43 sec
MapReduce Total cumulative CPU time: 6 seconds 430 msec
Ended Job = job_1464189390335_0002
Loading data to table db_track.track_log_tmp
Table db_track.track_log_tmp stats: [numFiles=1, numRows=2, totalSize=48, rawDataSize=46]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.43 sec   HDFS Read: 76517763 HDFS Write: 126 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
date    hour    _c2 _c3
Time taken: 60.805 seconds

步骤五：把Hive临时表数据导出到MySQL永久表

mysql> use db_0521;
Database changed
mysql> create table track_log(
    -> date varchar(20),
    -> hour varchar(10),
    -> pv varchar(50),
    -> uv varchar(50),
    -> primary key (date,hour)
    -> ) ;
Query OK, 0 rows affected (0.10 sec)
bin/sqoop export \
> --connect jdbc:mysql://hadoop-senior.ibeifeng.com:3306/db_0521 \
> --username root \
> --password 123 \
> --table track_log \
> --export-dir /user/hive/warehouse/db_track.db/track_log_tmp \
> --input-fields-terminated-by '\t'
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /opt/modules/sqoop-1.4.5-cdh5.3.6/bin/../../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
16/05/25 23:57:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.6
16/05/25 23:57:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/05/25 23:57:38 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/05/25 23:57:38 INFO tool.CodeGenTool: Beginning code generation
16/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 1
16/05/25 23:57:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `track_log` AS t LIMIT 1
16/05/25 23:57:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/modules/hadoop-2.5.0-cdh5.3.6
Note: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/05/25 23:57:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-beifeng/compile/c2aeb2ebcae155cfb2c4a61a8215a5b5/track_log.jar
16/05/25 23:57:42 INFO mapreduce.ExportJobBase: Beginning export of track_log
16/05/25 23:57:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/25 23:57:42 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
16/05/25 23:57:44 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/05/25 23:57:44 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/05/25 23:57:44 INFO client.RMProxy: Connecting to ResourceManager at hadoop-senior.ibeifeng.com/192.168.5.130:8032
16/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 1
16/05/25 23:57:46 INFO input.FileInputFormat: Total input paths to process : 1
16/05/25 23:57:47 INFO mapreduce.JobSubmitter: number of splits:4
16/05/25 23:57:47 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
16/05/25 23:57:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1464189390335_0003
16/05/25 23:57:48 INFO impl.YarnClientImpl: Submitted application application_1464189390335_0003
16/05/25 23:57:48 INFO mapreduce.Job: The url to track the job: http://hadoop-senior.ibeifeng.com:8088/proxy/application_1464189390335_0003/
16/05/25 23:57:48 INFO mapreduce.Job: Running job: job_1464189390335_0003
16/05/25 23:58:05 INFO mapreduce.Job: Job job_1464189390335_0003 running in uber mode : false
16/05/25 23:58:05 INFO mapreduce.Job:  map 0% reduce 0%
16/05/25 23:58:43 INFO mapreduce.Job:  map 100% reduce 0%
16/05/25 23:58:49 INFO mapreduce.Job: Job job_1464189390335_0003 completed successfully
16/05/25 23:58:51 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=525380
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=828
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=16
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=4
        Data-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=147642
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=147642
        Total vcore-seconds taken by all map tasks=147642
        Total megabyte-seconds taken by all map tasks=151185408
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=696
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=315
        CPU time spent (ms)=2930
        Physical memory (bytes) snapshot=352866304
        Virtual memory (bytes) snapshot=3355598848
        Total committed heap usage (bytes)=62914560
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
16/05/25 23:58:52 INFO mapreduce.ExportJobBase: Transferred 828 bytes in 67.7294 seconds (12.2251 bytes/sec)
16/05/25 23:58:52 INFO mapreduce.ExportJobBase: Exported 2 records.
mysql> show tables;
+-------------------+
| Tables_in_db_0521 |
+-------------------+
| hdfs2mysql        |
| my_user           |
| track_log         |
+-------------------+
3 rows in set (0.05 sec)
mysql> select * from track_log;
+----------+------+-------+-------+
| date     | hour | pv    | uv    |
+----------+------+-------+-------+
| 20150828 | 18   | 64972 | 23938 |
| 20150828 | 19   | 61162 | 22330 |
+----------+------+-------+-------+
2 rows in set (0.02 sec)

【作业十八】殷杰

步骤一：建立Hive数据库和分区表

内容目录