[关闭]
@awsekfozc 2015-12-11T17:20:55.000000Z 字数 3296 阅读 2546

Hive ETL

Hive

QQ截图20151210232705.png-19.9kB

日志文件处理

1.数据存储设计

1.数据存储格式,textfile列式存储
2.数据压缩格式,Snappy格式压缩
3.分区存储,日志按天分区(分区存储)
4.存储原数据为外部表
5.属性分割,正则表达式分割

2.创建代码

  1. use db_bf_log ;
  2. drop table IF EXISTS db_bf_log.bf_log_src ;
  3. create EXTERNAL table db_bf_log.bf_log_src (
  4. remote_addr string,
  5. remote_user string,
  6. time_local string,
  7. request string,
  8. status string,
  9. body_bytes_sent string,
  10. request_body string,
  11. http_referer string,
  12. http_user_agent string,
  13. http_x_forwarded_for string,
  14. host string
  15. )
  16. PARTITIONED BY (month string, day string)
  17. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
  18. WITH SERDEPROPERTIES (
  19. "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")"
  20. )
  21. stored as textfile tblproperties ("orc.compress"="SNAPPY");
  22. ##说明:
  23. 1.存储格式:textfile
  24. 2.分区字段:month,day
  25. 3.压缩格式:SNAPPY
  26. 4.外部表:EXTERNAL
  27. 5.分割类型:SERDErow format

2.数据加载

  1. use db_bf_log ;
  2. load data local inpath '/opt/datas/moodle.ibeifeng.access.log' overwrite into table db_bf_log.bf_log_src partition(month = '201512',day = '10');

3.数据查询

  1. select * from bf_log_src where month='201512' and day='10' limit 5;

QQ截图20151210222801.png-49.5kB

原数据处理

1.数据存储设计

1.大表拆分(只取业务需要的字段)
2.分区存储
3.ORC列式存储
4.snappy格式压缩

2.创建代码

  1. use db_bf_log ;
  2. drop table if exists db_bf_log.df_log_comm ;
  3. create table IF NOT EXISTS db_bf_log.df_log_comm (
  4. remote_addr string,
  5. time_local string,
  6. request string,
  7. http_referer string
  8. )
  9. PARTITIONED BY (month string, day string)
  10. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  11. stored as orc tblproperties ("orc.compress"="SNAPPY");
  12. ##说明:
  13. 1.存储格式:ORC
  14. 2.分区字段:month,day
  15. 3.压缩格式:SNAPPY

3.数据加载

  1. use db_bf_log ;
  2. INSERT into TABLE df_log_comm PARTITION(month='201512', day='10')
  3. select remote_addr, time_local, request, http_referer from db_bf_log.bf_log_src;

3.数据查询

  1. select * from df_log_comm where month='201512' and day='10' limit 5;

QQ截图20151210223354.png-29.2kB

数据清洗

1.清洗计划

1.转换符合格式的数据(UDF,python)
2.清理无效的数据(属性数量)
3.清理错误的数据(属性值)

2.UDF

UDF编写

3.python

创建python脚本(my.py)

  1. import sys
  2. import datetime
  3. for line in sys.stdin:
  4. line = line.strip()
  5. userid, movieid, rating, unixtime = line.split('\t')
  6. weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  7. print '\t'.join([userid, movieid, rating, str(weekday)])

加载脚本

  1. add FILE /opt/datas/my.py;

脚本清洗数据

  1. INSERT OVERWRITE TABLE u_data_new
  2. SELECT
  3. TRANSFORM (userid, movieid, rating, unixtime)
  4. USING 'python my.py'
  5. AS (userid, movieid, rating, weekday)
  6. FROM u_data;

优化设置(hive-site.xml)

Fetch Task

<property>
  <name>hive.fetch.task.conversion</name>
  <value>minimal</value>
  <description>
    Some select queries can be converted to single FETCH task minimizing latency.
    Currently the query should be single sourced not having any subquery and should not have
    any aggregations or distincts (which incurs RS), lateral views and joins.
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
  </description>
</property>

MR压缩设置

1.设置map输出压缩
hive.exec.compress.intermediate=true 
mapreduce.map.output.compress=true 
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
2.设置reduce输出压缩
hive.exec.compress.output
mapreduce.output.fileoutputformat.compress
mapreduce.output.fileoutputformat.compress.codec

并行性

hive.exec.parallel.thread.number=8
hive.exec.parallel=true;

JVM重用

mapreduce.job.jvm.numtasks=4

Reduce的数目

设置数目通过测试获得,测试各种数目根据正太分布图取最后结果。
mapreduce.job.reduces=0;

推测执行

hive.mapred.reduce.tasks.speculative.execution=true;
mapreduce.map.speculative=true;
mapreduce.reduce.speculative=true;

数据倾斜[1]

参考文档

其他帮助文档

1.正则在线工具
2.Hive DDL,文档1文档2
3.Hive DML,文档1文档2
4.Hive UDF,文旦1文档2

5.

在此输入正文

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注