文章基本目录结构:
- 数据导入
- 导入本地文件
- 导入hdfs文件
- 数据导出
- 导出到新表
- 导出到本地文件
- 导出到hdfs文件
本文的测试基于Hadoop 2.2.0 + hive 0.12.0
[一]、数据导入
LOAD的基本语法:
1 |
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] |
1、准备测试数据
准备本地数据,创建:/Users/micmiu/no_sync/testdata/hadoop/myword_local.txt 即可,内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
bigdata 23 local Hadoop 7 local Spark 12 local Storm 3 local Mesos 2 local YARN 8 local Tez 2 local MPI 5 local Hive 7 local Impala 9 local Presto 2 local micmiu.com 10 local MichaelSun 8 local |
ps: 字段之间用制表符(’\t’) 分割
准备 hdfs上的文本数据,先创建一个本地 /Users/micmiu/no_sync/testdata/hadoop/myword_local.txt 如下:
1 2 3 4 5 6 7 |
Java 15 hdfs Scale 12 hdfs Nodejs 7 hdfs Groovy 3 hdfs R 8 hdfs micmiu.com 8 hdfs MichaelSun 5 hdfs |
ps: 字段之间用制表符(’\t’) 分割
hdfs上创建测试用的目录:
hdfs dfs -mkdir -p /user/micmiu/test/input
然后把本地文件上传到hdfs系统中:
hdfs dfs -put /Users/micmiu/no_sync/testdata/hadoop/myword_hdfs.txt /user/micmiu/test/input
到此测试用的数据已经准备ok。
2、导入本地数据
Hive中创建测试表
1 2 3 4 5 6 |
hive> DROP TABLE IF EXISTS myword; OK Time taken: 0.096 seconds hive> CREATE TABLE myword(word STRING, counts INT, ds STRING) row format delimited fields terminated by '\t'; OK Time taken: 0.033 seconds hive> |
命令:
LOAD DATA LOCAL INPATH '/Users/micmiu/no_sync/testdata/hadoop/myword_local.txt' OVERWRITE INTO TABLE myword;
执行导入并验证查询如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
hive> LOAD DATA LOCAL INPATH '/Users/micmiu/no_sync/testdata/hadoop/myword_local.txt' OVERWRITE INTO TABLE myword; Copying data from file:/Users/micmiu/no_sync/testdata/hadoop/myword_local.txt Copying file: file:/Users/micmiu/no_sync/testdata/hadoop/myword_local.txt Loading data to table default.myword Table default.myword stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 194, raw_data_size: 0] OK Time taken: 0.189 seconds hive> select t.word,t.counts,t.ds from myword t; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1392168703335_0001, Tracking URL = http://micmiu-mbp.local:8088/proxy/application_1392168703335_0001/ Kill Command = /usr/local/share/hadoop/bin/hadoop job -kill job_1392168703335_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2014-02-12 10:57:09,753 Stage-1 map = 0%, reduce = 0% 2014-02-12 10:57:15,008 Stage-1 map = 100%, reduce = 0% Ended Job = job_1392168703335_0001 MapReduce Jobs Launched: Job 0: Map: 1 HDFS Read: 413 HDFS Write: 194 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK bigdata 23 local Hadoop 7 local Spark 12 local Storm 3 local Mesos 2 local YARN 8 local Tez 2 local MPI 5 local Hive 7 local Impala 9 local Presto 2 local micmiu.com 10 local MichaelSun 8 local Time taken: 24.576 seconds, Fetched: 13 row(s) hive> |
对比本地文件myword_local.txt 内容一致,导入成功。
3、导入hdfs上的数据
按照上一步【2】中方法重新创建表,或者 清空数据:TRUNCATE TABLE myword
查看下导入数据之前hdfs中文件结构:
1 2 3 4 |
micmiu-mbp:~ micmiu$ hdfs dfs -ls /user/micmiu/test/input Found 1 items -rw-r--r-- 1 micmiu supergroup 99 2014-02-12 13:48 /user/micmiu/test/input/myword_hdfs.txt micmiu-mbp:~ micmiu$ hdfs dfs -ls /user/hive/warehouse/myword |
准备导入hdfs中文件:
LOAD DATA INPATH '/user/micmiu/test/input/myword_hdfs.txt' OVERWRITE INTO TABLE myword;
导入过程如下:
1 2 3 4 5 |
hive> LOAD DATA INPATH '/user/micmiu/test/input/myword_hdfs.txt' OVERWRITE INTO TABLE myword; Loading data to table default.myword Table default.myword stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 99, raw_data_size: 0] OK Time taken: 0.16 seconds hive> |
验证查询导入的数据:
1 2 3 4 5 6 7 8 9 10 11 |
hive> select * from myword; OK Java 15 hdfs Scale 12 hdfs Nodejs 7 hdfs Groovy 3 hdfs R 8 hdfs micmiu.com 8 hdfs MichaelSun 5 hdfs Time taken: 0.047 seconds, Fetched: 7 row(s) hive> |
对比之前准备的文本myword_hdfs.txt 内容一致。
查看下hdfs中得文件结构变化:
1 2 3 4 |
icmiu-mbp:~ micmiu$ hdfs dfs -ls /user/micmiu/test/input micmiu-mbp:~ micmiu$ hdfs dfs -ls /user/hive/warehouse/myword Found 1 items -rw-r--r-- 1 micmiu supergroup 99 2014-02-12 13:48 /user/hive/warehouse/myword/myword_hdfs.txt |
ps:在操作中发现从hdfs中导入数据有两个问题:
- overwrite 参数在从hdfs导入数据时无效
- 如果hdfs重复导入同名的文件,hive会有错误,据说0.11版本修复了改bug,不过我在0.12.0中还是有这个问题(https://issues.apache.org/jira/browse/HIVE-3300)
[二]、数据导出
把上面两个文本数据都导入到表 myword 中,结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
hive> select * from myword; OK Java 15 hdfs Scale 12 hdfs Nodejs 7 hdfs Groovy 3 hdfs R 8 hdfs micmiu.com 8 hdfs MichaelSun 5 hdfs bigdata 23 local Hadoop 7 local Spark 12 local Storm 3 local Mesos 2 local YARN 8 local Tez 2 local MPI 5 local Hive 7 local Impala 9 local Presto 2 local micmiu.com 10 local MichaelSun 8 local Time taken: 0.041 seconds, Fetched: 20 row(s) hive> |
1、导出查询数据到新表
首先创建新的目标表 myword_stat:
1 |
hive> CREATE TABLE myword_stat(word STRING, counts INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; |
HQL:
1 |
INSERT OVERWRITE TABLE myword_stat SELECT t.word,sum(t.counts) AS counts FROM myword t group by t.word ORDER BY counts DESC; |
执行过程如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
hive> INSERT OVERWRITE TABLE myword_stat SELECT t.word,sum(t.counts) AS counts FROM myword t group by t.word ORDER BY counts DESC; Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1392168703335_0014, Tracking URL = http://micmiu-mbp.local:8088/proxy/application_1392168703335_0014/ Kill Command = /usr/local/share/hadoop/bin/hadoop job -kill job_1392168703335_0014 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2014-02-12 14:59:10,826 Stage-1 map = 0%, reduce = 0% 2014-02-12 14:59:16,023 Stage-1 map = 100%, reduce = 0% 2014-02-12 14:59:22,238 Stage-1 map = 100%, reduce = 100% Ended Job = job_1392168703335_0014 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1392168703335_0015, Tracking URL = http://micmiu-mbp.local:8088/proxy/application_1392168703335_0015/ Kill Command = /usr/local/share/hadoop/bin/hadoop job -kill job_1392168703335_0015 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2014-02-12 14:59:39,285 Stage-2 map = 0%, reduce = 0% 2014-02-12 14:59:45,506 Stage-2 map = 100%, reduce = 0% 2014-02-12 14:59:51,744 Stage-2 map = 100%, reduce = 100% Ended Job = job_1392168703335_0015 Loading data to table default.myword_stat Table default.myword_stat stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 156, raw_data_size: 0] MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 593 HDFS Write: 534 SUCCESS Job 1: Map: 1 Reduce: 1 HDFS Read: 899 HDFS Write: 156 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 57.611 seconds hive> select * from myword_stat; OK bigdata 23 micmiu.com 18 Java 15 MichaelSun 13 Spark 12 Scale 12 Impala 9 YARN 8 R 8 Nodejs 7 Hive 7 Hadoop 7 MPI 5 Storm 3 Groovy 3 Tez 2 Presto 2 Mesos 2 Time taken: 0.039 seconds, Fetched: 18 row(s) hive> |
2、导出查询数据到本地文件
1 |
hive> INSERT OVERWRITE LOCAL DIRECTORY '/Users/micmiu/no_sync/testdata/hadoop/myword_stat_exp' SELECT t.word,t.counts FROM myword_stat t; |
如果要自定义导出的分隔符(‘\t’):
1 |
hive> INSERT OVERWRITE LOCAL DIRECTORY '/Users/micmiu/no_sync/testdata/hadoop/myword_stat_exp' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT t.word,t.counts FROM myword_stat t; |
3、导出查询数据到hdfs中
1 |
hive> INSERT OVERWRITE DIRECTORY '/user/micmiu/test/output/myword_stat_exp' SELECT t.word,t.counts FROM myword_stat t; |
ps:导出到hdfs中自定义分隔符有错误。
—————– EOF @Michael Sun —————–
原创文章,转载请注明: 转载自micmiu – 软件开发+生活点滴[ http://www.micmiu.com/ ]
本文链接地址: http://www.micmiu.com/bigdata/hive/hive-dml-load-insert/
0 条评论。