Hive默认创建的表字段分隔符为:\001(ctrl-A),也可以通过 ROW FORMAT DELIMITED FIELDS TERMINATED BY
指定其他字符,但是该语法只支持单个字符,如果你的分隔符是多个字符,则需要你自定义InputFormat来实现,本文就以简单的示例演示多个字符作为分隔符的实现。
[一]、开发环境
- Hadoop 2.2.0
- Hive 0.12.0
- Java1.6+
- Mac OSX 10.9.1
[二]、示例
1、准备演示数据 mydemosplit.txt
1 2 3 4 5 6 7 |
michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/ michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/ michael|@^_^@|groovy|@^_^@|http://www.micmiu.com/lang/groovy/groovy-running-ways/ michael|@^_^@|sso|@^_^@|http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/ michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/ michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/spring/springmvc-binding-date/ michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/ |
分隔符为:“|@^_^@| ”
2、编码实现InputFormat
MyDemoInputFormat.java
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
package com.micmiu.hive; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.LineRecordReader; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; /** * * hive 自定义分隔符 比如:|@^_^@| * * @author <a href="http://www.micmiu.com">Michael</a> * @create Feb 24, 2014 3:11:16 PM * @version 1.0 */ public class MyDemoInputFormat extends TextInputFormat { @Override public RecordReader<LongWritable, Text> getRecordReader( InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(genericSplit.toString()); MyDemoRecordReader reader = new MyDemoRecordReader( new LineRecordReader(job, (FileSplit) genericSplit)); return reader; } public static class MyDemoRecordReader implements RecordReader<LongWritable, Text> { LineRecordReader reader; Text text; public MyDemoRecordReader(LineRecordReader reader) { this.reader = reader; text = reader.createValue(); } @Override public void close() throws IOException { reader.close(); } @Override public LongWritable createKey() { return reader.createKey(); } @Override public Text createValue() { return new Text(); } @Override public long getPos() throws IOException { return reader.getPos(); } @Override public float getProgress() throws IOException { return reader.getProgress(); } @Override public boolean next(LongWritable key, Text value) throws IOException { while (reader.next(key, text)) { // michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/ String strReplace = text.toString().toLowerCase() .replaceAll("\\|@\\^_\\^@\\|", "\001"); Text txtReplace = new Text(); txtReplace.set(strReplace); value.set(txtReplace.getBytes(), 0, txtReplace.getLength()); return true; } return false; } } } |
ps: 自定义实现接口InputFormat 、RecordReader,具体可以参考源码中得Base64TextInputFormat.java
编译打成jar包后,把该jar包copy一份到<HOME_HIVE>/lib/目录下,需要退出并重新进入Hive CLI模式。
3、建表和导入数据
用参数 STORED AS file_format
建表:
1 2 3 4 5 6 7 |
hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT 'com.micmiu.hive.MyDemoInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' hive> desc micmiu_blog; OK author string None category string None url string None Time taken: 0.05 seconds, Fetched: 3 row(s) |
导入上面的数据文件,对比导入前后表中的数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
hive>select * from micmiu_blog; OK Time taken: 0.033 seconds hive> LOAD DATA LOCAL INPATH '/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt' OVERWRITE INTO TABLE micmiu_blog; Copying data from file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt Copying file: file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt Loading data to table default.micmiu_blog Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0] OK Time taken: 0.197 seconds hive> select * from micmiu_blog; OK michael hadoop http://www.micmiu.com/opensource/hadoop/hive-metastore-config/ michael j2ee http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/ michael groovy http://www.micmiu.com/lang/groovy/groovy-running-ways/ michael sso http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/ michael hadoop http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/ michael j2ee http://www.micmiu.com/j2ee/spring/springmvc-binding-date/ michael hadoop http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/ Time taken: 0.053 seconds, Fetched: 7 row(s) hive> |
从上面的执行过程可以看出已经实现了自定义字符串作为分隔符。
—————– EOF @Michael Sun —————–
原创文章,转载请注明: 转载自micmiu – 软件开发+生活点滴[ http://www.micmiu.com/ ]
本文链接地址: http://www.micmiu.com/bigdata/hive/hive-inputformat-string/
楼主你好,我看到自定义分隔符为|@^_^@|,而在程序中替换的是
\\ |@ \\ ^_ \\ ^@ \\ |。想请问一下其中的原因。我想用$@_@$这种分隔符,不知道该怎么写。或者你可以给我推荐个东西,我去看。谢谢~ 😛