Hive自定义分隔符InputFormat

作者: Michael 日期: 2014 年 2 月 24 日发表评论 (1) 查看评论

Hive默认创建的表字段分隔符为：\001(ctrl-A)，也可以通过 ROW FORMAT DELIMITED FIELDS TERMINATED BY 指定其他字符，但是该语法只支持单个字符，如果你的分隔符是多个字符，则需要你自定义InputFormat来实现，本文就以简单的示例演示多个字符作为分隔符的实现。

[一]、开发环境

Hadoop 2.2.0
Hive 0.12.0
Java1.6+
Mac OSX 10.9.1

[二]、示例

1、准备演示数据 mydemosplit.txt

michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/
michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/
michael|@^_^@|groovy|@^_^@|http://www.micmiu.com/lang/groovy/groovy-running-ways/
michael|@^_^@|sso|@^_^@|http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/
michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/
michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/spring/springmvc-binding-date/
michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/

michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/

michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/

michael|@^_^@|groovy|@^_^@|http://www.micmiu.com/lang/groovy/groovy-running-ways/

michael|@^_^@|sso|@^_^@|http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/

michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/

michael|@^_^@|j2ee|@^_^@|http://www.micmiu.com/j2ee/spring/springmvc-binding-date/

michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/

分隔符为：“|@^_^@| ”

2、编码实现InputFormat

MyDemoInputFormat.java

package com.micmiu.hive;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;

/**
 * 
 * hive 自定义分隔符 比如：|@^_^@|
 * 
 * @author <a href="http://www.micmiu.com">Michael</a>
 * @create Feb 24, 2014 3:11:16 PM
 * @version 1.0
 */
public class MyDemoInputFormat extends TextInputFormat {

	@Override
	public RecordReader<LongWritable, Text> getRecordReader(
			InputSplit genericSplit, JobConf job, Reporter reporter)
			throws IOException {
		reporter.setStatus(genericSplit.toString());
		MyDemoRecordReader reader = new MyDemoRecordReader(
				new LineRecordReader(job, (FileSplit) genericSplit));
		return reader;
	}

	public static class MyDemoRecordReader implements
			RecordReader<LongWritable, Text> {

		LineRecordReader reader;
		Text text;

		public MyDemoRecordReader(LineRecordReader reader) {
			this.reader = reader;
			text = reader.createValue();
		}

		@Override
		public void close() throws IOException {
			reader.close();
		}

		@Override
		public LongWritable createKey() {
			return reader.createKey();
		}

		@Override
		public Text createValue() {
			return new Text();
		}

		@Override
		public long getPos() throws IOException {
			return reader.getPos();
		}

		@Override
		public float getProgress() throws IOException {
			return reader.getProgress();
		}

		@Override
		public boolean next(LongWritable key, Text value) throws IOException {
			while (reader.next(key, text)) {
				// michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/
				String strReplace = text.toString().toLowerCase()
						.replaceAll("\\|@\\^_\\^@\\|", "\001");
				Text txtReplace = new Text();
				txtReplace.set(strReplace);
				value.set(txtReplace.getBytes(), 0, txtReplace.getLength());
				return true;

			}
			return false;
		}
	}
}

package com.micmiu.hive;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileSplit;

import org.apache.hadoop.mapred.InputSplit;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.LineRecordReader;

import org.apache.hadoop.mapred.RecordReader;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapred.TextInputFormat;

/**

* hive 自定义分隔符比如：|@^_^@|

* @author <a href="http://www.micmiu.com">Michael</a>

* @create Feb 24, 2014 3:11:16 PM

* @version 1.0

public class MyDemoInputFormat extends TextInputFormat {

@Override

public RecordReader<LongWritable, Text> getRecordReader(

InputSplit genericSplit, JobConf job, Reporter reporter)

throws IOException {

reporter.setStatus(genericSplit.toString());

MyDemoRecordReader reader = new MyDemoRecordReader(

new LineRecordReader(job, (FileSplit) genericSplit));

return reader;

}

public static class MyDemoRecordReader implements

RecordReader<LongWritable, Text> {

LineRecordReader reader;

Text text;

public MyDemoRecordReader(LineRecordReader reader) {

this.reader = reader;

text = reader.createValue();

}

@Override

public void close() throws IOException {

reader.close();

}

@Override

public LongWritable createKey() {

return reader.createKey();

}

@Override

public Text createValue() {

return new Text();

}

@Override

public long getPos() throws IOException {

return reader.getPos();

}

@Override

public float getProgress() throws IOException {

return reader.getProgress();

}

@Override

public boolean next(LongWritable key, Text value) throws IOException {

while (reader.next(key, text)) {

// michael|@^_^@|hadoop|@^_^@|http://www.micmiu.com/opensource/hadoop/hive-metastore-config/

String strReplace = text.toString().toLowerCase()

.replaceAll("\\|@\\^_\\^@\\|", "\001");

Text txtReplace = new Text();

txtReplace.set(strReplace);

value.set(txtReplace.getBytes(), 0, txtReplace.getLength());

return true;

}

return false;

}

ps: 自定义实现接口InputFormat 、RecordReader，具体可以参考源码中得Base64TextInputFormat.java

编译打成jar包后，把该jar包copy一份到<HOME_HIVE>/lib/目录下，需要退出并重新进入Hive CLI模式。

3、建表和导入数据

用参数 STORED AS file_format 建表：

hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT 'com.micmiu.hive.MyDemoInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
hive> desc micmiu_blog;
OK
author              	string              	None                
category            	string              	None                
url                 	string              	None                
Time taken: 0.05 seconds, Fetched: 3 row(s)

hive> CREATE TABLE micmiu_blog(author STRING, category STRING, url STRING) STORED AS INPUTFORMAT 'com.micmiu.hive.MyDemoInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

hive> desc micmiu_blog;

author string None

category string None

url string None

Time taken: 0.05 seconds, Fetched: 3 row(s)

导入上面的数据文件,对比导入前后表中的数据：

hive>select * from  micmiu_blog;                                           OK                                       
Time taken: 0.033 seconds
hive> LOAD DATA LOCAL INPATH '/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt' OVERWRITE INTO TABLE micmiu_blog;
Copying data from file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
Copying file: file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt
Loading data to table default.micmiu_blog
Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0]
OK
Time taken: 0.197 seconds
hive> select * from  micmiu_blog;                                                 OK                                       
michael	hadoop	http://www.micmiu.com/opensource/hadoop/hive-metastore-config/
michael	j2ee	http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/
michael	groovy	http://www.micmiu.com/lang/groovy/groovy-running-ways/
michael	sso	http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/
michael	hadoop	http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/
michael	j2ee	http://www.micmiu.com/j2ee/spring/springmvc-binding-date/
michael	hadoop	http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/
Time taken: 0.053 seconds, Fetched: 7 row(s)
hive>

hive>select * from micmiu_blog; OK

Time taken: 0.033 seconds

hive> LOAD DATA LOCAL INPATH '/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt' OVERWRITE INTO TABLE micmiu_blog;

Copying data from file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt

Copying file: file:/Users/micmiu/no_sync/testdata/hadoop/hive/mydemosplit.txt

Loading data to table default.micmiu_blog

Table default.micmiu_blog stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 601, raw_data_size: 0]

Time taken: 0.197 seconds

hive> select * from micmiu_blog; OK

michael hadoop http://www.micmiu.com/opensource/hadoop/hive-metastore-config/

michael j2ee http://www.micmiu.com/j2ee/hibernate/hibernate-jpa-demo/

michael groovy http://www.micmiu.com/lang/groovy/groovy-running-ways/

michael sso http://www.micmiu.com/enterprise-app/sso/sso-cas-ldap-auth/

michael hadoop http://www.micmiu.com/opensource/hadoop/hive-tutorial-ddl-dml/

michael j2ee http://www.micmiu.com/j2ee/spring/springmvc-binding-date/

michael hadoop http://www.micmiu.com/opensource/hadoop/hadoop2x-cluster-setup/

Time taken: 0.053 seconds, Fetched: 7 row(s)

hive>

从上面的执行过程可以看出已经实现了自定义字符串作为分隔符。

—————– EOF @Michael Sun —————–

原创文章，转载请注明： 转载自micmiu – 软件开发+生活点滴[ http://www.micmiu.com/ ]

本文链接地址: http://www.micmiu.com/bigdata/hive/hive-inputformat-string/

HiveHive

← HBase基于Hadoop2的源码编译

github更新自己Fork的代码 →

发表评论？

1 条评论。

crazyhoney 2016 年 3 月 25 日在上午 9:42

楼主你好，我看到自定义分隔符为|@^_^@|，而在程序中替换的是
\\ |@ \\ ^_ \\ ^@ \\ |。想请问一下其中的原因。我想用$@_@$这种分隔符，不知道该怎么写。或者你可以给我推荐个东西，我去看。谢谢~ 😛

回复

micmiu – 软件开发+生活点滴