Ever tried specifying Ctrl-G or Ctrl-A as a delimiter using the TextOutputFormat? Well, you would not be able to with the current version of Hadoop.


11/04/11 18:39:43 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 3ebdc922a897735c130a12bb44fc8c0819077f9d]
Exception in thread “main” org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException:
org.xml.sax.SAXParseException: Character reference “&#7” is an invalid XML character.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1317)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1186)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1115)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:425)
at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1709)
at org.apache.hadoop.mapred.JobConf.
(JobConf.java:214)
at org.apache.hadoop.mapred.JobInProgress.
(JobInProgress.java:264)
at org.apache.hadoop.mapred.JobInProgress.
(JobInProgress.java:240)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026)
at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)

at org.apache.hadoop.ipc.Client.call(Client.java:818)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:443)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:467)
at com.ebay.ice.hadoop.mobius.srp.mapred.SRPImpressionCounter.run(SRPImpressionCounter.java:182)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.ebay.ice.hadoop.mobius.srp.mapred.SRPImpressionCounter.main(SRPImpressionCounter.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I faced the error shown above when I tried to use Ctrl-G (\u0007) as the delimiter between key-value pairs in the output. The issue here is that the Client serializes the Configuration, which is later unmarshalled by the JobTracker. This step on the JobTracker fails as it is unable to de-serialize “\u0007”, or for that matter, any special unicode character.

Workaround (a dirty one at that!) : Create your custom textoutputformat that specifies the unicode character as the default in the code. This is a really dirty hack, and I am working on making the code generic to accept the special unicode delimiter as an argument.

Here is the code from TextOutputFormat :

public RecordWriter<K, V> getRecordWriter(FileSystem ignored,

JobConf job,

String name,

Progressable progress)

throws IOException {

boolean isCompressed = getCompressOutput(job);

String keyValueSeparator = job.get(“mapred.textoutputformat.separator”,

“\t”);

if (!isCompressed) {

Path file = FileOutputFormat.getTaskOutputPath(job, name);

FileSystem fs = file.getFileSystem(job);

FSDataOutputStream fileOut = fs.create(file, progress);

return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);

} else {

Classextends CompressionCodec> codecClass =

getOutputCompressorClass(job, GzipCodec.class);

// create the named codec

CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job);

// build the filename including the extension

Path file =

FileOutputFormat.getTaskOutputPath(job,

name + codec.getDefaultExtension());

FileSystem fs = file.getFileSystem(job);

FSDataOutputStream fileOut = fs.create(file, progress);

return new LineRecordWriter<K, V>(new DataOutputStream

(codec.createOutputStream(fileOut)),

keyValueSeparator);

}

}

The RecordReader implementation uses “\t” as the default. Changing this in your Custom output format reader should work. But again, this is a “really dirty” hack! I will post later once I have a better implementation.

Another hack, would be to provide the delimiter through an XML resource file. The xml version needs to be marked 1.1, since 1.0 fails to recognize the special unicode characters. The XML 1.0 spec explicitly omitted most of the non-printing characters in the range 0x00 to 0x1F.

Name: mapred.textoutputformat.separator
Value: \u0007

<?xml version="1.1"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>hadoop.user</name>

<value>${user.name}</value>

</property>

<property>

<name>mapred.textoutputformat.separator</name>

<value>\u0007</value>

</property>

</configuration>

Advertisements