Unicode characters/Ctrl G or Ctrl A as TextOutputFormat (Hadoop) delimiter

Ever tried specifying Ctrl-G or Ctrl-A as a delimiter using the TextOutputFormat? Well, you would not be able to with the current version of Hadoop.

11/04/11 18:39:43 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 3ebdc922a897735c130a12bb44fc8c0819077f9d]
Exception in thread “main” org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException:
org.xml.sax.SAXParseException: Character reference “&#7” is an invalid XML character.
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1317)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1186)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1115)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:425)
at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1709)
at org.apache.hadoop.mapred.JobConf.
at org.apache.hadoop.mapred.JobInProgress.
at org.apache.hadoop.mapred.JobInProgress.
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026)
at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)

at org.apache.hadoop.ipc.Client.call(Client.java:818)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:443)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:467)
at com.ebay.ice.hadoop.mobius.srp.mapred.SRPImpressionCounter.run(SRPImpressionCounter.java:182)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.ebay.ice.hadoop.mobius.srp.mapred.SRPImpressionCounter.main(SRPImpressionCounter.java:110)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I faced the error shown above when I tried to use Ctrl-G (\u0007) as the delimiter between key-value pairs in the output. The issue here is that the Client serializes the Configuration, which is later unmarshalled by the JobTracker. This step on the JobTracker fails as it is unable to de-serialize “\u0007”, or for that matter, any special unicode character.

Workaround (a dirty one at that!) : Create your custom textoutputformat that specifies the unicode character as the default in the code. This is a really dirty hack, and I am working on making the code generic to accept the special unicode delimiter as an argument.

Here is the code from TextOutputFormat :

public RecordWriter<K, V> getRecordWriter(FileSystem ignored,

JobConf job,

String name,

Progressable progress)

throws IOException {

boolean isCompressed = getCompressOutput(job);

String keyValueSeparator = job.get(“mapred.textoutputformat.separator”,


if (!isCompressed) {

Path file = FileOutputFormat.getTaskOutputPath(job, name);

FileSystem fs = file.getFileSystem(job);

FSDataOutputStream fileOut = fs.create(file, progress);

return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);

} else {

Classextends CompressionCodec> codecClass =

getOutputCompressorClass(job, GzipCodec.class);

// create the named codec

CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job);

// build the filename including the extension

Path file =


name + codec.getDefaultExtension());

FileSystem fs = file.getFileSystem(job);

FSDataOutputStream fileOut = fs.create(file, progress);

return new LineRecordWriter<K, V>(new DataOutputStream





The RecordReader implementation uses “\t” as the default. Changing this in your Custom output format reader should work. But again, this is a “really dirty” hack! I will post later once I have a better implementation.

Another hack, would be to provide the delimiter through an XML resource file. The xml version needs to be marked 1.1, since 1.0 fails to recognize the special unicode characters. The XML 1.0 spec explicitly omitted most of the non-printing characters in the range 0x00 to 0x1F.

Name: mapred.textoutputformat.separator
Value: \u0007

<?xml version="1.1"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>












Create Filters – Excluding traffic from Google Analytics

To exclude traffic by Cookie Content

To exclude traffic from dynamic IP addresses, you can use a JavaScript function to set a cookie on your internal computers. You’ll then be able to filter all visitors with this cookies from appearing on your Analytics reports.

How to exclude traffic by cookie:

1. Create a new page on your domain, containing the following code:

<body onLoad="javascript:pageTracker._setVar('test_value');">

(Please note that this code is in addition to the Google Analytics tracking code that you have on every page of your website.)

2. In order to set the cookie, visit your newly created page from all computers that you would like to exclude from your reports.

3. Create an Exclude filter to remove data from visitors with this cookie. Follow these instructions http://www.google.com/support/googleanalytics/bin/answer.py?answer=55494

to create a filter with the following settings:

Filter Type: Custom filter > Exclude

Filter Field: User Defined

Filter Pattern: test_value

Case Sensitive: No

Helpful links:

  1. http://www.google.com/support/googleanalytics/bin/answer.py?answer=55494

Hadoop Tutorials – a few good links

Here is a list of documents I found really useful when I first started working with Hadoop : Below are the two papers from Google on MapReduce paradigm and GFS

  1. Map Reduce: http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/mapreduce-osdi04.pdf
  2. Google File System: http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/gfs-sosp2003.pdf

Tips for improving Map-reduce:

  1. http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/

General tutorials:

  1. http://www.javaworld.com/javaworld/jw-09-2008/jw-09-hadoop.html
  2. http://developer.yahoo.com/hadoop/tutorial
  3. Troubleshooting – http://www.cs.brandeis.edu/~cs147a/lab/hadoop-troubleshooting/

Hadoop installation on Ubuntu

Below is an excellent article by Michael Noll on how to install Hadoop on a single-node cluster.


He goes on to talk about 2-node linux cluster installation of hadoop, which you can follow once the single-node has successfully been installed.


Installing on Windows using a VM: (I haven’t personally tried it, but it seems a quick easy way to setup hadoop)

Ubuntu – installation

Here are a few links useful when installing Ubuntu:

  1. Video depicting dual-boot along with Windows on the system. http://www.youtube.com/watch?v=w8a-smrPlvE&feature=related
  2. Enable wireless WPA access point : http://www.debianadmin.com/enable-wpa-wireless-access-point-in-ubuntu-linux.html
  3. 13 things to do after Ubuntu installation : http://linuxondesktop.blogspot.com/2007/02/13-things-to-do-immediately-after.html
  4. Java installation : http://www.ubuntugeek.com/install-java-runtime-environment-jre-in-ubuntu-9-10-karmic.html
  5. Sample java program : Follow these steps to test installation with a java program – vi Sample.java, Type in this code snippet : class Sample {
    public static void main(String args[]) {
    System.out.println(“Sample program”);
  6. prashant@xyz-laptop:~$ javac Sample.java
    prashant@xyz-laptop:~$ java Sample
    Sample program
  7. If you see the above output, you can be sure the java installation is working fine. If you see errors, make sure javac has been installed correctly by (whereis javac). If this cannot be located, install java sdk as described in the above article.
  8. How to set environment variable JAVA_HOME – locate where java is installed, mine was installed at /usr/lib/jvm/java-6-sun – sudo vi /etc/bash.bashrc
    and insert
    export JAVA_HOME
  9. Now exit the shell and reopen it, type echo $JAVA_HOME and you should see the path set
  10. Skype installation : https://help.ubuntu.com/community/Skype
  11. chm reader : sudo apt-get install xchm

I will continue to add helpful resources as I proceed with my installation…

CakePHP – Creating a dropdown list

It took me a while to figure out the dropdown equivalent on CakePHP framework, and finally realized its as simple as creating, probably a text box or even a button. Here is how it’s done:

“select” is the keyword used to create the dropdown, here is the syntax from CakePHP help docs :

select(string $fieldName, array $options, mixed $selected, array $attributes, boolean $showEmpty)

Creates a select element, populated with the items in $options, with the option specified by $selected shown as selected by default. Set $showEmpty to false if you do not want an empty select option to be displayed.

echo $form->select(‘gender’,$options)

Will output:

<select name="data[User][gender]" id="UserGender">

<option value=""></option>

<option value="M">Male</option>

<option value="F">Female</option>


What you will observe in the dropdown list are the 2 options, Male and Female. However, after persisting this to the database, your table will contain ‘M’ or ‘F’ depending on your selection. This is because while declaring the parameter $options, you assigned ‘M’=>’Male’, ‘F’=>’Female’.

That’s it, you can modify the above snippet as per your requirements and you are good to go!!

Testing Apache, MySQL, PHP installation using Glassfish 1.5

There is not much documentation on how one can possibly test if Apache, MySQL and PHP have been installed properly after having built the package from Glassfish 1.5. I will go into some detail on how you can go about this.

I found the following guidelines extremely helpful while installing AMP on Solaris 10 : http://googlux.com/sunwebstack.html

This is the complete SUN documentation for the same : http://wikis.sun.com/download/attachments/111411306/GFWSICFG.pdf

To verify web server is running fine :

I will now provide the next step, that is testing whether PHP is working with your up and running web server, Apache. Before that, I am assuming that you have started the web server, you can verify the same by checking against the installed services :

bash-3.00$ svcs -a |grep apache
online Nov_16 svc:/network/http:sun-apache22

If not, start the server as described in the above documents.

To verify PHP is working and well configured with Apache :

  1. Go to the following directory : bash-3.00$ pwd
  2. Create a test file with .php extension : vi test.php
  3. Insert the following code :
    <?php phpinfo(); ?>
  4. Now open up a browser and enter the URL : http://server-name/test.php. If you are running the webserver locally, you can enter http://localhost/test.php

What just happened ?

The webserver looks inside the Document Root /var/opt/webstack/apache2/2.2/htdocs” to find the source files. This is the location defined by default in httpd.conf, a configuration file which defines several parameters, for eg, the port to be used to access the website, and the DocumentRoot location. You can change the location as per your convenience/requirement. If you do so, restart Apache to make sure the webserver picks up these changes.

bash-3.00$ pwd

bash-3.00$ ls -lrt httpd.conf

-rw-r–r– 1 root bin 13534 Jul 14 22:51 httpd.conf

bash-3.00$ cat httpd.conf | grep DocumentRoot
# DocumentRoot: The directory out of which you will serve your
DocumentRoot “/var/opt/webstack/apache2/2.2/htdocs”
# This should be changed to whatever you set DocumentRoot to.
# access content that does not live under the DocumentRoot.

We then created a php file named test.php, and inserted 1 line of code. phpinfo() is a function which outputs information regarding the current state of php, that is the version number, configuration parameters, server properties to name a few.

Then we simply opened a browser and entered the corresponding URL. If PHP is working and recognized by Apache web server, you should be able to see a page full of PHP related information. It looks something like this :

If all goes well, you can be sure php and apache are working fine on your Solaris box!!

I will add more details regarding testing MySQL soon..

How to add Google Analytics to your blog

I work for a web analytics team at an e-commerce company, and having said so was feeling relatively guilty of having a blog but not tracking the visits, unique visitors, bounce rate, average time on site, users behavior.. Well, I guess you get it !!

Google analytics is an easy to use service which can be easily integrated with your blog ( or for that matter your website ). Here are the steps to do so, I have divided them into a 2 step process :

Obtain Google Analytics code block to be pasted on your blogs’ base HTML layout :

  1. Go to http://www.google.com/analytics
  2. Click on Add website
  3. Select Add a Profile and enter the URL of your blog
  4. Choose the appropriate Country and Time Zone. Google does a good job of identifying and pre-selecting it for you, but in case you need it be changed, go ahead and do so.
  5. You will now be provided a block of HTML code that Google analytics provides, Copy-Paste this chunk onto Notepad or somewhere else for now.
Sticking in code block to your blog :
  1. Sign-in to your blogger account : www.blogger.com
  2. Go to the dashboard and click on Layout tab.
  3. Now click on “Edit HTML” and look at the bottom of this page, you will find something like
    <!-- end outer-wrapper -->

    (Google Analytics Code Block to be placed here)



  4. Paste the code block above
    and click “Save Changes”
  5. To make sure this works, go back and login to www.google.com/analytics and select your blog’s URL from the drop-down on top left corner of the page. Here you will either see Receiving Data (analytics successfully installed) or Tracking Not Installed (we went wrong somewhere). For cases with Tracking Not Installed, click on Check Status. This tells Google to check for your code block, and you will receive a message suggesting whether Google was able to find it or not. If not, you should try pasting the code block again. https://www.google.com/analytics/settings/check_status_profile_handler
Hope this helps !!

Bake some PHP with Cake…!!

Once upon a time, it was painful developing websites using different technologies, making it even harder to configure these to work in sync. With PHP coming into the limelight, its become a lot easier to develop websites from scratch…

I started learning this language on being given a project at work to design/implement a tool which was to be used for scheduling some of our jobs, which was done manually. Well, after reading a bit about PHP, it was clear using Apache, MySQL and PHP should be the easiest to cook up a website. Some of the advantages you might have using this bundle :

  1. Pre-configured packages such WAMP ( windows ), LAMP and XAMPP make it incredibly easy to install PHP, Apache HTTP server and MySQL.
  2. Well, there does exist a bundle for Solaris too, its called SAMP. The latest version can be installed using Glassfish 1.5, provided by Sun Microsystems.
  3. Once installed, you are ready to go !! No need of manually configuring any parameters, ain’t that awesome !! (well, you might want to change httpd.conf to set the path to webroot, location inside which your source files would be housed and picked up by the webserver or if you want to change the port on which your localhost is to be accessed via)
You will find several tutorials on installation of XAMP, LAMP or XAMPP and I wont go into the details of it. Simply google it, and you will have a bunch of documents you could refer to…

I started off with developing login, registration ( authentication ) for my website. It was fairly straightforward, created a USERS table having some cliche` looking fields such as username, password, firstname, lastname, email, created_date, modified_date….well, you get the idea…basically whatever you want to save about a user.

id integer auto_increment,
username char(50),
password char(100),
firstname char(30),
lastname char(30),
email char(30),

Getting such an authentication up for my site did not take long, but I then realized there must exist certain frameworks which make such tasks simpler for us. Well, after googling for it a bit, CakePHP is one such framework. And how do you implement authentication using it, with probably 15-20 lines of code….Im not kidding, I did mean 15-20 lines.

Please wait on, I will get into more details on installing CakePHP and having it work with your existing AMP. I’m using WAMP, the steps I follow would most likely be applicable to the same installation.

A few useful links I will leave you with :

  1. Download Eclipse – I use this as an editor for php development
  2. Install Wamphttp://www.wampserver.com/en/download.php
  3. Install UML toolhttp://amateras.sourceforge.jp/cgi-bin/fswiki_en/wiki.cgi?page=AmaterasUML – I like drawing up some UML diagrams before actually getting to implementation. You can overlook this step.