Pig is shipped with hadoop. A user could either use pig bundled with hadoop, or provide his/her own hadoop version. So the 2 ways to startup pig are, using:

  1. bundled  (pig shipped with hadoop, pig.jar)
  2. non-bundled (hadoop provided by user, pig-withouthadoop.jar)

As the name suggests, pig-withouthadoop.jar does not contain any hadoop libs/classes whereas pig.jar contains Hadoop 1.0.0 (as of release 0.11.1).

How does pig know which hadoop to use?

For this, let’s understand how the startup script (bin/pig) works. Here’s a simple representation of what happens:

  1. Use non-bundled if hadoop is set on classpath
  2. If not, use non-bundled if HADOOP_HOME is set
  3. If not, fallback on bundled pig jar
  4. Note that steps 1 and 2 should actually contain hadoop, it not pig startup script will fallback on bundled pig.

If you were to use pig with a Hadoop cluster running on 0.20.2, it could done by either setting hadoop on classpath or by specifying it via HADOOP_HOME.

export HADOOP_HOME=<path_to_hadoop>

On my machine, I have hadoop on the classpath

$ which hadoop
/home/pkommireddi/dev/tools/Linux/hadoop/hadoop-0.20.2/bin/hadoop

Specifying additional Java runtime options

Additional runtime options can be specified using PIG_OPTS

export PIG_OPTS=”-Dmapred.job.queue.name=my-queue”

Pro-tip: Make sure you are not inadvertently overriding PIG_OPTS that might have been set elsewhere. Instead, add to PIG_OPTS

export PIG_OPTS=”$PIG_OPTS -Dmapred.job.queue.name=my-queue”

Interesting behavior when using user-specified hadoop (non-bundled):
Pig script uses hadoop startup script ($HADOOP_HOME/bin/hadoop) in non-bundled mode. It adds PIG_OPTS to HADOOP_OPTS and invokes the hadoop startup script (let’s call it HSS). HSS picks up environment variables from “hadoop-env.sh”. It could be entirely possible that HADOOP_OPTS passed to the HSS is not used at all. This would happen if hadoop-env.sh is overriding it. For eg,

export HADOOP_OPTS="-server -Dlog4j.configuration=log4j.properties"

In this case, HADOOP_OPTS in hadoop-env.sh does not use user passed options. Instead, similar to PIG_OPTS tip above, HADOOP_OPTS should add to existing options in hadoop-env.sh

export HADOOP_OPTS="$HADOOP_OPTS -server -Dlog4j.configuration=log4j.properties"

Specifying additional classpath entries:

Use PIG_CLASSPATH to specify addition classpath entries. For eg, to add hadoop configuration files (hadoop-site.xml, core-site.xml) to classpath

export PIG_CLASSPATH=<path_to_hadoop_conf_dir>

Starting version 0.12, you should be able to override default classpath entries by setting PIG_USER_CLASSPATH_FIRST

export PIG_USER_CLASSPATH_FIRST=true

Specifying additional jars:

At times you might need jars from libraries not included in Pig distribution. These could be your custom UDFs, or 3rd party libs. Pig lets you add these jars to classpath using “pig.additional.jars”

 pig -Dpig.additional.jars=myjar.jar script.pig

Alternately, you could use REGISTER within your script.

Debugging:

Pig provides a command for users to debug any pig startup issues related to classpath. It provides information on

  1. Hadoop version,
  2. Pig version
  3. classpath
  4. java runtime options
  5. bundled vs unbundled

This is the “-secretDebugCmd”  – this really should not be such a secret :)

$ pig -x local -secretDebugCmd

Cannot find local hadoop installation, using bundled Hadoop 1.0.0
dry run:
/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/bin/java -Xmx1000m -Dpig.log.dir=/Users/pkommireddi/work/pig/pig-latest/bin/../logs -Dpig.log.file=pig.log -Dpig.home.dir=/Users/pkommireddi/work/pig/pig-latest/bin/.. -classpath /Users/pkommireddi/work/pig/pig-latest/bin/../conf:/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home/lib/tools.jar:/Users/pkommireddi/work/pig/pig-latest/bin/../build/ivy/lib/Pig/jython-standalone-2.5.2.jar /Users/pkommireddi/work/pig/pig-latest/bin/../build/ivy/lib/Pig/jython-standalone-2.5.3.jar:/Users/pkommireddi/work/pig/pig-latest/bin/../build/ivy/lib/Pig/jruby-complete-1.6.7.jar:/Users/pkommireddi/work/pig/pig-latest/bin/../pig.jar org.apache.pig.Main -x local

Conclusion:

If it’s under your control, always try and add to existing Hadoop or Pig options instead of resetting them. Of course, at times you might not have control over hadoop-env.sh in which case you can pass pig options when running a script

"pig -Dmapred.job.queue.name=my-queue script.pig"
About these ads