With Pig 0.8 onwards there is a configuration parameter “aggregate.warning” that is by default set to true. This means when your Pig script (UDF) encounters an exception, Pig aggregates these exception messages and prints out the summary at the end of script. Something like,

11/12/09 13:35:21 WARN mapReduceLayer.MapReduceLauncher: Encountered Warning DIVIDE_BY_ZERO 14 time(s).11/12/09 13:35:21 WARN mapReduceLayer.MapReduceLauncher: Encountered Warning UDF_WARNING_3 19113867 time(s).

What happens if aggregate.warning is set to false?

Disaster! You would expect a huge number of exceptions being thrown when you are processing web logs. You can not really predict the correctness of log lines, they could be incorrectly formatted, missing data, fields…. To err is log!

What you can do is handle these exceptions in your code, catch ArrayIndexOutOfBounds, NullPointer, ClassCast etc. But web logs are huge, may be hundreds of gigs per day and generating detailed exceptions in your Hadoop logs is not a good idea. You could potentially have all your datanodes run out of disk space.

Always make sure “aggregate.warning” is set to true.

EMBEDDED PIG IN JAVA – AGGREGATE WARNINGS

Though 0.8 onwards this set true within Pig scripts, Embedding Pig in Java behaves differently. To run Pig script from within Java program, you need to first create a PigServer object.

PigServer pigServer = new PigServer(ExexType.MAPREDUCE);
pigServer.executeQuery(“A = LOAD '/logs' using PigStorage();”);

However when you look at the output logs you would find Pig generates detailed exception messages. This is not what you would expect. I have made a fix for this but its been patched only to version 0.10. http://issues.apache.org/jira/browse/PIG-2425

For version 0.9.1 (and lower) workaround for this would be:

Properties properties = PropertiesUtil.loadDefaultProperties();
properties.setProperty("aggregate.warning", "true");
PigServer pigServer = new PigServer(ExecType.MAPREDUCE, properties);
Advertisements