PigStorage options – Schema and Source tagging

PigStorage is probably the most frequently used Load/Store utility. It parses input records based on a delimiter and the fields thereafter can be positionally referenced or referenced via alias. Starting version 0.10, Pig has a couple of options that could be very useful and I will explain what those are here.

Options

Schema: Reads/Stores the schema of the relation using a hidden JSON file (.pig_schema). Consider the following example:
```
grunt> cat example;
1 pig apache
2 hadoop apache
grunt> A = LOAD 'example' using PigStorage('\t') as (id:int, project:chararray, org:chararray);
grunt> B = foreach A generate project, org;
grunt> describe B;
B: {project: chararray,org: chararray}
grunt> store B into 'output';
grunt> cat output;
pig apache
hadoop apache
```
Schema for alias B is {project: chararray,org: chararray}.

Now you might want to load the output file and perform further processing on it. Typically this is achieved by loading the dataset ‘output’ using PigStorage and redefining the schema. But this is redundant, and possibly error-prone.
```
grunt> ExplicitSchema = LOAD 'output' using PigStorage('\t') as (project:chararray, org:chararray);
```
In the above line, we are having to explicitly define schema for the dataset ‘output’

With Pig 0.10, we now have an option to pass PigStorage the argument ‘-schema’ while storing data. This will create a ‘.pig_schema’ file in the output directory which is a JSON file containing the schema.
```
store B into 'output' using PigStorage('\t', '-schema');
```
So the next time you load ‘output’, you only need to specify the location of output to LOAD.
```
grunt> WithSchema = LOAD 'output';
grunt> describe WithSchema;
WithSchema: {project: chararray,org: chararray}
```
If you do not want the schema to be loaded, you can disable it with ‘-noschema’
```
grunt> WithSchemaDisabled = LOAD 'output' using PigStorage('\t', '-noschema');
grunt> describe WithSchemaDisabled;
Schema for WithSchemaDisabled unknown.
```
Another useful property of this option is that it creates a header file containing the column names.

From Pig Storage java docs: This file “.pig_headers” is dropped in the output directory. This file simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data).

To summarize (thanks to Dmitriy for suggesting adding this section):
Source tagging: Adds source filename as first field of each tuple (Please refer to UPDATE below before reading this) You may sometimes want to know the exact file that a record came from. For example, let’s say we have a Log dataset that is partitioned based on application server id. All log events from <app server 1> are contained in a file logs_app_server_1.gz, all events from <app server 2> are contained in logs_app_server_2.gz and so on. When you read all these app log files at once, you may want to throw in the app server id into your analysis. PigStorage can now (Pig release 0.10) be used to accomplish this. If -tagsource is specified, PigStorage will prepend input split path to each Tuple/row. User needs to ensure pig.splitCombination is set to false. This is because Pig by default can combine small files (based on property pig.splitCombination) and pass it to the mapper, in which case it would be difficult to determine the exact input filename from combined records.
```
set pig.splitCombination false;
```
Usage:
```
A = LOAD 'input' using PigStorage(',','-tagsource'); 
B = foreach A generate $0 as input_filename;
```
The first field in each Tuple will contain input filename.

For example, let’s say we have 2 files ‘data’ and ‘data2’.
```
grunt> cat data
[open#apache,1#2,11#2]
[apache#hadoop,3#4,12#hadoop]

grunt> cat data2
1	10
4	11
5	10

grunt> set pig.splitCombination false;

grunt> A = load 'data*' using PigStorage('\t', '-tagsource');

grunt> dump A;
(data,[open#apache,1#2,11#2])
(data,[apache#hadoop,3#4,12#hadoop])
(data2,1,10)
(data2,4,11)
(data2,5,10)
```
CAVEAT:
Disabling “pig.splitCombination” can have a negative affect on performance of Pig jobs. Please note this property is turned on by default. It combines multiple small input files and passes them to a single mapper. Files are combined up to size specified by property “pig.maxCombinedSplitSize”. For eg, if you have 10 files 25 MB each and “pig.maxCombinedSplitSize” set to 256 MB, PigStorage combines all of these 10 files and passes it to a single mapper.

Please do consider this performance hit while using ‘-tagsource’.

Many more features and improvements to look forward to with Pig 0.10, these PigStorage features is only 1 of them!

UPDATE 05/28/2012

There was a fix made due to which we no longer need to disable “pig.splitCombination” https://issues.apache.org/jira/browse/PIG-2462. Please DO NOT disable this feature in your pig scripts.

4 thoughts on “PigStorage options – Schema and Source tagging”

Isaac Hepworth (@isaach) said:

August 3, 2012 at 3:23 am

great info, thank you! it’s a shame that -tagsource doesn’t bring in the full path of the input data but it’s a great idea.

- prash1784 said:
  
  August 3, 2012 at 8:34 pm
  
  Please watch https://issues.apache.org/jira/browse/PIG-2857. That should be coming in soon.
  
Andy Skelton said:

February 5, 2014 at 3:27 pm

I have a Pig job with so many S3 inputs that it would take hours just to get the first jar started. Specifying -noschema reduced that time to minutes. Thanks!

Pingback: HBase tables from Hive | Kevin's Blog

Hadoopified

~ Almost everything Hadoop!

PigStorage options – Schema and Source tagging

4 thoughts on “PigStorage options – Schema and Source tagging”

Leave a reply to Isaac Hepworth (@isaach) Cancel reply

Share this:

Related

4 thoughts on “PigStorage options – Schema and Source tagging”

Leave a reply to Isaac Hepworth (@isaach) Cancel reply