PigStorage is probably the most frequently used Load/Store utility. It parses input records based on a delimiter and the fields thereafter can be positionally referenced or referenced via alias. Starting version 0.10, Pig has a couple of options that could be very useful and I will explain what those are here.

Options

  1. Schema: Reads/Stores the schema of the relation using a hidden JSON file (.pig_schema). Consider the following example:
    grunt> cat example;
    1 pig apache
    2 hadoop apache
    grunt> A = LOAD 'example' using PigStorage('\t') as (id:int, project:chararray, org:chararray);
    grunt> B = foreach A generate project, org;
    grunt> describe B;
    B: {project: chararray,org: chararray}
    grunt> store B into 'output';
    grunt> cat output;
    pig apache
    hadoop apache

    Schema for alias B is {project: chararray,org: chararray}.

    Now you might want to load the output file and perform further processing on it. Typically this is achieved by loading the dataset ‘output’ using PigStorage and redefining the schema. But this is redundant, and possibly error-prone.

    grunt> ExplicitSchema = LOAD 'output' using PigStorage('\t') as (project:chararray, org:chararray);

    In the above line, we are having to explicitly define schema for the dataset ‘output’

    With Pig 0.10, we now have an option to pass PigStorage the argument ‘-schema’ while storing data. This will create a ‘.pig_schema’ file in the output directory which is a JSON file containing the schema.

    store B into 'output' using PigStorage('\t', '-schema');

    So the next time you load ‘output’, you only need to specify the location of output to LOAD.

    grunt> WithSchema = LOAD 'output';
    grunt> describe WithSchema;
    WithSchema: {project: chararray,org: chararray}

    If you do not want the schema to be loaded, you can disable it with ‘-noschema’

    grunt> WithSchemaDisabled = LOAD 'output' using PigStorage('\t', '-noschema');
    grunt> describe WithSchemaDisabled;
    Schema for WithSchemaDisabled unknown.

    Another useful property of this option is that it creates a header file containing the column names.

    From Pig Storage java docs: This file “.pig_headers” is dropped in the output directory. This file simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data).

    To summarize (thanks to Dmitriy for suggesting adding this section):

      • PigStorage always tries to load the .pig_schema file, unless you explicitly say -noschema.
      • If you don’t specify anything at all, PigStorage will try to load a schema, and silently fail (behave as before) if it’s not present or unreadable.
      • If you specify -schema during loading, PigStorage will fail if a schema is not present.
      • If you specify -noschema during loading, PigStorage will ignore the .pig_schema file.
      • PigStorage will only *store* the schema if you specify -schema.

     

  2. Source tagging: Adds source filename as first field of each tuple (Please refer to UPDATE below before reading this) You may sometimes want to know the exact file that a record came from. For example, let’s say we have a Log dataset that is partitioned based on application server id. All log events from <app server 1> are contained in a file logs_app_server_1.gz, all events from <app server 2> are contained in logs_app_server_2.gz and so on. When you read all these app log files at once, you may want to throw in the app server id into your analysis. PigStorage can now (Pig release 0.10) be used to accomplish this. If -tagsource is specified, PigStorage will prepend input split path to each Tuple/row. User needs to ensure pig.splitCombination is set to false. This is because Pig by default can combine small files (based on property pig.splitCombination) and pass it to the mapper, in which case it would be difficult to determine the exact input filename from combined records.
    set pig.splitCombination false;

    Usage:

    A = LOAD 'input' using PigStorage(',','-tagsource'); 
    B = foreach A generate $0 as input_filename;

    The first field in each Tuple will contain input filename.

    For example, let’s say we have 2 files ‘data’ and ‘data2′.

    grunt> cat data
    [open#apache,1#2,11#2]
    [apache#hadoop,3#4,12#hadoop]
    
    grunt> cat data2
    1	10
    4	11
    5	10
    
    grunt> set pig.splitCombination false;
    
    grunt> A = load 'data*' using PigStorage('\t', '-tagsource');
    
    grunt> dump A;
    (data,[open#apache,1#2,11#2])
    (data,[apache#hadoop,3#4,12#hadoop])
    (data2,1,10)
    (data2,4,11)
    (data2,5,10)

    CAVEAT:
    Disabling “pig.splitCombination” can have a negative affect on performance of Pig jobs. Please note this property is turned on by default. It combines multiple small input files and passes them to a single mapper. Files are combined up to size specified by property “pig.maxCombinedSplitSize”. For eg, if you have 10 files 25 MB each and “pig.maxCombinedSplitSize” set to 256 MB, PigStorage combines all of these 10 files and passes it to a single mapper.

    Please do consider this performance hit while using ‘-tagsource’.

Many more features and improvements to look forward to with Pig 0.10, these PigStorage features is only 1 of them!

UPDATE 05/28/2012

There was a fix made due to which we no longer need to disable “pig.splitCombination” https://issues.apache.org/jira/browse/PIG-2462. Please DO NOT disable this feature in your pig scripts.

About these ads