Pig in Big Data | IndianTechnoEra

What is pig?

Pig is a high-level programming language used for processing large data sets.

It is designed to simplify the writing of MapReduce programs.

Pig enables data workers to write complex data transformations without having to write code.

Pig programs are written in a language called Pig Latin, which is a data flow language similar to SQL.

Pig programs can be run on Apache Hadoop clusters for distributed processing of large data sets.

Pig architecture

The Apache Pig architecture consists of a high-level language, Pig Latin, and a run-time environment. The language is a data flow language that allows users to write complex data transformations, as well as a set of operators for reading, writing, and processing data.

The run-time environment includes a compiler which translates the Pig Latin code into a set of MapReduce jobs which are executed on a Hadoop cluster.

The Pig architecture also includes a data storage system which includes a distributed file system, such as HDFS, as well as a database, such as HBase. Finally, the Pig architecture includes a library of user-defined functions (UDFs) which can be used to extend the functionality of Pig Latin.

Pig Architecture | IndianTechnoEra

What are features of pig

1. Multi-Query Language Support: Pig supports a high-level language called Pig Latin. It is a blend of procedural and declarative programming languages and allows developers to write complex data transformations in a simpler way.

2. Powerful Optimization Engine: Pig provides a powerful optimization engine that enables developers to make efficient use of the data by optimizing data flows.

3. Schema Aware: Pig is schema aware and can process both structured and semi-structured data. It can process data from multiple sources and infer the data structure based on its structure.

4. Scalability: Pig is highly scalable and can process data in parallel on multiple machines. It can also be used to process large data sets efficiently.

5. Extensibility: Pig is extensible and can be used to develop custom functions and data processing operations.

6. Fault Tolerance: Pig can automatically detect and recover from errors such as missing data or data corruption.

Execution Modes of Pig

There are two execution modes of Pig:

1. Local Mode: In the Local Mode, Pig runs in a local single JVM (Java Virtual Machine) and accesses the local file system. The data is stored in the local file system and is not distributed across the cluster.

2. MapReduce Mode: In the MapReduce Mode, Pig runs on the Hadoop cluster and accesses the HDFS. The data is stored in HDFS and is distributed across the cluster.

Grunt in big data

Grunt is a JavaScript task runner that automates tasks such as minification, compilation, unit testing, and linting.

It can be used to improve workflow and productivity by automating repetitive tasks.

It also helps developers easily manage and organize their code.

What is pig Latin?

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop.

It is a textual language that abstracts the programming from the Java MapReduce idiom into a notation.

User Defined Functions in big data

In the big data world, user-defined functions are used to process large volumes of data in an efficient and cost-effective manner.

Big data systems such as Apache Spark and Hadoop allow for the creation of user-defined functions (UDFs) that can be used to transform and analyze data.

UDFs can be used to perform complex calculations on large datasets, such as statistical analysis, machine learning, and data mining.

UDFs can also be used to perform data cleaning tasks, such as extracting relevant information from raw text or formatting data for use in other systems.

Pig Data Processing operators in big data

1. Load: Load data from various sources like HDFS, HBase, relational databases, local file systems etc.

2. Filter: Filter data by applying conditions and selecting only the relevant data needed for further processing.

3. Group: Group the data according to common characteristics like age, gender, location etc.

4. Join: Join two or more datasets together to form a single dataset.

5. Sort: Sort data in ascending or descending order based on a particular field.

6. Union: Union two or more datasets together to form a single dataset with all the records of the datasets combined.

7. Distinct: Select unique records from a dataset.

8. Limit: Restrict the output of the data processing to a certain number of records.

9. Sample: Select a sample of records from a dataset.

10. Store: Save the output of the data processing in a file or data store.

Difference between mapreduce and pig

Pig Commands

Basic Pig Commands

1. Fs: This will list all the file in the HDFS

grunt> fs –ls

2. Clear: This will clear the interactive Grunt shell.

grunt> clear

3. History:

This command shows the commands executed so far.

grunt> history

4. Reading Data: Assuming the data resides in HDFS, and we need to read data to Pig.

grunt> college_students = LOAD ‘hdfs://localhost:9000/pig_data/college_data.txt’

USING PigStorage(‘,’) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

PigStorage() is the function that loads and stores data as structured text files.

5. Storing Data: Store operator is used to storing the processed/loaded data.

grunt> STORE college_students INTO ‘ hdfs://localhost:9000/pig_Output/ ‘ USING PigStorage (‘,’);

Here, “/pig_Output/” is the directory where relation needs to be stored.

6. Dump Operator: This command is used to display the results on screen. It usually helps in debugging.

grunt> Dump college_students;

7. Describe Operator: It helps the programmer to view the schema of the relation.

grunt> describe college_students;

8. Explain: This command helps to review the logical, physical and map-reduce execution plans.

grunt> explain college_students;

9. Illustrate operator: This gives step-by-step execution of statements in Pig Commands.

grunt> illustrate college_students;

Intermediate Pig Commands

1. Group: This command works towards grouping data with the same key.

grunt> group_data = GROUP college_students by first name;

2. COGROUP: It works similarly to the group operator. The main difference between Group & Cogroup operator is that group operator usually used with one relation, while cogroup is used with more than one relation.

3. Join: This is used to combine two or more relations.

Example: In order to perform self-join, let’s say relation “customer” is loaded from HDFS tp pig commands in two relations customers1 & customers2.

grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

Join could be self-join, Inner-join, Outer-join.

4. Cross: This pig command calculates the cross product of two or more relations.

grunt> cross_data = CROSS customers, orders;

5. Union: It merges two relations. The condition for merging is that both the relation’s columns and domains must be identical.

grunt> student = UNION student1, student2;

Advanced Commands

1. Filter: This helps in filtering out the tuples out of relation, based on certain conditions.

filter_data = FILTER college_students BY city == ‘Chennai’;

2. Distinct: This helps in the removal of redundant tuples from the relation.

grunt> distinct_data = DISTINCT college_students;

This filtering will create a new relation name “distinct_data”

3. Foreach: This helps in generating data transformation based on column data.

grunt> foreach_data = FOREACH student_details GENERATE id,age,city;

This will get the id, age, and city values of each student from the relation student_details and hence will store it into another relation named foreach_data.

4. Order by: This command displays the result in a sorted order based on one or more fields.

grunt> order_by_data = ORDER college_students BY age DESC;

This will sort the relation “college_students” in descending order by age.

5. Limit: This command gets limited no. of tuples from the relation.

grunt> limit_data = LIMIT student_details 4;

Tips and Tricks

Below are the different tips and tricks:-

1. Enable Compression on your input and output

set input.compression.enabled true;

set output.compression.enabled true;

Above mentioned lines of code must be at the beginning of the Script, so that will enable Pig Commands to read compressed files or generate compressed files as output.

2. Join multiple relations

For performing the left join on say three relations (input1, input2, input3), one needs to opt for SQL. It’s because outer join is not supported by Pig on more than two tables.

Rather you perform left to join in two steps like:

data1 = JOIN input1 BY key LEFT, input2 BY key;

data2 = JOIN data1 BY input1::key LEFT, input3 BY key;

This means two map-reduce jobs.

To perform the above task more effectively, one can opt for “Cogroup”. Cogroup can join multiple relations. Cogroup by default does outer join.

IndianTechnoEra

Pig in Big Data | IndianTechnoEra

What is pig?

Pig architecture

What are features of pig

Execution Modes of Pig

Grunt in big data

What is pig Latin?

User Defined Functions in big data

Pig Data Processing operators in big data

Difference between mapreduce and pig

Difference between mapreduce and pig

Pig Commands

Basic Pig Commands

Intermediate Pig Commands

Advanced Commands

Tips and Tricks

Post a Comment

Fundamental of Mathematical Statistics by S. C. Gupta V. K. Kapoor | IndianTechnoEra

SQL Server Encryption: From Setup to Implementation

EF Core Mastery: From Zero to Hero

ITE - CodeSam