In this assignment we will experiment with Apache Pig and Pig Latin to simplify executing Hadoop MapReduce programs. We will do all the questions from lab 2 using Pig and compare the difficulty with regular MapReduce programs. We will also write some larger queries with Pig.
Sample outputs are below:
To use Pig, login using SSH to gpu1.ddl.ok.ubc.ca with your Novell account. Type pig to get a shell then enter Pig commands.
Write a Pig script that will list all the game records. The data set is available at /user/rlawrenc/416/lab2/small/games.txt. There should be 100 records printed.
Write a Pig script that given some game id (hard-coded constant) will return the game record if found. The data set is available at /user/rlawrenc/416/lab2/small/games.txt.
Write a Pig script that lists only the players over 18. The output should be sorted by age ascending. The data set is available at /user/rlawrenc/416/lab2/small/players.txt. Note that this question is harder than the rest as it requires defining a UDF. It is best to leave this question until last.
Write a Pig script that will calculate the number of players per game. The output does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.
Write a Pig script that will output the top 10 scores in descending order for a given game id. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.
Write a Pig script that will output pairs of game ids and the number of players they have in common. For instance, if game X and game Y have 2,000 players in common (play both games), then output X, Y, 2000. The data does not have to be sorted. The data set is available at /user/rlawrenc/416/lab2/small/player_games.txt.
Write a Pig script that will list all games along with the count of the number of players in each game with a score over 98,000. If a game does not have a player with a score over 98,000, it should still appear in the output with a count of 0.
Write a Pig script that will list all players that either have a score in some game over 90,000 or play a game published by 'Electronic Arts'.
Write a Pig script that for each publisher will list two records. The first will be the publisher id, "female", and the maximum number of women that play one of its games. The second row will be the publisher id, "male", and the maximum number of men that play one of its games. Note that the games may be different.
For each game, show the total number of players, the total number of female and male players, and the percentage of female and male players.
Submit a single Pig script file with answers to all questions using Connect or by email. You can demonstrate your work at any time for feedback and marking.