Ask Homework Help/Study Tips Expert

Assignment: Simple Data Analysis with MapReduce and Spark

1 Introduction
This assignment tests your ability to implement simple data analytic workload using basic features of MapReduce and Spark framework. The data set you will work on is the Trend- ing Youtube Video Statistics data from Kaggle . There are two workloads you should design and implement against this data set. You are required to implement one with MapReduce and the other with Spark. You can choose which framework you want to use on which workload.

2 Input Data Set Description
The dataset contains several months' records of daily top trending YouTube video in the following five countries: Canada, France, Germany, UK and USA. There are up to 200 trending videos listed per day.

Each country's data is saved in a separate CSV file. Each row of the CSV file represents a trending video record. If a video is listed as trending in multiple days, each trending appearance has its own record. The record includes video id, title, trending date, publish time, number of views, and so on. The video record also includes a category id field. The categories are slightly different in each country. A JSON file is provided for each country. The JSON file defines the mapping between category ID and category name.

3 Analysis Workload Description

Category and Trending Correlation
Some videos are trending in multiple countries. We are interested to know if there is any correlation between category and overlapping trending. For instance, if UK and CA users have common interests in music, but very different interest in sports, we might see 3% trending music videos in UK that also appear in the trending list of CA; while only 0.5% of trending sports videos in UK appears in CA's trending list.
In this workload, you are asked to find out, for a given pair of countries A and B, for each category in country A, the total number of videos trending in country A and the percentage of them that are also trending in country B. For any video with multiple trending appearances in a country, it should be counted as one video in that country.

The result would look like, suppose the country is GB and US

Entertainment; total: 617; 31.6 in US Sports; total:163; 16.6 in US
...

It means that there are 617 videos from Entertainment category in UK's trending list. 31.4% of the 617 videos also appear in US's trending list; There are 152 videos from Sports category in UK's trending list. 17.1% of the 100 videos also appear in US's trending list.

Impact of Trending on View Number

Listing a video as trending would help it attract more views. The view number may quickly increase after a video is listed as trending for the first time. In fact it is not unusual for the view number to double between a video's first and second trending appearance.

Below are a few records of a particular video:

videoID         Trending Date          Publish Time                    Views         Country

xYtsL9znopI   18.17.02              2018-02-16T14:00:09.000Z  960453      CA

xYtsL9znopI   18.18.02             2018-02-16T14:00:09.000Z  2109193     CA

xYtsL9znopI   18.19.02              2018-02-16T14:00:09.000Z  2768767    CA

xYtsL9znopI   18.20.02             2018-02-16T14:00:09.000Z  3213410     CA

The video has four trending appearances in CA between February 17 of 2018 and Febru- ary 20 of 2018. The view number in its first appearance (2018/02/17) is 960,453; the view number in its second appearance (2018/02/18) is 2,109,193. There is a 119.6% increase between the second and first appearance. In contrast the increase between the third and the second appearance is only 31.2%.
In this workload, you are asked to find out, for each country, all videos that have greater than or equal to 100% 1,000% increase1 in viewing number between its second and first trending appearance. The result should be grouped by country and sorted discerningly by percent increase.

The result would look like
DE; V1zTJIfGKaA, 19501.0
DE; RIgNyiGttog, 12346.6
...
CA; _I_D_8Z4sJE, 8438.1 CA; -K9ujx8vO_A, 8298.3
...

4 Coding and Execution Requirement
Your implementation should utilize features provided by the respective framework. In particular, you should parallelize most of the operations. The Hadoop implementation should run in a pseudo-distributed mode. The Spark implementation should run in a standalone cluster or YARN cluster on a single machine.

5 Deliverable

The report should describe the design of both workloads. In particular, you should describe the sequence of operations/actions taken to obtain the final result, and highlight the part that can be executed in parallel. You can use diagrams to help explaining the sequence.

Attachment:- Assignment.rar

Homework Help/Study Tips, Others

  • Category:- Homework Help/Study Tips
  • Reference No.:- M92791997
  • Price:- $70

Guranteed 36 Hours Delivery, In Price:- $70

Have any Question?


Related Questions in Homework Help/Study Tips

Review the website airmail service from the smithsonian

Review the website Airmail Service from the Smithsonian National Postal Museum that is dedicated to the history of the U.S. Air Mail Service. Go to the Airmail in America link and explore the additional tabs along the le ...

Read the article frank whittle and the race for the jet

Read the article Frank Whittle and the Race for the Jet from "Historynet" describing the historical influences of Sir Frank Whittle and his early work contributions to jet engine technologies. Prepare a presentation high ...

Overviewnow that we have had an introduction to the context

Overview Now that we have had an introduction to the context of Jesus' life and an overview of the Biblical gospels, we are now ready to take a look at the earliest gospel written about Jesus - the Gospel of Mark. In thi ...

Fitness projectstudents will design and implement a six

Fitness Project Students will design and implement a six week long fitness program for a family member, friend or co-worker. The fitness program will be based on concepts discussed in class. Students will provide justifi ...

Read grand canyon collision - the greatest commercial air

Read Grand Canyon Collision - The greatest commercial air tragedy of its day! from doney, which details the circumstances surrounding one of the most prolific aircraft accidents of all time-the June 1956 mid-air collisio ...

Qestion anti-trustprior to completing the assignment

Question: Anti-Trust Prior to completing the assignment, review Chapter 4 of your course text. You are a manager with 5 years of experience and need to write a report for senior management on how your firm can avoid the ...

Question how has the patient and affordable care act of

Question: How has the Patient and Affordable Care Act of 2010 (the "Health Care Reform Act") reshaped financial arrangements between hospitals, physicians, and other providers with Medicare making a single payment for al ...

Plate tectonicsthe learning objectives for chapter 2 and

Plate Tectonics The Learning Objectives for Chapter 2 and this web quest is to learn about and become familiar with: Plate Boundary Types Plate Boundary Interactions Plate Tectonic Map of the World Past Plate Movement an ...

Question critical case for billing amp codingcomplete the

Question: Critical Case for Billing & Coding Complete the Critical Case for Billing & Coding simulation within the LearnScape platform. You will need to create a single Microsoft Word file and save it to your computer. A ...

Review the cba provided in the resources section between

Review the CBA provided in the resources section between the Trustees of Columbia University and Local 2110 International Union of Technical, Office, and Professional Workers. Describe how this is similar to a "contract" ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As