Ask Homework Help/Study Tips Expert

Assignment -

For this assignment, the first two problems require Hadoop mapreduce jobs, although you need only solve one of them. Each of these problems should have it's own folder. The folder for a problem must contain a .txt file which gives the command line invocation for the job. For Java jobs submit the project directory as well as a jar. The streaming job will require it's own folder in which you will have files for the mapper and reducer. Problems which are carried out in Spark require only the file which will be submitted through spark-submit. Spark jobs will be implemented in Python. For Spark jobs, key-value output may include parentheses. For problems which do not require Mapreduce or Spark follow the instructions given below including all work in the main submission zip.

Solve one of problems 1 and 2.

1. The following is a mapreduce exercise. You may use either the Java or Streaming API's. From the UCI Machine Learning Repository download the compressed files docwords.nytimes.txt.gz and vocab.nytimes.txt.gz. These are part of the bag of words data set. Create a file named words_nytimes.txt which is the same as docwords.nytimes.txt but with the first three lines removed. Using the distributed cache translate the records of the nytimes data set into the form (docid, actual term , term count, max frequency for document). Parentheses should not be part of the output and you may use different delimiters. The actual term is the mapping of a term id as given in the file vocab.nytimes.txt. The input file here is words_nytimes.txt and the file which will be put in the distributed cache is vocab.nytimes.txt. The VM may have difficulty with the entire dataset. If you are having issues run on only a part of the file.

2. In this exercise you will implement matrix multiplication as a streaming job using Python. You will do so by executing a secondary sort in such a way that no buffering is required in the reducer. Your reducer may use only O(1) additional memory. For example you may use a small number of variables, storing foats or ints only.

3. In this problem you will build an inverted index for the nytimes datain the following sense. The output will be a term id together with a sorted list of the documents in which the term is found. To be precise the output will be lines with tab separated fields where the first field is the term and the subsequent fields are of the form docid:count where the count is the number of times that the term appears in the document. Furthermore, the docid:count data needs to be sorted, highest to lowest, by count. So the document for which the count is greatest will appear first and that in which the count is least will appear last. You will implement this in Spark. Your submission will be a file whose lines contain the required data together with a file giving the code/commands executed. Compress the submission data.

4. For this problem you will need to read about the tf-idf transform in the book Mining of Massive Datasets. For this problem the file words_nytimes.txt will be the input. The output will be the same as the input except that the third field which gives the count of the term in the document will be replaced by the tf-idf score for the term in the document.

You may solve this using any method you like, however the tf-idf score must be as defined in the above mentioned text. You need only submit the output. You must compress the output and include it with you zipped submission.

5. The following must be solved using Spark. You will submit your answers together with a file containing the commands you executed. It is recommended that you employ data frames for this problem. You may need to make use of AWS if your computer is unable to process the entire data set. When asking about particular words give the id only. Referring to the New York Times dataset mentioned above answer the following questions.

(a) How many documents have at least 100 distinct words?

(b) Which document contains the most total words from the vocabulary?

(c) Which document contains the most distinct words from the vocabulary?

(d) Which document, with at least 100 words, has the greatest lexical richness with respect to the vocabulary? By lexical richness we mean the number of distinct words divided by the total number of words.

(e) Which document, with at least 100 words, has the least lexical richness?

(f) Which word from the vocabulary appears the most across all of the documents, in terms of total count?

(g) How many documents have fewer than 50 words from the vocabulary?

(h) What is the average number of total words per document?

(i) What is the average number of distinct words per document?

6. Download the file movies.txt.gz and familiarize yourself with it's structure. This is a large file and the download may take some time depending on your internet connection. After this you will create a new file called reviews.csv which will have on each line the following:

review id, product id, score, helpfulness score

where the fields are separated by a comma. There should be one line per review in the file. You may solve this exercise in whatever manner you chose. Now carry out the following parts, you will include code and/or commands that were executed to answer these questions. You will also submit the compressed output. As in the previous you may use any method you like.

(a) Verify that you have the correct number of reviews in the file you created.

(b) Verify the number of distinct products.

(c) Verify the number of distinct users.

(d) Verify the number of users with 50 or more reviews.

(e) Create a file called mean_rating.csv which has one line per unique reviewer such that each line has the user id and mean score of all their ratings separated by a comma. This file should also be compressed and submitted.

Textbook - Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.

Attachment:- Assignment File.rar

Homework Help/Study Tips, Others

  • Category:- Homework Help/Study Tips
  • Reference No.:- M92867130

Have any Question?


Related Questions in Homework Help/Study Tips

Review the website airmail service from the smithsonian

Review the website Airmail Service from the Smithsonian National Postal Museum that is dedicated to the history of the U.S. Air Mail Service. Go to the Airmail in America link and explore the additional tabs along the le ...

Read the article frank whittle and the race for the jet

Read the article Frank Whittle and the Race for the Jet from "Historynet" describing the historical influences of Sir Frank Whittle and his early work contributions to jet engine technologies. Prepare a presentation high ...

Overviewnow that we have had an introduction to the context

Overview Now that we have had an introduction to the context of Jesus' life and an overview of the Biblical gospels, we are now ready to take a look at the earliest gospel written about Jesus - the Gospel of Mark. In thi ...

Fitness projectstudents will design and implement a six

Fitness Project Students will design and implement a six week long fitness program for a family member, friend or co-worker. The fitness program will be based on concepts discussed in class. Students will provide justifi ...

Read grand canyon collision - the greatest commercial air

Read Grand Canyon Collision - The greatest commercial air tragedy of its day! from doney, which details the circumstances surrounding one of the most prolific aircraft accidents of all time-the June 1956 mid-air collisio ...

Qestion anti-trustprior to completing the assignment

Question: Anti-Trust Prior to completing the assignment, review Chapter 4 of your course text. You are a manager with 5 years of experience and need to write a report for senior management on how your firm can avoid the ...

Question how has the patient and affordable care act of

Question: How has the Patient and Affordable Care Act of 2010 (the "Health Care Reform Act") reshaped financial arrangements between hospitals, physicians, and other providers with Medicare making a single payment for al ...

Plate tectonicsthe learning objectives for chapter 2 and

Plate Tectonics The Learning Objectives for Chapter 2 and this web quest is to learn about and become familiar with: Plate Boundary Types Plate Boundary Interactions Plate Tectonic Map of the World Past Plate Movement an ...

Question critical case for billing amp codingcomplete the

Question: Critical Case for Billing & Coding Complete the Critical Case for Billing & Coding simulation within the LearnScape platform. You will need to create a single Microsoft Word file and save it to your computer. A ...

Review the cba provided in the resources section between

Review the CBA provided in the resources section between the Trustees of Columbia University and Local 2110 International Union of Technical, Office, and Professional Workers. Describe how this is similar to a "contract" ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As