Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Homework Help/Study Tips Expert

Assignment -

For this assignment, the first two problems require Hadoop mapreduce jobs, although you need only solve one of them. Each of these problems should have it's own folder. The folder for a problem must contain a .txt file which gives the command line invocation for the job. For Java jobs submit the project directory as well as a jar. The streaming job will require it's own folder in which you will have files for the mapper and reducer. Problems which are carried out in Spark require only the file which will be submitted through spark-submit. Spark jobs will be implemented in Python. For Spark jobs, key-value output may include parentheses. For problems which do not require Mapreduce or Spark follow the instructions given below including all work in the main submission zip.

Solve one of problems 1 and 2.

1. The following is a mapreduce exercise. You may use either the Java or Streaming API's. From the UCI Machine Learning Repository download the compressed files docwords.nytimes.txt.gz and vocab.nytimes.txt.gz. These are part of the bag of words data set. Create a file named words_nytimes.txt which is the same as docwords.nytimes.txt but with the first three lines removed. Using the distributed cache translate the records of the nytimes data set into the form (docid, actual term , term count, max frequency for document). Parentheses should not be part of the output and you may use different delimiters. The actual term is the mapping of a term id as given in the file vocab.nytimes.txt. The input file here is words_nytimes.txt and the file which will be put in the distributed cache is vocab.nytimes.txt. The VM may have difficulty with the entire dataset. If you are having issues run on only a part of the file.

2. In this exercise you will implement matrix multiplication as a streaming job using Python. You will do so by executing a secondary sort in such a way that no buffering is required in the reducer. Your reducer may use only O(1) additional memory. For example you may use a small number of variables, storing foats or ints only.

3. In this problem you will build an inverted index for the nytimes datain the following sense. The output will be a term id together with a sorted list of the documents in which the term is found. To be precise the output will be lines with tab separated fields where the first field is the term and the subsequent fields are of the form docid:count where the count is the number of times that the term appears in the document. Furthermore, the docid:count data needs to be sorted, highest to lowest, by count. So the document for which the count is greatest will appear first and that in which the count is least will appear last. You will implement this in Spark. Your submission will be a file whose lines contain the required data together with a file giving the code/commands executed. Compress the submission data.

4. For this problem you will need to read about the tf-idf transform in the book Mining of Massive Datasets. For this problem the file words_nytimes.txt will be the input. The output will be the same as the input except that the third field which gives the count of the term in the document will be replaced by the tf-idf score for the term in the document.

You may solve this using any method you like, however the tf-idf score must be as defined in the above mentioned text. You need only submit the output. You must compress the output and include it with you zipped submission.

5. The following must be solved using Spark. You will submit your answers together with a file containing the commands you executed. It is recommended that you employ data frames for this problem. You may need to make use of AWS if your computer is unable to process the entire data set. When asking about particular words give the id only. Referring to the New York Times dataset mentioned above answer the following questions.

(a) How many documents have at least 100 distinct words?

(b) Which document contains the most total words from the vocabulary?

(c) Which document contains the most distinct words from the vocabulary?

(d) Which document, with at least 100 words, has the greatest lexical richness with respect to the vocabulary? By lexical richness we mean the number of distinct words divided by the total number of words.

(e) Which document, with at least 100 words, has the least lexical richness?

(f) Which word from the vocabulary appears the most across all of the documents, in terms of total count?

(g) How many documents have fewer than 50 words from the vocabulary?

(h) What is the average number of total words per document?

(i) What is the average number of distinct words per document?

6. Download the file movies.txt.gz and familiarize yourself with it's structure. This is a large file and the download may take some time depending on your internet connection. After this you will create a new file called reviews.csv which will have on each line the following:

review id, product id, score, helpfulness score

where the fields are separated by a comma. There should be one line per review in the file. You may solve this exercise in whatever manner you chose. Now carry out the following parts, you will include code and/or commands that were executed to answer these questions. You will also submit the compressed output. As in the previous you may use any method you like.

(a) Verify that you have the correct number of reviews in the file you created.

(b) Verify the number of distinct products.

(c) Verify the number of distinct users.

(d) Verify the number of users with 50 or more reviews.

(e) Create a file called mean_rating.csv which has one line per unique reviewer such that each line has the user id and mean score of all their ratings separated by a comma. This file should also be compressed and submitted.

Textbook - Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.

Attachment:- Assignment File.rar

Homework Help/Study Tips, Others

  • Category:- Homework Help/Study Tips
  • Reference No.:- M92867130

Have any Question?


Related Questions in Homework Help/Study Tips

Summary paper - comparing budget line itemslooking at our

Summary Paper - Comparing Budget Line Items Looking at our state budgets and how the funds are allocated to different entities can be very enlightening. Using the link below, you will find interesting information regardi ...

Question contextoverview this week we looked briefly at

Question: Context/overview: This week, we looked briefly at marketing research tools. One popular tool is the focus group. In your assignment this week, you will read more about conducting and analyzing focus groups, bef ...

Vin and geo were two stockbrokers who worked at an

Vin and Geo were two stockbrokers who worked at an investment banking firm in New York. Unable to achieve their financial dreams through honest methods, they decided to earn a little extra on the side by ripping off some ...

Assignmentas you learned this week issues related to health

Assignment As you learned this week, issues related to health and well-being assume great importance during middle childhood. The habits and practices that children adopt during this period can have profound effects on t ...

Question competency understand economic terminology and

Question: Competency: Understand economic terminology and economic definitions pertaining to decisions made by managers. Course Scenario: Oil Company X is a large oil refinery which has been expanding and taking on new i ...

Question welcome to the unit viii discussion board be sure

Question: Welcome to the Unit VIII discussion board! Be sure to read the unit lesson and assigned readings before posting so that they can inform your post. Begin by reading the unit lesson first. Give an example of a pe ...

Assignmentfuturetek sells high-tech computer chips and

Assignment Futuretek sells high-tech computer chips and software to smartphone manufacturers worldwide. Futuretek maintains two software databases: one containing Futuretek's customer list with nonpublic contact informat ...

Assignment -choose a real-life service organisation that

Assignment - Choose a real-life service organisation that you are familiar with. Prepare a flowchart of the back-stage as well as the front-stage operations of this business. Using this flowchart, explain the significanc ...

Assignment conducting an environmental analysisfor all

Assignment : Conducting an Environmental Analysis For all assignments assume that you are the administrator of a fictitious organization of your choice. The organization can be any type of health care organization such a ...

Question 1 do a search online 1-2 antibiotics that affect

Question: 1. Do a search online 1-2 antibiotics that affect Gram-positive bacteria and list them. On what part of the cell do the antibiotics usually work? List one or two antibiotics that affect Gram-negative bacteria? ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As