Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Python Expert

Assignment -

In this problem you will be working with data from a collection of Wikipedia edit logs. The file that you will be working with is enwiki-20080103.main.bz2. This file is bzip2 compressed and is about 8.5GB. The file decompresses to a little over 300GB. You will want a part of the file to use while developing your program. Using bait you can decompress the file a little at a time. Once you have decompressed some of it you can recompress it using the bz2 command. The output you create in this part will be used in the second part of the project. You should familiarize yourself with this file before planning out your code.

Make sure you get the correct data set as there is more than one at this link. You may carry out this project using whatever method you like. I developed two solutions: one using Hadoop MapReduce with the Java API on AWS, and another using only python on a computer with a large amount of main memory(64 GB). If you do not own such a computer you can obtain an AWS instance with the desired specifications.

WARNING: Although this data contains only text data there is offensive material contained in it. You are likely to find this once you begin to extract the link data. Given the size of the dataset I am not completely aware of everything which one might find. If you believe that this is likely to be an issue for you please let me know.

1. Execute a job whose output is a file containing lines that consist of tab separated fields where the fields give the following: Article Name, Number of Edits of the Article, Number of Major Edits of the Article, Number of Out links, Number of In links, Number of Distinct Editors, and the time of the earliest edit. Thus there are seven tab separated fields. Here by Article Name we mean anything that is an article that was edited or something that was linked to by an actual article. So Article Name can include things like image files. Do not include External links. It is thus the case that some of the fields may not exist. You can use 0 for all missing fields except the earliest edit where you should use something indicating absence, if no edit times can be associated with it.

2. Determine the directed graph which associates articles to the objects to which they link, excluding External Links. Each line of output will be an article followed by an object to which it links where the two fields are tab separated.

In this part you will work with the data you generated from the above in section 1 & 2. For these problems you are free to use any method you like. In fact you are encouraged to choose whatever tool you feel best fits the problem.

3. Using the data from problem 2 of the first part of the project, remove all edges which do not connect actual pages to one another. By actual page we will simply refer to something which is the subject of an edit and not something that is only linked to by pages. Then perform PageRank on the topic graph. Use a β of .85. Submit a file which gives each page together with its PageRank. Fields should be tab separated and the data should be sorted by PageRank in descending order.

4. In this problem you will think of, and execute, a series of queries on your table data from problem 1 of the first part. A page refers to something which was the subject of at least one edit. First perform the following:

(a) Determine the page which has been edited the most number of times.

(b) Determine the page which has the largest number of distinct editors.

(c) Determine the page which has the earliest edit time.

(d) Determine the object which has the largest number of in links.

(e) Determine the page which has the largest number of outlines.

(f) Determine the number of pages which have no outlines.

Now think of four more queries to perform. In your submission include the results of each of the ten queries you performed. Also include a description of the four additional queries you performed.

Attachment:- Assignment Files.rar

Python, Programming

  • Category:- Python
  • Reference No.:- M92849429

Have any Question?


Related Questions in Python

The second task in this assignment is to create a python

The second task in this assignment is to create a Python program called pancakes.py that will determine the final order of a stack of pancakes after a series of flips.(PYTHON 3) Problem Task In this problem, your input w ...

Homework -this homework will have both a short written and

Homework - This homework will have, both a short written and coding assignment. The problems that are supposed to be written are clearly marked. 1) (Written) Make heuristics Describe two heuristics for the slide problem ...

Question why is software configuration management

Question : Why is software configuration management considered an umbrella activity in software engineering? Please include examples and supporting discussion. The response must be typed, single spaced, must be in times ...

Lab assignment -background - we have discussed in detail

Lab Assignment - Background - We have discussed, in detail, the function of Stacks and Queues and how they are specifically implemented in Python. To get a better understanding of the utility of these data structures, we ...

Environment setupthe first mini project will be based on

Environment Setup The first mini project will be based on Ladder Logic programming. We will be using Schneider Electric's IDE called SoMachine Basic to do the programming. The latest ver- sion of SoMachine Basic for Wind ...

Architecture and system integrationcase study queensland

Architecture and System Integration Case Study: Queensland Health - eHealth Investment Strategy After evaluating various platforms, Queensland Health finally decided to adopt a Service Oriented Architecture (SOA) for its ...

Questionwhat is a python development frameworkgive 3

Question What is a python development framework? Give 3 examples python development framework used today. and explain which development framework is used in which industry.

Python programming assignment -you first need an abstract

Python Programming Assignment - You first need an abstract base class, called, Account which has the following attributes and methods: accountID: This attribute holds the ID assigned the account , if not provided set to ...

Project reconnaissance and attack on ics

Project: Reconnaissance and Attack on ICS NetworksEnvironment Setup The second mini project will be based on Industrial Network Protocols, specifically the Modbus protocol. Please follow the instructions carefully to set ...

Question a software company sells a package that retails

Question : A software company sells a package that retails for $99. Quantity discounts are given according to the following table: Quantity Discount 10 - 19 20% 20 - 49 30% 50 - 99 40% 100 or more 50% Write a program usi ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As