Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Python Expert

For this program, you'll be working with a set of text files to see if you can catch a plagiarist. This is a real-world problem that requires a software solution. Your program should (quickly) identify similarities between papers to identify cases where part or all of a paper was copied.
Here is adiagram showing similarities between documents; this is an actual set of physics lab assignments from a large university.

412_Similarities between documents.jpg

Each node (square) in the graph is a document. Each edge (line) connects two documents based on the number of 6-word phrases they have in common. To reduce noise, only documents sharing more than 200 6-word phrases are shown. The red square is the lab manual, and the brown squares are two sample lab reports that were distributed in class. (Apparently many students 'borrowed' heavily from these documents.) But if we look carefully, we notice that several papers share a large number of phrases in common with each other and with no other documents. For ex, a pair at the top left share 718, a pair at the top right show 882, and a pair on the lower left share 863 6-word phrases in common. It's likely that those people turned in essentially the same lab report or copied from each other.

Your program will read through a directory full of files and identify pairs of files that share more than 200 6-word phrases, as a means of detecting copied work.

It's important to understand what we mean by “6-word phrases.” It's not just looking at 6 words, then the next 6, etc.; after all, even plagiarists are smarter than that. A 6-word phrase is a word and the following 5 words, for each word with at least 5 words after it. So, for ex, the text:

Now is the time for all good men to come to the aid of their country. Contains the 6-word phrases:

Now is the time for all
is the time for all good
the time for all good men
time for all good men to
...and so on.

Thus, a single extended passage can generate many duplicates. On the other hand, a sentence  that happens to begin with “Now is the time...” will be less likely to generate more than a few 'hits' as duplicate phrases. For our purposes, upper- or lower-case doesn't matter, nor does punctuation. You're given a data file containing about 25 (mostly very bad) high school papers downloaded from www.freeessays.cc.

You're welcome to add a few more papers from the same site if you want to see how well your program handles larger data sets. You should NOT hard-code file names, path names, or the number of papers into your code; your program will be tested against a different data set. Inside the directory containing your program, make a subfolder to hold the papers, and unzip the data file into it. Your program will ask the user for the name of the folder, verify that the folder exists (reprompt the user if it doesn't), read the files in it, and find the number of shared 6-word phrases between all possible pairs of files in that folder, reporting all pairs of files which have more than 200 such
phrases in common. Report the number of phrases in common, and both file names. When the  program ends, the active (current or 'working') directory should be the same as it was when the program started.

Hints and development notes:

1) You can read an entire file into one big string, then split it on the spaces to produce a list of separate words. Likewise, you can convert strings to a consistent case, and either ignore punctuation or remove it.

2 A '6-word phrase' is six consecutive words found in the same order in both files. Spacing  and capitalization are irrelevant.

3 A naïve approach would be to read in all files, break them up into phrases as needed, then  use nested for loops to compare each file against every other. However, this ends up  comparing each pair A and B twice; once when A is in the for loop, once when B is in it. Our measure is symmetric; if A shares 234 phrases with B, then B shares 234 phrases with A. Can you arrange things so each pair is only compared once?

4) You will make heavy use of the os and os.path modules in this program, to read  directories, select files for reading, etc.

Other thoughts:

1) The number of pairs between N elements goes up as the square of N. This means that as the number of papers increases, performance will drop. If your program finishes the small data file in a few seconds, it may take several minutes (or more) to process a group of a hundred files, and hours or even days to process a set of a thousand files (even if you could hold everything you need in memory at once). Practical “real-world size” solutions to this problem may involve generating and storing temporary data that can be re-loaded if needed, or developing some sort of 'profile' of the text and then comparing profiles.

2) The combination of 200 hits and 6-word phrases is fairly arbitrary. You may want to experiment with other ranges to see how that affects the sensitivity of your detector.

3) Our approach is also very mechanical; a student who knew it was in use could make minor modifications to the plagiarized text and reduce the chances of being detected. Again, you  may want to try paraphrasing rather than directly copying and see if it makes a difference to your detector.

4) Solving the plagiarism-detection problem quickly is, in general, a VERY difficult problem, especially using only the data structures we've covered so far. By all means, try out different approaches, but don't freak out if your solution doesn't scale up to a large data set well, or doesn't give the results you expected.

Sample run:

What's the name of the subfolder with the files? Documents You may want to examine the following files:
Files [filename1] and [filename2]
A total of 3 files are listed for this data set; names obscured to keep from making it too easy. --BKH

 

Python, Programming

  • Category:- Python
  • Reference No.:- M9604

Have any Question?


Related Questions in Python

Project reconnaissance and attack on ics

Project: Reconnaissance and Attack on ICS NetworksEnvironment Setup The second mini project will be based on Industrial Network Protocols, specifically the Modbus protocol. Please follow the instructions carefully to set ...

Part i the assignment filesone of the most important

Part I: The Assignment Files One of the most important outcomes of this assignment is that you understand the importance of testing. This assignment will follow an iterative development cycle. That means you will write a ...

Question a software company sells a package that retails

Question : A software company sells a package that retails for $99. Quantity discounts are given according to the following table: Quantity Discount 10 - 19 20% 20 - 49 30% 50 - 99 40% 100 or more 50% Write a program usi ...

Question write a simple python program that takes use

Question: Write a simple python program that takes use inputs as non-zero digits and converts them into binary form. The response must be typed, single spaced, must be in times new roman font (size 12) and must follow th ...

A software company sells a package that retails for 99

A software company sells a package that retails for $99. Quantity discounts are given according to the following table: Quantity Discount 10 - 19 20% 20 - 49 30% 50 - 99 40% 100 or more 50% Write a program using python t ...

Lab assignment -background - we have discussed in detail

Lab Assignment - Background - We have discussed, in detail, the function of Stacks and Queues and how they are specifically implemented in Python. To get a better understanding of the utility of these data structures, we ...

Quesiton write a python script that counts occurrences of

Quesiton: Write a python script that counts occurrences of words in a file. • The script expects two command-line arguments: the name of an input file and a threshold (an integer). Here is an example of how to run the sc ...

Foundations of programming assignment - feduni bankingthis

Foundations of Programming Assignment - FedUni Banking This assignment will test your skills in designing and programming applications to specification. Assignment Overview - You are tasked with creating an application t ...

The second task in this assignment is to create a python

The second task in this assignment is to create a Python program called pancakes.py that will determine the final order of a stack of pancakes after a series of flips.(PYTHON 3) Problem Task In this problem, your input w ...

In this programming assignment you will write a client

In this programming assignment, you will write a client pingprogram in Python. Your client will send a simple ping message to a server, receive a correspondingpong message back from the server, and determine the delay be ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As