Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Homework Help/Study Tips Expert

Imperative Programming -Stylometrics

Our goal here is to use a rudimentary characterization of authors' uses of words to identify the authors of unknown works. We will use dictionaries and simple statistics (really just ratios) to categorize an author's work by the frequency with which they use their 50 most popular words.

For example, say you wanted to characterize the plays written by Shakespeare and the stories written by Melville. You might choose a large sample of each. For example,

1 For Shakespeare you might choose 3 plays: Macbeth, Othello and All's Well that Ends Well.
2 For Milton you might choose Moby Dick, Bartleby and Omoo.

You can find these texts on the Internet. For example, take a look at http://www.gutenberg.org. For example you can find Melville's Moby Dick at http://www.gutenberg.org/ebooks/2701. Because we want to work with plain text, we should use the Plain Text UTF-8 files, e.g. http://www.gutenberg.org/cache/epub/2701/pg2701.txt. So, then you might characterize these files by some simple statistics. For example, you might characterize the Shakespeare texts by the words that appear a certain number of times (as a percentage of the total number of unique words) in the Shakespeare plays but under some percentage in the Melville texts. You will have to experiment to determine these percentages.

Then use these characterizations to decide among, say 10 different files, which contain works of Shakespeare and which contain works of Melville. These 10 works can be found on the Internet and saved as files, say file1.txt, …, file10.txt. See if you can use the characterizations (or vocabulary signatures) in this way to identify authors.

Feel free to modify the parameters of this project so long as you at least try this simple characterization.

You may try additional tasks. For example you might work with a larger set of authors. You might try categorizing scientific articles as to their field or sub-fields.

To characterize authors (at least Shakespeare and Melville) use 3 works. For Melville, use:
- http://www.gutenberg.org/cache/epub/2701/pg2701.txt
- http://www.gutenberg.org/cache/epub/11231/pg11231.txt
- http://www.gutenberg.org/cache/epub/4045/pg4045.txt
For Shakespeare, use:
- http://www.gutenberg.org/cache/epub/2264/pg2264.txt
- http://www.gutenberg.org/cache/epub/2267/pg2267.txt
- http://www.gutenberg.org/cache/epub/1125/pg1125.txt

To characterize and author we build a dictionary, one for each author.

1 We read in a large body of work by that author (e.g. 3 works). From this work, we build a dictionary of the work's 50 most frequently used words and their counts (as in wordfreq.py from our handout).

2 We go through the dictionary replacing each count, by a ratio:

? We compute this ratio by dividing the count by the total number of words (we should count them as we process them in (1)). The total number of words will include a count of duplicates; it's the total number of words in the entire body of text that we are characterizing.
? So corresponding to each of the 50 most popular words in the author's work is the ratio of the use of that word to the total number of words in the text.

Then we define a function identifyAuthor(), such that identifyAuthor(filename), where filename is a string name of a file containing the text we want to identify (e.g. an unknown work by one of the authors), returns either the name of the author who we think wrote the work, or "unknown" if we think none of our authors wrote the work. The function identifyAuthor() should do the following:

1 Read in the work from the named file.
2 Build a dictionary, mapping the work's 50 most frequent words to the ratios, calculated in the same way we did for the authors' works.
3 We want to compute a difference, between this dictionary and those for each of the authors:

? For each word in the 50-word dictionary for this unknown work, look up the ratio in both this dictionary and that for the author; if the word is not in the author's dictionary, make it 0 (zero).

? Computer the absolute value of the difference between the two ratios.

? The difference between the dictionary for the unknown work and the dictionary characterization of the work is the sum of the differences for the 50 words.

4 We say that the author of the work is that whose dictionary is the least different from that for the unknown work.

5 We define some arbitrary cutoff x (difference) as indicating none of the authors wrote the unknown work: if the differences between the dictionary for the unknown work and the dictionaries for each of the authors is greater than x, we say the author is unknown.

So that I may test your identification method, make sure you name it identifyAuthor such that identifyAuthor("file") attempts to identify the author who wrote the work in the file named "file", and returns either the string containing the author's name or "unknown".

Experiment as much as possible. Write about your experiments and their results. Show results and discuss them.

You should submit two files to the vault for homework5:

1 memo.txt -- This will contain a (plain-text) narrative explaining the design of your solution, how you experimented in coming up with ratios and cut-offs for identifying authors, and the results of test runs.

2 sylometrics.py -- your Python program that implements your solution, defining identifyAuthor("file") and any helper functions you need. Don't forget your docstrings!

Important:

Your program should read, moby.txt, bartleby.txt, and omoo.txt to build a characterization of Melville.

Your program should read, macbeth.txt, othello.txt, and allswell.txt to build a characterization of Shakespeare.

These six file will be in my test directory, so all you need to submit is your memo.txt and stylometrics.txt

You must insure the names are exactly correct; that's part of your assignment.

You may experiment with other files, and you should run tests, but be sure to comment all of that experimenting out.

I will simply execute,

identifyAuthor( "some file name") a couple of times.

Homework Help/Study Tips, Others

  • Category:- Homework Help/Study Tips
  • Reference No.:- M92045801
  • Price:- $70

Guranteed 36 Hours Delivery, In Price:- $70

Have any Question?


Related Questions in Homework Help/Study Tips

Focus of report analysis of hrm-related issues and their

Focus of report: Analysis of HRM-related issues and their solutions Topic - McDonald's You are required to investigate current HRM-related issues in the workplace. You are to conduct research into a workplace of your cho ...

Question in this assignment you will be creating a

Question: In this assignment, you will be creating a PowerPoint presentation based on the application of the functional health assessment of a movie character. To complete this assignment, choose a movie from the followi ...

What are the three ways in which heredity and environment

What are the three ways in which heredity and environment may be correlated, using examples from development?

Edmund burke viewed society as the source of moral growth

Edmund Burke viewed society as the source of moral growth across generations and across members; an organic and enduring social fabric. Accordingly, the relationships between people within a society are essential. After ...

Question write a brief thought paper that reflects on the

Question: Write a Brief Thought Paper that reflects on the reading for class ((PDF file) that includes full name, date, and the class number on the top left corner. (Times New Roman font, 12 font sizes, 1" margins, singl ...

Question - write about the different theories realism

Question - Write about the different theories realism, liberalism, constructivism, & Marxism. Note - A rough draft of 300 -400 words is needed in like 4-5 hours. Full work can be submitted by the deadline and Words: 900.

Question assume you were in charge of the risk management

Question: Assume you were in charge of the Risk Management functions at Notre Dame at the time of the accident that claimed Declan Sullivan's life. a) What five (5) things would you have required to take place in order t ...

Explain the issues of runway incursions and address some of

Explain the issues of runway incursions and address some of the safety management challenges in mitigating these incursions. Remember, you must have a title page, 300 word body written in the third person, and at least t ...

1 written report - annotated bibliographythis is the major

1. Written Report - Annotated Bibliography This is the major piece of work for this course and as such, should satisfy the following criteria: - A company should an Australian company. - Demonstrate understanding of the ...

Elemements of a crime portfoliowrite a one page 250 word

Elemements of a Crime Portfolio Write a one page 250 word MLA paper discussing the key elements of a crime. Details: 1. Research and summarize the following: a. What are the elements that must be present for a crime to b ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As