Ask Computer Engineering Expert

Quotations from the opposing candidates in the 2004

Presidential Debates:

I proposed a constitutional amendment. The reason I did so was because I was worried that activist judges are actually defining the definition of marriage, and the surest way to protect marriage between a man and woman is to amend the Constitution.

I believe marriage is between a man and a woman. But I also believe that because we are the United States of America, we're a country with a great, unbelievable Constitution, with rights that we afford people, that you can't discriminate in the workplace.

In this assignment, your task will be to write a computer program that can (usually) predict whether a quote such as these are more likely to have been said by George W. Bush or John F. Kerry. There are many reasonable approaches to this problem. In this assignment, we will consider one of these: We will begin by building a statistical model of the kind of language used by each of the candidates. Then, given a new piece of text, we will determine who said it by measuring how well it fits each of the statistical models, and selecting the one that gives the best fit.

Markov models. The probabilistic language models we will be using are called Markov models. Markov models are at the heart of modern speech recognition systems, and are used for a broad range of natural language processing tasks, as well as many other problems relevant to artificial intelligence. They are widely used to model all sorts of dynamical processes in engineering, mathematics, finance and many other areas.

A Markov model defines a probability distribution over sequences of symbols. Said differently, a Markov model defines a probabilistic mechanism for randomly generating sequences over some alphabet of symbols. In a zero-th order Markov model, the symbols in the sequence are generated with a fixed probability that does not in any way depend on other symbols in the sequence. For instance, the alphabet might consist of just the letters a and b, and the zero-th order Markov model might specify that each symbol in the sequence is a with probability 2/3 and b with probability 1/3. Thus, such a zero-th order Markov model might generate a sequence like the following:

a b a b a a a b b b a a a a b b b a a b a a a a a b a a a a b a a a b b

In a first order Markov model, the probability of the current symbol that is being generated can depend on preceding symbol. For instance, consider a sequence in which every occurrence of b is always followed by a, while an a is followed with equal probability by either a or b. Such a model might generate a sequence like the following.

b a b a b a b a a b a b a b a b a a b a a b a a a a a a b a b a a a a b

In this example, the sequence is likely to have about the same proportion of a's and b's as in the preceding example. However, the model is first order rather than zero-th order because whether or not the current symbol is a or b significantly affects the probability of what symbol will be generated next.

This idea can be generalized to a k-th order Markov model in which the probability of the current symbol can depend on the preceding k symbols.

Markov models are frequently used to build a probabilistic model or statistical estimate of a natural language. In modeling ordinary English text, a zero-th order model is clearly inadequate for most purposes, capturing only the frequencies of each letter. On the other hand, a first order Markov model can capture the notion, for instance, that a q is nearly always followed by u. Higher order Markov models can capture the tendency of the letters proba to be followed by b or t. (In other language applications, entire words are regarded as the "letters" of the "alphabet" so that the Markov model provides a model of the sequences of words rather than letters that are likely to be observed. We will not consider such a framework in this assignment, but you are welcome to explore it on your own.)

Language modeling. The first part of the assignment is to implement a class called MarkovModel.java that can build a k-th order Markov model from text stored in a given file. (Eventually, you will build two objects using this class, one for Bush and one for Kerry.) Constructing a k-th order Markov model from text sequences is mostly a matter of counting. For each sequence p of length k (let us call it the context), we need to estimate the probability of p being followed by each letter c in our alphabet. Given a text sample, this probability can be estimated by the number of times that p is followed by c, divided by the number of times that p appears as a context at all. That is, we can estimate the probability of c following p by N(p·c) / N(p), where N(p·c) denotes the number of times we observe the concatenated sequencep·c, and N(p) denotes the number of times that we observe p.

Unfortunately, this estimate will be problematic if some of the counts are zero, as is certain to happen on real data. Therefore, instead, we will use a different "smoothed" estimate, namely (N(p·c) + 1) / (N(p) + S), where S is the size of our alphabet.

This form of smoothing, called Laplace smoothing, has theoretical justification coming from probability theory, and ensures that zero counts will not be a problem.

For example, if the input text is "aabcabaacaac", and we are using a second-order Markov model (k = 2) and the three-letter alphabet is {a, b, c}, then we can estimate the probability that "aa" is followed by'c'; by (2 + 1) / (3 + 3) = 1/2 since N("aa") = 3, N("aac") = 2 and S = 3. Similarly, the probabilities that "aa" is followed by 'a' and 'b' is 1/6 and 1/3, respectively. Note that the probabilities sum to 1.
One thing to notice here: To handle the beginning and end of the string, we treat the string as circular. Thus, N("caa") = 2 instead of 1.
So, in constructing a Markov model, your first job is to write code that will compute the appropriate counts N(·) by scanning the text file and counting how many times all sequences of the required lengths appear in the file. These counts should be stored appropriately for later use. You should also compute and record the alphabet size S. Organize your program by creating a data type MarkovModel with the following API:

public class MarkovModel { public MarkovModel(int k, String corpus) // build an order-k Markov model from corpus public double laplace(String s) // return laplace-smoothed probability estimate of s public String toString() // return a string representation of this model }

Now, the first line below computes the number of occurrences of each substring from s of size k and k+1; the second line prints it to standard output; and remaining lines compute and print Laplace-smoothed probability estimates.

MarkovModel model = new MarkovModel(k, corpus); StdOut.println(model); StdOut.printf("%.4f\n", model.laplace("aac")); StdOut.printf("%.4f\n", model.laplace("aaa")); StdOut.printf("%.4f\n", model.laplace("aab"));

For example, whenk is 2 andcorpus is"aabcabaacaac", this code fragment should print the following:

alphabet size S = 3 "aa" 3 "ab" 2 "ac" 2 "ba" 1 "bc" 1 "ca" 3 "aab" 1 "aac" 2 "aba" 1 "abc" 1 "aca" 2 "baa" 1 "bca" 1 "caa" 2 "cab" 1 0.5000 0.1667 0.3333

Determining the likelihood of a new sequence. Next, we will want to use the Markov model that you constructed from data to compute a measure of how well that model fits a new text sequence. Later, we will use this measure to determine which of two models (one for Bush and one for Kerry) better fits a test sequence. To compute a measure of fit, we will compute the probability of the model generating the new sequence, a quantity usually called the likelihood of the sequence under the model. For each symbol c in the sequence, we can compute the probability of observing c under the model, given its k-letter context p. In particular, this is just the Laplace-smoothed estimate given above. To compute the likelihood, or probability of the entire sequence, we can multiply these probabilities together, for all symbols in the sequence. Although correct mathematically, the resulting product might be indistinguishable from zero using floating point arithmetic. Therefore, rather than multiplying probabilities, you will need to work with log probabilities. That is, for each symbol c in the sequence, you will need to compute the log of its probability, and you will then add (rather than multiply) these log probabilities to arrive at the final log likelihood of the sequence. Keep in mind that all of these log probabilities will be negative (since probabilities are never more than one).

Your second job is to compute, for a given string, and for each position i in the string, an estimate of the log probability of the character at position i, given its k-character context. The log likelihood of this entire sequence can then be computed as the sum of these log probabilities.

For instance, continuing the example above, having constructed the second order (k = 2) Markov model, suppose we are given a new string "aabca". We can compute the log probability of each of the five symbols in this string as follows (where the first column gives the context of each of the symbols in the second column):

context c log probability ------------------------------------------------ "aa" 'b' log((1 + 1) / (3 + 3)) = -1.0986 "ab" 'c' log((1 + 1) / (2 + 3)) = -0.9163 "bc" 'a' log((1 + 1) / (1 + 3)) = -0.6931 "ca" 'a' log((2 + 1) / (3 + 3)) = -0.6931 "aa" 'a' log((0 + 1) / (3 + 3)) = -1.7918 TOTAL log likelihood = -5.1930 AVERAGE log likelihood = -1.0386

One thing to notice here: To handle the beginning and end of the string, we again treat the string as circular. Thus, in "aabca", the character following the context "ca" is 'a'.

Choosing the best model. Now that we have a method of measuring the fit of a model to a particular string, we can choose the best model to be the one that maximizes the likelihood of the string. In other words, we can simply measure the likelihood of the string under each model (Bush or Kerry) and choose the one that is greatest.

Write a program TopModel.java as follows: its first argument is the order parameter k; its next two arguments are the names of the files containing excerpts from the two candidates (or other text that you might choose to experiment with). The remaining arguments are the names of the files containing text sequences you wish to classify. Your program should build models for each of the two candidates, read in each text sequence, and classify it using the output format described below.

Input format. Each file is a sequence of Unicode characters. We provide data files from the three presidential debates. You should train using the first two debates and test on the third. Read the files using the In.java library that we provide.

In in = new In("filename.txt"); String corpus = in.readAll();

For amusement, you might also see how your Bush/Kerry models predict when attempting to classify quotes from John Edwards and Dick Cheney taken from the vice presidential debate. Or you might use the data from the Obama/McCain and Biden/Palin debates.

Output format. For each text sequence, your program should output the filename of the text block and its average likelihood under each model (that is, the likelihood as computed above, divided by the length of the sequence-this is simply to prevent unusually long sequences from seeming more significant than they are). Also, print the difference of these two numbers; the sign of the difference will indicate which model is more likely, and the absolute value of the difference can be interpreted as a measure of confidence in this prediction.

To get an idea of why the algorithm is making its predictions, you should also find and print the ten positions in the test sequence where the difference in the log probabilities of the two models are greatest. That is, if the log probability of character i (given its context) under model j is lj[i], then you should find the 10 indices i for which |l1[i] - l2[i]| is greatest. For each of these, print i's context, characteri, l1[i], l2[i] and l1[i] - l2[i]. For instance, you might get output like this:

%java TopModel 2 bush1+2.txt kerry1+2.txt bush3-00.txt bush3-01.txt bush3-00.txt -2.1131 -2.1895 +0.0764 "spr" -2.715 -4.963 +2.247 "eek" -5.617 -3.457 -2.160 "eek" -5.617 -3.457 -2.160 "ari" -6.011 -3.901 -2.110 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "ban" -2.338 -4.174 +1.836 "sig" -2.744 -4.514 +1.769 "fea" -2.645 -4.401 +1.756 bush3-01.txt -2.1467 -2.1607 +0.0140 "Of " -2.351 -4.331 +1.979 " Of" -2.543 -4.382 +1.839 " My" -1.674 -3.308 +1.634 "My " -1.376 -2.983 +1.607 "sal" -4.956 -3.418 -1.538 "sal" -4.956 -3.418 -1.538 "nen" -2.275 -3.703 +1.428 " Go" -2.970 -4.357 +1.386 "k I" -5.380 -4.003 -1.377 "dis" -1.946 -3.292 +1.346

The first line indicates that for the file bush3-00.txt, the average likelihood under the model built using bush1+2.txt was -2.1131, and -2.1895 under the kerry1+2.txt model. The difference of these numbers is +0.0764, which, being positive, indicates an overall prediction that this quote was uttered by Bush. The next line indicates that the most significant difference in log probabilities between the two models came in predicting 'r' under context "sp" where the Bush log probability was -2.715 and the Kerry log probability was -4.963, the difference being +2.247.

When printing characters from the text, you should convert all white space to ordinary spaces.

Analysis. Analyze your approach to this problem giving estimates of its time and space requirements. Also, critically explore the effectiveness of this technique for this task.

Submission. Submit TopModel.java, MarkovModel.java, and any other files needed by your program (excluding those in stdlib.jar and adt.jar). Finally, submit a readme.txt file and answer the questions

Computer Engineering, Engineering

  • Category:- Computer Engineering
  • Reference No.:- M91897385

Have any Question?


Related Questions in Computer Engineering

Does bmw have a guided missile corporate culture and

Does BMW have a guided missile corporate culture, and incubator corporate culture, a family corporate culture, or an Eiffel tower corporate culture?

Rebecca borrows 10000 at 18 compounded annually she pays

Rebecca borrows $10,000 at 18% compounded annually. She pays off the loan over a 5-year period with annual payments, starting at year 1. Each successive payment is $700 greater than the previous payment. (a) How much was ...

Jeff decides to start saving some money from this upcoming

Jeff decides to start saving some money from this upcoming month onwards. He decides to save only $500 at first, but each month he will increase the amount invested by $100. He will do it for 60 months (including the fir ...

Suppose you make 30 annual investments in a fund that pays

Suppose you make 30 annual investments in a fund that pays 6% compounded annually. If your first deposit is $7,500 and each successive deposit is 6% greater than the preceding deposit, how much will be in the fund immedi ...

Question -under what circumstances is it ethical if ever to

Question :- Under what circumstances is it ethical, if ever, to use consumer information in marketing research? Explain why you consider it ethical or unethical.

What are the differences between four types of economics

What are the differences between four types of economics evaluations and their differences with other two (budget impact analysis (BIA) and cost of illness (COI) studies)?

What type of economic system does norway have explain some

What type of economic system does Norway have? Explain some of the benefits of this system to the country and some of the drawbacks,

Among the who imf and wto which of these governmental

Among the WHO, IMF, and WTO, which of these governmental institutions do you feel has most profoundly shaped healthcare outcomes in low-income countries and why? Please support your reasons with examples and research/doc ...

A real estate developer will build two different types of

A real estate developer will build two different types of apartments in a residential area: one- bedroom apartments and two-bedroom apartments. In addition, the developer will build either a swimming pool or a tennis cou ...

Question what some of the reasons that evolutionary models

Question : What some of the reasons that evolutionary models are considered by many to be the best approach to software development. The response must be typed, single spaced, must be in times new roman font (size 12) an ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As