Ask Question, Ask an Expert

+1-415-315-9853

info@mywordsolution.com

Ask Programming Language Expert

A. Introduction

Sentiment analysis is a subfield of NLP concerned with the determination of opinion and subjectivity in a text, which has application in analysis of online product reconsiders, recommendations, blogs, and other kinds of opinionated documents.

In this assignment you will be developing classifiers for sentiment analysis of movie reviews by using Support Vector Machines (SVMs), in the way of the paper by Pang, Lee, and Vaithyanathan [1], which was the foremost research on this topic. The goal is to develop a classifier that performs sentiment analysis, assigning a movie review a label of "positive" or "negative" that predicts whether the author of the review liked the movie or disliked it.

You might use Java or Python programming and scripting languages of your choice for this assignment, but for the machine learning you should use SVMlight (section D).  http://svmlight.joachims.org/

B. Data

The data (accessible on the course web page) consists of 1,000 positive and 1,000 negative reviews. These have been divided in training, validation, and test sets of 800, 100, and 100 reviews, respectively. In order to promote you not to optimize against the testing set while building your classifiers, the testing data will not be immediately available.
The reviews were obtained from Pang's website [2], and then part-of-speech tagged by using a bidirectional Maximum Entropy Markov Model [3, 4].

Each document is formatted as one sentence per line. Each token is of the format word/POStag, where a "word" also includes punctuation. Each word is in lowercase. There is sometimes more than one slash in a token, such as in preparer/director/NN.

C. Baseline system

For a baseline system, think of 20 words that you think would be indicative of a positive movie review, and 20 words that you think would be indicative of a negative reconsider.

To develop the baseline classifier, take this approach: given a movie review, count how many times it contains either a positive word or a negative word (token occurrences). Allocate the label POSITIVE if the review holds more positive words than negative words. Assign the label NEGATIVE if it contains more negative words than positive words. If there are an equal number of positive and negative words, it is a TIE.

D. Machine learning

The machine learning software to be used is SVMlight [5], which learns Support Vector Machines for binary classification. It is available for UNIX systems, Windows, and Mac OS X.

You will require reading the documentation on the SVMlight website in order to figure out how to use the software. To test whether you know how to use it, it might be helpful to first create a small, "toy" dataset by hand, and then train and test the SVM on it. When training the classifier, choose the option for classification:

-z {c,r,p} - select between classification (c), regression (r), and
preference ranking (p)

A training file is of the format:

.=. : : ... : #
.=. +1 | -1 | 0 |
.=. | "qid"
.=.
.=.

Since we are doing binary classification, the value of should be +1 or -1.

Every feature (which might be expressed as an integer or a string) is associated with a value, which is a floating-point number. If you want a feature to be binary-valued, you may use values of 0.0 and 1.0.

With binary features, it is not necessary to comprise an explicit representation feature of features that do not occur. For illustration, suppose a document contains 100 different words out of a vocabulary of 50,000 possible words. If you are using binary features, it suffices to include a feature with a value of 1.0 for each of the words that do occur. You do not have to include a feature with a value of 0.0 for each of the 49,900 words that do not appear in the document.

You do not need to perform smoothing.

E. Feature sets

Use these feature sets for training and testing your classifier:

1. unigrams
2. bigrams
3. unigrams + POS
4. adjectives
5. top unigrams
6. optimized

Detailed explanation:

1. unigrams: use the word unigrams that occurred >= 4 times in the training data. Let this quantity be N.
2. bigrams: use the N most-frequent bigrams.
3. unigrams + POS: use all combinations of word/tag for each of the unigrams in (1). Since a word may occur with multiple tags, the quantity of this type of feature will be greater than N.
4. adjectives: use the adjectives that occurred >= 4 times. Let this quantity be M.
5. top unigrams: use the M most-frequent unigrams.
6. optimized: select any combination of features you would like, to try to produce the best classifier possible. For ex, you might choose different cut off values for frequencies of dissimilar types of features. You could also make entirely new types of features. You could also try dissimilar settings for training the SVM. The optimized classifier should be produced through a process of repeatedly training the classifier and computing its performance on the validation set.

F. Evaluation

Train the SVMs on the training data and perform preliminary tests on the validation data. To appraise your classifiers, compute the accuracy rate on the testing data, which is percentage of movie reviews correctly classified. For the baseline classifier, also find out the number of ties.
Appraise your classifiers on testing data when it is released. Don’t further optimize your system based on performance on the testing data.

G. Turn in

Produce a document that states:

- Short descriptions of the attached files
- A list of the positive and negative words selected for your baseline system
- Performance of the baseline system on the test set
- A table listing the number of divergent features for each feature set. Since the split of the data into training and testing is not exactly the same as Panget al.¡¦s, the quantity of different features will be similar, but not identical.
- A table of performance of classifiers on validation set and test set
- A written comparison of your results with Pang et al.'s (minimum 5 lines)
- Construct a table listing the 50 most-frequently misclassified reviews (across all 6 classifiers) in the validation set, and the number of classifiers by which they were misclassified. For instance, the review cv808_12635.txt may have been misclassified by 4 classifiers. Illustrate 5 different attributes of the frequently misclassified reviews, showing excerpts from 2 reviews for each attribute. For each of these attributes, describe a probable feature that could be added to improve performance.

H. Submission:

A compressed directory, containing:

- All source code
- One ex of a feature file that you produced
- Your written document
- Any additional files that you would like to attach

Programming Language, Programming

  • Category:- Programming Language
  • Reference No.:- M91286

Have any Question? 


Related Questions in Programming Language

Fr each of the following c assignment statementsa x a

For each of the following C assignment statements a) x = a + b*c; b) x = a/(b+c) - d*(e+f); c) x = a[i] + 1; d) a[i] = b[c[i]]; e) a[i][j] = b[i][k] + c[k][j]; f) *p++ = *q++; generate three-address code, assuming that a ...

Programming taskswhile working on the tasks below you are

Programming tasks While working on the tasks below, you are free to request assistance on D2L at: Communications / Discussions / Assignments (or other appropriate sub-topic) These tasks entail modifying your prior code f ...

Now consider the outer loop of given figure consisting of

Now consider the outer loop of given figure, consisting of blocks B2, B3, B4, and B5. Let g be the transfer function for the loop body, from the entry of the loop at B2 to its exit at B5. Let i measure the number of iter ...

Assignmentindent code and insert comments to document your

Assignment Indent code and insert comments to document your program. Program must be implemented and run as instructed. Source file and executable are placed in a folder. Define the class with the name BankAccount to sto ...

Assignment introduction to computer sciencepart a this

Assignment: Introduction to Computer Science Part A: This question is to be submitted to the instructor in the form of a Word (or OpenOffice) document containing the Java code and appropriate screen capture(s) of the out ...

Create a very basic calculator map out the numeric keypad

Create a very basic calculator, map out the numeric keypad (17 buttons) and an EditText view. If text is given, prompt the user with a message that complains about the error. Toast.makeToast(getApplicationContext() , "er ...

In this assignment you will write r functions for

In this assignment, you will write R functions for forecasting future values of a time series, and apply them to observations on numbers of deaths and maximum temperatures in Houston, Texas. Doing this will provide more ...

Program 1the local yogurt shop is expanding its selection

Program #1 The local yogurt shop is expanding its selection of frozen treats, and would like you to modify the program you wrote to calculate and print their customer's bills. You will also write a test plan to test the ...

In this lab you will write a program that simulates the

In this lab you will write a program that simulates the dialing of a phone number. The phone number may have either digits, letters, or both. (See sample output below) Here are the letters associated with each digit. 0 5 ...

Derive a class programmer from employee supply a

Derive a class Programmer from Employee. Supply a constructor Programmer Exercise 1: Derive a class Programmer from Employee. Supply a constructor Programmer (string name, double salary) that calls the base-class constru ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Section onea in an atwood machine suppose two objects of

SECTION ONE (a) In an Atwood Machine, suppose two objects of unequal mass are hung vertically over a frictionless

Part 1you work in hr for a company that operates a factory

Part 1: You work in HR for a company that operates a factory manufacturing fiberglass. There are several hundred empl

Details on advanced accounting paperthis paper is intended

DETAILS ON ADVANCED ACCOUNTING PAPER This paper is intended for students to apply the theoretical knowledge around ac

Create a provider database and related reports and queries

Create a provider database and related reports and queries to capture contact information for potential PC component pro

Describe what you learned about the impact of economic

Describe what you learned about the impact of economic, social, and demographic trends affecting the US labor environmen