Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Computer Engineering Expert

Assignment 2: K-nearest neighbor for text classification.

The goal of text classification is to identify the topic for a piece of text (news article, web-blog, etc.). Text classification has obvious utility in the age of information overload, and it has become a popular turf for applying machine learning algorithms. In this project, you will have the opportunity to implement k-nearest neighbor and apply it to text classification on the well known Reuter news collection.

1.       Download the dataset from my website, which is created from the original collection and contains a training file, a test file, the topics, and the format for train/test.

2.       Implement the k-nearest neighbor algorithm for text classification. Your goal is to predict the topic for each news article in the test set. Try the following distance or similarity measures with their corresponding representations.

a.        Hamming distance: each document is represented as a boolean vector, where each bit represents whether the corresponding word appears in the document.

b.       Euclidean distance: each document is represented as a numeric vector, where each number represents how many times the corresponding word appears in the document (it could be zero).

c.         Cosine similarity with TF-IDF weights (a popular metric in information retrieval): each document is represented by a numeric vector as in (b). However, now each number is the TF-IDF weight for the corresponding word (as defined below). The similarity between two documents is the dot product of their corresponding vectors, divided by the product of their norms.

3.        Let w be a word, d be a document, and N(d,w) be the number of occurrences of w in d (i.e., the number in the vector in (b)). TF stands for term frequency, and TF(d,w)=N(d,w)/W(d), where W(d) is the total number of words in d. IDF stands for inverted document frequency, and IDF(d,w)=log(D/C(w)), where D is the total number of documents, and C(w) is the total number of documents that contains the word w; the base for the logarithm is irrelevant, you can use e or 2. The TF-IDF weight for w in d is TF(d,w)*IDF(d,w); this is the number you should put in the vector in (c). TF-IDF is a clever heuristic to take into account of the "information content" that each word conveys, so that frequent words like "the" is discounted and document-specific ones are amplified. You can find more details about it online or in standard IR text.

4.       You should try k = 1, k = 3 and k = 5 with each of the representations above. Notice that with a distance measure, the k-nearest neighborhoods are the ones with the smallest distance from the test point, whereas with a similarity measure, they are the ones with the highest similarity scores.

 

 

Computer Engineering, Engineering

  • Category:- Computer Engineering
  • Reference No.:- M9556324

Have any Question?


Related Questions in Computer Engineering

Question suppose we have the following context-free grammar

Question : Suppose we have the following context-free grammar which accepts a list of variable initializations. Goal ::= single | Goal single single ::= VAR "=" exp ";" exp ::= VAR | INT | exp + exp Here each V AR termin ...

Can someone show me a working implementation in c source

Can someone show me a working implementation, in C source code, of a Linux shell that will use shared memory for command execution using pipes that read data from the shared memory region? For example: you type ls -1 | w ...

Question 1 your organization has approximately 10 tb of

Question: 1. Your Organization has approximately 10 TB of data, and you need to decide if your organization should have on-site or off-site tape storage. 2. Your organization must be able to easily recover data no older ...

Question 1 why is it critical for an organization to have a

Question: 1. Why is it critical for an organization to have a DoS attack response plan well before it happens? 2. Please discuss the techniques used by malware developers to disguise their code and prevent it from being ...

Need help with a java program that takes two arrays a and b

Need help with a Java program that takes two arrays a and b of length 5 storing int values, and returns the dot product of a and b. That is, it returns an array c of length n such that c[i]=a[i]*b[i].

There is a formula that calculates what a lifetime and

There is a formula that calculates what a lifetime and loyal customer contributes to long term profit and it is an astounding amount. Question - How would a company achieve such a feat?

Question suppose we decide to add a new operation to our

Question : Suppose we decide to add a new operation to our Stack ADT called sizeIs, which returns avalue of primitive type int equal to the number of items on stack. The method signature for sizeIs is public int sizeIs() ...

Explain how the following industries should adapt their

Explain how the following industries should adapt their businesses to the ever expanding use of social networks and mobile computing (smart phones, tablet computers, etc.): 1) Media and Entertainment, 2) Department store ...

Question suppose that your uncle is a real estate agent and

Question : Suppose that your uncle is a real estate agent and he decided to have a web-site to list his available real estate properties. And use a laptop computer to show clients these properties. What hardware is neede ...

Please discuss the design principles that guide the authors

Please discuss the design principles that guide the authors of instruction sets in making the right balance. Provide examples of application of each of the three design principles while designing instruction sets.

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As