Ask Homework Help/Study Tips Expert

Data Mining Assignment: Clustering and Basic Classification

OBJECTIVES

Learn some of the clustering features of Scikit-learn to do partitioning and hierarchical clustering (k-Means and hierarchical clustering algorithms);

Learn about document clustering, and document similarity scoring using TFIDF;

Using built-in k-nearest neighbor and interpret the output in Scikit-learn - this is an extension of the implementation you did last time;

Learn how to binarize categorical variables in Scikit-learn;

Learn how to use DecisionTreeClassifier to build a basic classifier.

DATA FOR PART 1 -

We will be using data from IMDB and working with movie data. IMDB is a movie database that is widely used to learn about (and rate) movies. Much of the work around movies focuses on predicting ratings - for example, the Net?ix Prize contest was designed to encourage developers to explore better algorithms for rating movies. Instead of predicting ratings, we will work in- stead with clustering the plots of movies. Data will come from the OMDB API which allows a developer to extract information from IMDB programmatically since there is no open public API directly published by IMDB.

You can view the notebook here to see how the data was extracted, but you can skip that step and look directly at the file which is the output of that data. Also, you can find the data for this assignment in data directory, and in it you will see a TSV file called

data/top1000_movie_summaries.tsv.

BACKGROUND FOR PART 1

Document clustering is a common task in text mining and has broad applications in a variety of contexts. In the unsupervised context, such clustering provides insights into a set of documents and the common features they share. In the supervised context such clustering allows one to train and subsequently classify documents. For example, if one were to determine of a document is of a certain kind (e.g. legal, academic) one can use labeled instances to learn the features that would allow the discrimination of unlabeled/unseen instances.

There are several good resources in information retrieval that you may want to bookmark for future reference in text mining and information retrieval generally:

Manning, C.D., Raghavan, P. and Schütze, H. (2008) Introduction to Information Retrieval. doi: http://dx.doi.org/10.1017/CBO9780511809071; Available at: Stanford NLP - Information Retrieval.

DOCUMENT ANALYSIS: TERM FREQUENCY (TF) AND INVERSE DOCUMENT FREQUENCY (TF) The intuition behind analyzing words in documents hinges on the following:

  • terms that are frequent in documents are given higher importance than those that are infrequent
  • terms that are frequent across documents are not considered as important

that is common words across an entire corpus are discounted while those that are common within documents are boosted.

Part 2 - Classification With k-Nearest Neighbors

In HW1 you learned about and used the k-NN algorithm. You computed k neighbors based on actual data. This algorithm can also be used to do what is called a lazy learner because it learns from the testing phase instead of the training phase. This has performance issues unto itself since all the data it has to be seen. It can, nonetheless, be used as a way to do supervised classification since it has learned all the class labels already.

You will first start with just a warm-up of the using the Nearest Neighbors algorithm already implented in Scikit-learn.

Part 3 - Classification With Decision Trees

As we learned, decision trees are a powerful way to build classifiers, especially since the output is interpretable. By using information gain such as entropy and gini coefficient, nodes can be chosen that split the data in meaningful ways allowing the leaf nodes to provide the labels of a set of decisions as one follows each attribute at a decision point.

Attachment:- Assignment.rar

Homework Help/Study Tips, Others

  • Category:- Homework Help/Study Tips
  • Reference No.:- M92063762

Have any Question?


Related Questions in Homework Help/Study Tips

Review the website airmail service from the smithsonian

Review the website Airmail Service from the Smithsonian National Postal Museum that is dedicated to the history of the U.S. Air Mail Service. Go to the Airmail in America link and explore the additional tabs along the le ...

Read the article frank whittle and the race for the jet

Read the article Frank Whittle and the Race for the Jet from "Historynet" describing the historical influences of Sir Frank Whittle and his early work contributions to jet engine technologies. Prepare a presentation high ...

Overviewnow that we have had an introduction to the context

Overview Now that we have had an introduction to the context of Jesus' life and an overview of the Biblical gospels, we are now ready to take a look at the earliest gospel written about Jesus - the Gospel of Mark. In thi ...

Fitness projectstudents will design and implement a six

Fitness Project Students will design and implement a six week long fitness program for a family member, friend or co-worker. The fitness program will be based on concepts discussed in class. Students will provide justifi ...

Read grand canyon collision - the greatest commercial air

Read Grand Canyon Collision - The greatest commercial air tragedy of its day! from doney, which details the circumstances surrounding one of the most prolific aircraft accidents of all time-the June 1956 mid-air collisio ...

Qestion anti-trustprior to completing the assignment

Question: Anti-Trust Prior to completing the assignment, review Chapter 4 of your course text. You are a manager with 5 years of experience and need to write a report for senior management on how your firm can avoid the ...

Question how has the patient and affordable care act of

Question: How has the Patient and Affordable Care Act of 2010 (the "Health Care Reform Act") reshaped financial arrangements between hospitals, physicians, and other providers with Medicare making a single payment for al ...

Plate tectonicsthe learning objectives for chapter 2 and

Plate Tectonics The Learning Objectives for Chapter 2 and this web quest is to learn about and become familiar with: Plate Boundary Types Plate Boundary Interactions Plate Tectonic Map of the World Past Plate Movement an ...

Question critical case for billing amp codingcomplete the

Question: Critical Case for Billing & Coding Complete the Critical Case for Billing & Coding simulation within the LearnScape platform. You will need to create a single Microsoft Word file and save it to your computer. A ...

Review the cba provided in the resources section between

Review the CBA provided in the resources section between the Trustees of Columbia University and Local 2110 International Union of Technical, Office, and Professional Workers. Describe how this is similar to a "contract" ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As