Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Programming Language Expert

Problem 1

Letter Recognition exercise

Letter Recognition

One of the earliest applications of predictive analytics was automatic recognition of letters, which is used in applications like sorting mail at post offices. In this problem, we will build a model that uses attributes of images of four letters in the Roman alphabet - A, B, P, and Ft - to predict which letter a particular image corresponds to.

In this problem, we have more than two classifications that are possible for each observation, like the situation in Chapter 8, when D2Hawkeye built a model to classify expected healthcare cost. Such problems are called multi-class classification problems.

The file Letters.csv (available in the Online Companion) contains 3116 observations, each of which corresponds to a certain image of one of the four letters A, B, P and R. The images came from 20 different fonts, which were then randomly distorted to produce the final images; each such distorted image is represented as a collection of pixels, each of which is "on" or "off." For each such distorted image, we have available certain attributes of the image in terms of these pixels, as well as which of the four letters the image is. These variables are described in Table 22.7.

a) To warm up, start by predicting whether or not the letter is "B." First, create a new variable called Is13 in your dataset, which takes value "Yes" if the letter is B, and "No" if the letter is not B. Then randomly split your dataset into a training set and a testing set, putting 50% of the data in each set.

i) Before building models, let us consider a baseline method that always predicts the most frequent outcome, which is "not B." What is the accuracy of this baseline method on the test set?

ii) Build a CART tree to predict whether or not a letter is a B, using the training set to build your model. Remember to not use the variable Letter as one of the independent variables in the model, as this is related to what we are trying to predict! Select reasonable parameter values for the model, and justify your parameter choices. What is the accuracy of this CART model on the test set?

iii) Now, build a random forest model to predict whether or not the letter is a B. Again, select reasonable parameter values for the model, and justify your parameter choices. What is the accuracy of the model on the test set?

Table 22.7: Variables in the dataset Lctters.csv.

Variable Description
Letter The letter that the image corresponds to (A, 13, P or R).
Xbox The horizontal position of where the smallest box covering the letter shape begins.
Ybox The vertical position of where the smallest box covering the letter shape begins.
Width The width of this smallest box.
Height height of this smallest box.
Onpix The total number of "on" pixels in the character image.
Xbar The mean horizontal position of all of the "on" pixels.
Ybar The mean vertical position of all of the "on" piXels.
X2bar The mean squared horizontal position of all of the "on" pixels in the image.
Y2bar The mean squared vertical position of all of the "on" pixels in the image.
XYbar The mean of the product of the horizontal and vertical position of all of the "on" pixels in the image.
X2Ybar The mean of the product of the squared horizontal position and the vertical position of all of the 'ton" pixels.
XY2bar The mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels. 
Xedge The mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image.
XedgeYcor The mean of the product of the number of horizontal edges at each vertical position and the vertical position.
Yedge The mean number of edges as the images is scanned from top to bottom, along the whole horizontal length of the image.
YedgeXcor The mean of the product of the number of vertical edges at each horizontal position and the horizontal position.

iv) Compare the accuracy of your CART and Random Forest models. Which one performs better? For this application, do you think interpretability or accuracy is more important?

b) Let us now move on to the problem that we were originally interested in, which is to predict whether or not a letter is one of the four letters A, B, P or R. The variable in our dataset which we will be trying to predict is Letter.

i) In a multi-class classification problem, a simple baseline model is to predict the most frequent class of all of the options for every observation. For this problem, what does the baseline method predict, and what is the baseline accuracy on the testing set? Do you think this simple baseline method is a useful benchmark for this problem? Why or why not?

ii) Now build a classification tree to predict Letter. using the training set to build your model. (Remember not to use the variable IsB in the model, as this is related to what we are trying to predict!) Select reasonable parameter values and justify your parameter choices. What is the test set accuracy of your CART model?

(HINT: When you are computing the test set accuracy using a classification matrix, you want to add everything on the main diagonal and divide by the total number of observations in the test set.)

iii) Now, build a random forest model to predict Letter, using the training data - again, do not forget to remove the IsB variable. What is the test set accuracy of your random forest model?

iv) Compare the accuracy of your CART and Random Forest models for this problem. Which one would you recommend for this problem? Is your choice different from the model you recommended in part (a)?

Problem 2

Document Clustering exercise

Document Clustering

Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term "jaguar" into Google, over 400 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings, like this one. If we search for "jaguar," we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.

Clustering methods can be used to automatically group search results into categories, making it easier to find relevant results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov, the official Web portal for the U.S. government. The two most common clustering algorithms used for document clustering are Hierarchical and IC-means.

In this exercise, we will be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. The file DailyKos.csv can be found in the Online Companion for this book, and contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (Republican candidate) and Senator John Kerry (Democratic candidate). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.

There are 1,545 variables in this dataset -- each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). For each document, or observation, the variable values are the number of times that word appeared in the document. (If you are familiar with text analytics, this approach is called bag of words.)

a) Start by building a Hierarchical Clustering model to cluster docu-ments using all of the variables in the dataset. Indicate which distance metrics you used for distances between the observations and distances between the clusters.

i) Building a hierarchical clustering model will probably take a significant amount of time on this dataset. Why?

ii) Plot the dendrogram of your hierarchical clustering model. Using the dendrogram and thinking about this particular appli¬cation, which number of clusters would you recommend? Keep in mind that document clustering would most likely be used by Daily Kos to show readers categories to choose from when trying to decide which articles to read.

iii) Assign each observation to a cluster, using the number of clusters you recommended in the previous subproblem. How many observations are in each cluster?

iv) In the previous chapter, we analyzed the centroids of the clusters by looking at the average values of all of the variables in each cluster. We do not want to do that here though, since we have over 1,000 variables! Instead, split your dataset into a dataset for each cluster, using your cluster assignments.

Then, find the six most frequent words in each cluster. If you are using R, and your dataset for the first cluster is called HierClusterl, this can be done with the command:
tail (sort (colMeans(HierClusteri) )).

Describe each cluster. Is there a cluster that is mostly about the Iraq war? Is there a cluster that is mostly about the democratic party? It might be helpful to know that in 2004, Howard Dean was one of the candidates for the Democratic nomination for the President of the United States, John Kerry was the candidate who won the democratic nomination, and John Edwards was the running mate of John Kerry (the Democratic Vice President nominee).

b) Now cluster the documents using K-means clustering. Choose the same number of clusters that you recommended for Hierarchical clustering.

i) How many observations are in each cluster? Is your answer the same as it was with Hierarchical clustering? Why or why not?

ii) Just like you did for Hierarchical clustering, split your dataset into a dataset for each K-means cluster, and analyze the most frequent words in each cluster. Are the clusters similar to the Hierarchical clusters? Can you find a similar Hierarchical cluster for each K-means cluster? Keep in mind that the order of the clusters (which cluster is labeled as 1, which cluster is labeled as 2, etc.) is meaningless - for example, Hierarchical cluster 3 might be very similar to K-means cluster 1.

c) Try repeating this problem with a different number of clusters than you originally selected. How do your results compare between the two selections? Do you prefer one number of clusters over the other? Are you able to make different observations about the data when the number of clusters changes?

Problem 3 (Real-life applications)

1. Make a summary in Word of at least 400 words and not more than 800 words of the paper "Analyzing user preferences using Facebook fan pages" posted on Canvas, explaining the clustering method used and describing the resulting clusters. Don't read the appendix. (Note: SPSS is a statistical software like R, except that it is not open-source).

2. Make a summary in Word of at least 500 words of Chapter 14 of the Analytics Edge textbook, making sure to include a brief description of each section, take particular care to describe the clustering approach in 14.3 Defining Peer Groups (among other things) and the Condorcet clustering method (except the "optimal clustering" section, which is starred and is therefore more advanced than the other sections) and answer the question: how can analytics be used to detect Medicaid fraud?

Article - Analyzing User Preferences Using Facebook Fan Pages by Pin Luarn, Hsien-Chih Kuo, Hong-Wen Lin, Yu-Ping Chiu, Ya-Cing Jhan

Attachment:- data sets.rar

Programming Language, Programming

  • Category:- Programming Language
  • Reference No.:- M92766685
  • Price:- $95

Guranteed 48 Hours Delivery, In Price:- $95

Have any Question?


Related Questions in Programming Language

Question 1 what is hadoop explaining hadoop 2 what is

Question: 1. What is Hadoop (Explaining Hadoop) ? 2. What is HDFS? 3. What is YARN (Yet Another Resource Negotiator)? The response must be typed, single spaced, must be in times new roman font (size 12) and must follow t ...

Extend the adworks applicationi add dialogs to allow the

Extend the AdWorks application I. Add Dialogs to allow the user to Add, Edit, Read and Delete a Customer and refresh the view accordingly. 1. The user should be able to select a specific customer from the DataGrid and cl ...

1 write a function named check that has three parameters

1. Write a function named check () that has three parameters. The first parameter should accept an integer number, andthe second and third parameters should accept a double-precision number. The function body should just ...

Assignment - horse race meetingthe assignment will assess

Assignment - Horse Race Meeting The Assignment will assess competencies for ICTPRG524 Develop high level object-oriented class specifications. Summary The assignment is to design the classes that are necessary for the ad ...

Overviewthis tasks provides you an opportunity to get

Overview This tasks provides you an opportunity to get feedback on your Learning Summary Report. The Learning Summary Report outlines how the work you have completed demonstrates that you have met all of the unit's learn ...

Assignmentquestion onegiving the following code snippet

Assignment Question One Giving the following code snippet. What kind of errors you will get and how can you correct it. A. public class HelloJava { public static void main(String args[]) { int x=10; int y=2; System.out.p ...

Task silly name testeroverviewcontrol flow allows us to

Task: Silly Name Tester Overview Control flow allows us to alter the order in which our programs execute. Building on our knowledge of variables, we can now use control flow to create programs that perform more than just ...

Assignment - proposal literature review research method1

Assignment - Proposal, Literature Review, Research Method 1. Abstract - Summary of the knowledge gap: problems of the existing research - Aim of the research, summary of what this project is to achieve - Summary of the a ...

Php amp session managment assignment -this assignment looks

PHP & SESSION MANAGMENT ASSIGNMENT - This assignment looks at using PHP for creating cookies and session management. Class Exercise - Web Project: Member Registration/Login This exercise will cover adding data connectivi ...

Task arrays and structsoverviewin this task you will

Task: Arrays and Structs Overview In this task you will continue to work on the knight database to help Camelot keep track of all of their knights. We can now add a kingdom struct to help work with and manage all of the ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As