Ask Computer Engineering Expert

Part 1 - Description, Visualisation and Pre-processing [R Only]

a) Explore the data

i. Use as many functions/techniques in R as necessary to adequately describe and visualise the data. Provide a table for all the attributes of the dataset including the measures of centrality (mean, median etc.), dispersion and how many missing values each attribute has. Use the table to make comments about the data.

ii. Produce histograms for each attribute. Provide details how you created the histograms and comment on the distribution of data. Use also the descriptive statistics you produced above to help you characterise the shape of the distribution.

b) Explore the relationships between the attributes, and between the class and the attributes

i. Calculate the correlations between er and pgr, b1 and b2, and p1 and p2 (three correlations). What do these tell you about the relationships between these variables?

ii. Produce scatterplots between the class variable and er, pgr and h1 variables (note: you may have to recode the class variable as numeric to produce scatterplots). What do these tell you about the relationships between these three variables and the class?

c) General Conclusions

Take into considerations all the descriptive statistics, the visualisations, the correlations you produced together with the missing values and comment on the importance of the attributes. Which of the attributes seem to hold significant information and which you can regard as insignificant? Provide an explanation for your choice.

d) Dealing with missing values in R

i. Write an script in R to find missing values and replace them using three strategies. Replace missing values with 0, mean and median

ii. Compare and contrast these approaches

f) Attribute transformation  

Explore the use of three transformation techniques (mean centering, normalisation and standardisation) to scale the attributes, and compare their various effects.

g) Attribute / instance selection

i. Starting again from the raw data, consider attribute and instance deletion strategies to deal with missing values. Choose a number of missing values per instance or per attribute and delete instances/attributes accordingly. Explain your choice.

ii. Consider using correlations between attributes to reduce the number of attributes. Try to reduce the dataset to contain only uncorrelated attributes.

iii. Use principal component analysis in R to create a data set with ten attributes.

As a result, you will end up with several different sets of data to be used in Part 3 & 4. Give each set of data a clear and distinct name, so that you can easily refer to again in the later stages.

Part 2 - Clustering [R Only]

Using R (only), explore the use of clustering to find natural groupings in the data, without using the class variable - i.e. use only the 20 numeric (input) attributes to perform the clustering. Once the data is clustered, you may use the class variable to evaluate or interpret the results (how do the new clusters compare to the original classes?).

a) Use hierarchical, k-means, PAM as clustering algorithms to create classifications of seven clusters and write the results. Which algorithm produces better results when compared to the class attribute? [10]

b) As each of these algorithms has adjustable parameters, you may explore the 'optimisation' or 'tuning' of these parameters, either manually or (preferably) automatically. Which parameters produce the best results for each clustering algorithm? Provide the reasoning of the techniques you used to find the optimal parameters.

c) Choose one clustering algorithm of the above and perform this clustering on alternative data sets that you have produced as a result of Part 2.

i. The reduced data set featuring only the first 10 Principal Components.

ii. The dataset after deletion of instances and attributes.

iii. The three datasets after you replaced missing values with the three techniques.

iv. Which of these datasets had a positive impact on the quality of the clustering? Provide explanations using the results for each clustering of the alternative data set.

Part 3 - Classification [Weka and R]

You must use Weka to perform the classification, but you may choose to use R to present results. Use Weka to explore the use of various classification techniques to create models that predict the given class from the input attributes. Split the data (randomly) into a training set (2/3 of the data) and a test set (containing 1/3 of the data);

a) Try using the following classification algorithms: ZeroR, OneR, NaïveBayes, IBk (kBNN) and J48 (C4.5) algorithms. Which algorithm produces the best results?

b) Choose one classification algorithm of the above and explore various parameter settings for each of the different splits of data. Which parameters improve the predictive ability of the algorithm?

c) Choose one classification algorithm of the above and use the data sets you created in part 2:

i. The reduced data set featuring only the first 10 Principal Components.

ii. The dataset after deletion of instances and attributes.

iii. The three datasets after you replaced missing values with the three techniques.

iv. Which of the datasets had a good impact on the predictive ability of the algorithm? Provide explanations using the results for each clustering of the alternative data set.

Attachment:- Assignment.rar

Computer Engineering, Engineering

  • Category:- Computer Engineering
  • Reference No.:- M91757104
  • Price:- $70

Priced at Now at $70, Verified Solution

Have any Question?


Related Questions in Computer Engineering

Does bmw have a guided missile corporate culture and

Does BMW have a guided missile corporate culture, and incubator corporate culture, a family corporate culture, or an Eiffel tower corporate culture?

Rebecca borrows 10000 at 18 compounded annually she pays

Rebecca borrows $10,000 at 18% compounded annually. She pays off the loan over a 5-year period with annual payments, starting at year 1. Each successive payment is $700 greater than the previous payment. (a) How much was ...

Jeff decides to start saving some money from this upcoming

Jeff decides to start saving some money from this upcoming month onwards. He decides to save only $500 at first, but each month he will increase the amount invested by $100. He will do it for 60 months (including the fir ...

Suppose you make 30 annual investments in a fund that pays

Suppose you make 30 annual investments in a fund that pays 6% compounded annually. If your first deposit is $7,500 and each successive deposit is 6% greater than the preceding deposit, how much will be in the fund immedi ...

Question -under what circumstances is it ethical if ever to

Question :- Under what circumstances is it ethical, if ever, to use consumer information in marketing research? Explain why you consider it ethical or unethical.

What are the differences between four types of economics

What are the differences between four types of economics evaluations and their differences with other two (budget impact analysis (BIA) and cost of illness (COI) studies)?

What type of economic system does norway have explain some

What type of economic system does Norway have? Explain some of the benefits of this system to the country and some of the drawbacks,

Among the who imf and wto which of these governmental

Among the WHO, IMF, and WTO, which of these governmental institutions do you feel has most profoundly shaped healthcare outcomes in low-income countries and why? Please support your reasons with examples and research/doc ...

A real estate developer will build two different types of

A real estate developer will build two different types of apartments in a residential area: one- bedroom apartments and two-bedroom apartments. In addition, the developer will build either a swimming pool or a tennis cou ...

Question what some of the reasons that evolutionary models

Question : What some of the reasons that evolutionary models are considered by many to be the best approach to software development. The response must be typed, single spaced, must be in times new roman font (size 12) an ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As