Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Statistics and Probability Expert

Given the data set called County Demographic Information, construct a predictive model for the variable "Total Serious Crime" using some or all of the other variables in the set of data.

The model should be mathematically valid, accurate and reliable.

Total Serious Crime is Variable #8

Other Variables:

#2 Land Area
#3 Total Population
#4 Percent of Population aged 18-34
#5 Percent of Population 65 or over
#6 Number of Active Physicians
#7 Number of Hospital Beds
#9 Percent of High School Graduates
#10 Percent of Population with College Degrees
#11 Percent of Population below poverty level
#12 Unemployment Percent
#13 Per Capita Income
#14 Total Personal Income
#15 Geographic Region

Note: I am omitting the data set to simplify this problem; the following analyses use the data set described above, and you can assume the math is calculated correctly. I am testing to see if you can identify what analytical techniques may be validly employed and how effective are they building a model.

Variables 2 to 14 are numeric variables and variable 15 is categoric.

Question 1:

In the given data set, we were asked to determine if an accurate predictive model for Variable #8, Serious Crime could be found using the attached data.

Since Variable 15 was determined to be categoric, regression was not appropriate to use; so I used Analysis of Variance (ANOVA) to examine if there was a significant relationship between Variable 8 and 15. The results (using Systat 13.0) are printed above.

Variables Levels
VAR(15) (4 levels) 1.000 2.000 3.000 4.000
Dependent Variable VAR(8)
N 440
Multiple R 0.110
Squared Multiple R 0.012
Estimates of Effects B = (X'X)-1X'Y
Factor Level VAR(8)
CONSTANT 28,017.368
VAR(15) 1 -4,931.339
VAR(15) 2 -6,236.627
VAR(15) 3 -1,026.394
Analysis of Variance
Source Type III SS df Mean Squares F-Ratio p-Value
VAR(15) 1.795E+010 3 5.985E+009 1.774 0.151
Error 1.471E+012 436 3.374E+009
ANOVA results suggest that Variable 15 is significantly related to Variable 8, but Variable 15 can only explain approximately 15.1% of the variation in Variable 8.
Therefore, I conclude that variable 15 is significantly related to variable 8 although variable 15 is only a minor factor in predicting variable 8.

Question 2:

Using Systat, I employed Multiple Linear Regression to attempt to create a predictive model, using all of the available variables as independent variables.

The results are shown below.

Dependent Variable VAR(8)
N 440
Multiple R 0.919
Squared Multiple R 0.844
Adjusted Squared Multiple R 0.839
Standard Error of Estimate 23,367.069
Regression Coefficients B = (X'X)-1X'Y
Effect Coefficient Standard Error Std.
Coefficient Tolerance t p-Value
CONSTANT -50,925.731 35,344.226 0.000 . -1.441 0.150
VAR(2) -3.054 0.849 -0.081 0.719 -3.599 0.000
VAR(3) 0.234 0.020 2.422 0.008 11.560 0.000
VAR(4) 221.063 424.685 0.016 0.393 0.521 0.603
VAR(5) 32.120 380.640 0.002 0.539 0.084 0.933
VAR(6) -5.189 3.150 -0.159 0.0390.159 0.039 -1.647 0.100
VAR(7) 3.404 2.280 0.134 0.046 1.493 0.136
VAR(9) -265.566 321.799 -0.032 0.244 -0.825 0.410
VAR(10) 140.915 373.505 0.019 0.152 0.377 0.706
VAR(11) 1,142.711 488.132 0.091 0.241 2.341 0.020
VAR(12) -159.661 658.025 -0.006 0.526 -0.243 0.808 0.243 0.
VAR(13) 2.335 0.699 0.163 0.154 3.339 0.001
VAR(14) -7.070 0.946 -1.564 0.008 -7.475 0.000
VAR(15) 1,456.610 1,319.387 0.026 0.668 1.104 0.270
Analysis of Variance
Source SS df Mean Squares F-Ratio p-Value
Regression 1.256E+012 13 9.664E+010 176.989 0.000
Residual 2.326E+011 426 5.460E+008

Since the combined model had a p-value of 0.000, I concluded that this model could accurately predict variable 8, Total Serious Crime. The R-Squared value of approximately .84 suggests that the model explains about 84% of the variation in Serious Crime. Therefore, I conclude that this is a fairly accurate, valid, predictive model of Total Serious Crime.

Question 3:

Since many individual, independent variables of the previous regression model had p-values above .05, they were not significant factors. I discarded them, redid the regression analysis, and got the results listed below.

Dependent Variable VAR(8)
N 440
Multiple R 0.918
Squared Multiple R 0.842
Adjusted Squared Multiple R 0.840
Standard Error of Estimate 23,274.901
Regression Coefficients B = (X'X)-1X'Y
Effect Coefficient Standard Error Std.
Coefficient Tolerance t p-Value
CONSTANT -63,890.789 10,233.100 0.000 . -6.244 0.000
VAR(2) -3.109 0.758 -0.083 0.894 -4.101 0.000
VAR(3) 0.250 0.016 2.580 0.013 15.282 0.000
VAR(11) 1,449.915 307.144 0.116 0.603 4.721 0.000
VAR(13) 2.460 0.469 0.171 0.341 5.250 0.000
VAR(14) -7.899 0.787 -1.748 0.012 -10.037 0.000
Analysis of Variance
Source SS df Mean Squares F-Ratio p-Value
Regression 1.254E+012 5 2.508E+011 462.898 0.000
Residual 2.351E+011 434 5.417E+008
This model is a better predictive model than analysis #2 since it has a higher F-value, and therefore a smaller p-value. Also, each factor of the model has a p-value smaller than .05; this indicates that each component is significant in itself. The R-Squared value of .84 indicates that I can predict Variable 8 with approximately 84% accuracy, using only five variables and a constant.

Question 4:

Repeating the previous analysis, but deleting the constant allowed me to raise the R-Squared value to almost .87.
Dependent Variable VAR(8)
N 440
Multiple R 0.932
Squared Multiple R 0.869
Adjusted Squared Multiple R 0.868
Standard Error of Estimate 23,381.775
Regression Coefficients B = (X'X)-1X'Y
Effect Coefficient Standard Error Std.
Coefficient Tolerance t p-Value
VAR(2) -3.010 0.763 -0.088 0.612 -3.942 0.000
VAR(3) 0.245 0.016 2.739 0.009 15.107 0.000
VAR(9) -697.218 118.026 -0.846 0.015 -5.907 0.000
VAR(10) 496.913 209.212 0.174 0.056 2.375 0.018
VAR(11) 683.363 248.743 0.105 0.206 2.747 0.006
VAR(13) 1.727 0.472 0.511 0.015 3.657 0.000
VAR(14) -7.658 0.780 -1.800 0.009 -9.818 0.000
Analysis of Variance
Source SS df Mean Squares F-Ratio p-Value
Regression 1.576E+012 7 2.251E+011 411.714 0.000
Residual 2.367E+011 433 5.467E+008
Using seven variables and no constant, I found a model that had each component with a low p-value (under .05) and an overall p-value of 0.000. I would conclude similar to what I did in analysis #3, but I would prefer this model because of its higher R-Squared value.

Question 5:

Trying to optimize the model, I repeated the earlier analytical methods. I discarded the constant and tried to lower the number of variables. I was able to find a model (see results listed below, and compare to analyses #3 and #4 ) that used only four variables. Each variable had a p-value under .05, the F-value was higher than earlier models (therefore, the overall p-value was lower for the overall model) and the R-Squared value was still approximately .84.

Dependent Variable VAR(8)
N 440
Multiple R 0.916
Squared Multiple R 0.840
Adjusted Squared Multiple R 0.839
Standard Error of Estimate 25,805.795
Regression Coefficients B = (X'X)-1X'Y
Effect Coefficient Standard Error Std.
Coefficient Tolerance t p-Value
VAR(2) -2.141 0.814 -0.062 0.656 -2.629 0.009
VAR(3) 0.088 0.002 0.979 0.644 41.013 0.000
VAR(11) 1,240.562 217.578 0.191 0.327 5.702 0.000
VAR(13) -0.846 0.116 -0.251 0.314 -7.328 0.000
Analysis of Variance
Source SS df Mean Squares F-Ratio p-Value
Regression 1.522E+012 4 3.805E+011 571.367 0.000
Residual 2.903E+011 436 6.659E+008

Therefore, I concluded that Model #5 was the preferred model since it only had four input variables and achieved approximately the same predictive accuracy. Thus I needed only four independent variables to predict variable #8 with accuracy of approximately 84%.

A) Are each of the five analyses valid? (if not, why not?)
B) Are each of the five analyses significant? (why?)
C) Are each of the five analyses accurate? (why?)
D) Which is the best predictive model and why?

Statistics and Probability, Statistics

  • Category:- Statistics and Probability
  • Reference No.:- M91588797
  • Price:- $35

Priced at Now at $35, Verified Solution

Have any Question?


Related Questions in Statistics and Probability

On the production line the company finds that 997 of

On the production line the company finds that 99.7% of products are made correctly. You are responsible for quality control and take batches of 30 products from the line and test them. What number of the 30 being incorre ...

For each of the following describe1 the direction of the

For each of the following describe: 1. The direction of the relationship 2. The strength of the relationship 3. A verbal description of the relationship between the variables. r = -.96 between the craving for Taco Bell a ...

A recent study reported that the prevalence of

A recent study reported that the prevalence of hyperlipidemia is 30% in children 2 to 6 years of age. If 12 children are analyzed what is the probability that at least 3 are hyperlipidemic?

Why does a high standard deviation of a stocks rate of

Why does a high standard deviation of a stock's rate of return mean that that stock is risky? Explain. (Hint: Intuitively explain how we calculate standard deviation and what does it mean)

Suppose a sample space has things ab and c twice draw from

Suppose a sample space has things a,b and c. Twice, draw from the sample space and replace. The possible sequences formed are {aa, ab, ac,ba,bb, bc,ca,cb,cc} Now suppose there are Y different things. There are Y ways the ...

The number of facets in the eye of the fruit fly drosophila

The number of facets in the eye of the fruit fly Drosophila melanogaster is of interest in genetics to model changes across many generations (ie genetic selection). The distribution of the number of facets in the eye in ...

A random sample of 87 eighth grade students scores on a

A random sample of 87 eighth grade? students' scores on a national mathematics assessment test has a mean score of 278. This test result prompts a state school administrator to declare that the mean score for the? state' ...

David currently has 500 in an account that earns 10 apr

David currently has $500 in an account that earns 10% APR, compounded monthly. Assuming he doesn't withdraw any of the funds, how much will his balance be in 6 years?

Income can have significant effects on peoples spending

Income can have significant effects on people's spending patterns. Research studies have revealed that consumer expenditure is influenced by various factors such as their income, gender, age and level of education. In or ...

The times that customers spend in a book store are normally

The times that customers spend in a book store are normally distributed with a mean of 39.5 minutes and a standard deviation of 15.9 minutes. A random sample of 25 customers has a mean of 36.1 minutes or less. Would this ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As