Ask Question, Ask an Expert

+61-413 786 465

info@mywordsolution.com

Ask Statistics and Probability Expert

Assignment: EDA and Bootstrapping

Objective

Statistics play a vital role in data science for (at least) two reasons. First, it can be used to gain a deep understanding of data. This process is known as Exploratory Data Analysis (EDA).

Second, it can be used to infer the relationship between a sample and the population. This process is known as inference. In this assignment, you will learn about EDA and statistical inference through the analysis of a very interesting dataset - property tax report data. Specifically, you will learn the followings:

1. Be able to perform EDA on a single column (i.e., univariate analysis)
2. Be able to perform EDA on multiple columns (i.e., multivariate analysis)
3. Be able to extract insights from visualizations
4. Be able to ask critical questions about data
5. Be able to estimate a population parameter based on a sample
6. Be able to use the bootstrap to quantify the uncertainty of an estimated value

In this assignment, you can use pandas or PySpark to manipulate data, and use matplotlib or seaborn to make plots.

Part 1. EDA

Imagine you are a data scientist working at a real-estate company. In this week, your job is to analyze the Vancouver's housing price. You first download a dataset from property_tax_report_2018.zip (property_tax_report_2018.zip). The dataset contains information on properties from BC Assessment (BCA) and City sources in 2018. You can find the schema information of the dataset from attached file. But this is not enough. You still know little about the data. That's why you need to do EDA in order to get a better and deeper understanding of the data.

We first load the data as a DataFrame. To make this analysis more interesting, I added two new columns to the data: CURRENT_PRICE represents the property price in 2018; PREVIOUS_PRICE represents the property price in 2017.

In [ ]:import pandas as pd

df = pd.read_csv("property_tax_report_2018.csv")

df['CURRENT_PRICE'] = df.apply(lambda x: x['CURRENT_LAND_VALUE ']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)

df['PREVIOUS_PRICE'] = df.apply(lambda x: x['PREVIOUS_LAND_VAL UE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)

Now let's start the EDA process.

Question 1. Look at some example rows

Print the first five rows of the data:

Question 2. Get summary statistics

From the above output, you will know that the data has 28 columns. Please use the describe() function to get the summary statistics of each column.

Please look at the above output carefully, and make sure that you understand the meanings of each row (e.g., std, 25% percentile).

Question 3. Examine missing values

Now we are going to perform EDA on a single column (i.e., univariate analysis). We chose YEAR_BUILT, which represents in which year a property was built. We first check whether the column has any missing value.

Missing values are very common in real-world datasets. In practice, you should always be aware of the impact of the missing values on your downstream analysis results.

Question 4. Plot a line chart

We now start investigating the values in the YEAR_BUILT column. Suppose we want to know: "How many properties were built in each year (from 1900 to 2018)?" Please plot a line chart to answer the question.

Please write down the two most interesting findings that you draw from the plot. For example, you can say: "Vancouver has about 6300 properties built in 1996 alone, which is more than any other year". For each finding, please write no more than 2 sentences.
Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 5. Plot a bar chart

Next, we want to find that, between 1900 and 2018, which years have the most number of properties been built? Plot a bar chart to show the top 20 years.

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 6. Plot a histogram

What's the distribution of the number of properties built between 1900 and 2018? Please plot a histogram to answer this question.

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 7. Make a scatter plot

Suppose we are interested in those years which built more than 2000 properties. Make a scatter plot to examine whether there is a relationship between the number of built properties and the year?

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 8. PDF and CDF

Can you believe that you have already drawn 8 interesting findings by exploring a single column! This is the power of EDA combined with critical thinking. Now we are moving to multivariate analysis.
Suppose you want to compare the housing price between this year and last year, i.e., CURRENT_PRICE vs. PREVIOUS_PRICE. You can plot their distributions, and make the comparison. There are two ways to define a distribution: Probabilistic Distribution Function and Cumulative Distribution Function (CDF).

In the following, please make two plots and put them side-by-side.

In the first plot, use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE.
In the second plot, use histograms to plot the cumulative distributions of CURRENT_PRICE and PREVIOUS_PRICE.
There are a few properties which are way more expensive than the others. For both plots, please exclude those properties by setting xlim = (0, 5Million).

In [ ]:

Please write down the two most interesting findings that you draw from the plots.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 9. Use EDA to answer an interesting question (1)

In the above plots, we found that the overall housing price has increased, but we do not which type of property has increased more.
Now we add another variable LEGAL_TYPE (e.g., STRATA, LAND) to the analysis, and consider three variables (LEGAL_TYPE, CURRENT_PRICE, PREVIOUS_PRICE) in total.

In the following, please make two plots and put them side-by-side.

In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for LEGAL_TYPE = "STRATA".

In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for LEGAL_TYPE = "LAND".

In [ ]:

Please write down the two most interesting findings that you draw from the plots.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 10. Use EDA to answer interesting questions

Although the housing price of the entire Vancouver area is increasing, there might be some areas whose housing price is decreasing. To answer this question, we need to consider another column
-- PROPERTY_POSTAL_CODE.

PROPERTY_POSTAL_CODE (e.g., "V5A 1S6") is a six-character string with a space separating the third and fourth characters. We use the first three characters to represent an area.

We first filter out the areas which have less than 10 properties. For each of the remaining areas, we calculate the percentage of the properties whose price has decreased compared to the last year. For example, if an area "V5A" has 50 properties, and 30 of them have decreased, then the percentage is 60%.

Please write code to find the top-10 areas with the highest percentages. Create a bar chart to visualize them.

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 11. Come up with your own question.

You need to complete the following three tasks.

Firstly, please come up with an interesting question on your own (like Q9 and Q10).

A short description of the question: [ADD TEXT]

Secondly, please write code so that the output of your code can answer the question.

In [ ]:

Thirdly, please write the two most important findings.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Part 2. Bootstrapping

In Part 1, we run our analysis over the full dataset. In reality, however, you may not be that lucky. It is more often than not that you can only collect a sample of the data. Whenever you derive a conclusion from a sample (e.g., The Vancouver's housing price has increased by 10\% since last year), you should ALWAYS ask yourself: "CAN I TRUST IT?". In other words, you want to know that if the same analysis was conducted on the full data, would the same conclusion be derived? In Part 2, you will learn how to use bootstrapping to answer this question.

Please download the sample dataset property_tax_report_2018_sample.zip (property_tax_report_2018_sample.zip), and load it as a DataFrame.

In [ ]:df_sample = pd.read_csv("property_tax_report_sample.csv")

df_sample['CURRENT_PRICE'] = df_sample.apply(lambda x: x['CURR ENT_LAND_VALUE']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)

df_sample['PREVIOUS_PRICE'] = df_sample.apply(lambda x: x['PRE VIOUS_LAND_VALUE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)

df_sample = df_sample[df_sample['LEGAL_TYPE'] == 'STRATA']

Task 1. Analysis Result Without Bootstrapping

Please compute the median of PREVIOUS_PRICE and CURRENT_PRICE, respectively, and compare them in a bar chart.

In [ ]:

Task 2. Analysis Result With Bootstrapping

From the above chart, we find that the median of PREVIOUS_PRICE is about 0.6 M, and the median of CURRENT_PRICE is about 0.7 M. Since the numbers were obtained from the sample, "CAN WE TRUST THESE NUMBERS?"

In the following, please implement the bootstrap by yourself, and compute a 95%-confidence interval for each number. You can find the description of the algorithm in Section 7.

Attachment:- Assignment.rar

Statistics and Probability, Statistics

  • Category:- Statistics and Probability
  • Reference No.:- M92679291
  • Price:- $110

Guranteed 48 Hours Delivery, In Price:- $110

Have any Question?


Related Questions in Statistics and Probability

Your company is considering a new project that will require

Your Company is considering a new project that will require $18,000 of new equipment at the start of the project. The equipment will have a depreciable life of 5 years and will be depreciated to a book value of $3,000 us ...

Anbspgovernments department of transportation reported that

A? government's department of transportation reported that in? 2009, airline A led all domestic airlines in? on-time arrivals for domestic? flights, with a rate of 82.9?%. Complete parts a through e below. a.  What is th ...

Determine the minimum sample size required when you want to

Determine the minimum sample size required when you want to be 98% confident that the sample mean is within two units of the population mean. Assume a standard deviation of 4.82 in a normally distributed population.

Most analysts focus on the cost of tuition as the way to

Most analysts focus on the cost of tuition as the way to measure the cost of a college education. But incidentals, such as textbook costs, are rarely considered. A researcher at Drummand University wishes to estimate the ...

Nbsp1 nine percent of men and 025 of women cannot

1. Nine percent of men and 0.25% of women cannot distinguish between the colors red and green. This is the type of color blindness that causes problems with traffic signals. a. If six men are randomly selected for a stud ...

What is the 99 confidence interval for a sample of 52 seat

What is the 99% confidence interval for a sample of 52 seat belts that have a mean length of 85.6 inches long and a standard deviation of 3.8 inches?

Use the information below to answer the questions that

Use the information below to answer the questions that follow : While the Mars candy company no longer reports the percentages of each color M&M that are produced, they had previously reported the following: 13% brown  1 ...

Question suppose x is a normally distributed random

Question: Suppose X is a normally distributed random variable with mean µ = 200 and standard deviation σ = 20 a. Sketch the probability density function of X¯. On the same graph, sketch the probability density function o ...

Based on labor statistics17nbspof workers in a particular

Based on labor? statistics,17% of workers in a particular profession are female. Complete parts a through e below based on a random sample of 8 workers in this profession. a. What is the probability that exactly one work ...

A sample of 100 people is classified by gender malefemale

A sample of 100 people is classified by gender (male/female) and by whether they are registered voters. The sample consists of 80 females and 20 males, and has a total of 60 registered voters. If these data are used for ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As