Ask Programming Language Expert

Assignment - Big Data Management

Domain: Airline Industry

Project: U.S. Airline/AirtrafficAnalysis

Analysis and Data Synopsis:

The data being considered for our project is on U.S. Air-Traffic data of 2016 from U.S. Department of Transportation (DOT), Bureau of Transportation Statistics (BTS) consisting of 16 U.S. air carriers that have at least 1 percent of total domestic scheduled-service passenger revenues, plus two other carriers that report voluntarily. The data cover nonstop scheduled-service flights between points within the United States (including territories).

The analysis is aimed at studying and finding out air-traffic patterns in 2016 across various airports, airlines and sectors over US. The study also aims to analyze the performance of U.S. airliners across various parameters and study the factors affecting the performance as well as patterns in performance across various time periods. The dataset size is around 465 MB and covers around 465k records of data.

Analysis Aimed at Covering:
1. Comprehending/Overall Summarization of the Air-Traffic Dataset

2. U.S. Air-Traffic/Airport Operation Analysis 2016 -
a. Airports with maximum air-traffic flows - Volume of air-traffic
b. Airports serving maximum no. of airlines
c. Airports service pattern across various time periods

3. U.S.Airline Functionality Analysis 2016 -
a. Aircrafts with maximum flight services
b. Airport coverage density
c. Weekday air-travel density analysis

4. Aircraft flight performance analysis -
a. Aircraft delays - causes, volume, frequency
b. Aircraft Cancellation - count, airports, airlines, period, causal effects

5. Diverted Air-lines analysis -
a. Summary
b. Cause and density

Assignment contains four questions and will ask you to get familiar with aspects of Apache Spark. While first three questions require you to get familiar with Spark programming, the last question will ask you to understand an existing code and explain it in simple terms.

Q1. Consider the two data files (users.csv, transactions.csv). Users file has the following fields:
a) UserID
b) EmailID
c) NativeLanguage
d) Location

Transactions file has the following fields:
a) Transaction_ID
b) Product_ID
c) UserID
d) Price
e) Product_Description

By making use of Spark Core (i.e. without using Spark SQL) find out:
a) Count of unique locations where each product is sold.
b) Find out products bought by each user.
c) Total spending done by each user on each product.

You cannot make use of Spark SQL for this.

Q2. For this question, please make use of the attached JSON file (tweets.json). Make use of Spark SQL library to answer the following questions:
a) Save the dataset as a DataFrame, and print the schema.
b) Get all of the tweets made by a user (any user would work. We should be able to replace user names to get tweets by that particular user).
c) Find count of all tweets by each user user.
d) Get a list of all of the people who are mentioned in tweets.
e) Count the number of time each person is mentioned in the entire dataset of tweets.
f) Give top 50 users who are mentioned the most.
g) Get a list of all hashtags mentioned in the dataset.
h) Find how many times each hashtag is mentioned in the dataset.
i) Get a list of all of the people who are located in a particular city (e.g. Paris)
j) Get country wise distribution of users, and find out which country ranks highest in terms of number of tweets, and number of users.
k) Find out number of tweets where a user is from France and mentions Paris in their tweets.

Q3. For this question, you would need to use the concepts learnt in Graph analytics session, and use datasets trip.csv and station.csv. The two files contain bike sharing data provided by SF Bay Area Portal. Trip.csv file contains following fields:
a) tripId
b) Duration
c) StartDate
d) EndDate
e) StartStation
f) StartTerminal
g) EndDate
h) EndStation
i) EndTerminal
j) BikeID
k) SubscriberType
l) ZipCode
Station.csv file contains following fields:

a) stationId
b) Name
c) Lat (Latitude)
d) Long (Longitude)
e) Dockcount
f) Landmark
g) Installation

Using the two files, please perform the following:

a) Import the data and create a graph using GraphFrames (Hint: Your graph will have nodes and edges. Nodes here would be individual stations so id field would be name field in station.csv file. Edges would have src and dst so it would Start Station and End Station fields in trip.csv file respectively. You can make use of other fields as properties of nodes and edges).
b) Find out number of incoming connections and outgoing connections for each node and print the top 10 nodes.
c) Find out which are the most common direct routes that people take and print top 10.
d) From the analysis in b, see which are the stations where people most frequently start their trips but do not come back. (Hint: You might have to think of incoming connections as a ratio of outgoing connections). Print top 10 such stations.
e) Find all such patterns where any station a is connected to station b, b is connected to c, but c is not directly connected to a.
f) Run a PageRank algorithm to figure out which is the most important station in the entire graph.

Q4. Consider the Movie Similarities code and problem that was discussed during the class (Session 4). Please provide a brief write-up on the problem, steps needed to arrive at the solution (recommendation system), and how exactly those steps are implemented in the code. While you are doing so, please also mention what each line of code does (It is not sufficient to mention what each block of code does, you would have to provide explanation for each line).

Programming Language, Programming

  • Category:- Programming Language
  • Reference No.:- M92601483
  • Price:- $65

Priced at Now at $65, Verified Solution

Have any Question?


Related Questions in Programming Language

Assignment - haskell program for regular expression

Assignment - Haskell Program for Regular Expression Matching Your assignment is to modify the slowgrep.hs Haskell program presented in class and the online notes, according to the instructions below. You may carry out th ...

Assignment task -q1 a the fibonacci numbers are the numbers

Assignment Task - Q1. (a) The Fibonacci numbers are the numbers in the following integer sequence, called the Fibonacci sequence, and are characterised by the fact that every number after the first two is the sum of the ...

Question - create a microsoft word macro using vba visual

Question - Create a Microsoft Word macro using VBA (Visual Basic for Applications). Name the macro "highlight." The macro should highlight every third line of text in a document. (Imagine creating highlighting that will ...

Assignmentquestion onegiving the following code snippet

Assignment Question One Giving the following code snippet. What kind of errors you will get and how can you correct it. A. public class HelloJava { public static void main(String args[]) { int x=10; int y=2; System.out.p ...

Assignment - proposal literature review research method1

Assignment - Proposal, Literature Review, Research Method 1. Abstract - Summary of the knowledge gap: problems of the existing research - Aim of the research, summary of what this project is to achieve - Summary of the a ...

1 write a function named check that has three parameters

1. Write a function named check () that has three parameters. The first parameter should accept an integer number, andthe second and third parameters should accept a double-precision number. The function body should just ...

Assignment - horse race meetingthe assignment will assess

Assignment - Horse Race Meeting The Assignment will assess competencies for ICTPRG524 Develop high level object-oriented class specifications. Summary The assignment is to design the classes that are necessary for the ad ...

Task silly name testeroverviewcontrol flow allows us to

Task: Silly Name Tester Overview Control flow allows us to alter the order in which our programs execute. Building on our knowledge of variables, we can now use control flow to create programs that perform more than just ...

Structs and enumsoverviewin this task you will create a

Structs and Enums Overview In this task you will create a knight database to help Camelot keep track of all of their knights. Instructions Lets get started. 1. What the topic 5 videos, these will guide you through buildi ...

Task working with arraysoverviewin this task you will

Task: Working with Arrays Overview In this task you will create a simple program which will create and work with an array of strings. This array will then be populated with values, printed out to the console, and then, w ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As