About Dataset The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
-
Create a DF(airlines_1987_to_2008) from this path
%fs ls dbfs:/databricks-datasets/asa/airlines/ (There are csv files in airlines folder. It contains 1987.csv to 2008.csv files. Create only one DF from all the files )
-
Create a PySpark Datatypes schema for the above DF
-
View the dataframe
-
Return count of records in dataframe
-
Select the columns - Origin, Dest and Distance
-
Filtering data with 'where' method, where Year = 2001
-
Create a new dataframe (airlines_1987_to_2008_drop_DayofMonth) exluding dropped column (“DayofMonth”)
-
Display new DataFrame
-
Create column 'Weekend' and a new dataframe(AddNewColumn) and display
-
Cast ActualElapsedTime column to integer and use printschema to verify
-
Rename 'DepTime' to 'DepartureTime'
-
Drop duplicate rows based on Year and Month and Create new df (Drop Rows)
-
Display Sort by descending order for Year Column using sort()
-
Group data according to Origin and returning count
-
Group data according to dest and finding maximum value for each 'Dest'
-
Write data in Delta format