SQL Queries Demonstration: Airline Operations Analysis¶

In this notebook, we'll explore various aspects of airline operations using SQL queries. Our dataset comprises three tables: airlines, airports, and flights. By leveraging SQL queries, we aim to extract valuable insights to understand airline performance, identify trends, and address operational challenges.

Key Questions Addressed:¶

  1. Distinct Airlines:

    • Understanding the variety of airlines operating in our dataset.
  2. Maximum Distance Analysis by Airline:

    • Determining which airlines operate flights covering the longest distances.
  3. Busiest Departure Days from Top 10 Airports:

    • Identifying the busiest days for departures from the top 10 airports.
  4. Cancellation Ratio Analysis by Airline:

    • Assessing the cancellation ratios of different airlines to gauge their reliability.
  5. Top 10 Airports with the Highest Average Arrival Delays in January:

    • Identifying airports with significant average arrival delays in a specific month.
  6. Analysis of On-time Landings Despite Departure Delays Exceeding 15 Minutes by Airlines:

    • Evaluating which airlines maintain on-time arrivals despite departure delays.

Dataset Overview:¶

  • Airlines Table:

    • Contains information about various airlines, including their names and unique identifiers.
  • Airports Table:

    • Provides details about airports, such as their names, locations, and unique codes.
  • Flights Table:

    • Includes comprehensive flight data, such as departure/arrival times, distances, cancellations, delays, and associated airline and airport codes.

Methodology:¶

  • We'll utilize SQL queries to extract relevant information from the dataset.
  • Each query will be accompanied by a brief explanation of its purpose and the insights it provides.
  • Visualizations may complement some analyses for clearer interpretation.

Let's Dive In!¶

Without further ado, let's delve into our SQL queries to uncover valuable insights into airline operations and performance.

In [1]:
# @hidden_cell
%load_ext sql
%sql sqlite:///jupyter_sql_tutorial.db

Distinct airlines¶

This query selects unique values from the "AIRLINE" column of the airlines table. It retrieves only distinct airline names, meaning each airline name appears only once in the result set, even if it occurs multiple times in the original table.

In [84]:
%%sql
select distinct AIRLINE
from airlines
 * sqlite:///jupyter_sql_tutorial.db
Done.
Out[84]:
AIRLINE
United Air Lines Inc.
American Airlines Inc.
US Airways Inc.
Frontier Airlines Inc.
JetBlue Airways
Skywest Airlines Inc.
Alaska Airlines Inc.
Spirit Air Lines
Southwest Airlines Co.
Delta Air Lines Inc.
Atlantic Southeast Airlines
Hawaiian Airlines Inc.
American Eagle Airlines Inc.
Virgin America

Maximum Distance Analysis by Airline¶

This query retrieves the maximum distance traveled for each airline from the flights table. It groups the data by the airline name (AIRLINE) and selects the maximum distance (max(DISTANCE)) for each group. So, the result contains one row per airline, with the airline name and the maximum distance traveled by that airline.

In [86]:
%%sql
select AIRLINE ,max(DISTANCE)
from flights
group by 1
 * sqlite:///jupyter_sql_tutorial.db
Done.
Out[86]:
AIRLINE max(DISTANCE)
AA 3784
AS 2874
B6 2704
DL 4983
EV 1330
F9 2065
HA 4983
MQ 1236
NK 2342
OO 1735
UA 4962
US 2979
VX 2704
WN 2447

Busiest Departure Days from Top 10 Airports¶

This query identifies the busiest departure days for flights originating from the top 10 airports. Here's a summary:

  1. Identifying Top Airports:

    • Selects the top 10 airports based on the number of departing flights from each airport.
  2. Ranking Departure Days:

    • Calculates the number of departures for each airport on each day and ranks them based on departure count.
  3. Selecting Relevant Information:

    • Retrieves the airport name, city, departure date, and departure number for the busiest departure day of each airport.
  4. Joining with Airport Data:

    • Joins the departure data with airport information to enrich the results.
  5. Filtering and Ordering:

    • Filters the results to include only the busiest departure day for each airport.
    • Orders the results by the number of departures on the busiest day in descending order.

Overall, the query provides insights into the busiest departure days for flights from the top 10 airports, offering details such as airport name, city, departure date, and the number of departures on those days.

In [50]:
%%sql

with top_airport as (select ORIGIN_AIRPORT, count(*)
from flights
group by 1
order by 2 desc
limit 10),

Ranking_days as (
select F.ORIGIN_AIRPORT,
F.YEAR*10000+F.Month*100+F.DAY as date, 
count(*)  departure_number,
RANK() OVER (PARTITION BY F.ORIGIN_AIRPORT ORDER BY count(*) DESC) as ranking
from flights F
    join top_airport T
    on F.ORIGIN_AIRPORT = T.ORIGIN_AIRPORT
group by 1,2 
order by 1,2)

select A.AIRPORT,
CITY,
R.date,
R.departure_number
from Ranking_days R
join airport A
on R.ORIGIN_AIRPORT = A.IATA_CODE
where ranking=1
order by departure_number desc
 * sqlite:///jupyter_sql_tutorial.db
Done.
Out[50]:
AIRPORT CITY date departure_number
Hartsfield-Jackson Atlanta International Airport Atlanta 20150213 1069
Chicago O'Hare International Airport Chicago 20150220 847
Dallas/Fort Worth International Airport Dallas-Fort Worth 20150102 809
Denver International Airport Denver 20150102 658
Los Angeles International Airport Los Angeles 20150102 640
George Bush Intercontinental Airport Houston 20150102 508
San Francisco International Airport San Francisco 20150102 493
Phoenix Sky Harbor International Airport Phoenix 20150102 487
McCarran International Airport Las Vegas 20150220 399
Orlando International Airport Orlando 20150103 360

Cancellation Ratio Analysis by Airline¶

The query calculates the cancellation ratio for each airline, indicating the proportion of cancelled flights relative to the total number of flights operated by each airline. It achieves this through the following steps:

  1. Selection of Relevant Data:

    • It selects the airline name (A.AIRLINE) and calculates the cancellation ratio.
  2. Calculation of Cancellation Ratio:

    • The cancellation ratio is computed by dividing the total number of cancelled flights by the total count of flights operated by each airline. This ratio is rounded to four decimal places.
  3. Joining Tables:

    • It joins the flights table (F) with the airlines table (A) using the airline IATA code to access additional airline information.
  4. Grouping by Airline:

    • The results are grouped by airline to aggregate flight data for each carrier.
  5. Ordering by Cancellation Ratio:

    • Finally, the results are ordered by the cancellation ratio in descending order, showcasing which airlines have the highest proportion of cancelled flights relative to the total number of flights they operate.
In [85]:
%%sql
select A.AIRLINE,
round(sum(F.CANCELLED) * 1.00/ count(F.CANCELLED),4) as ratio_cancelled
from flights F
join airlines A
on F.AIRLINE = A.IATA_CODE
group by 1
order by 2 desc
 * sqlite:///jupyter_sql_tutorial.db
Done.
Out[85]:
AIRLINE ratio_cancelled
American Eagle Airlines Inc. 0.0989
JetBlue Airways 0.0606
Atlantic Southeast Airlines 0.0489
US Airways Inc. 0.0383
American Airlines Inc. 0.0323
Skywest Airlines Inc. 0.0292
Southwest Airlines Co. 0.0279
United Air Lines Inc. 0.0277
Virgin America 0.0273
Spirit Air Lines 0.0199
Delta Air Lines Inc. 0.0191
Frontier Airlines Inc. 0.0159
Alaska Airlines Inc. 0.0062
Hawaiian Airlines Inc. 0.0025

Top 10 Airports with the Highest Average Arrival Delays in January¶

The query calculates the average arrival delay for flights arriving at each airport during the month of January. It achieves this through the following steps:

  1. Selection of Relevant Data:

    • It selects the airport name (A.AIRPORT), city (A.CITY), and calculates the average delay.
  2. Calculation of Average Delay:

    • The average delay is computed by taking the average of arrival delays (ARRIVAL_DELAY) for flights arriving at each airport. If the arrival delay is negative (i.e., the flight arrives early), it is treated as 0.
  3. Joining Tables:

    • It joins the flights table (F) with the airport table (A) using the destination airport IATA code to access airport information.
  4. Filtering by Month:

    • The query restricts the data to flights that occurred in January (month = 1).
  5. Grouping by Airport:

    • The results are grouped by airport name and city to aggregate flight data for each destination airport.
  6. Ordering and Limiting Results:

    • The results are ordered by the average delay in descending order.
    • Only the top 10 airports with the highest average delays are included in the output.

In summary, the query provides insights into the airports with the highest average arrival delays for flights arriving in January, showing the airport name, city, and the average delay for each of the top 10 airports.

In [65]:
%%sql

select A.AIRPORT,
A.CITY,
round(avg(case when ARRIVAL_DELAY<0 then 0
    else ARRIVAL_DELAY end),2) as avg_delay 
from flights F
join airport A
on F.DESTINATION_AIRPORT = A.IATA_CODE
where month = 1
group by A.AIRPORT,A.CITY
order by 3 desc
limit 10
 * sqlite:///jupyter_sql_tutorial.db
Done.
Out[65]:
AIRPORT CITY avg_delay
Sawyer International Airport Marquette 47.46
Waterloo Regional Airport Waterloo 36.31
Jamestown Regional Airport Jamestown 33.15
University of Illinois - Willard Airport Champaign/Urbana 32.98
Southwest Oregon Regional Airport (North Bend Municipal) North Bend 32.44
Central Wisconsin Airport Mosinee 32.06
Devils Lake Regional Airport Devils Lake 31.5
Alpena County Regional Airport Alpena 31.12
Toledo Express Airport Toledo 30.49
Sioux Gateway Airport Sioux City 30.13

Analysis of On-Time Landings Despite Departure Delays Exceeding 15 Minutes by Airlines¶

The query aims to analyze the proportion of flights that depart with delays greater than 15 minutes and subsequently land on time, for each airline. It accomplishes this through the following steps:

  1. Selection of Delayed Departures:

    • The query begins by selecting only those flights where the departure delay exceeds 15 minutes. These flights are stored in a CTE (Common Table Expression) called delay_departure.
  2. Pre-processing Data:

    • The pre_resu CTE calculates the count of delayed take-offs (take_off_delayed) and the count of flights that land on time (landed_in_time) for each airline. It aggregates the data by airline.
  3. Calculation of Ratios:

    • It then calculates the ratio of flights that land on time to the total number of delayed take-offs. This ratio is rounded to three decimal places.
  4. Final Selection and Ordering:

    • The final select statement retrieves the airline name, count of delayed take-offs, count of flights that landed on time, and the calculated ratio. The results are ordered by the ratio of on-time landings to delayed take-offs in descending order.

Overall, the query provides insights into the performance of airlines in managing delays by analyzing the proportion of flights that depart late but arrive on time.

In [83]:
%%sql

-- select only delay flight by more than 15min at departure
with delay_departure as (select *
from flights
where DEPARTURE_DELAY > 15),

pre_resu as (
select 
A.airline ,
count(*) as take_off_delayed ,
sum( case when D.ARRIVAL_DELAY<= 0 then 1
    else 0
    end) as landed_in_time
from delay_departure D
join airlines A 
    on D.AIRLINE = A.IATA_CODE
group by 1)

select 
airline,
take_off_delayed,
landed_in_time,
round(landed_in_time*1.00 / take_off_delayed,3) as ratio_delay_dep_landed_in_time
from pre_resu
order by 4 desc
 * sqlite:///jupyter_sql_tutorial.db
Done.
Out[83]:
airline take_off_delayed landed_in_time ratio_delay_dep_landed_in_time
United Air Lines Inc. 17005 1380 0.081
Virgin America 1370 110 0.08
Delta Air Lines Inc. 18347 1168 0.064
Alaska Airlines Inc. 2842 134 0.047
American Airlines Inc. 14034 517 0.037
Southwest Airlines Co. 34998 1243 0.036
JetBlue Airways 9557 290 0.03
US Airways Inc. 8997 251 0.028
Spirit Air Lines 3972 89 0.022
Hawaiian Airlines Inc. 1042 18 0.017
Skywest Airlines Inc. 17316 259 0.015
Atlantic Southeast Airlines 16767 231 0.014
Frontier Airlines Inc. 3484 33 0.009
American Eagle Airlines Inc. 14273 116 0.008