Aviation Industry EDA with SQL and Python

 
Objective

The objective of this analysis is to identify opportunities for improving the occupancy rate of low-performing flights, which can ultimately result in increased revenue for the airline. By identifying factors that contribute to low occupancy, the airline can implement strategies to attract more passengers and fill up empty seats. This can include adjustments to pricing, marketing efforts, scheduling, or service enhancements. Increasing the occupancy rate on these flights will not only optimize resource utilization but also generate additional revenue for the airline, leading to improved financial performance and overall profitability.

 

Import Libraries

In [1]:

import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json

import warnings
warnings.filterwarnings('ignore')

 

Make Database Connection

In [2]:

conn = sqlite3.connect('travel.sqlite')
cursor = conn.cursor()
 

ETL Data for Analysis

In [3]:

cursor.execute(""" select name 
               FROM sqlite_master 
               WHERE type = 'table';
               """)

table_list = [table[0] for table in cursor.fetchall()]
print(table_list)
 
Out[3]:

['aircrafts_data', 'airports_data', 'boarding_passes', 'bookings', 'flights', 'seats', 'ticket_flights', 'tickets']

In [4]:

tables = pd.read_sql("""SELECT *
                        FROM sqlite_master
                        WHERE type='table';""", conn)
tables

Out[4]:

 

 
  type name tbl_name rootpage sql
0 table aircrafts_data aircrafts_data 2 CREATE TABLE aircrafts_data (\r\n aircraft_...
1 table airports_data airports_data 3 CREATE TABLE airports_data (\r\n airport_co...
2 table boarding_passes boarding_passes 4 CREATE TABLE boarding_passes (\r\n ticket_n...
3 table bookings bookings 5 CREATE TABLE bookings (\r\n book_ref charac...
4 table flights flights 6 CREATE TABLE flights (\r\n flight_id intege...
5 table seats seats 7 CREATE TABLE seats (\r\n aircraft_code char...
6 table ticket_flights ticket_flights 8 CREATE TABLE ticket_flights (\r\n ticket_no...
7 table tickets tickets 9 CREATE TABLE tickets (\r\n ticket_no charac...
 
In [5]:
aircrafts_data = pd.read_sql_query("select * from aircrafts_data", conn)
airports_data = pd.read_sql_query("select * from airports_data", conn)
boarding_passes = pd.read_sql_query("select * from boarding_passes", conn)
bookings = pd.read_sql_query("select * from bookings", conn)
flights = pd.read_sql_query("select * from flights", conn)
seats = pd.read_sql_query("select * from seats", conn)
ticket_flights = pd.read_sql_query("select * from ticket_flights", conn)
tickets = pd.read_sql_query("select * from tickets", conn)

In [6]:
aircrafts_data['model'] = aircrafts_data['model'].apply(lambda x: json.loads(x)['en'])

In [7]:
airports_data['airport_name'] = airports_data['airport_name'].apply(lambda x: json.loads(x)['en'])
airports_data['city'] = airports_data['city'].apply(lambda x: json.loads(x)['en'])

In [8]:

aircrafts_data

Out[8]:

 

 
  aircraft_code model range
0 773 Boeing 777-300 11100
1 763 Boeing 767-300 7900
2 SU9 Sukhoi Superjet-100 3000
3 320 Airbus A320-200 5700
4 321 Airbus A321-200 5600
5 319 Airbus A319-100 6700
6 733 Boeing 737-300 4200
7 CN1 Cessna 208 Caravan 1200
8 CR2 Bombardier CRJ-200 2700
 
In [9]:

for table in table_list:
    print('\ntable:', table)
    column_info = cursor.execute("PRAGMA table_info({})".format(table))
    for col in column_info.fetchall():
        print(col[1:3])

 
Out[9]:
table: aircrafts_data
('aircraft_code', 'character(3)')
('model', 'jsonb')
('range', 'INTEGER')

table: airports_data
('airport_code', 'character(3)')
('airport_name', 'jsonb')
('city', 'jsonb')
('coordinates', 'point')
('timezone', 'TEXT')

table: boarding_passes
('ticket_no', 'character(13)')
('flight_id', 'INTEGER')
('boarding_no', 'INTEGER')
('seat_no', 'character varying(4)')

table: bookings
('book_ref', 'character(6)')
('book_date', 'timestamp with time zone')
('total_amount', 'numeric(10,2)')

table: flights
('flight_id', 'INTEGER')
('flight_no', 'character(6)')
('scheduled_departure', 'timestamp with time zone')
('scheduled_arrival', 'timestamp with time zone')
('departure_airport', 'character(3)')
('arrival_airport', 'character(3)')
('status', 'character varying(20)')
('aircraft_code', 'character(3)')
('actual_departure', 'timestamp with time zone')
('actual_arrival', 'timestamp with time zone')

table: seats
('aircraft_code', 'character(3)')
('seat_no', 'character varying(4)')
('fare_conditions', 'character varying(10)')

table: ticket_flights
('ticket_no', 'character(13)')
('flight_id', 'INTEGER')
('fare_conditions', 'character varying(10)')
('amount', 'numeric(10,2)')

table: tickets
('ticket_no', 'character(13)')
('book_ref', 'character(6)')
('passenger_id', 'character varying(20)')

In [10]:
for table in table_list:
    print('\ntable:', table)
    df_table = pd.read_sql_query(f'select * from {table}', conn)
    print(df_table.isnull().sum())

 
Out[10]:
table: aircrafts_data
aircraft_code    0
model            0
range            0
dtype: int64

table: airports_data
airport_code    0
airport_name    0
city            0
coordinates     0
timezone        0
dtype: int64

table: boarding_passes
ticket_no      0
flight_id      0
boarding_no    0
seat_no        0
dtype: int64

table: bookings
book_ref        0
book_date       0
total_amount    0
dtype: int64

table: flights
flight_id              0
flight_no              0
scheduled_departure    0
scheduled_arrival      0
departure_airport      0
arrival_airport        0
status                 0
aircraft_code          0
actual_departure       0
actual_arrival         0
dtype: int64

table: seats
aircraft_code      0
seat_no            0
fare_conditions    0
dtype: int64

table: ticket_flights
ticket_no          0
flight_id          0
fare_conditions    0
amount             0
dtype: int64

table: tickets
ticket_no       0
book_ref        0
passenger_id    0
dtype: int64

In [11]:

sns.set_style('whitegrid')
fig,axes = plt.subplots(figsize=(12,8))
ax = sns.barplot(x='model',y='range', data=aircrafts_data, palette = 'Paired')
for container in ax.containers:
    ax.bar_label(container)
plt.title('AirPlane Models with their ranges')
plt.xticks(rotation=45)
plt.show()

 
Out[11]:
 
In [12]:

df = pd.read_sql_query("""select aircraft_code, count(*) as num_seats from seats
                        group by aircraft_code having num_seats >100""", conn)

# df.to_csv('aircraft_seats.csv')

In [13]:

sns.set_style('whitegrid')
fig,axes = plt.subplots(figsize=(12,8))
ax = sns.barplot(x='aircraft_code',y='num_seats', data=df, palette = 'flare')
for container in ax.containers:
    ax.bar_label(container)
plt.title('AirCraft codes Vs Number of Seats')
plt.xticks(rotation=45)
plt.show()

 
Out[13]:
 
In [14]:

crafts = pd.read_sql("""SELECT aircraft_code, model->'en'
                        FROM aircrafts_data
                        where aircraft_code IN (319, 320, 321, 733, 763, 773);""", conn)
crafts

Out[14]:

 

 
  aircraft_code model->'en'
0 773 "Boeing 777-300"
1 763 "Boeing 767-300"
2 320 "Airbus A320-200"
3 321 "Airbus A321-200"
4 319 "Airbus A319-100"
5 733 "Boeing 737-300"
 

Trends in ticket sales

In [15]:

tickets = pd.read_sql_query("""select * from tickets inner join bookings
                    on tickets.book_ref = bookings.book_ref""", conn)

tickets['book_date'] = pd.to_datetime(tickets['book_date'])
tickets['date'] = tickets['book_date'].dt.date
tickets_count = tickets.groupby('date')[['date']].count()
plt.figure(figsize=(18,6))
plt.plot(tickets_count.index, tickets_count['date'], color='green', scalex=True, marker = "*")
plt.title('Total amount sold on Each Date', fontsize=30)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Amount', fontsize=20)
plt.grid('b')
plt.show()

 
Out[15]:
 

Trends In Earnings over time

In [16]:

bookings = pd.read_sql_query("select * from bookings", conn)

bookings['book_date'] = pd.to_datetime(bookings['book_date'])
bookings['date'] = bookings['book_date'].dt.date
booking_amount = bookings.groupby('date')[['total_amount']].sum()

plt.figure(figsize=(18,6))
plt.plot(booking_amount.index, booking_amount['total_amount'],color='orange',scalex=True, marker = '*')
plt.title('Number of Tickets Booked on Each Date', fontsize=30)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Total Amount Earned', fontsize=20)
plt.grid('b')
plt.show()

 
Out[16]:
 

Average fares in diffrent aircrafts based on seat class.

In [17]:

df = pd.read_sql_query("""select fare_conditions, aircraft_code,avg(amount) 
                        from ticket_flights join flights 
                        on ticket_flights.flight_id = flights.flight_id
                        group by aircraft_code, fare_conditions""", conn)
df.to_csv('fare_avg_amount.csv')

sns.set_style('whitegrid')
fig,axes = plt.subplots(figsize=(12,8))
ax = sns.barplot(x='aircraft_code',y='avg(amount)',hue='fare_conditions', data=df, palette = 'flare')
for container in ax.containers:
    ax.bar_label(container)
plt.title('Class wise Average Flight Prices')
plt.xticks(rotation=45)
plt.show()

 
Out[17]:
 
In [18]:

crafts = pd.read_sql("""SELECT aircraft_code, model->'en'
                        FROM aircrafts_data
                        where aircraft_code IN (319, 321, 733, 763, 773, 'CN1', 'CR2', 'SU9');""", conn)
crafts

Out[18]:

 

 
  aircraft_code model->'en'
0 773 "Boeing 777-300"
1 763 "Boeing 767-300"
2 SU9 "Sukhoi Superjet-100"
3 321 "Airbus A321-200"
4 319 "Airbus A319-100"
5 733 "Boeing 737-300"
6 CN1 "Cessna 208 Caravan"
7 CR2 "Bombardier CRJ-200"
 
In [19]:

revenue = pd.read_sql_query("""select aircraft_code,ticket_count,total_revenue,total_revenue/ticket_count as avg_revenue_per_ticket from
                    (select aircraft_code, count(*) as ticket_count, sum(amount) as total_revenue from ticket_flights
                        join flights on ticket_flights.flight_id = flights.flight_id
                        group by aircraft_code)""", conn)
revenue.to_csv('revenue.csv')
revenue

Out[19]:

 

 
  aircraft_code ticket_count total_revenue avg_revenue_per_ticket
0 319 52853 2706163100 51201
1 321 107129 1638164100 15291
2 733 86102 1426552100 16568
3 763 124774 4371277100 35033
4 773 144376 3431205500 23765
5 CN1 14672 96373800 6568
6 CR2 150122 1982760500 13207
7 SU9 365698 5114484700 13985
 
In [20]:

df = pd.DataFrame(revenue)

# Set the style for the plot
sns.set(style='whitegrid')

# Set the figure size
plt.figure(figsize=(8, 6))

# Create the bar plot
sns.barplot(x='aircraft_code', y='value', hue='variable', data=pd.melt(df, id_vars=['aircraft_code']))

# Add labels and title
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Bar Plot with Multiple Columns')

# Show the plot
plt.show()

 
Out[20]:
 
In [21]:

df

Out[21]:

 

 
  aircraft_code ticket_count total_revenue avg_revenue_per_ticket
0 319 52853 2706163100 51201
1 321 107129 1638164100 15291
2 733 86102 1426552100 16568
3 763 124774 4371277100 35033
4 773 144376 3431205500 23765
5 CN1 14672 96373800 6568
6 CR2 150122 1982760500 13207
7 SU9 365698 5114484700 13985
 
In [22]:

columns_to_plot = ['ticket_count', 'avg_revenue_per_ticket']

# Subset the DataFrame with the selected columns
df_subset = df[['aircraft_code'] + columns_to_plot]
ax = sns.barplot(x='aircraft_code', y='value', hue='variable', data=pd.melt(df_subset, 'aircraft_code'), palette='Set2')

# Add labels and title
plt.xlabel('Aircraft')
plt.ylabel('Value')
plt.title('Ticket sold and average revenue per ticket')

 
Out[22]:
 
In [23]:

pd.read_sql_query("""select aircraft_code, count(*) as seats_count from seats 
                    group by aircraft_code""", conn
                  )

Out[23]:

 

 
  aircraft_code seats_count
0 319 116
1 320 140
2 321 170
3 733 130
4 763 222
5 773 402
6 CN1 12
7 CR2 50
8 SU9 97
 

Aircrafts having more than 100 seats?

In [24]:

pd.read_sql_query("""select aircraft_code, count(*) as seats_count from seats 
                    group by aircraft_code having seats_count >=100""", conn
                  )

Out[24]:

 

 
  aircraft_code seats_count
0 319 116
1 320 140
2 321 170
3 733 130
4 763 222
5 773 402
 
In [25]:

tickets = pd.read_sql_query("""select * from tickets
                  inner join bookings
                  on tickets.book_ref = bookings.book_ref  
                  """, conn
                  )

tickets.dtypes

Out[25]:
ticket_no       object
book_ref        object
passenger_id    object
book_ref        object
book_date       object
total_amount     int64
dtype: object


In [26]:

tickets['book_date'] = pd.to_datetime(tickets['book_date'])

tickets['date'] = tickets['book_date'].dt.date

In [27]:

x = tickets.groupby('date')[['date']].count()
plt.figure(figsize=(18,6))
plt.plot(x.index, x['date'],color='orange',scalex=True, marker = 'x')
plt.title('Number of Tickets Sold on Each Date', fontsize=30)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Number of Tickets', fontsize=20)
plt.grid('b')
plt.show()

Out[27]:
 
In [28]:

df = pd.read_sql_query("""select fare_conditions, aircraft_code,ROUND(AVG(amount), 2) as avg_amount 
                        from ticket_flights join flights 
                        on ticket_flights.flight_id = flights.flight_id
                        group by aircraft_code, fare_conditions""", conn)
df

Out[28]:

 

 
  fare_conditions aircraft_code avg_amount
0 Business 319 113550.56
1 Economy 319 38311.40
2 Business 321 34435.66
3 Economy 321 11534.97
4 Business 733 41865.63
5 Economy 733 13985.15
6 Business 763 82839.84
7 Economy 763 27594.72
8 Business 773 57779.91
9 Comfort 773 32740.55
10 Economy 773 19265.23
11 Economy CN1 6568.55
12 Economy CR2 13207.66
13 Business SU9 33487.85
14 Economy SU9 11220.18
 
In [29]:

sns.barplot(data=df, x=df['aircraft_code'],y=df['avg_amount'] ,hue=df['fare_conditions'])

Out[29]:
 

Average Occupancy Rate

In [30]:

occupancy_rate = pd.read_sql_query("""select a.aircraft_code, round(avg(a.seats_count), 2) as avg_booked_seats, 
                b.num_seats, round(avg(a.seats_count)/b.num_seats, 2) as occupancy_rate  from
                (select aircraft_code,flights.flight_id,count(*) as seats_count from boarding_passes
                    inner join flights
                    on boarding_passes.flight_id = flights.flight_id
                    group by aircraft_code,flights.flight_id) as a
                    inner join 
                    (select aircraft_code,count(*) as num_seats from seats
                    group by aircraft_code) as b
                    on a.aircraft_code = b.aircraft_code group by a.aircraft_code""", conn
                  )
occupancy_rate

Out[30]:

 

 
  aircraft_code avg_booked_seats num_seats occupancy_rate
0 319 53.58 116 0.46
1 321 88.81 170 0.52
2 733 80.26 130 0.62
3 763 113.94 222 0.51
4 773 264.93 402 0.66
5 CN1 6.00 12 0.50
6 CR2 21.48 50 0.43
7 SU9 56.81 97 0.59
 

Increase occupancy rate by 10% of all aircraft

In [31]:

occupancy_rate['inc occupancy rate'] = occupancy_rate['occupancy_rate']\
    +round(occupancy_rate['occupancy_rate']*0.1, 2)
occupancy_rate

Out[31]:

 

 
  aircraft_code avg_booked_seats num_seats occupancy_rate inc occupancy rate
0 319 53.58 116 0.46 0.51
1 321 88.81 170 0.52 0.57
2 733 80.26 130 0.62 0.68
3 763 113.94 222 0.51 0.56
4 773 264.93 402 0.66 0.73
5 CN1 6.00 12 0.50 0.55
6 CR2 21.48 50 0.43 0.47
7 SU9 56.81 97 0.59 0.65
 
In [32]:

occupancy_rate

Out[32]:

 

 
  aircraft_code avg_booked_seats num_seats occupancy_rate inc occupancy rate
0 319 53.58 116 0.46 0.51
1 321 88.81 170 0.52 0.57
2 733 80.26 130 0.62 0.68
3 763 113.94 222 0.51 0.56
4 773 264.93 402 0.66 0.73
5 CN1 6.00 12 0.50 0.55
6 CR2 21.48 50 0.43 0.47
7 SU9 56.81 97 0.59 0.65
 
In [33]:

revenue

Out[33]:

 

 
  aircraft_code ticket_count total_revenue avg_revenue_per_ticket
0 319 52853 2706163100 51201
1 321 107129 1638164100 15291
2 733 86102 1426552100 16568
3 763 124774 4371277100 35033
4 773 144376 3431205500 23765
5 CN1 14672 96373800 6568
6 CR2 150122 1982760500 13207
7 SU9 365698 5114484700 13985
 
In [34]:

revenue_kpi = pd.merge(revenue, occupancy_rate, on='aircraft_code', how='inner')
revenue_kpi

Out[34]:

 

 
  aircraft_code ticket_count total_revenue avg_revenue_per_ticket avg_booked_seats num_seats occupancy_rate inc occupancy rate
0 319 52853 2706163100 51201 53.58 116 0.46 0.51
1 321 107129 1638164100 15291 88.81 170 0.52 0.57
2 733 86102 1426552100 16568 80.26 130 0.62 0.68
3 763 124774 4371277100 35033 113.94 222 0.51 0.56
4 773 144376 3431205500 23765 264.93 402 0.66 0.73
5 CN1 14672 96373800 6568 6.00 12 0.50 0.55
6 CR2 150122 1982760500 13207 21.48 50 0.43 0.47
7 SU9 365698 5114484700 13985 56.81 97 0.59 0.65
 
In [35]:

revenue_kpi = revenue_kpi.drop('inc occupancy rate', axis=1)

In [36]:

revenue_kpi

Out[36]:

 

 
  aircraft_code ticket_count total_revenue avg_revenue_per_ticket avg_booked_seats num_seats occupancy_rate
0 319 52853 2706163100 51201 53.58 116 0.46
1 321 107129 1638164100 15291 88.81 170 0.52
2 733 86102 1426552100 16568 80.26 130 0.62
3 763 124774 4371277100 35033 113.94 222 0.51
4 773 144376 3431205500 23765 264.93 402 0.66
5 CN1 14672 96373800 6568 6.00 12 0.50
6 CR2 150122 1982760500 13207 21.48 50 0.43
7 SU9 365698 5114484700 13985 56.81 97 0.59
 
In [37]:

conn.close()

 

By examining the financial impact of raising occupancy rates and implementing data-driven pricing strategies, airlines can enhance occupancy rates, revenue, and profitability. However, it is crucial to strike a balance between increasing occupancy rates and delivering high-quality service while upholding safety regulations.

To summarize, analyzing revenue data and occupancy rates is crucial for airlines to maximize profitability. By understanding the correlation between ticket bookings and revenue, optimizing pricing strategies, and increasing occupancy rates, airlines can drive business growth and success in a highly competitive industry.