Features from destinations

  • Takes the flights data and airports data
  • destinations are in a list in the flights data
  • Clean the ‘destination’ of flights so that we list all intermediate destinations and first/last clearly
  • Merge with airports data to include airport data in our model input

TODO:

  • Clearly a waste of space with all the copied values for each airport ID, restructure pipeline so that isn’t required
  • I imagine that the airports data is pretty static and we could expect a table on DataBricks

Due to time-constraints I use this solution so that we can keep merging/indexing only by id. In a more sophisticated approach we could merge downstream by other columns, such as destination on airport - therefore not having to include all duplicate features and save them to disk at this stage.

Parameters

  • flights_file: Filepath of flights data in format received from Schiphol - airports_file: Filepath of airports data from

Schiphol - output_file: Filepath to write output csv file with minimal modelling input

### Returns

Output CSV file with minimal model input

id                   |  aircraftRegistration  |  airlineCode  |  terminal  |  serviceType  |      scheduleDateTime        |     actualOffBlockTime      |  scheduleDelaySeconds
124257473326719795   |    PHEXI               |     80.0      |     2.0    |       J       |  2018-05-01 16:35:00+02:00   |  2018-05-01 16:58:16+02:00  |         1396.0
124538476600837715   |    PHEXL               |     2481.0    |     1.0    |       J       |  2018-06-10 13:00:00+02:00   |  2018-06-10 13:11:25+02:00  |         685.0
123512829091050355   |    PHBQO               |     100.0     |     2.0    |       J       |  2018-01-15 10:15:00+01:00   |  2018-01-15 10:35:10+01:00  |         1210.0
123786805997701057   |    PHEXG               |     2481.0    |     1.0    |       J       |  2018-02-23 17:45:00+01:00   |  2018-02-23 17:55:52+01:00  |         652.0
124664922607744671   |    PHBXP               |     1551.0    |     2.0    |       J       |  2018-06-28 20:50:00+02:00   |  2018-06-28 22:09:23+02:00  |         4763.0

id | final_destination | Country | City | Latitude | Longitude | Altitude | DST | destination_distance 123414478192901837 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414481790516475 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414478192901991 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414481790510775 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414479288269149 | SPC | Spain | Santa Cruz … | 28.626499 | -17.75560 | 107.0 | E | 3.267980e+01

File parameters

[3]:
# input parameters cell
flights_file = "../lvt-schiphol-assignment-snakemake/data/raw/flights.csv"
airports_file = "../lvt-schiphol-assignment-snakemake/data/raw/airports.csv"
output_file = "processed_flights.csv"

Libraries

[4]:
import pandas as pd
import numpy as np

import sys
sys.path.append("../")

from src.data.google_storage_io import read_csv_data, write_csv_data

Read data

[5]:
%%time
df_flights = read_csv_data(flights_file)
df_airports = read_csv_data(airports_file)
Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/raw/flights.csv

Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/raw/airports.csv

Wall time: 1.95 s

Destination features

  • Route to destination in a column of lists
  • Get final destination and number of destination
[15]:
df_flights
[15]:
actualOffBlockTime aircraftRegistration aircraftType.iatamain aircraftType.iatasub airlineCode baggageClaim estimatedLandingTime expectedTimeBoarding expectedTimeGateClosing expectedTimeGateOpen ... prefixICAO publicEstimatedOffBlockTime publicFlightState.flightStates route.destinations scheduleDate scheduleTime serviceType terminal transferPositions transferPositions.transferPositions
0 NaN NaN NaN NaN 148.0 NaN NaN NaN NaN NaN ... ZXP NaN ['SCH'] ['AMS'] 2018-01-01 03:02:07 P NaN NaN NaN
1 NaN PHPXY AW1 NaN 148.0 NaN NaN NaN NaN NaN ... ZXP NaN ['SCH'] ['AMS'] 2018-01-01 03:16:00 NaN NaN NaN NaN
2 NaN NaN AW1 NaN 148.0 NaN NaN NaN NaN NaN ... ZXP NaN ['SCH'] ['AMS'] 2018-01-01 03:16:29 P NaN NaN NaN
3 2018-01-01T03:22:00.000+01:00 PHPXB NaN NaN 148.0 NaN NaN NaN NaN NaN ... ZXP NaN ['DEP'] ['AMS'] 2018-01-01 03:30:00 NaN NaN NaN NaN
4 2018-01-01T05:58:22.000+01:00 PHHSJ 73H 73H 164.0 NaN NaN NaN NaN NaN ... TRA NaN ['DEP'] ['SPC'] 2018-01-01 06:00:00 J 1.0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
523270 NaN NaN 320 320 64.0 NaN NaN 2018-07-31T21:20:00.000+02:00 2018-07-31T21:30:00.000+02:00 NaN ... EZY NaN ['SCH'] ['MAN'] 2018-07-31 21:50:00 J NaN NaN NaN
523271 NaN NaN 319 319 64.0 NaN NaN 2018-07-31T21:25:00.000+02:00 2018-07-31T21:35:00.000+02:00 NaN ... EZY NaN ['SCH'] ['SEN'] 2018-07-31 21:55:00 J NaN NaN NaN
523272 NaN NaN 73H 73H 234.0 NaN NaN NaN NaN NaN ... RYR NaN ['SCH'] ['DUB'] 2018-07-31 22:05:00 J NaN NaN NaN
523273 NaN NaN 73W 73W 100.0 NaN NaN 2018-07-31T21:40:00.000+02:00 2018-07-31T21:55:00.000+02:00 2018-07-31T21:10:00.000+02:00 ... KLM NaN ['SCH'] ['NCL'] 2018-07-31 22:10:00 J 2.0 NaN NaN
523274 NaN NaN 787 788 794.0 NaN NaN 2018-07-31T21:35:00.000+02:00 2018-07-31T22:05:00.000+02:00 2018-07-31T20:30:00.000+02:00 ... AMX NaN ['SCH'] ['MEX'] 2018-07-31 22:25:00 J NaN NaN NaN

523275 rows × 28 columns

[6]:
%%time

# route destinations parsed as a list then calculate length and expand list to columns
df_routes = df_flights[["id", "route.destinations"]] \
    .assign(route_list = lambda d: d["route.destinations"].apply(eval)) \
    .assign(route_length = lambda d: d["route_list"].apply(len),
            first_destination = lambda d: d["route_list"].apply(lambda x: x[0]),
            final_destination = lambda d: d["route_list"].apply(lambda x: x[-1]))

# determine separate route output columns
max_route_length = df_routes["route_length"].max()
destination_columns = [f"destination_{i}" for i in range(max_route_length)]

# unlist routes into multiple columns
df_routes[destination_columns] =  pd.DataFrame(df_routes["route_list"] \
                                               .apply(lambda x: (x + [np.nan] * max_route_length)[:max_route_length]).tolist(),
                                               index= df_routes.index)
df_routes
Wall time: 4.58 s
[6]:
id route.destinations route_list route_length first_destination final_destination destination_0 destination_1 destination_2 destination_3 destination_4
0 123414478192901837 ['AMS'] [AMS] 1 AMS AMS AMS NaN NaN NaN NaN
1 123414481790516475 ['AMS'] [AMS] 1 AMS AMS AMS NaN NaN NaN NaN
2 123414478192901991 ['AMS'] [AMS] 1 AMS AMS AMS NaN NaN NaN NaN
3 123414481790510775 ['AMS'] [AMS] 1 AMS AMS AMS NaN NaN NaN NaN
4 123414479288269149 ['SPC'] [SPC] 1 SPC SPC SPC NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
523270 124896773782315507 ['MAN'] [MAN] 1 MAN MAN MAN NaN NaN NaN NaN
523271 124896773782912169 ['SEN'] [SEN] 1 SEN SEN SEN NaN NaN NaN NaN
523272 124896745325235371 ['DUB'] [DUB] 1 DUB DUB DUB NaN NaN NaN NaN
523273 124896745995906173 ['NCL'] [NCL] 1 NCL NCL NCL NaN NaN NaN NaN
523274 124896744612750419 ['MEX'] [MEX] 1 MEX MEX MEX NaN NaN NaN NaN

523275 rows × 11 columns

[7]:
df_airports.head()
[7]:
Airport Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST Tz Type Source
0 1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081690 145.391998 5282 10 U Pacific/Port_Moresby airport OurAirports
1 2 Madang Airport Madang Papua New Guinea MAG AYMD -5.207080 145.789001 20 10 U Pacific/Port_Moresby airport OurAirports
2 3 Mount Hagen Kagamuga Airport Mount Hagen Papua New Guinea HGU AYMH -5.826790 144.296005 5388 10 U Pacific/Port_Moresby airport OurAirports
3 4 Nadzab Airport Nadzab Papua New Guinea LAE AYNZ -6.569803 146.725977 239 10 U Pacific/Port_Moresby airport OurAirports
4 5 Port Moresby Jacksons International Airport Port Moresby Papua New Guinea POM AYPY -9.443380 147.220001 146 10 U Pacific/Port_Moresby airport OurAirports
[8]:
df_airports.query("IATA == 'AMS'")
[8]:
Airport Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST Tz Type Source
574 580 Amsterdam Airport Schiphol Amsterdam Netherlands AMS EHAM 52.308601 4.76389 -11 1 E Europe/Amsterdam airport OurAirports
[9]:
def distance_to_schiphol(lat, lon):
    """"euclidean distance to hard-coded coords of Schiphol"""
    schiphol_coords = np.array([52.308601, 4.76389])
    dist = np.linalg.norm(np.array([lat, lon]) -  schiphol_coords)
    return dist
[10]:
%%time

df_final_destination_features = pd.merge(
    df_routes[["id", "final_destination"]],
    df_airports[["IATA", "Country", "City", "Latitude", "Longitude", "Altitude", "DST", "Type"]],
    how = "left",
    left_on = ["final_destination"],
    right_on = ["IATA"])

df_final_destination_features = df_final_destination_features \
    .assign(destination_distance = lambda d: d[["Latitude", "Longitude"]] \
                                .apply(lambda x: distance_to_schiphol(lat=x[0], lon=x[1]), axis=1)
           )
df_final_destination_features
Wall time: 26.3 s
[10]:
id final_destination IATA Country City Latitude Longitude Altitude DST Type destination_distance
0 123414478192901837 AMS AMS Netherlands Amsterdam 52.308601 4.763890 -11.0 E airport 4.338444e-07
1 123414481790516475 AMS AMS Netherlands Amsterdam 52.308601 4.763890 -11.0 E airport 4.338444e-07
2 123414478192901991 AMS AMS Netherlands Amsterdam 52.308601 4.763890 -11.0 E airport 4.338444e-07
3 123414481790510775 AMS AMS Netherlands Amsterdam 52.308601 4.763890 -11.0 E airport 4.338444e-07
4 123414479288269149 SPC SPC Spain Santa Cruz De La Palma 28.626499 -17.755600 107.0 E airport 3.267980e+01
... ... ... ... ... ... ... ... ... ... ... ...
523270 124896773782315507 MAN MAN United Kingdom Manchester 53.353699 -2.274950 257.0 E airport 7.116003e+00
523271 124896773782912169 SEN SEN United Kingdom Southend 51.571400 0.695556 49.0 E airport 4.134587e+00
523272 124896745325235371 DUB DUB Ireland Dublin 53.421299 -6.270070 242.0 E airport 1.108992e+01
523273 124896745995906173 NCL NCL United Kingdom Newcastle 55.037498 -1.691670 266.0 E airport 7.008647e+00
523274 124896744612750419 MEX MEX Mexico Mexico City 19.436300 -99.072098 7316.0 S airport 1.089151e+02

523275 rows × 11 columns

[11]:
# meta columns for utility for columns we will often merge by
output_columns = ["id", "final_destination", "Country", "City", "Latitude", "Longitude", "Altitude", "DST", "destination_distance"]

# DataFrame with id + merging columns + prediction target
df_output = df_final_destination_features[output_columns]
df_output.head()
[11]:
id final_destination Country City Latitude Longitude Altitude DST destination_distance
0 123414478192901837 AMS Netherlands Amsterdam 52.308601 4.76389 -11.0 E 4.338444e-07
1 123414481790516475 AMS Netherlands Amsterdam 52.308601 4.76389 -11.0 E 4.338444e-07
2 123414478192901991 AMS Netherlands Amsterdam 52.308601 4.76389 -11.0 E 4.338444e-07
3 123414481790510775 AMS Netherlands Amsterdam 52.308601 4.76389 -11.0 E 4.338444e-07
4 123414479288269149 SPC Spain Santa Cruz De La Palma 28.626499 -17.75560 107.0 E 3.267980e+01

Write output to CSV

Local or Google Storage is both handled.ipynb_checkpoints/

[12]:
# # write output file
write_csv_data(df_output, output_file, index=False)
Writing file to local directory
File:   processed_flights.csv

Overview of the output data

[13]:
df_output.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 523275 entries, 0 to 523274
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype
---  ------                --------------   -----
 0   id                    523275 non-null  int64
 1   final_destination     523275 non-null  object
 2   Country               522336 non-null  object
 3   City                  522336 non-null  object
 4   Latitude              522336 non-null  float64
 5   Longitude             522336 non-null  float64
 6   Altitude              522336 non-null  float64
 7   DST                   522336 non-null  object
 8   destination_distance  522336 non-null  float64
dtypes: float64(4), int64(1), object(4)
memory usage: 39.9+ MB
[ ]: