Features from destinations¶
- Takes the flights data and airports data
- destinations are in a list in the flights data
- Clean the ‘destination’ of flights so that we list all intermediate destinations and first/last clearly
- Merge with airports data to include airport data in our model input
TODO:
- Clearly a waste of space with all the copied values for each airport ID, restructure pipeline so that isn’t required
- I imagine that the airports data is pretty static and we could expect a table on DataBricks
Due to time-constraints I use this solution so that we can keep merging/indexing only by id. In a more sophisticated approach we could merge downstream by other columns, such as destination on airport - therefore not having to include all duplicate features and save them to disk at this stage.
Parameters¶
Schiphol - output_file: Filepath to write output csv file with minimal modelling input |
| ### Returns |
Output CSV file with minimal model input
id | aircraftRegistration | airlineCode | terminal | serviceType | scheduleDateTime | actualOffBlockTime | scheduleDelaySeconds
124257473326719795 | PHEXI | 80.0 | 2.0 | J | 2018-05-01 16:35:00+02:00 | 2018-05-01 16:58:16+02:00 | 1396.0
124538476600837715 | PHEXL | 2481.0 | 1.0 | J | 2018-06-10 13:00:00+02:00 | 2018-06-10 13:11:25+02:00 | 685.0
123512829091050355 | PHBQO | 100.0 | 2.0 | J | 2018-01-15 10:15:00+01:00 | 2018-01-15 10:35:10+01:00 | 1210.0
123786805997701057 | PHEXG | 2481.0 | 1.0 | J | 2018-02-23 17:45:00+01:00 | 2018-02-23 17:55:52+01:00 | 652.0
124664922607744671 | PHBXP | 1551.0 | 2.0 | J | 2018-06-28 20:50:00+02:00 | 2018-06-28 22:09:23+02:00 | 4763.0
id | final_destination | Country | City | Latitude | Longitude | Altitude | DST | destination_distance 123414478192901837 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414481790516475 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414478192901991 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414481790510775 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414479288269149 | SPC | Spain | Santa Cruz … | 28.626499 | -17.75560 | 107.0 | E | 3.267980e+01
File parameters¶
[3]:
# input parameters cell
flights_file = "../lvt-schiphol-assignment-snakemake/data/raw/flights.csv"
airports_file = "../lvt-schiphol-assignment-snakemake/data/raw/airports.csv"
output_file = "processed_flights.csv"
Libraries¶
[4]:
import pandas as pd
import numpy as np
import sys
sys.path.append("../")
from src.data.google_storage_io import read_csv_data, write_csv_data
Read data¶
[5]:
%%time
df_flights = read_csv_data(flights_file)
df_airports = read_csv_data(airports_file)
Reading file from local directory
File: ../lvt-schiphol-assignment-snakemake/data/raw/flights.csv
Reading file from local directory
File: ../lvt-schiphol-assignment-snakemake/data/raw/airports.csv
Wall time: 1.95 s
Destination features¶
- Route to destination in a column of lists
- Get final destination and number of destination
[15]:
df_flights
[15]:
| actualOffBlockTime | aircraftRegistration | aircraftType.iatamain | aircraftType.iatasub | airlineCode | baggageClaim | estimatedLandingTime | expectedTimeBoarding | expectedTimeGateClosing | expectedTimeGateOpen | ... | prefixICAO | publicEstimatedOffBlockTime | publicFlightState.flightStates | route.destinations | scheduleDate | scheduleTime | serviceType | terminal | transferPositions | transferPositions.transferPositions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | 148.0 | NaN | NaN | NaN | NaN | NaN | ... | ZXP | NaN | ['SCH'] | ['AMS'] | 2018-01-01 | 03:02:07 | P | NaN | NaN | NaN |
| 1 | NaN | PHPXY | AW1 | NaN | 148.0 | NaN | NaN | NaN | NaN | NaN | ... | ZXP | NaN | ['SCH'] | ['AMS'] | 2018-01-01 | 03:16:00 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | AW1 | NaN | 148.0 | NaN | NaN | NaN | NaN | NaN | ... | ZXP | NaN | ['SCH'] | ['AMS'] | 2018-01-01 | 03:16:29 | P | NaN | NaN | NaN |
| 3 | 2018-01-01T03:22:00.000+01:00 | PHPXB | NaN | NaN | 148.0 | NaN | NaN | NaN | NaN | NaN | ... | ZXP | NaN | ['DEP'] | ['AMS'] | 2018-01-01 | 03:30:00 | NaN | NaN | NaN | NaN |
| 4 | 2018-01-01T05:58:22.000+01:00 | PHHSJ | 73H | 73H | 164.0 | NaN | NaN | NaN | NaN | NaN | ... | TRA | NaN | ['DEP'] | ['SPC'] | 2018-01-01 | 06:00:00 | J | 1.0 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 523270 | NaN | NaN | 320 | 320 | 64.0 | NaN | NaN | 2018-07-31T21:20:00.000+02:00 | 2018-07-31T21:30:00.000+02:00 | NaN | ... | EZY | NaN | ['SCH'] | ['MAN'] | 2018-07-31 | 21:50:00 | J | NaN | NaN | NaN |
| 523271 | NaN | NaN | 319 | 319 | 64.0 | NaN | NaN | 2018-07-31T21:25:00.000+02:00 | 2018-07-31T21:35:00.000+02:00 | NaN | ... | EZY | NaN | ['SCH'] | ['SEN'] | 2018-07-31 | 21:55:00 | J | NaN | NaN | NaN |
| 523272 | NaN | NaN | 73H | 73H | 234.0 | NaN | NaN | NaN | NaN | NaN | ... | RYR | NaN | ['SCH'] | ['DUB'] | 2018-07-31 | 22:05:00 | J | NaN | NaN | NaN |
| 523273 | NaN | NaN | 73W | 73W | 100.0 | NaN | NaN | 2018-07-31T21:40:00.000+02:00 | 2018-07-31T21:55:00.000+02:00 | 2018-07-31T21:10:00.000+02:00 | ... | KLM | NaN | ['SCH'] | ['NCL'] | 2018-07-31 | 22:10:00 | J | 2.0 | NaN | NaN |
| 523274 | NaN | NaN | 787 | 788 | 794.0 | NaN | NaN | 2018-07-31T21:35:00.000+02:00 | 2018-07-31T22:05:00.000+02:00 | 2018-07-31T20:30:00.000+02:00 | ... | AMX | NaN | ['SCH'] | ['MEX'] | 2018-07-31 | 22:25:00 | J | NaN | NaN | NaN |
523275 rows × 28 columns
[6]:
%%time
# route destinations parsed as a list then calculate length and expand list to columns
df_routes = df_flights[["id", "route.destinations"]] \
.assign(route_list = lambda d: d["route.destinations"].apply(eval)) \
.assign(route_length = lambda d: d["route_list"].apply(len),
first_destination = lambda d: d["route_list"].apply(lambda x: x[0]),
final_destination = lambda d: d["route_list"].apply(lambda x: x[-1]))
# determine separate route output columns
max_route_length = df_routes["route_length"].max()
destination_columns = [f"destination_{i}" for i in range(max_route_length)]
# unlist routes into multiple columns
df_routes[destination_columns] = pd.DataFrame(df_routes["route_list"] \
.apply(lambda x: (x + [np.nan] * max_route_length)[:max_route_length]).tolist(),
index= df_routes.index)
df_routes
Wall time: 4.58 s
[6]:
| id | route.destinations | route_list | route_length | first_destination | final_destination | destination_0 | destination_1 | destination_2 | destination_3 | destination_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 123414478192901837 | ['AMS'] | [AMS] | 1 | AMS | AMS | AMS | NaN | NaN | NaN | NaN |
| 1 | 123414481790516475 | ['AMS'] | [AMS] | 1 | AMS | AMS | AMS | NaN | NaN | NaN | NaN |
| 2 | 123414478192901991 | ['AMS'] | [AMS] | 1 | AMS | AMS | AMS | NaN | NaN | NaN | NaN |
| 3 | 123414481790510775 | ['AMS'] | [AMS] | 1 | AMS | AMS | AMS | NaN | NaN | NaN | NaN |
| 4 | 123414479288269149 | ['SPC'] | [SPC] | 1 | SPC | SPC | SPC | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 523270 | 124896773782315507 | ['MAN'] | [MAN] | 1 | MAN | MAN | MAN | NaN | NaN | NaN | NaN |
| 523271 | 124896773782912169 | ['SEN'] | [SEN] | 1 | SEN | SEN | SEN | NaN | NaN | NaN | NaN |
| 523272 | 124896745325235371 | ['DUB'] | [DUB] | 1 | DUB | DUB | DUB | NaN | NaN | NaN | NaN |
| 523273 | 124896745995906173 | ['NCL'] | [NCL] | 1 | NCL | NCL | NCL | NaN | NaN | NaN | NaN |
| 523274 | 124896744612750419 | ['MEX'] | [MEX] | 1 | MEX | MEX | MEX | NaN | NaN | NaN | NaN |
523275 rows × 11 columns
[7]:
df_airports.head()
[7]:
| Airport | Name | City | Country | IATA | ICAO | Latitude | Longitude | Altitude | Timezone | DST | Tz | Type | Source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Goroka Airport | Goroka | Papua New Guinea | GKA | AYGA | -6.081690 | 145.391998 | 5282 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
| 1 | 2 | Madang Airport | Madang | Papua New Guinea | MAG | AYMD | -5.207080 | 145.789001 | 20 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
| 2 | 3 | Mount Hagen Kagamuga Airport | Mount Hagen | Papua New Guinea | HGU | AYMH | -5.826790 | 144.296005 | 5388 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
| 3 | 4 | Nadzab Airport | Nadzab | Papua New Guinea | LAE | AYNZ | -6.569803 | 146.725977 | 239 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
| 4 | 5 | Port Moresby Jacksons International Airport | Port Moresby | Papua New Guinea | POM | AYPY | -9.443380 | 147.220001 | 146 | 10 | U | Pacific/Port_Moresby | airport | OurAirports |
[8]:
df_airports.query("IATA == 'AMS'")
[8]:
| Airport | Name | City | Country | IATA | ICAO | Latitude | Longitude | Altitude | Timezone | DST | Tz | Type | Source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 574 | 580 | Amsterdam Airport Schiphol | Amsterdam | Netherlands | AMS | EHAM | 52.308601 | 4.76389 | -11 | 1 | E | Europe/Amsterdam | airport | OurAirports |
[9]:
def distance_to_schiphol(lat, lon):
""""euclidean distance to hard-coded coords of Schiphol"""
schiphol_coords = np.array([52.308601, 4.76389])
dist = np.linalg.norm(np.array([lat, lon]) - schiphol_coords)
return dist
[10]:
%%time
df_final_destination_features = pd.merge(
df_routes[["id", "final_destination"]],
df_airports[["IATA", "Country", "City", "Latitude", "Longitude", "Altitude", "DST", "Type"]],
how = "left",
left_on = ["final_destination"],
right_on = ["IATA"])
df_final_destination_features = df_final_destination_features \
.assign(destination_distance = lambda d: d[["Latitude", "Longitude"]] \
.apply(lambda x: distance_to_schiphol(lat=x[0], lon=x[1]), axis=1)
)
df_final_destination_features
Wall time: 26.3 s
[10]:
| id | final_destination | IATA | Country | City | Latitude | Longitude | Altitude | DST | Type | destination_distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 123414478192901837 | AMS | AMS | Netherlands | Amsterdam | 52.308601 | 4.763890 | -11.0 | E | airport | 4.338444e-07 |
| 1 | 123414481790516475 | AMS | AMS | Netherlands | Amsterdam | 52.308601 | 4.763890 | -11.0 | E | airport | 4.338444e-07 |
| 2 | 123414478192901991 | AMS | AMS | Netherlands | Amsterdam | 52.308601 | 4.763890 | -11.0 | E | airport | 4.338444e-07 |
| 3 | 123414481790510775 | AMS | AMS | Netherlands | Amsterdam | 52.308601 | 4.763890 | -11.0 | E | airport | 4.338444e-07 |
| 4 | 123414479288269149 | SPC | SPC | Spain | Santa Cruz De La Palma | 28.626499 | -17.755600 | 107.0 | E | airport | 3.267980e+01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 523270 | 124896773782315507 | MAN | MAN | United Kingdom | Manchester | 53.353699 | -2.274950 | 257.0 | E | airport | 7.116003e+00 |
| 523271 | 124896773782912169 | SEN | SEN | United Kingdom | Southend | 51.571400 | 0.695556 | 49.0 | E | airport | 4.134587e+00 |
| 523272 | 124896745325235371 | DUB | DUB | Ireland | Dublin | 53.421299 | -6.270070 | 242.0 | E | airport | 1.108992e+01 |
| 523273 | 124896745995906173 | NCL | NCL | United Kingdom | Newcastle | 55.037498 | -1.691670 | 266.0 | E | airport | 7.008647e+00 |
| 523274 | 124896744612750419 | MEX | MEX | Mexico | Mexico City | 19.436300 | -99.072098 | 7316.0 | S | airport | 1.089151e+02 |
523275 rows × 11 columns
[11]:
# meta columns for utility for columns we will often merge by
output_columns = ["id", "final_destination", "Country", "City", "Latitude", "Longitude", "Altitude", "DST", "destination_distance"]
# DataFrame with id + merging columns + prediction target
df_output = df_final_destination_features[output_columns]
df_output.head()
[11]:
| id | final_destination | Country | City | Latitude | Longitude | Altitude | DST | destination_distance | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 123414478192901837 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 |
| 1 | 123414481790516475 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 |
| 2 | 123414478192901991 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 |
| 3 | 123414481790510775 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 |
| 4 | 123414479288269149 | SPC | Spain | Santa Cruz De La Palma | 28.626499 | -17.75560 | 107.0 | E | 3.267980e+01 |
Write output to CSV¶
Local or Google Storage is both handled.ipynb_checkpoints/
[12]:
# # write output file
write_csv_data(df_output, output_file, index=False)
Writing file to local directory
File: processed_flights.csv
Overview of the output data¶
[13]:
df_output.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 523275 entries, 0 to 523274
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 523275 non-null int64
1 final_destination 523275 non-null object
2 Country 522336 non-null object
3 City 522336 non-null object
4 Latitude 522336 non-null float64
5 Longitude 522336 non-null float64
6 Altitude 522336 non-null float64
7 DST 522336 non-null object
8 destination_distance 522336 non-null float64
dtypes: float64(4), int64(1), object(4)
memory usage: 39.9+ MB
[ ]: