Features from destinations¶

Takes the flights data and airports data
destinations are in a list in the flights data
Clean the ‘destination’ of flights so that we list all intermediate destinations and first/last clearly
Merge with airports data to include airport data in our model input

TODO:

Clearly a waste of space with all the copied values for each airport ID, restructure pipeline so that isn’t required
I imagine that the airports data is pretty static and we could expect a table on DataBricks

Due to time-constraints I use this solution so that we can keep merging/indexing only by id. In a more sophisticated approach we could merge downstream by other columns, such as destination on airport - therefore not having to include all duplicate features and save them to disk at this stage.

Parameters¶

flights_file: Filepath of flights data in format received from Schiphol - airports_file: Filepath of airports data from

Schiphol - output_file: Filepath to write output csv file with minimal modelling input

### Returns

Output CSV file with minimal model input

id                   |  aircraftRegistration  |  airlineCode  |  terminal  |  serviceType  |      scheduleDateTime        |     actualOffBlockTime      |  scheduleDelaySeconds
124257473326719795   |    PHEXI               |     80.0      |     2.0    |       J       |  2018-05-01 16:35:00+02:00   |  2018-05-01 16:58:16+02:00  |         1396.0
124538476600837715   |    PHEXL               |     2481.0    |     1.0    |       J       |  2018-06-10 13:00:00+02:00   |  2018-06-10 13:11:25+02:00  |         685.0
123512829091050355   |    PHBQO               |     100.0     |     2.0    |       J       |  2018-01-15 10:15:00+01:00   |  2018-01-15 10:35:10+01:00  |         1210.0
123786805997701057   |    PHEXG               |     2481.0    |     1.0    |       J       |  2018-02-23 17:45:00+01:00   |  2018-02-23 17:55:52+01:00  |         652.0
124664922607744671   |    PHBXP               |     1551.0    |     2.0    |       J       |  2018-06-28 20:50:00+02:00   |  2018-06-28 22:09:23+02:00  |         4763.0

id | final_destination | Country | City | Latitude | Longitude | Altitude | DST | destination_distance 123414478192901837 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414481790516475 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414478192901991 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414481790510775 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 123414479288269149 | SPC | Spain | Santa Cruz … | 28.626499 | -17.75560 | 107.0 | E | 3.267980e+01

File parameters¶

[3]:

# input parameters cell
flights_file = "../lvt-schiphol-assignment-snakemake/data/raw/flights.csv"
airports_file = "../lvt-schiphol-assignment-snakemake/data/raw/airports.csv"
output_file = "processed_flights.csv"

Libraries¶

[4]:

import pandas as pd
import numpy as np

import sys
sys.path.append("../")

from src.data.google_storage_io import read_csv_data, write_csv_data

Read data¶

[5]:

%%time
df_flights = read_csv_data(flights_file)
df_airports = read_csv_data(airports_file)

Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/raw/flights.csv

Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/raw/airports.csv

Wall time: 1.95 s

Destination features¶

Route to destination in a column of lists
Get final destination and number of destination

[15]:

df_flights

[15]:

	actualOffBlockTime	aircraftRegistration	aircraftType.iatamain	aircraftType.iatasub	airlineCode	baggageClaim	estimatedLandingTime	expectedTimeBoarding	expectedTimeGateClosing	expectedTimeGateOpen	...	prefixICAO	publicEstimatedOffBlockTime	publicFlightState.flightStates	route.destinations	scheduleDate	scheduleTime	serviceType	terminal	transferPositions	transferPositions.transferPositions
0	NaN	NaN	NaN	NaN	148.0	NaN	NaN	NaN	NaN	NaN	...	ZXP	NaN	['SCH']	['AMS']	2018-01-01	03:02:07	P	NaN	NaN	NaN
1	NaN	PHPXY	AW1	NaN	148.0	NaN	NaN	NaN	NaN	NaN	...	ZXP	NaN	['SCH']	['AMS']	2018-01-01	03:16:00	NaN	NaN	NaN	NaN
2	NaN	NaN	AW1	NaN	148.0	NaN	NaN	NaN	NaN	NaN	...	ZXP	NaN	['SCH']	['AMS']	2018-01-01	03:16:29	P	NaN	NaN	NaN
3	2018-01-01T03:22:00.000+01:00	PHPXB	NaN	NaN	148.0	NaN	NaN	NaN	NaN	NaN	...	ZXP	NaN	['DEP']	['AMS']	2018-01-01	03:30:00	NaN	NaN	NaN	NaN
4	2018-01-01T05:58:22.000+01:00	PHHSJ	73H	73H	164.0	NaN	NaN	NaN	NaN	NaN	...	TRA	NaN	['DEP']	['SPC']	2018-01-01	06:00:00	J	1.0	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
523270	NaN	NaN	320	320	64.0	NaN	NaN	2018-07-31T21:20:00.000+02:00	2018-07-31T21:30:00.000+02:00	NaN	...	EZY	NaN	['SCH']	['MAN']	2018-07-31	21:50:00	J	NaN	NaN	NaN
523271	NaN	NaN	319	319	64.0	NaN	NaN	2018-07-31T21:25:00.000+02:00	2018-07-31T21:35:00.000+02:00	NaN	...	EZY	NaN	['SCH']	['SEN']	2018-07-31	21:55:00	J	NaN	NaN	NaN
523272	NaN	NaN	73H	73H	234.0	NaN	NaN	NaN	NaN	NaN	...	RYR	NaN	['SCH']	['DUB']	2018-07-31	22:05:00	J	NaN	NaN	NaN
523273	NaN	NaN	73W	73W	100.0	NaN	NaN	2018-07-31T21:40:00.000+02:00	2018-07-31T21:55:00.000+02:00	2018-07-31T21:10:00.000+02:00	...	KLM	NaN	['SCH']	['NCL']	2018-07-31	22:10:00	J	2.0	NaN	NaN
523274	NaN	NaN	787	788	794.0	NaN	NaN	2018-07-31T21:35:00.000+02:00	2018-07-31T22:05:00.000+02:00	2018-07-31T20:30:00.000+02:00	...	AMX	NaN	['SCH']	['MEX']	2018-07-31	22:25:00	J	NaN	NaN	NaN

523275 rows × 28 columns

[6]:

%%time

# route destinations parsed as a list then calculate length and expand list to columns
df_routes = df_flights[["id", "route.destinations"]] \
    .assign(route_list = lambda d: d["route.destinations"].apply(eval)) \
    .assign(route_length = lambda d: d["route_list"].apply(len),
            first_destination = lambda d: d["route_list"].apply(lambda x: x[0]),
            final_destination = lambda d: d["route_list"].apply(lambda x: x[-1]))

# determine separate route output columns
max_route_length = df_routes["route_length"].max()
destination_columns = [f"destination_{i}" for i in range(max_route_length)]

# unlist routes into multiple columns
df_routes[destination_columns] =  pd.DataFrame(df_routes["route_list"] \
                                               .apply(lambda x: (x + [np.nan] * max_route_length)[:max_route_length]).tolist(),
                                               index= df_routes.index)
df_routes

Wall time: 4.58 s

[6]:

	id	route.destinations	route_list	route_length	first_destination	final_destination	destination_0	destination_1	destination_2	destination_3	destination_4
0	123414478192901837	['AMS']	[AMS]	1	AMS	AMS	AMS	NaN	NaN	NaN	NaN
1	123414481790516475	['AMS']	[AMS]	1	AMS	AMS	AMS	NaN	NaN	NaN	NaN
2	123414478192901991	['AMS']	[AMS]	1	AMS	AMS	AMS	NaN	NaN	NaN	NaN
3	123414481790510775	['AMS']	[AMS]	1	AMS	AMS	AMS	NaN	NaN	NaN	NaN
4	123414479288269149	['SPC']	[SPC]	1	SPC	SPC	SPC	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...
523270	124896773782315507	['MAN']	[MAN]	1	MAN	MAN	MAN	NaN	NaN	NaN	NaN
523271	124896773782912169	['SEN']	[SEN]	1	SEN	SEN	SEN	NaN	NaN	NaN	NaN
523272	124896745325235371	['DUB']	[DUB]	1	DUB	DUB	DUB	NaN	NaN	NaN	NaN
523273	124896745995906173	['NCL']	[NCL]	1	NCL	NCL	NCL	NaN	NaN	NaN	NaN
523274	124896744612750419	['MEX']	[MEX]	1	MEX	MEX	MEX	NaN	NaN	NaN	NaN

523275 rows × 11 columns

[7]:

df_airports.head()

[7]:

	Airport	Name	City	Country	IATA	ICAO	Latitude	Longitude	Altitude	Timezone	DST	Tz	Type	Source
0	1	Goroka Airport	Goroka	Papua New Guinea	GKA	AYGA	-6.081690	145.391998	5282	10	U	Pacific/Port_Moresby	airport	OurAirports
1	2	Madang Airport	Madang	Papua New Guinea	MAG	AYMD	-5.207080	145.789001	20	10	U	Pacific/Port_Moresby	airport	OurAirports
2	3	Mount Hagen Kagamuga Airport	Mount Hagen	Papua New Guinea	HGU	AYMH	-5.826790	144.296005	5388	10	U	Pacific/Port_Moresby	airport	OurAirports
3	4	Nadzab Airport	Nadzab	Papua New Guinea	LAE	AYNZ	-6.569803	146.725977	239	10	U	Pacific/Port_Moresby	airport	OurAirports
4	5	Port Moresby Jacksons International Airport	Port Moresby	Papua New Guinea	POM	AYPY	-9.443380	147.220001	146	10	U	Pacific/Port_Moresby	airport	OurAirports

[8]:

df_airports.query("IATA == 'AMS'")

[8]:

	Airport	Name	City	Country	IATA	ICAO	Latitude	Longitude	Altitude	Timezone	DST	Tz	Type	Source
574	580	Amsterdam Airport Schiphol	Amsterdam	Netherlands	AMS	EHAM	52.308601	4.76389	-11	1	E	Europe/Amsterdam	airport	OurAirports

[9]:

def distance_to_schiphol(lat, lon):
    """"euclidean distance to hard-coded coords of Schiphol"""
    schiphol_coords = np.array([52.308601, 4.76389])
    dist = np.linalg.norm(np.array([lat, lon]) -  schiphol_coords)
    return dist

[10]:

%%time

df_final_destination_features = pd.merge(
    df_routes[["id", "final_destination"]],
    df_airports[["IATA", "Country", "City", "Latitude", "Longitude", "Altitude", "DST", "Type"]],
    how = "left",
    left_on = ["final_destination"],
    right_on = ["IATA"])

df_final_destination_features = df_final_destination_features \
    .assign(destination_distance = lambda d: d[["Latitude", "Longitude"]] \
                                .apply(lambda x: distance_to_schiphol(lat=x[0], lon=x[1]), axis=1)
           )
df_final_destination_features

Wall time: 26.3 s

[10]:

	id	final_destination	IATA	Country	City	Latitude	Longitude	Altitude	DST	Type	destination_distance
0	123414478192901837	AMS	AMS	Netherlands	Amsterdam	52.308601	4.763890	-11.0	E	airport	4.338444e-07
1	123414481790516475	AMS	AMS	Netherlands	Amsterdam	52.308601	4.763890	-11.0	E	airport	4.338444e-07
2	123414478192901991	AMS	AMS	Netherlands	Amsterdam	52.308601	4.763890	-11.0	E	airport	4.338444e-07
3	123414481790510775	AMS	AMS	Netherlands	Amsterdam	52.308601	4.763890	-11.0	E	airport	4.338444e-07
4	123414479288269149	SPC	SPC	Spain	Santa Cruz De La Palma	28.626499	-17.755600	107.0	E	airport	3.267980e+01
...	...	...	...	...	...	...	...	...	...	...	...
523270	124896773782315507	MAN	MAN	United Kingdom	Manchester	53.353699	-2.274950	257.0	E	airport	7.116003e+00
523271	124896773782912169	SEN	SEN	United Kingdom	Southend	51.571400	0.695556	49.0	E	airport	4.134587e+00
523272	124896745325235371	DUB	DUB	Ireland	Dublin	53.421299	-6.270070	242.0	E	airport	1.108992e+01
523273	124896745995906173	NCL	NCL	United Kingdom	Newcastle	55.037498	-1.691670	266.0	E	airport	7.008647e+00
523274	124896744612750419	MEX	MEX	Mexico	Mexico City	19.436300	-99.072098	7316.0	S	airport	1.089151e+02

523275 rows × 11 columns

[11]:

# meta columns for utility for columns we will often merge by
output_columns = ["id", "final_destination", "Country", "City", "Latitude", "Longitude", "Altitude", "DST", "destination_distance"]

# DataFrame with id + merging columns + prediction target
df_output = df_final_destination_features[output_columns]
df_output.head()

[11]:

	id	final_destination	Country	City	Latitude	Longitude	Altitude	DST	destination_distance
0	123414478192901837	AMS	Netherlands	Amsterdam	52.308601	4.76389	-11.0	E	4.338444e-07
1	123414481790516475	AMS	Netherlands	Amsterdam	52.308601	4.76389	-11.0	E	4.338444e-07
2	123414478192901991	AMS	Netherlands	Amsterdam	52.308601	4.76389	-11.0	E	4.338444e-07
3	123414481790510775	AMS	Netherlands	Amsterdam	52.308601	4.76389	-11.0	E	4.338444e-07
4	123414479288269149	SPC	Spain	Santa Cruz De La Palma	28.626499	-17.75560	107.0	E	3.267980e+01

Write output to CSV¶

Local or Google Storage is both handled.ipynb_checkpoints/

[12]:

# # write output file
write_csv_data(df_output, output_file, index=False)

Writing file to local directory
File:   processed_flights.csv

Overview of the output data¶

[13]:

df_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 523275 entries, 0 to 523274
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype
---  ------                --------------   -----
 0   id                    523275 non-null  int64
 1   final_destination     523275 non-null  object
 2   Country               522336 non-null  object
 3   City                  522336 non-null  object
 4   Latitude              522336 non-null  float64
 5   Longitude             522336 non-null  float64
 6   Altitude              522336 non-null  float64
 7   DST                   522336 non-null  object
 8   destination_distance  522336 non-null  float64
dtypes: float64(4), int64(1), object(4)
memory usage: 39.9+ MB

[ ]: