[1]:

%matplotlib inline

Merge base input and feature data¶

Takes the flights data
Processes the schedule/realized datetimes and computes the delay in seconds
Remove observations with unknown prediction targets
Write prediction target with minimal feature set to CSV

Parameters¶

base_file: Filepath of base model input with at least column ‘id’
features: List of feature files or a string of feature files separated by a ‘+’

Returns¶

Output CSV file with minimal model input

      id             |  aircraftRegistration   |  airlineCode   |  terminal   |  ...   |  year   | ...
123414481790510775   |         PHPXB           |     148.0     |     NaN      | ...    |  2018   | ...
123414479288269149   |         PHHSJ           |     164.0     |     1.0      | ...    |  2018   | ...
123414479666542945   |         PHHSG           |     100.0     |     1.0      | ...    |  2018   | ...
123414479288365061   |         PHHSG           |     164.0     |     1.0      | ...    |  2018   | ...
123414479288274329   |         PHHXB           |     164.0     |     1.0      | ...    |  2018   | ...

File parameters¶

[2]:

# input parameters cell
base_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv"
features = [
    "../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv",
    "../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv"
]


output_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_extended_input.csv"

[3]:

if isinstance(features, str):
    features = features.split('+')
    print("Parsed features from string instead of List object")
    print(features)

[4]:

columns_to_ignore = [
    "scheduleDateTime", "scheduleDate", "scheduleTime", "actualOffBlockTime"
]

Libraries¶

[5]:

import pandas as pd
import numpy as np

import sys
sys.path.append("../")

from src.data.google_storage_io import read_csv_data, write_csv_data

Read data¶

[6]:

%%time
df_base = read_csv_data(base_file)
df_base.head()

Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv

Wall time: 533 ms

[6]:

	id	aircraftRegistration	airlineCode	terminal	serviceType	scheduleDateTime	actualOffBlockTime	scheduleDelaySeconds
0	123414481790510775	PHPXB	148.0	NaN	NaN	2018-01-01 03:30:00+01:00	2018-01-01 03:22:00+01:00	-480.0
1	123414479288269149	PHHSJ	164.0	1.0	J	2018-01-01 06:00:00+01:00	2018-01-01 05:58:22+01:00	-98.0
2	123414479666542945	PHHSG	100.0	1.0	J	2018-01-01 06:05:00+01:00	2018-01-01 06:00:00+01:00	-300.0
3	123414479288365061	PHHSG	164.0	1.0	J	2018-01-01 06:05:00+01:00	2018-01-01 06:00:00+01:00	-300.0
4	123414479288274329	PHHXB	164.0	1.0	J	2018-01-01 06:15:00+01:00	2018-01-01 06:26:34+01:00	694.0

[7]:

%%time

# read feature data from multiple files and merge by 'id'
print(f"Reading features from first file: {features[0]}")
df_features = read_csv_data(features[0])

if len(features) > 0:
    for feature_file in features[1:]:
        print(f"Merging features from file: {feature_file}")
        old_shape = df_features.shape
        tmp_features = read_csv_data(feature_file)
        df_features = pd.merge(
            df_features,
            tmp_features,
            on="id",
            how="inner"
        )
        print(f"Merged features. Shape {old_shape} -> {df_features.shape}")
df_features.head()

Reading features from first file: ../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv
Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv

Merging features from file: ../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv
Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv

Merged features. Shape (523275, 9) -> (487716, 17)
Wall time: 1.23 s

[7]:

	id	final_destination	Country	City	Latitude	Longitude	Altitude	DST	destination_distance	quarter	month	year	dayofmonth	weekofyear	hour	minutes
0	123414481790510775	AMS	Netherlands	Amsterdam	52.308601	4.76389	-11.0	E	4.338444e-07	1	1	2018	1	1	3	30
1	123414479288269149	SPC	Spain	Santa Cruz De La Palma	28.626499	-17.75560	107.0	E	3.267980e+01	1	1	2018	1	1	6	0
2	123414479666542945	LPA	Spain	Gran Canaria	27.931900	-15.38660	78.0	E	3.162698e+01	1	1	2018	1	1	6	5
3	123414479288365061	LPA	Spain	Gran Canaria	27.931900	-15.38660	78.0	E	3.162698e+01	1	1	2018	1	1	6	5
4	123414479288274329	TLV	Israel	Tel-aviv	32.011398	34.88670	135.0	E	3.632300e+01	1	1	2018	1	1	6	15

Merge base model input with features¶

One large file to pass onto model notebooks

Downside: One large file with a lot of copied values

Upside: Easier to verify downstream model notebooks use the same data

[8]:

df_output = pd.merge(
    df_base,
    df_features,
    on="id",
    how="inner")
print(f"Data shape: {df_output.shape}")
df_output.head()

Data shape: (487714, 24)

[8]:

	id	aircraftRegistration	airlineCode	terminal	serviceType	scheduleDateTime	actualOffBlockTime	scheduleDelaySeconds	final_destination	Country	...	DST	destination_distance	quarter	month	year	dayofmonth	weekofyear	hour	minutes
0	123414481790510775	PHPXB	148.0	NaN	NaN	2018-01-01 03:30:00+01:00	2018-01-01 03:22:00+01:00	-480.0	AMS	Netherlands	...	E	4.338444e-07	1	1	2018	1	1	3	30
1	123414479288269149	PHHSJ	164.0	1.0	J	2018-01-01 06:00:00+01:00	2018-01-01 05:58:22+01:00	-98.0	SPC	Spain	...	E	3.267980e+01	1	1	2018	1	1	6	0
2	123414479666542945	PHHSG	100.0	1.0	J	2018-01-01 06:05:00+01:00	2018-01-01 06:00:00+01:00	-300.0	LPA	Spain	...	E	3.162698e+01	1	1	2018	1	1	6	5
3	123414479288365061	PHHSG	164.0	1.0	J	2018-01-01 06:05:00+01:00	2018-01-01 06:00:00+01:00	-300.0	LPA	Spain	...	E	3.162698e+01	1	1	2018	1	1	6	5
4	123414479288274329	PHHXB	164.0	1.0	J	2018-01-01 06:15:00+01:00	2018-01-01 06:26:34+01:00	694.0	TLV	Israel	...	E	3.632300e+01	1	1	2018	1	1	6	15

5 rows × 24 columns

Write output to CSV¶

Local or Google Storage is both handled

[9]:

# # write output file
write_csv_data(df_output, output_file, index=False)

Writing file to local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/delays_extended_input.csv

Overview of the output data¶

[10]:

df_output.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 487714 entries, 0 to 487713
Data columns (total 24 columns):
 #   Column                Non-Null Count   Dtype
---  ------                --------------   -----
 0   id                    487714 non-null  int64
 1   aircraftRegistration  487711 non-null  object
 2   airlineCode           486501 non-null  float64
 3   terminal              477391 non-null  float64
 4   serviceType           482935 non-null  object
 5   scheduleDateTime      487714 non-null  object
 6   actualOffBlockTime    487714 non-null  object
 7   scheduleDelaySeconds  487714 non-null  float64
 8   final_destination     487701 non-null  object
 9   Country               486955 non-null  object
 10  City                  486955 non-null  object
 11  Latitude              486955 non-null  float64
 12  Longitude             486955 non-null  float64
 13  Altitude              486955 non-null  float64
 14  DST                   486955 non-null  object
 15  destination_distance  486955 non-null  float64
 16  dayofweek             487714 non-null  int64
 17  quarter               487714 non-null  int64
 18  month                 487714 non-null  int64
 19  year                  487714 non-null  int64
 20  dayofmonth            487714 non-null  int64
 21  weekofyear            487714 non-null  int64
 22  hour                  487714 non-null  int64
 23  minutes               487714 non-null  int64
dtypes: float64(7), int64(9), object(8)
memory usage: 93.0+ MB