[1]:
%matplotlib inline
Merge base input and feature data¶
- Takes the flights data
- Processes the schedule/realized datetimes and computes the delay in seconds
- Remove observations with unknown prediction targets
- Write prediction target with minimal feature set to CSV
Parameters¶
- base_file: Filepath of base model input with at least column ‘id’
- features: List of feature files or a string of feature files separated by a ‘+’
Returns¶
Output CSV file with minimal model input
id | aircraftRegistration | airlineCode | terminal | ... | year | ...
123414481790510775 | PHPXB | 148.0 | NaN | ... | 2018 | ...
123414479288269149 | PHHSJ | 164.0 | 1.0 | ... | 2018 | ...
123414479666542945 | PHHSG | 100.0 | 1.0 | ... | 2018 | ...
123414479288365061 | PHHSG | 164.0 | 1.0 | ... | 2018 | ...
123414479288274329 | PHHXB | 164.0 | 1.0 | ... | 2018 | ...
File parameters¶
[2]:
# input parameters cell
base_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv"
features = [
"../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv",
"../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv"
]
output_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_extended_input.csv"
[3]:
if isinstance(features, str):
features = features.split('+')
print("Parsed features from string instead of List object")
print(features)
[4]:
columns_to_ignore = [
"scheduleDateTime", "scheduleDate", "scheduleTime", "actualOffBlockTime"
]
Libraries¶
[5]:
import pandas as pd
import numpy as np
import sys
sys.path.append("../")
from src.data.google_storage_io import read_csv_data, write_csv_data
Read data¶
[6]:
%%time
df_base = read_csv_data(base_file)
df_base.head()
Reading file from local directory
File: ../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv
Wall time: 533 ms
[6]:
| id | aircraftRegistration | airlineCode | terminal | serviceType | scheduleDateTime | actualOffBlockTime | scheduleDelaySeconds | |
|---|---|---|---|---|---|---|---|---|
| 0 | 123414481790510775 | PHPXB | 148.0 | NaN | NaN | 2018-01-01 03:30:00+01:00 | 2018-01-01 03:22:00+01:00 | -480.0 |
| 1 | 123414479288269149 | PHHSJ | 164.0 | 1.0 | J | 2018-01-01 06:00:00+01:00 | 2018-01-01 05:58:22+01:00 | -98.0 |
| 2 | 123414479666542945 | PHHSG | 100.0 | 1.0 | J | 2018-01-01 06:05:00+01:00 | 2018-01-01 06:00:00+01:00 | -300.0 |
| 3 | 123414479288365061 | PHHSG | 164.0 | 1.0 | J | 2018-01-01 06:05:00+01:00 | 2018-01-01 06:00:00+01:00 | -300.0 |
| 4 | 123414479288274329 | PHHXB | 164.0 | 1.0 | J | 2018-01-01 06:15:00+01:00 | 2018-01-01 06:26:34+01:00 | 694.0 |
[7]:
%%time
# read feature data from multiple files and merge by 'id'
print(f"Reading features from first file: {features[0]}")
df_features = read_csv_data(features[0])
if len(features) > 0:
for feature_file in features[1:]:
print(f"Merging features from file: {feature_file}")
old_shape = df_features.shape
tmp_features = read_csv_data(feature_file)
df_features = pd.merge(
df_features,
tmp_features,
on="id",
how="inner"
)
print(f"Merged features. Shape {old_shape} -> {df_features.shape}")
df_features.head()
Reading features from first file: ../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv
Reading file from local directory
File: ../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv
Merging features from file: ../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv
Reading file from local directory
File: ../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv
Merged features. Shape (523275, 9) -> (487716, 17)
Wall time: 1.23 s
[7]:
| id | final_destination | Country | City | Latitude | Longitude | Altitude | DST | destination_distance | dayofweek | quarter | month | year | dayofmonth | weekofyear | hour | minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 123414481790510775 | AMS | Netherlands | Amsterdam | 52.308601 | 4.76389 | -11.0 | E | 4.338444e-07 | 0 | 1 | 1 | 2018 | 1 | 1 | 3 | 30 |
| 1 | 123414479288269149 | SPC | Spain | Santa Cruz De La Palma | 28.626499 | -17.75560 | 107.0 | E | 3.267980e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 0 |
| 2 | 123414479666542945 | LPA | Spain | Gran Canaria | 27.931900 | -15.38660 | 78.0 | E | 3.162698e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 |
| 3 | 123414479288365061 | LPA | Spain | Gran Canaria | 27.931900 | -15.38660 | 78.0 | E | 3.162698e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 |
| 4 | 123414479288274329 | TLV | Israel | Tel-aviv | 32.011398 | 34.88670 | 135.0 | E | 3.632300e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 15 |
Merge base model input with features¶
- One large file to pass onto model notebooks
Downside: One large file with a lot of copied values
Upside: Easier to verify downstream model notebooks use the same data
[8]:
df_output = pd.merge(
df_base,
df_features,
on="id",
how="inner")
print(f"Data shape: {df_output.shape}")
df_output.head()
Data shape: (487714, 24)
[8]:
| id | aircraftRegistration | airlineCode | terminal | serviceType | scheduleDateTime | actualOffBlockTime | scheduleDelaySeconds | final_destination | Country | ... | DST | destination_distance | dayofweek | quarter | month | year | dayofmonth | weekofyear | hour | minutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 123414481790510775 | PHPXB | 148.0 | NaN | NaN | 2018-01-01 03:30:00+01:00 | 2018-01-01 03:22:00+01:00 | -480.0 | AMS | Netherlands | ... | E | 4.338444e-07 | 0 | 1 | 1 | 2018 | 1 | 1 | 3 | 30 |
| 1 | 123414479288269149 | PHHSJ | 164.0 | 1.0 | J | 2018-01-01 06:00:00+01:00 | 2018-01-01 05:58:22+01:00 | -98.0 | SPC | Spain | ... | E | 3.267980e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 0 |
| 2 | 123414479666542945 | PHHSG | 100.0 | 1.0 | J | 2018-01-01 06:05:00+01:00 | 2018-01-01 06:00:00+01:00 | -300.0 | LPA | Spain | ... | E | 3.162698e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 |
| 3 | 123414479288365061 | PHHSG | 164.0 | 1.0 | J | 2018-01-01 06:05:00+01:00 | 2018-01-01 06:00:00+01:00 | -300.0 | LPA | Spain | ... | E | 3.162698e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 |
| 4 | 123414479288274329 | PHHXB | 164.0 | 1.0 | J | 2018-01-01 06:15:00+01:00 | 2018-01-01 06:26:34+01:00 | 694.0 | TLV | Israel | ... | E | 3.632300e+01 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 15 |
5 rows × 24 columns
Write output to CSV¶
Local or Google Storage is both handled
[9]:
# # write output file
write_csv_data(df_output, output_file, index=False)
Writing file to local directory
File: ../lvt-schiphol-assignment-snakemake/data/model_input/delays_extended_input.csv
Overview of the output data¶
[10]:
df_output.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 487714 entries, 0 to 487713
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 487714 non-null int64
1 aircraftRegistration 487711 non-null object
2 airlineCode 486501 non-null float64
3 terminal 477391 non-null float64
4 serviceType 482935 non-null object
5 scheduleDateTime 487714 non-null object
6 actualOffBlockTime 487714 non-null object
7 scheduleDelaySeconds 487714 non-null float64
8 final_destination 487701 non-null object
9 Country 486955 non-null object
10 City 486955 non-null object
11 Latitude 486955 non-null float64
12 Longitude 486955 non-null float64
13 Altitude 486955 non-null float64
14 DST 486955 non-null object
15 destination_distance 486955 non-null float64
16 dayofweek 487714 non-null int64
17 quarter 487714 non-null int64
18 month 487714 non-null int64
19 year 487714 non-null int64
20 dayofmonth 487714 non-null int64
21 weekofyear 487714 non-null int64
22 hour 487714 non-null int64
23 minutes 487714 non-null int64
dtypes: float64(7), int64(9), object(8)
memory usage: 93.0+ MB