[1]:
%matplotlib inline

Merge base input and feature data

  • Takes the flights data
  • Processes the schedule/realized datetimes and computes the delay in seconds
  • Remove observations with unknown prediction targets
  • Write prediction target with minimal feature set to CSV

Parameters


  • base_file: Filepath of base model input with at least column ‘id’
  • features: List of feature files or a string of feature files separated by a ‘+’

Returns


Output CSV file with minimal model input

      id             |  aircraftRegistration   |  airlineCode   |  terminal   |  ...   |  year   | ...
123414481790510775   |         PHPXB           |     148.0     |     NaN      | ...    |  2018   | ...
123414479288269149   |         PHHSJ           |     164.0     |     1.0      | ...    |  2018   | ...
123414479666542945   |         PHHSG           |     100.0     |     1.0      | ...    |  2018   | ...
123414479288365061   |         PHHSG           |     164.0     |     1.0      | ...    |  2018   | ...
123414479288274329   |         PHHXB           |     164.0     |     1.0      | ...    |  2018   | ...

File parameters

[2]:
# input parameters cell
base_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv"
features = [
    "../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv",
    "../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv"
]


output_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_extended_input.csv"
[3]:
if isinstance(features, str):
    features = features.split('+')
    print("Parsed features from string instead of List object")
    print(features)
[4]:
columns_to_ignore = [
    "scheduleDateTime", "scheduleDate", "scheduleTime", "actualOffBlockTime"
]

Libraries

[5]:
import pandas as pd
import numpy as np

import sys
sys.path.append("../")

from src.data.google_storage_io import read_csv_data, write_csv_data

Read data

[6]:
%%time
df_base = read_csv_data(base_file)
df_base.head()
Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv

Wall time: 533 ms
[6]:
id aircraftRegistration airlineCode terminal serviceType scheduleDateTime actualOffBlockTime scheduleDelaySeconds
0 123414481790510775 PHPXB 148.0 NaN NaN 2018-01-01 03:30:00+01:00 2018-01-01 03:22:00+01:00 -480.0
1 123414479288269149 PHHSJ 164.0 1.0 J 2018-01-01 06:00:00+01:00 2018-01-01 05:58:22+01:00 -98.0
2 123414479666542945 PHHSG 100.0 1.0 J 2018-01-01 06:05:00+01:00 2018-01-01 06:00:00+01:00 -300.0
3 123414479288365061 PHHSG 164.0 1.0 J 2018-01-01 06:05:00+01:00 2018-01-01 06:00:00+01:00 -300.0
4 123414479288274329 PHHXB 164.0 1.0 J 2018-01-01 06:15:00+01:00 2018-01-01 06:26:34+01:00 694.0
[7]:
%%time

# read feature data from multiple files and merge by 'id'
print(f"Reading features from first file: {features[0]}")
df_features = read_csv_data(features[0])

if len(features) > 0:
    for feature_file in features[1:]:
        print(f"Merging features from file: {feature_file}")
        old_shape = df_features.shape
        tmp_features = read_csv_data(feature_file)
        df_features = pd.merge(
            df_features,
            tmp_features,
            on="id",
            how="inner"
        )
        print(f"Merged features. Shape {old_shape} -> {df_features.shape}")
df_features.head()
Reading features from first file: ../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv
Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/features/route_destinations.csv

Merging features from file: ../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv
Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/features/schedule_time_features.csv

Merged features. Shape (523275, 9) -> (487716, 17)
Wall time: 1.23 s
[7]:
id final_destination Country City Latitude Longitude Altitude DST destination_distance dayofweek quarter month year dayofmonth weekofyear hour minutes
0 123414481790510775 AMS Netherlands Amsterdam 52.308601 4.76389 -11.0 E 4.338444e-07 0 1 1 2018 1 1 3 30
1 123414479288269149 SPC Spain Santa Cruz De La Palma 28.626499 -17.75560 107.0 E 3.267980e+01 0 1 1 2018 1 1 6 0
2 123414479666542945 LPA Spain Gran Canaria 27.931900 -15.38660 78.0 E 3.162698e+01 0 1 1 2018 1 1 6 5
3 123414479288365061 LPA Spain Gran Canaria 27.931900 -15.38660 78.0 E 3.162698e+01 0 1 1 2018 1 1 6 5
4 123414479288274329 TLV Israel Tel-aviv 32.011398 34.88670 135.0 E 3.632300e+01 0 1 1 2018 1 1 6 15

Merge base model input with features

  • One large file to pass onto model notebooks

Downside: One large file with a lot of copied values

Upside: Easier to verify downstream model notebooks use the same data

[8]:
df_output = pd.merge(
    df_base,
    df_features,
    on="id",
    how="inner")
print(f"Data shape: {df_output.shape}")
df_output.head()
Data shape: (487714, 24)
[8]:
id aircraftRegistration airlineCode terminal serviceType scheduleDateTime actualOffBlockTime scheduleDelaySeconds final_destination Country ... DST destination_distance dayofweek quarter month year dayofmonth weekofyear hour minutes
0 123414481790510775 PHPXB 148.0 NaN NaN 2018-01-01 03:30:00+01:00 2018-01-01 03:22:00+01:00 -480.0 AMS Netherlands ... E 4.338444e-07 0 1 1 2018 1 1 3 30
1 123414479288269149 PHHSJ 164.0 1.0 J 2018-01-01 06:00:00+01:00 2018-01-01 05:58:22+01:00 -98.0 SPC Spain ... E 3.267980e+01 0 1 1 2018 1 1 6 0
2 123414479666542945 PHHSG 100.0 1.0 J 2018-01-01 06:05:00+01:00 2018-01-01 06:00:00+01:00 -300.0 LPA Spain ... E 3.162698e+01 0 1 1 2018 1 1 6 5
3 123414479288365061 PHHSG 164.0 1.0 J 2018-01-01 06:05:00+01:00 2018-01-01 06:00:00+01:00 -300.0 LPA Spain ... E 3.162698e+01 0 1 1 2018 1 1 6 5
4 123414479288274329 PHHXB 164.0 1.0 J 2018-01-01 06:15:00+01:00 2018-01-01 06:26:34+01:00 694.0 TLV Israel ... E 3.632300e+01 0 1 1 2018 1 1 6 15

5 rows × 24 columns

Write output to CSV

Local or Google Storage is both handled

[9]:
# # write output file
write_csv_data(df_output, output_file, index=False)
Writing file to local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/delays_extended_input.csv

Overview of the output data

[10]:
df_output.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 487714 entries, 0 to 487713
Data columns (total 24 columns):
 #   Column                Non-Null Count   Dtype
---  ------                --------------   -----
 0   id                    487714 non-null  int64
 1   aircraftRegistration  487711 non-null  object
 2   airlineCode           486501 non-null  float64
 3   terminal              477391 non-null  float64
 4   serviceType           482935 non-null  object
 5   scheduleDateTime      487714 non-null  object
 6   actualOffBlockTime    487714 non-null  object
 7   scheduleDelaySeconds  487714 non-null  float64
 8   final_destination     487701 non-null  object
 9   Country               486955 non-null  object
 10  City                  486955 non-null  object
 11  Latitude              486955 non-null  float64
 12  Longitude             486955 non-null  float64
 13  Altitude              486955 non-null  float64
 14  DST                   486955 non-null  object
 15  destination_distance  486955 non-null  float64
 16  dayofweek             487714 non-null  int64
 17  quarter               487714 non-null  int64
 18  month                 487714 non-null  int64
 19  year                  487714 non-null  int64
 20  dayofmonth            487714 non-null  int64
 21  weekofyear            487714 non-null  int64
 22  hour                  487714 non-null  int64
 23  minutes               487714 non-null  int64
dtypes: float64(7), int64(9), object(8)
memory usage: 93.0+ MB