Time features from DateTime data

year, week_number, day_of_week, etc..

This notebook will take a DataFrame with scheduleDateTime or optionally a different column

  1. Calculate time features from DateTime values
  2. Output pd.DataFrame with id columns + time feature columns
  3. Write output to CSV file

Parameters


  • input_file: Filepath of flights data in format received from Schiphol
  • output_file: Filepath to write output csv file with minimal modelling input
  • dt_column: (Optional) Column with datetime values to create features from. Default scheduleDateTime
  • id_column: (Optional) ID column to keep from input file. Default id

Returns


Output format

id                   |   scheduleDateTime           |  dayofweek   |  quarter   |  month   |  year  |   dayofmonth   |  weekofyear   |  hour   |  minutes
123414481790510775   |  2018-01-01 03:30:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   3     |  30
123414479288269149   |  2018-01-01 06:00:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  0
123414479666542945   |  2018-01-01 06:05:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  5
123414479288365061   |  2018-01-01 06:05:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  5
123414479288274329   |  2018-01-01 06:15:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  15
[19]:
# parameters
input_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv"
output_file = "schedule_time_features.csv"
dt_column = "scheduleDateTime"
id_column = "id"

Imports

[1]:
import pandas as pd
import numpy as np

import sys
sys.path.append("../")

from src.data.google_storage_io import read_csv_data, write_csv_data

Load data

[31]:
%%time

df = read_csv_data(input_file)
df.head()
Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv

Wall time: 535 ms
[31]:
id aircraftRegistration airlineCode terminal serviceType scheduleDateTime actualOffBlockTime scheduleDelaySeconds
0 123414481790510775 PHPXB 148.0 NaN NaN 2018-01-01 03:30:00+01:00 2018-01-01 03:22:00+01:00 -480.0
1 123414479288269149 PHHSJ 164.0 1.0 J 2018-01-01 06:00:00+01:00 2018-01-01 05:58:22+01:00 -98.0
2 123414479666542945 PHHSG 100.0 1.0 J 2018-01-01 06:05:00+01:00 2018-01-01 06:00:00+01:00 -300.0
3 123414479288365061 PHHSG 164.0 1.0 J 2018-01-01 06:05:00+01:00 2018-01-01 06:00:00+01:00 -300.0
4 123414479288274329 PHHXB 164.0 1.0 J 2018-01-01 06:15:00+01:00 2018-01-01 06:26:34+01:00 694.0

Create DataFrame with daily time features

[44]:
%%time

keep_columns = [
        'dayofweek', 'quarter',
        'month', 'year', 'dayofmonth',
        'weekofyear'
    ]

def add_time_features(df,
                      date_column,
                      id_column = None,
                      keep_columns = [
                          'dayofweek', 'quarter',
                          'month', 'year', 'dayofmonth',
                          'weekofyear', 'hour', 'minutes']):
    """
    Creates time series features from date column

    Pass a dataframe with dates in `date_column`
    The input dataframe is returned with an id column and the  extracted
    basic time features.

    """
    original_columns = list(df.columns)

    df[date_column] = pd.to_datetime(df[date_column], utc=True).dt.tz_convert("Europe/Amsterdam")

    df = df.assign(
            dayofweek = df[date_column].dt.dayofweek,
            quarter = df[date_column].dt.quarter,
            month = df[date_column].dt.month,
            year = df[date_column].dt.year,
            dayofyear = df[date_column].dt.dayofyear, # ignored
            dayofmonth = df[date_column].dt.day,
            weekofyear = df[date_column].dt.weekofyear,
            hour = df[date_column].dt.hour,
            minutes = df[date_column].dt.minute)

    if id_column is None:
        df = df.reset_index()
        id_column = "index"

    output_columns = [id_column] + keep_columns

    return df[output_columns]


df_output = add_time_features(df, date_column = dt_column, id_column = id_column)
df_output.head()

Wall time: 448 ms
[44]:
id dayofweek quarter month year dayofmonth weekofyear hour minutes
0 123414481790510775 0 1 1 2018 1 1 3 30
1 123414479288269149 0 1 1 2018 1 1 6 0
2 123414479666542945 0 1 1 2018 1 1 6 5
3 123414479288365061 0 1 1 2018 1 1 6 5
4 123414479288274329 0 1 1 2018 1 1 6 15
[45]:
df_output.describe(include='all')
[45]:
id dayofweek quarter month year dayofmonth weekofyear hour minutes
count 4.877160e+05 487716.000000 487716.000000 487716.000000 487716.000000 487716.000000 487716.000000 487716.000000 487716.000000
mean 1.241177e+17 2.940383 1.637703 3.869715 2017.999996 14.799594 14.881474 13.650389 27.531055
std 3.891198e+14 1.994474 0.610135 1.855526 0.002025 8.904548 7.913090 4.649598 17.187699
min 1.234004e+17 0.000000 1.000000 1.000000 2017.000000 1.000000 1.000000 0.000000 0.000000
25% 1.237868e+17 1.000000 1.000000 2.000000 2018.000000 7.000000 8.000000 9.000000 15.000000
50% 1.241310e+17 3.000000 2.000000 4.000000 2018.000000 14.000000 15.000000 13.000000 30.000000
75% 1.244612e+17 5.000000 2.000000 5.000000 2018.000000 23.000000 22.000000 17.000000 40.000000
max 1.247644e+17 6.000000 4.000000 12.000000 2018.000000 31.000000 52.000000 23.000000 59.000000

Write output to CSV

Local or Google Storage is both handled

[46]:
# # write output file
write_csv_data(df_output, output_file, index=False)
Writing file to local directory
File:   schedule_time_features.csv

Overview of the output data

[47]:
df_output.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487716 entries, 0 to 487715
Data columns (total 9 columns):
id            487716 non-null int64
dayofweek     487716 non-null int64
quarter       487716 non-null int64
month         487716 non-null int64
year          487716 non-null int64
dayofmonth    487716 non-null int64
weekofyear    487716 non-null int64
hour          487716 non-null int64
minutes       487716 non-null int64
dtypes: int64(9)
memory usage: 33.5 MB
[ ]: