Time features from DateTime data¶

year, week_number, day_of_week, etc..¶

This notebook will take a DataFrame with scheduleDateTime or optionally a different column

Calculate time features from DateTime values
Output pd.DataFrame with id columns + time feature columns
Write output to CSV file

Parameters¶

input_file: Filepath of flights data in format received from Schiphol
output_file: Filepath to write output csv file with minimal modelling input
dt_column: (Optional) Column with datetime values to create features from. Default scheduleDateTime
id_column: (Optional) ID column to keep from input file. Default id

Returns¶

Output format

id                   |   scheduleDateTime           |  dayofweek   |  quarter   |  month   |  year  |   dayofmonth   |  weekofyear   |  hour   |  minutes
123414481790510775   |  2018-01-01 03:30:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   3     |  30
123414479288269149   |  2018-01-01 06:00:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  0
123414479666542945   |  2018-01-01 06:05:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  5
123414479288365061   |  2018-01-01 06:05:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  5
123414479288274329   |  2018-01-01 06:15:00+01:00   |      0       |      1     |    1     |  2018  |        1       |       1       |   6     |  15

[19]:

# parameters
input_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv"
output_file = "schedule_time_features.csv"
dt_column = "scheduleDateTime"
id_column = "id"

Imports¶

[1]:

import pandas as pd
import numpy as np

import sys
sys.path.append("../")

from src.data.google_storage_io import read_csv_data, write_csv_data

Load data¶

[31]:

%%time

df = read_csv_data(input_file)
df.head()

Reading file from local directory
File:   ../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv

Wall time: 535 ms

[31]:

	id	aircraftRegistration	airlineCode	terminal	serviceType	scheduleDateTime	actualOffBlockTime	scheduleDelaySeconds
0	123414481790510775	PHPXB	148.0	NaN	NaN	2018-01-01 03:30:00+01:00	2018-01-01 03:22:00+01:00	-480.0
1	123414479288269149	PHHSJ	164.0	1.0	J	2018-01-01 06:00:00+01:00	2018-01-01 05:58:22+01:00	-98.0
2	123414479666542945	PHHSG	100.0	1.0	J	2018-01-01 06:05:00+01:00	2018-01-01 06:00:00+01:00	-300.0
3	123414479288365061	PHHSG	164.0	1.0	J	2018-01-01 06:05:00+01:00	2018-01-01 06:00:00+01:00	-300.0
4	123414479288274329	PHHXB	164.0	1.0	J	2018-01-01 06:15:00+01:00	2018-01-01 06:26:34+01:00	694.0

Create DataFrame with daily time features¶

[44]:

%%time

keep_columns = [
        'dayofweek', 'quarter',
        'month', 'year', 'dayofmonth',
        'weekofyear'
    ]

def add_time_features(df,
                      date_column,
                      id_column = None,
                      keep_columns = [
                          'dayofweek', 'quarter',
                          'month', 'year', 'dayofmonth',
                          'weekofyear', 'hour', 'minutes']):
    """
    Creates time series features from date column

    Pass a dataframe with dates in `date_column`
    The input dataframe is returned with an id column and the  extracted
    basic time features.

    """
    original_columns = list(df.columns)

    df[date_column] = pd.to_datetime(df[date_column], utc=True).dt.tz_convert("Europe/Amsterdam")

    df = df.assign(
            dayofweek = df[date_column].dt.dayofweek,
            quarter = df[date_column].dt.quarter,
            month = df[date_column].dt.month,
            year = df[date_column].dt.year,
            dayofyear = df[date_column].dt.dayofyear, # ignored
            dayofmonth = df[date_column].dt.day,
            weekofyear = df[date_column].dt.weekofyear,
            hour = df[date_column].dt.hour,
            minutes = df[date_column].dt.minute)

    if id_column is None:
        df = df.reset_index()
        id_column = "index"

    output_columns = [id_column] + keep_columns

    return df[output_columns]


df_output = add_time_features(df, date_column = dt_column, id_column = id_column)
df_output.head()

Wall time: 448 ms

[44]:

	id	quarter	month	year	dayofmonth	weekofyear	hour	minutes
0	123414481790510775	1	1	2018	1	1	3	30
1	123414479288269149	1	1	2018	1	1	6	0
2	123414479666542945	1	1	2018	1	1	6	5
3	123414479288365061	1	1	2018	1	1	6	5
4	123414479288274329	1	1	2018	1	1	6	15

[45]:

df_output.describe(include='all')

[45]:

	id	dayofweek	quarter	month	year	dayofmonth	weekofyear	hour	minutes
count	4.877160e+05	487716.000000	487716.000000	487716.000000	487716.000000	487716.000000	487716.000000	487716.000000	487716.000000
mean	1.241177e+17	2.940383	1.637703	3.869715	2017.999996	14.799594	14.881474	13.650389	27.531055
std	3.891198e+14	1.994474	0.610135	1.855526	0.002025	8.904548	7.913090	4.649598	17.187699
min	1.234004e+17	0.000000	1.000000	1.000000	2017.000000	1.000000	1.000000	0.000000	0.000000
25%	1.237868e+17	1.000000	1.000000	2.000000	2018.000000	7.000000	8.000000	9.000000	15.000000
50%	1.241310e+17	3.000000	2.000000	4.000000	2018.000000	14.000000	15.000000	13.000000	30.000000
75%	1.244612e+17	5.000000	2.000000	5.000000	2018.000000	23.000000	22.000000	17.000000	40.000000
max	1.247644e+17	6.000000	4.000000	12.000000	2018.000000	31.000000	52.000000	23.000000	59.000000

Write output to CSV¶

Local or Google Storage is both handled

[46]:

# # write output file
write_csv_data(df_output, output_file, index=False)

Writing file to local directory
File:   schedule_time_features.csv

Overview of the output data¶

[47]:

df_output.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487716 entries, 0 to 487715
Data columns (total 9 columns):
id            487716 non-null int64
dayofweek     487716 non-null int64
quarter       487716 non-null int64
month         487716 non-null int64
year          487716 non-null int64
dayofmonth    487716 non-null int64
weekofyear    487716 non-null int64
hour          487716 non-null int64
minutes       487716 non-null int64
dtypes: int64(9)
memory usage: 33.5 MB

[ ]: