Time features from DateTime data¶
year, week_number, day_of_week, etc..¶
This notebook will take a DataFrame with scheduleDateTime or optionally a different column
- Calculate time features from DateTime values
- Output pd.DataFrame with id columns + time feature columns
- Write output to CSV file
Parameters¶
- input_file: Filepath of flights data in format received from Schiphol
- output_file: Filepath to write output csv file with minimal modelling input
- dt_column: (Optional) Column with datetime values to create features from. Default
scheduleDateTime - id_column: (Optional) ID column to keep from input file. Default
id
Returns¶
Output format
id | scheduleDateTime | dayofweek | quarter | month | year | dayofmonth | weekofyear | hour | minutes
123414481790510775 | 2018-01-01 03:30:00+01:00 | 0 | 1 | 1 | 2018 | 1 | 1 | 3 | 30
123414479288269149 | 2018-01-01 06:00:00+01:00 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 0
123414479666542945 | 2018-01-01 06:05:00+01:00 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5
123414479288365061 | 2018-01-01 06:05:00+01:00 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5
123414479288274329 | 2018-01-01 06:15:00+01:00 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 15
[19]:
# parameters
input_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv"
output_file = "schedule_time_features.csv"
dt_column = "scheduleDateTime"
id_column = "id"
Imports¶
[1]:
import pandas as pd
import numpy as np
import sys
sys.path.append("../")
from src.data.google_storage_io import read_csv_data, write_csv_data
Load data¶
[31]:
%%time
df = read_csv_data(input_file)
df.head()
Reading file from local directory
File: ../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv
Wall time: 535 ms
[31]:
| id | aircraftRegistration | airlineCode | terminal | serviceType | scheduleDateTime | actualOffBlockTime | scheduleDelaySeconds | |
|---|---|---|---|---|---|---|---|---|
| 0 | 123414481790510775 | PHPXB | 148.0 | NaN | NaN | 2018-01-01 03:30:00+01:00 | 2018-01-01 03:22:00+01:00 | -480.0 |
| 1 | 123414479288269149 | PHHSJ | 164.0 | 1.0 | J | 2018-01-01 06:00:00+01:00 | 2018-01-01 05:58:22+01:00 | -98.0 |
| 2 | 123414479666542945 | PHHSG | 100.0 | 1.0 | J | 2018-01-01 06:05:00+01:00 | 2018-01-01 06:00:00+01:00 | -300.0 |
| 3 | 123414479288365061 | PHHSG | 164.0 | 1.0 | J | 2018-01-01 06:05:00+01:00 | 2018-01-01 06:00:00+01:00 | -300.0 |
| 4 | 123414479288274329 | PHHXB | 164.0 | 1.0 | J | 2018-01-01 06:15:00+01:00 | 2018-01-01 06:26:34+01:00 | 694.0 |
Create DataFrame with daily time features¶
[44]:
%%time
keep_columns = [
'dayofweek', 'quarter',
'month', 'year', 'dayofmonth',
'weekofyear'
]
def add_time_features(df,
date_column,
id_column = None,
keep_columns = [
'dayofweek', 'quarter',
'month', 'year', 'dayofmonth',
'weekofyear', 'hour', 'minutes']):
"""
Creates time series features from date column
Pass a dataframe with dates in `date_column`
The input dataframe is returned with an id column and the extracted
basic time features.
"""
original_columns = list(df.columns)
df[date_column] = pd.to_datetime(df[date_column], utc=True).dt.tz_convert("Europe/Amsterdam")
df = df.assign(
dayofweek = df[date_column].dt.dayofweek,
quarter = df[date_column].dt.quarter,
month = df[date_column].dt.month,
year = df[date_column].dt.year,
dayofyear = df[date_column].dt.dayofyear, # ignored
dayofmonth = df[date_column].dt.day,
weekofyear = df[date_column].dt.weekofyear,
hour = df[date_column].dt.hour,
minutes = df[date_column].dt.minute)
if id_column is None:
df = df.reset_index()
id_column = "index"
output_columns = [id_column] + keep_columns
return df[output_columns]
df_output = add_time_features(df, date_column = dt_column, id_column = id_column)
df_output.head()
Wall time: 448 ms
[44]:
| id | dayofweek | quarter | month | year | dayofmonth | weekofyear | hour | minutes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 123414481790510775 | 0 | 1 | 1 | 2018 | 1 | 1 | 3 | 30 |
| 1 | 123414479288269149 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 0 |
| 2 | 123414479666542945 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 |
| 3 | 123414479288365061 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 |
| 4 | 123414479288274329 | 0 | 1 | 1 | 2018 | 1 | 1 | 6 | 15 |
[45]:
df_output.describe(include='all')
[45]:
| id | dayofweek | quarter | month | year | dayofmonth | weekofyear | hour | minutes | |
|---|---|---|---|---|---|---|---|---|---|
| count | 4.877160e+05 | 487716.000000 | 487716.000000 | 487716.000000 | 487716.000000 | 487716.000000 | 487716.000000 | 487716.000000 | 487716.000000 |
| mean | 1.241177e+17 | 2.940383 | 1.637703 | 3.869715 | 2017.999996 | 14.799594 | 14.881474 | 13.650389 | 27.531055 |
| std | 3.891198e+14 | 1.994474 | 0.610135 | 1.855526 | 0.002025 | 8.904548 | 7.913090 | 4.649598 | 17.187699 |
| min | 1.234004e+17 | 0.000000 | 1.000000 | 1.000000 | 2017.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
| 25% | 1.237868e+17 | 1.000000 | 1.000000 | 2.000000 | 2018.000000 | 7.000000 | 8.000000 | 9.000000 | 15.000000 |
| 50% | 1.241310e+17 | 3.000000 | 2.000000 | 4.000000 | 2018.000000 | 14.000000 | 15.000000 | 13.000000 | 30.000000 |
| 75% | 1.244612e+17 | 5.000000 | 2.000000 | 5.000000 | 2018.000000 | 23.000000 | 22.000000 | 17.000000 | 40.000000 |
| max | 1.247644e+17 | 6.000000 | 4.000000 | 12.000000 | 2018.000000 | 31.000000 | 52.000000 | 23.000000 | 59.000000 |
Write output to CSV¶
Local or Google Storage is both handled
[46]:
# # write output file
write_csv_data(df_output, output_file, index=False)
Writing file to local directory
File: schedule_time_features.csv
Overview of the output data¶
[47]:
df_output.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487716 entries, 0 to 487715
Data columns (total 9 columns):
id 487716 non-null int64
dayofweek 487716 non-null int64
quarter 487716 non-null int64
month 487716 non-null int64
year 487716 non-null int64
dayofmonth 487716 non-null int64
weekofyear 487716 non-null int64
hour 487716 non-null int64
minutes 487716 non-null int64
dtypes: int64(9)
memory usage: 33.5 MB
[ ]: