Load Data
We loaded three default data (1)-(3) of NMBS trains, stations, and
lines from Kaggle, and four additional data (4)-(7) of Belgium
statistical tracts, Belgium population data, zone station data, and
weather data from RIEM package. [1]
- Trains dataset (
trains
, trains_test
,
trains_train
, trains_trainTest
,
trains_trarinTrain
)
The training dataset of trains provides date, time, origin station
code (from
), destination station code (to
),
vehicle ID, and occupancy information shown at three different levels
(low, medium, high), from July 27, 2016 to October 29, 2016. The test
dataset of trains have the same columns as the training dataset, but
from October 29, 2016 to December 19, 2016. The test dataset does not
have occupancy level information.
- Stations dataset (
stations
)
The stations dataset contains names, locations (longitude and
latitude), and average stop times (avg_stop_times
). Since
NMBS operates trains in Belgium as well as France and the Netherlands,
we filtered the dataset to exclude non-Belgian stations (e.g. those in
the Netherlands, France, Luxembourg etc.). Additionally, we created
columns to calculate the distance between origin and destination
stations, the distance from Belgium to the origin and destination
stations, and the bearing between origin and destination stations.
- Lines dataset (
line_info
)
The lines dataset includes information about the vehicle ID, vehicle
type, the number of stops, and stopping station IDs. Since the
information in the train dataset is insufficient, we did not use
stopping station IDs in this analysis.
- Belgium statistical tracts dataset (
statBel
and
statBel_muni
)
The Belgium statistical tracts dataset contains the geometries of
each tract. We gained this dataset from Statbel, the Belgian statistical
office. We joined this dataset to calculate the population of Belgium by
municipality and intersect this dataset to the stations and the weather
stations.
- Belgium population dataset (
popBel
)
We assumed that the population of the origin and destination regions
would affect the occupancy level of the trains. We used the Belgium
population dataset to calculate the population of Belgium by
municipality and joined this dataset to the statistical tracts
dataset.
- Zone station dataset (
stationZone
)
The zone station dataset contains the zone information of each train
station. At first, we assumed the fare information may be one of the
independent variable so we added this information to the stations
dataset to calculate the fare. However, we could not find the opened
fare information of NMBS, so we did not use this information in the
analysis.
- Weather dataset (
weather_belgium
,
weatherstations_with_muni
)
We assumed that the weather conditions of the origin and destination
regions at specific dates and times would affect the occupancy level of
the trains. The dataset is from RIEM, which is developed by Iowa
Environment Mesonet. Since RIEM provides global coverage, we were able
to get temperature, wind, and relative humidity data from 12 weather
stations located in Belgium airports. Since the percipitation data of
Belgium was unavailable in RIEM, we used relative humidity as a
substitute variable to reflect general patterns related to
precipitation.
# Load (1) trains, (2) stations, and (3) SNCB lines data
line_info <- read.csv("./Data/line_info.csv")
trains_test <- read.csv("./Data/trains_test.csv")
trains_train <- read.csv("./Data/trains_train.csv")
stations <- read.csv("./Data/stations.csv")
#removing unknown stations and changing null vehicle to undefined
trains_test$vehicle[trains_test$vehicle == "(null)"] <- "undefined"
trains_train <- subset(trains_train, to != "000000000")
# Load (4) Belgium statistical tracts and (5) population data
statBel <- st_read("./Data/sh_statbel_statistical_sectors_3812_20240101.geojson") %>%
st_transform(4326)
popBel <- read_excel("./Data/TF_POP_STRUCT_SECTORS_2016.xlsx") # 2016 Belgium population data by tracts
# Load Zone station data to calculate fare
stationZone <- read_excel("./Data/station-zone.xlsx")
# Filter stations data to exclude non-Belgian stations (Netherlands, France, etc.)
stations <- stations %>%
filter(country.code=="be")
stations <- left_join(stations, stationZone, by = c("name" = "Station")) # Add zone information to stations)
# Stat tracts and population of Belgium by municipality
statBel_muni <- statBel %>%
group_by(cd_dstr_refnis) %>%
summarize(geometry = st_union(geometry)) %>%
ungroup()
statBel_muni <- statBel_muni %>%
left_join(
statBel %>% st_drop_geometry() %>% select(cd_dstr_refnis, tx_adm_dstr_descr_fr) %>% distinct(),
by = "cd_dstr_refnis"
)
popBel_muni <- popBel %>%
mutate(CD_REFNIS_N = as.character(as.numeric(substr(CD_REFNIS, 1, 2)) * 1000)) %>%
select(CD_REFNIS_N, POPULATION, TX_DESCR_FR) %>%
group_by(CD_REFNIS_N) %>%
summarize(POPULATION = sum(POPULATION))
statBel_muni <- left_join(statBel_muni, popBel_muni, by = c("cd_dstr_refnis" = "CD_REFNIS_N"))
# Clean up the URI column to extract the station IDs
stations <- stations %>%
mutate(station_id = gsub("http://irail.be/stations/NMBS/", "", URI))
# Convert stations dataset to sf object
stations_sf <- stations %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
Data Wrangling
We transformed raw data into a more usable format for analysis
through the following processs. First, we combined the training and test
datasets, which were initially separated, to manipulate the entire
dataset at once. After we finished the data wrangling, we split them
back into training and test sets.
Second, we recoded occupancy data from three levels (low, medium,
high) to two levels (low, high). This was done to focus more on
analyzing trends in low occupancy data, and to simplify regression
analysis process by using a binomial model instead of a multinomial
one.
[2]
Low: There are plenty of seats left.
Medium: It is hard to find a seat and it is difficult to sit
together.
High: There are no seats left and people have to stand
up
Low: There are plenty of seats left.
High: It is hard to find a seat and it is difficult to sit
together or there are no seats left and people have to stand
up.
Lastly, we created a panel dataset called trains
by
performing a left_join of internal and external datasets. This
incorporated both dependent and independent variables of the following
into one table. The detailed is as shown in Table 1.
Primary internal data: datetime
(date and time,
dbl), week
(week number, chr), dotw
(day of
week), interval60
(hourly interval), from_x
(origin station information, e.g. name, latitude, longitude etc.),
to_x
(destination station information, e.g. name, latitude,
longitude etc.), vehicle
(train number),
occupancy_original
(occupancy of three levels),
occupancy
(occupancy of two levels),
avg_stop_times
(average stop times of each line),
n_stops
(number of stops of each line)
Primary external data:
Administration sector: cd_dstr_refnis
(borough
number), tx_adm_dstr_descr_fr
(name of borough that origin
and destination stations are located)
Population: POPULATION
(population of borough that
origin and destination stations are located)
Weather: temp
, humid
, wind
(temperature, humidity, and wind speed of each hour of day at origin and
destination stations), nearest_weather_station
(the nearest
weather station location of origin and destination stations)
Event: event
(weekend and holiday information)
Event: 0 = Weekday, 1 = Weekend or holiday
Table 1. Variable Type and Description
Variable
|
Type
|
Description
|
from/to
|
<chr>
|
9-digit ID of each origin and destination station
|
vehicle
|
<chr>
|
ID of train
|
occupancy_original
|
<chr>
|
Train occupancy of three levels (low, medium, high)
|
occupancy
|
<chr>
|
Train occupancy of two levels (low, high)
|
interval60
|
<dttm>
|
Hourly interval
|
week
|
<dbl>
|
Week number
|
datetime
|
<dttm>
|
Date and time
|
dotw
|
<ord>
|
Day of week
|
from/to_station
|
<chr>
|
Name of each origin and destination station
|
from/to_avg_stop_times
|
<dbl>
|
The average number of vehicles stopping each day in this station
|
from/to_lng
|
<dbl>
|
Longitude of each origin and destination station
|
from/to_lat
|
<dbl>
|
Latitude of each origin and destination station
|
from/to_pop
|
<dbl>
|
Population of each origin and destination station
|
from/to_cd_dstr_refnis
|
<chr>
|
5-digit borough number that each origin and destination station is
located
|
from/to_tx_adm_dstr_descr_fr
|
<chr>
|
Name of a borough that each origin and destination station is located
|
from/to_nearest_weatherstation
|
<chr>
|
4-letter code of the nearest weather station from each origin and
destination train station
|
from/to_temp
|
<dbl>
|
Temperature of each origin and destination train station in Celsius
|
from/to_humid
|
<dbl>
|
Relative humidity of each origin and destination train station in
percentage
|
from/to_wind
|
<dbl>
|
Wind of each origin and destination train station in meters per second
|
vehicle_type
|
<chr>
|
4-letter code of vehicle type
|
nr_of_stops
|
<int>
|
Number of stops of this O-D pair
|
event
|
<dbl>
|
Weekend and holiday information (0 = Weekday, 1 = Weekend or holiday)
|
n_high
|
<int>
|
Frequency of high occupancy of this O-D pair during Jul-Oct 2016
|
n_low
|
<int>
|
Frequency of low occupancy of this O-D pair during Jul-Oct 2016
|
dist_from_O_to_Brussels
|
<dbl>
|
Distance from an origin station to Brussels in kilometers
|
dist_from_D_to_Brussels
|
<dbl>
|
Distance from a destination station to Brussels in kilometers
|
dist_from_O_to_D
|
<dbl>
|
Distance between an origin and a destination station in kilometers
|
bearing_from_O_to_D
|
<dbl>
|
Bearing angle from an origin to a destination station
|
bearing_from_O_to_D_cat
|
<chr>
|
Categorized bearing angle (N, S, E, W, NE, NW, SE, SW)
|
time_cat
|
<fct>
|
Categorized time by commute, morning, noon, evening, midnight
|
date_numeric
|
<dbl>
|
Numerized date
|
time_numeric
|
<dbl>
|
Numerized time
|
occupancy_numeric
|
<dbl>
|
Numerized occupancy (0 = high, 1 = low)
|
# Change station code from numeric to character and rbind train and test sets
trains_test$from <- as.character(trains_test$from)
trains_test$from <- paste0("00", trains_test$from)
trains_test$to <- as.character(trains_test$to)
trains_test$to <- paste0("00", trains_test$to)
trains_test$occupancy <- NA
# Rbind train and test sets
trains <- rbind(trains_train, trains_test)
#trains$occupancy <- factor(trains$occupancy, levels = c("low", "medium", "high"))
# Believe this causes regression error
# Convert date and time to POSIX datetime
trains$datetime <- as.POSIXct(paste(trains$date, trains$time), format = "%Y-%m-%d %I:%M:%S %p")
trains$week <- week(trains$datetime)
trains$dotw <- wday(trains$datetime, label = TRUE)
# Update vehicle ID
trains <- trains %>%
mutate(vehicle = case_when(
grepl("^\\d+$", vehicle) & vehicle %in% c("7006", "7966", "8011", "8015") ~ paste0("P", vehicle),
grepl("^\\d+$", vehicle) ~ paste0("IC", vehicle),
TRUE ~ vehicle
))
# Recode Low-medium-high to Low-high
trains$occupancy_original <- trains$occupancy
trains <- trains %>%
mutate(occupancy = recode(occupancy, "medium" = "high")) %>% # (1) Recode medium to high
# mutate(occupancy = ifelse(occupancy == "medium", NA_character_, occupancy)) %>%
# (2) Remove medium to high
mutate(interval60=ymd_h(substr(datetime, 1, 13))) %>%
mutate(week=week(interval60))
(1) Population and Statistical tract data
As of 2016, the population of Belgium is 11,267,910. In Belgium, 43
municipalities (CD_DSTR_REFNIS
), 589 Reference from
National Institute of Statistics Code (CD_REFNIS
), and
2,882 statistical sectors (CD_SECTOR
). We performed a left
join to add the municipality where the origin and destination stations
are located and their population, to the trains
dataset.
statBel_muni <- statBel_muni %>%
mutate(tx_adm_dstr_descr_fr_appr = str_extract(tx_adm_dstr_descr_fr,
"(?<=Arrondissement d’|Arrondissement de )\\b[\\wÀ-ÿ\\s-]+\\b"))
ggplot()+
geom_sf(data = statBel_muni, aes(fill = POPULATION), color = "white")+
geom_sf_text(
data = statBel_muni,
aes(label = tx_adm_dstr_descr_fr_appr),
size = 3,
color = "white"
) +
labs(
title = "Population of Belgium by Municipality",
subtitle = "2016",
caption="Source: StatBel"
) +
theme_void()+plotTheme+
theme(legend.position = "bottom")

stations_with_pop <- st_join(stations_sf, statBel_muni) %>%
st_drop_geometry()
stations_with_pop <- stations_with_pop %>%
bind_cols(as.data.frame(st_coordinates(stations_sf))) %>%
rename(longitude = X, latitude = Y)
# Add from and to station names and left join population into trains data
trains <- trains %>%
left_join(stations_with_pop, by = c("from" = "station_id"))
trains <- trains %>%
select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, name, avg_stop_times, longitude, latitude, cd_dstr_refnis, tx_adm_dstr_descr_fr, POPULATION) %>%
rename(from_station = name) %>%
rename(from_avg_stop_times = avg_stop_times) %>%
rename(from_lng = longitude) %>%
rename(from_lat = latitude) %>%
rename(from_pop = POPULATION) %>%
rename(from_cd_dstr_refnis = cd_dstr_refnis) %>%
rename(from_tx_adm_dstr_descr_fr = tx_adm_dstr_descr_fr)
trains <- trains %>%
left_join(stations_with_pop, by = c("to" = "station_id"))
trains <- trains %>%
select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, from_station, from_avg_stop_times, from_lng, from_lat, from_pop, from_cd_dstr_refnis, from_tx_adm_dstr_descr_fr, name, avg_stop_times, longitude, latitude, cd_dstr_refnis,tx_adm_dstr_descr_fr, POPULATION) %>%
rename(to_station = name) %>%
rename(to_avg_stop_times = avg_stop_times) %>%
rename(to_lng = longitude) %>%
rename(to_lat = latitude) %>%
rename(to_pop = POPULATION) %>%
rename(to_cd_dstr_refnis = cd_dstr_refnis)%>%
rename(to_tx_adm_dstr_descr_fr = tx_adm_dstr_descr_fr)
(2) Weather data
We assumed that the weather conditions of the origin and destination
regions would affect the occupancy level of the trains. We used the
weather data from RIEM package to calculate the temperature, relative
humidity, and wind speed of each hour of the day at the origin and
destination regions. Relative humidity was used as a substitute variable
and can reflect general patterns related to precipitation. Also, by
spatially intersection, we added a column that shows the nearest weather
station from each municipality. Finally, the nearest weather station and
the weather data from the nearest weather station was left joined to the
trains
dataset.
# Load weather data
weatherstations_sf <- riem_stations(network = "BE__ASOS") %>% # BE__ASOS = Belgium ASOS (Weather Stations)
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
weatherstations <- weatherstations_sf %>%
pull(id)
weather_belgium <- data.frame()
for (station in weatherstations) { # Loop through each station
weather_data <- riem_measures( # Fetch data for the current station
station = station,
date_start = "2016-07-27",
date_end = "2016-12-19"
) %>%
select(valid, tmpf, relh, sknt) %>% # Select relevant columns (timestamp, temperature by Fahrenheit, relative humidity in percentage, wind speed in knots)
mutate(station_id = station) # Add a column to identify the station
weather_belgium <- bind_rows(weather_belgium, weather_data) # Append the fetched data to the main data frame
}
weather_belgium <- weather_belgium %>%
mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>% # Extract the timestamp and convert it to a POSIXct object
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
group_by(interval60, station_id) %>%
summarize(Temp = (max(tmpf, na.rm = TRUE)-32)*5/9, # Temperature by Celsius
Humid = max(relh, na.rm = TRUE), # relative humidity in percentage
Wind = max(sknt, na.rm = TRUE)*0.51444) # Wind speed in meters per second
# Convert weather dataset to sf object
weatherstations_with_muni <- st_join(weatherstations_sf, statBel_muni) %>%
st_drop_geometry()
# Find nearest weather station for municipalities without a station
statBel_muni <- statBel_muni %>%
mutate(
weatherstation_within = st_intersects(geometry, weatherstations_sf, sparse = FALSE) %>%
apply(1, function(x) if (any(x)) weatherstations_sf$id[which(x)[1]] else NA) # Replace `id` with the column identifying stations
)
statBel_muni <- statBel_muni %>%
mutate(
nearest_weatherstation = ifelse(
is.na(weatherstation_within),
weatherstations_sf$id[st_nearest_feature(geometry, weatherstations_sf)], # Replace `id` with station identifier
weatherstation_within
)
)
statBel_muni <- statBel_muni %>% select(-weatherstation_within)
# left join weather station data into trains data
trains <- trains %>%
left_join(statBel_muni, by=c("from_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
rename(from_nearest_weatherstation = nearest_weatherstation) %>%
select(-geometry, -POPULATION)
trains <- trains %>%
left_join(statBel_muni, by=c("to_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
rename(to_nearest_weatherstation = nearest_weatherstation) %>%
select(-geometry, -POPULATION)
# left join weather data into trains data
trains <- trains %>%
left_join(weather_belgium, by = c("interval60" = "interval60", "from_nearest_weatherstation" = "station_id")) %>%
mutate(from_temp = Temp) %>%
mutate(from_humid = Humid) %>%
mutate(from_wind = Wind) %>%
select(-Temp, -Humid, -Wind)
trains <- trains %>%
left_join(weather_belgium, by = c("interval60" = "interval60", "to_nearest_weatherstation" = "station_id")) %>%
mutate(to_temp = Temp) %>%
mutate(to_humid = Humid) %>%
mutate(to_wind = Wind) %>%
select(-Temp, -Humid, -Wind)
trains <- trains %>%
mutate(from_humid = case_when(is.infinite(from_humid) ~ NA_real_,
TRUE ~ from_humid)) %>%
mutate(to_humid = case_when(is.infinite(to_humid) ~ NA_real_,
TRUE ~ to_humid))
The following map shows the administrative districts of Belgium and
the locations of weather stations.
ggplot()+
geom_sf(data=statBel_muni , fill="#a1b7b8", color="#333")+
geom_sf(data=weatherstations_sf, color="#df543b", size=1)+
geom_sf_text(data = statBel_muni ,
aes(label = tx_adm_dstr_descr_fr_appr),
size = 3, color = "black") +
labs(
title = "Weather Station Location in Belgium",
subtitle = "Weather data from RIEM package",
caption = "Source: StatBel, RIEM package by Maelle Salmon"
) +
theme_void()+plotTheme

The following plots show humidity, wind speed, and temperature data
by 12 weather stations in Belgium.
grid.arrange(top = "Weather DATA - Belgium - Jul - Dec 2016",
ggplot(weather_belgium, aes(interval60, Humid)) + geom_line(aes(color=station_id))+plotTheme,
ggplot(weather_belgium, aes(interval60, Wind)) + geom_line(aes(color=station_id))+plotTheme,
ggplot(weather_belgium, aes(interval60, Temp)) + geom_line(aes(color=station_id))+plotTheme)

(3) Vehicle data
Train types in Belgium is as the following.[3] In this analysis, we
removed “EUR”, “ICE”, “ICT”, “TGV”, “TRN”, and other undefined vehicle
code.
InterCity (IC): IC trains connect Belgium’s large cities. These
trains only stop at the biggest train stations and sometimes cross
international borders.
Peak (P): P trains run during peak travel times. They provide
additional alternatives when you’re travelling during busy periods. Most
of these trains run in the mornings and in late afternoon.
Local trains (L): L trains generally connect cities, but they also
stop at every station along the route.
S-trains (S): S trains are suburban trains that connect the city
centre with the surrounding communes. S trains stop in most stations
along the route.
EXTRA: Additional train services, used at exceptionally busy periods.
For example, these are the trains that travel towards the Belgian coast
on very sunny days.
T (Touristic): Additional train service to certain tourist
destinations.
EXP (Coast Express): Extra train service to the Belgian coast during
the summer period. International trains (INT = EC, THA, TGV, ICE, EST):
Regular international trains, namely Eurocity, Thalys, TGV, ICE and
Eurostar.
# Left join line_data into trains data
trains <- trains %>%
left_join(line_info, by = c("vehicle" = "vehicle_id"))
# Clean up vehicle type
trains <- trains %>%
mutate(
vehicle_type = ifelse(
is.na(vehicle_type) | vehicle_type == "",
gsub("[0-9]", "", vehicle),
vehicle_type
)
)
trains <- trains %>%
mutate(vehicle_type = case_when(
vehicle_type == "ic" & row_number() == min(which(vehicle_type == "ic")) ~ "IC",
TRUE ~ vehicle_type
))
trains <- trains %>%
mutate(vehicle = case_when(
vehicle == "ic2029" & row_number() == min(which(vehicle == "ic2029")) ~ "IC2029",
TRUE ~ vehicle
))
# Filter few vehicle_type
trains <- trains %>%
filter(!(vehicle_type %in% c("EUR", "ICE", "ICT", "TGV", "TRN", "THA", "undefined", "(null)")))
(4) Event data
# Create event data for weekends and holidays
trains <- trains %>%
mutate(event = case_when(dotw == "Sat" ~ 1,
dotw == "Sun" ~ 1,
ymd(interval60) >= "2016-07-28" & ymd(interval60) <= "2016-09-04" ~ 1,
ymd(interval60) == "2016-11-03" | ymd(interval60) == "2016-11-04" | ymd(interval60) == "2016-11-05" ~ 1,
ymd(interval60) == "2016-11-13" | ymd(interval60) == "2016-11-14" | ymd(interval60) == "2016-11-15" ~ 1,
TRUE ~ 0))
(5) Distance and Bearing Angle from Origin to Destination
# Calculate the distance from O to Brussels, from D to Brussels, and from O to D in kilometers
trains <- trains %>%
mutate(
dist_from_O_to_Brussels = pmap_dbl(
list(from_lng, from_lat),
~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
),
dist_from_D_to_Brussels = pmap_dbl(
list(to_lng, to_lat),
~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
),
dist_from_O_to_D = pmap_dbl(
list(from_lng, from_lat, to_lng, to_lat),
~ distHaversine(c(..1, ..2), c(..3, ..4)) / 1000
)
)
# Calculate the bearing angle from O to D
deg2rad <- function(deg) {
return(deg * pi / 180)
}
rad2deg <- function(rad) {
return(rad * 180 / pi)
}
calculate_bearing <- function(lat1, lon1, lat2, lon2) {
phi1 <- deg2rad(lat1)
phi2 <- deg2rad(lat2)
delta_lambda <- deg2rad(lon2 - lon1)
theta <- atan2(
sin(delta_lambda) * cos(phi2),
cos(phi1) * sin(phi2) - sin(phi1) * cos(phi2) * cos(delta_lambda)
)
bearing <- (rad2deg(theta) + 360) %% 360
return(bearing)
}
trains$bearing_from_O_to_D <- calculate_bearing(trains$from_lat, trains$from_lng, trains$to_lat, trains$to_lng)
trains$bearing_from_O_to_D_cat <- case_when(
trains$bearing_from_O_to_D >= 337.5 | trains$bearing_from_O_to_D < 22.5 ~ "N",
trains$bearing_from_O_to_D >= 22.5 & trains$bearing_from_O_to_D < 67.5 ~ "NE",
trains$bearing_from_O_to_D >= 67.5 & trains$bearing_from_O_to_D < 112.5 ~ "E",
trains$bearing_from_O_to_D >= 112.5 & trains$bearing_from_O_to_D < 157.5 ~ "SE",
trains$bearing_from_O_to_D >= 157.5 & trains$bearing_from_O_to_D < 202.5 ~ "S",
trains$bearing_from_O_to_D >= 202.5 & trains$bearing_from_O_to_D < 247.5 ~ "SW",
trains$bearing_from_O_to_D >= 247.5 & trains$bearing_from_O_to_D < 292.5 ~ "W",
trains$bearing_from_O_to_D >= 292.5 & trains$bearing_from_O_to_D < 337.5 ~ "NW"
)
(6) Time categorization
The following plot shows a distribution between high and low
occupancy levels by hour of the day. The train schedule was concentrated
at 06:00-08:59 and 16:00-19:59. This result corresponds to the typical
rush hour in Belgium, which is 08:00-17:00. Despite the large number of
trains in this time zone, the low occupancy rate was low at around 33%.
However, the low occupancy rate was about 62% at 10:00-14:59.
# Categorize time
time_category <- function(time) {
hour <- as.numeric(format(time, "%H"))
if ((hour >= 6 & hour < 9) | (hour >= 16 & hour < 20)) {
return("commute time")
} else if (hour >= 0 & hour < 6) {
return("morning")
} else if (hour >= 9 & hour < 16) {
return("noon")
} else {
return("evening")
}
}
trains <- trains %>%
mutate(time_cat = sapply(datetime, time_category))
trains$time_cat <- factor(trains$time_cat, levels = c( "morning", "commute time", "noon", "evening"))