pacman::p_load(arrow, lubridate, tidyverse, sf, tmap)In-Class Exercise 2
In-class Exercise
In-class exercise 2 description
1.0 Overview
2.0 Setup
2.1 Dependencies
- arrow
- lubridate
- tidyverse
- tmap
- sf
2.2 Importing Data
Read the first parquet files
# eval: FALSE - Show the code without evaluating
df <- read_parquet("data/GrabPosisi/part-00000-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet")Import Master Plan 2019 Planning Subzone Boundary
mpsz2019 <- st_read("data/dataGov/MPSZ2019.kml") %>%
st_transform(crs = 3414)Reading layer `URA_MP19_SUBZONE_NO_SEA_PL' from data source
`/Users/matthewho/Work/Y3S2/IS415/Website/IS415/InClassEx/ICE2/data/dataGov/MPSZ2019.kml'
using driver `KML'
Simple feature collection with 332 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY, XYZ
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
2.3 Data Wrangling
Convert the ping timestamp to datetime
as_datetime is a lubridate function. Replace the column with the reformatted version.
lubridate helps to handle datetime data.
df$pingtimestamp <- as_datetime(df$pingtimestamp)Write the parquet data to rds
write_rds(df, "data/rds/part0.rds")Check the dataframe
head(df)# A tibble: 6 × 9
trj_id driving_mode osname pingtimestamp rawlat rawlng speed bearing
<chr> <chr> <chr> <dttm> <dbl> <dbl> <dbl> <int>
1 70014 car android 2019-04-11 00:40:36 1.34 104. 18.9 248
2 73573 car android 2019-04-18 10:17:03 1.32 104. 17.7 44
3 75567 car android 2019-04-13 07:37:06 1.33 104. 14.0 34
4 1410 car android 2019-04-20 03:41:33 1.26 104. 13.0 181
5 4354 car android 2019-04-18 10:48:17 1.28 104. 14.8 93
6 32630 car android 2019-04-16 06:14:18 1.30 104. 23.2 73
# ℹ 1 more variable: accuracy <dbl>
Extract trip-start locations (Origins)
trj_id is a duplicated field, referring to the trip. Group by trip. sort by time stamp Filter out the origin in the first row Use mutate to create new columns
origin_df <- df %>%
group_by(trj_id) %>%
arrange(pingtimestamp) %>%
filter(row_number()==1) %>%
mutate(weekday = wday(pingtimestamp,
label=TRUE,
abbr=TRUE),
start_hr = factor(hour(pingtimestamp)),
day = factor(mday(pingtimestamp)))Extract trip-end locations (Destinations)
Same as above, but use desc(pingtimestamp) to take the largest (end)
destination_df <- df %>%
group_by(trj_id) %>%
arrange(desc(pingtimestamp)) %>%
filter(row_number()==1) %>%
mutate(weekday = wday(pingtimestamp,
label=TRUE,
abbr=TRUE),
end_hr = factor(hour(pingtimestamp)),
day = factor(mday(pingtimestamp)))Important: Saving data, saving memory
Dataframes are large and the raw ones have many entries. Save the intermediate dataframes.
# #|eval: FALSE - Show the code without evaluating
# #|echo: false - Code will not appear to the reader
write_rds(destination_df, "data/rds/destination_df.rds")
write_rds(origin_df, "data/rds/origin_df.rds")Reimporting data
The old files are no longer needed, but we can keep it to show our process.
destination_df <- read_rds("data/rds/destination_df.rds")
origin_df <- read_rds("data/rds/origin_df.rds")