In-Class Exercise 2

In-class Exercise

In-class exercise 2 description

Author
Published

January 15, 2024

1.0 Overview

2.0 Setup

2.1 Dependencies

  • arrow
  • lubridate
  • tidyverse
  • tmap
  • sf
pacman::p_load(arrow, lubridate, tidyverse, sf, tmap)

2.2 Importing Data

Read the first parquet files

# eval: FALSE - Show the code without evaluating
df <- read_parquet("data/GrabPosisi/part-00000-8bbff892-97d2-4011-9961-703e38972569.c000.snappy.parquet")

Import Master Plan 2019 Planning Subzone Boundary

mpsz2019 <- st_read("data/dataGov/MPSZ2019.kml") %>%
  st_transform(crs = 3414)
Reading layer `URA_MP19_SUBZONE_NO_SEA_PL' from data source 
  `/Users/matthewho/Work/Y3S2/IS415/Website/IS415/InClassEx/ICE2/data/dataGov/MPSZ2019.kml' 
  using driver `KML'
Simple feature collection with 332 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY, XYZ
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

2.3 Data Wrangling

Convert the ping timestamp to datetime

as_datetime is a lubridate function. Replace the column with the reformatted version.

lubridate helps to handle datetime data.

df$pingtimestamp <- as_datetime(df$pingtimestamp)

Write the parquet data to rds

write_rds(df, "data/rds/part0.rds")

Check the dataframe

head(df)
# A tibble: 6 × 9
  trj_id driving_mode osname  pingtimestamp       rawlat rawlng speed bearing
  <chr>  <chr>        <chr>   <dttm>               <dbl>  <dbl> <dbl>   <int>
1 70014  car          android 2019-04-11 00:40:36   1.34   104.  18.9     248
2 73573  car          android 2019-04-18 10:17:03   1.32   104.  17.7      44
3 75567  car          android 2019-04-13 07:37:06   1.33   104.  14.0      34
4 1410   car          android 2019-04-20 03:41:33   1.26   104.  13.0     181
5 4354   car          android 2019-04-18 10:48:17   1.28   104.  14.8      93
6 32630  car          android 2019-04-16 06:14:18   1.30   104.  23.2      73
# ℹ 1 more variable: accuracy <dbl>

Extract trip-start locations (Origins)

trj_id is a duplicated field, referring to the trip. Group by trip. sort by time stamp Filter out the origin in the first row Use mutate to create new columns

origin_df <- df %>%
  group_by(trj_id) %>%
  arrange(pingtimestamp) %>%
  filter(row_number()==1) %>%
  mutate(weekday = wday(pingtimestamp,
                        label=TRUE,
                        abbr=TRUE),
         start_hr = factor(hour(pingtimestamp)),
         day = factor(mday(pingtimestamp)))

Extract trip-end locations (Destinations)

Same as above, but use desc(pingtimestamp) to take the largest (end)

destination_df <- df %>%
  group_by(trj_id) %>%
  arrange(desc(pingtimestamp)) %>%
  filter(row_number()==1) %>%
  mutate(weekday = wday(pingtimestamp,
                        label=TRUE,
                        abbr=TRUE),
         end_hr = factor(hour(pingtimestamp)),
         day = factor(mday(pingtimestamp)))

Important: Saving data, saving memory

Dataframes are large and the raw ones have many entries. Save the intermediate dataframes.

# #|eval: FALSE - Show the code without evaluating
# #|echo: false - Code will not appear to the reader
write_rds(destination_df, "data/rds/destination_df.rds")
write_rds(origin_df, "data/rds/origin_df.rds")

Reimporting data

The old files are no longer needed, but we can keep it to show our process.

destination_df <- read_rds("data/rds/destination_df.rds")
origin_df <- read_rds("data/rds/origin_df.rds")