--- title: "Converting Text to Time Objects" author: "Hauke Sonnenberg" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Converting Text to Time Objects} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) print_times <- function(x) { old_pars <- par(mar = c(2.5, 4.5, 0.5, 0.5)) plot(x, seq_along(x), las = 1, xlab = "", ylab = "Index", ylim = c(length(x) + 0.4, 0.6)) par(old_pars) } ``` This tutorial describes how to convert text timestamps to `POSIXct` objects. The conversion becomes necessary when time differences are to be calculated. We consider two cases that differ in the way that the time recording device operates its clock: - Case 1: the clock switches between standard (winter) time and summer time. - Case 2: the clock does not switch between standard (winter) time and summer time but stays in standard time over the whole year. In both cases, using the base R function `as.POSIXct()` to convert text timestamps to `POSIXct` objects may lead to unintended results (as detailed below). This package contains a function `textToEuropeBerlinPosix()` that is a wrapper around `as.POSIXct()` but that is specialised on the two above cases. It returns `POSIXct` objects in time zone "Europe/Berlin" and can handle timestamps from both of the above cases as input. ## Creation of example timestamps We start by creating text timestamps as they could have been recorded by a measuring device. ```{r} # Define first and last timestamps from <- "2019-01-01" to <- "2020-01-01" step <- 3600 # seconds, i.e. one timestamp per hour # Case 1: local time, i.e. standard time in winter, summer time in summer tz <- "Europe/Berlin" times_local <- seq(as.POSIXct(from, tz), as.POSIXct(to, tz), by = step) # Case 2: always standard time (Note that GMT-1 refers to UTC+1!) tz <- "Etc/GMT-1" times_standard <- seq(as.POSIXct(from, tz), as.POSIXct(to, tz), by = step) ``` In both cases, all time differences are one (hour), as intended: ```{r} all(diff(times_local) == 1) all(diff(times_standard) == 1) ``` We use `format()` to simulate that the timestamps are recorded in a certain local format: ```{r} format <- "%d.%m.%y %H:%M" timestamps_local <- format(times_local, format = format) timestamps_standard <- format(times_standard, format = format) ``` Show the first three timestamps in each case to check whether the format was applied as intended: ```{r} writeLines(head(timestamps_local, 3)) writeLines(head(timestamps_standard, 3)) ``` Note that one of the local timestamps appears twice: ```{r} timestamps_local[duplicated(timestamps_local)] ``` In contrast, the standard timestamps are all unique: ```{r} any(duplicated(timestamps_standard)) ``` ## Case 1: Clock switches between standard time and summer time In this case the function `as.POSIXct()` may return unexpected results. These are caused by duplicates in the input vector of text timestamps. In the case considered in this chapter, duplicated timestamps are absolutely valid. They result from shifting time back from summer time to standard time. Let's see what happens if we call `as.POSIXct()` on the vector of local timestamps: ```{r} # Convert text timestamps to POSIXct objects times <- as.POSIXct(timestamps_local, tz = "Europe/Berlin", format = format) ``` ### Problem Even though time zone and time format were set correctly, the vector of returned `POSIXct` objects is corrupt. We can see the problem when looking at the time differences: ```{r} table(diff(times)) ``` Not all time differences are one hour (3600 seconds) anymore! Once there is no time difference, i.e. two adjacent times are identical, and once there is a time difference of two hours (7200 seconds), i.e. one hour is skipped. Read further down for the reason and for a function allowing to further investigate the problem. ### Solution Use the function `textToEuropeBerlinPOSIX()` to do the conversion. It interprets ambiguous timestamps correctly, provided that they are given in chronological order. See below for details. ```{r} # Convert text timestamps to POSIXct objects times_local_from_text <- kwb.datetime::textToEuropeBerlinPosix( timestamps_local, format = format ) ``` The created `POSIXct` objects are now identical to the original ones: ```{r} identical(times_local, times_local_from_text) ``` ## Case 2: Clock stays in standard time What happens in this case, when applying `as.POSIXct()` on the timestamps? ```{r} times <- as.POSIXct(timestamps_standard, tz = "Europe/Berlin", format = format) ``` ### Problem The vector `timestamps_standard` contains timestamps that do not exist in the time zone `Europe/Berlin` (for details, see below). Unfortunately, the function `as.POSIXct()` does not give a warning about this. We convince ourselves that something went wrong by looking at the frequencies of the time differences: ```{r} table(diff(times)) ``` Not all differences are one hour (3600 seconds) as expected. Once the time difference is zero, i.e. two adjacent times are identical, and once there is a time difference of two hours (7200 seconds), i.e. one hour is skipped. This is the behaviour on a Linux system, on Windows we would get a different, but also unintended result. ### Solution Use the function `textToEuropeBerlinPOSIX()` to do the conversion. Set `switches = FALSE` to tell the function that the timestamps were recorded by a clock that does not switch between standard time and summer time: ```{r} # Convert text timestamps to POSIXct objects times_standard_from_text <- kwb.datetime::textToEuropeBerlinPosix( timestamps_standard, format = format, switches = FALSE ) ``` The created `POSIXct` objects are now (almost) identical to the original ones. The only difference is the time zone attribute that is "Etc/GMT-1" in the original timestamps but "Europe/Berlin" in the timestamps provided by `textToEuropeBerlinPosix()`: ```{r} attr(times_standard, "tzone") attr(times_standard_from_text, "tzone") ``` We set the time zone of the original times to "Europe/Berlin" (that does not change the underlying time information!) in order to check for identity: ```{r} attr(times_standard, "tzone") <- "Europe/Berlin" identical(times_standard, times_standard_from_text) ``` See vignette ["Exkurs Zeitzonen"](timezones.html) (in German!) for details on this case. # Background and Details In R, time information are stored in objects of class `POSIXct`. The function `as.POSIXct()` is used to convert character strings representing points in time into the corresponding `POSIXct` objects. This chapter points out some important details about this function. Problems may arise with times recorded in time zones that change for Daylight Saving. This is the case for the time zone "Europe/Berlin" that is used in the following example. In this time zone, the time is given in Central European Time (CET) in winter and in Central European Summer Time (CEST) in summer. ## When does Summer Time start/end? Use the function `date_range_CEST()` to find out at what days summer time starts and ends, respectively: ```{r} kwb.datetime::date_range_CEST(2017:2019) ``` Note that this function returns character strings and not, for example, `Date` objects. ## Example: Reading local timestamps in Berlin, Germany Imagine a measuring device taking measurements every 30 minutes at some location in Berlin, Germany. The clock of the device is configured to switch between standard time and daylight saving time (summer time) and vice versa. In 2017, on October 29, when summer time is reverted back to standard time, the recorded timestamps around the time shift are: ```{r} # Define timestamps (character) timestamps <- c( "2017-10-29 01:30:00", # 1: CEST "2017-10-29 02:00:00", # 2: CEST "2017-10-29 02:30:00", # 3: CEST "2017-10-29 02:00:00", # 4: CET "2017-10-29 02:30:00", # 5: CET "2017-10-29 03:00:00" # 6: CET ) ``` The timestamps "02:00" and "02:30" appear twice, at indices 2 and 3, respectively, first and at indices 4 and 5, respectively, second. This is because at 03:00 (CEST) the clock is set back to 02:00 (CET). The first occurrences of the two timestamps refer to summer time (CEST) whereas the second occurrences refer to standard time (CET). ### What is the problem? What happens if we convert these timestamps to time objects? Using `as.POSIXct()` and the (correct) time zone "Europe/Berlin", we get the following vector of time objects: ```{r} # Convert timestamps to POSIXct and print them (times <- as.POSIXct(timestamps, tz = "Europe/Berlin")) ``` The function cannot distinguish between the first and second occurrences of the times 02:00 and 02:30. The output and the following plot reveal that the timestamps between 02:00 and 03:00 (exclusive) are always interpreted as being in summer time (CEST). ```{r fig.width = 5, fig.height = 2, echo = FALSE} print_times(times) ``` ### What is the solution (step by step)? How can we tell R that the first occurrences of 02:00 and 02:30 refer to CEST and the second ocurrences refer to CET? We could try the following: ```{r} as.POSIXct(tz = "Europe/Berlin", c( "2017-10-29 01:30:00 CEST", "2017-10-29 02:00:00 CEST", "2017-10-29 02:30:00 CEST", "2017-10-29 02:00:00 CET", "2017-10-29 02:30:00 CET", "2017-10-29 03:00:00 CET" )) ``` **Unfortunately, this does not work!** Again, everything between 02:00 and 03:00 (exclusive) is assumed to refer to CEST, as the output above indicates. However, R accepts a format in which the number of hours ahead of Coordinated Universal Time (UTC) is indicated in the timestamps. In our example this looks as follows: ```{r} iso_timestamps <- c( "2017-10-29 01:30:00+0200", "2017-10-29 02:00:00+0200", "2017-10-29 02:30:00+0200", "2017-10-29 02:00:00+0100", "2017-10-29 02:30:00+0100", "2017-10-29 03:00:00+0100" ) ``` Timestamps in CEST are two hours (and zero minutes) ahead of UTC. This is indicated in the timestamp by the ending *+0200*. Timestamps in CET are only one hour ahead of UTC and thus indicated by *+0100*. Timestamps given in this format can be converted to POSIXct objects by setting the `format` argument of the `as.POSIXct()` function to `"%F %T%z"`: ```{r} as.POSIXct(iso_timestamps, tz = "Europe/Berlin", format = "%F %T%z") ``` For the meaning of the placeholders %F, %T and %z, respectively, in the format string, see `?strftime`. The package kwb.datetime provides a function `isoToLocaltime()` that does the same: ```{r} kwb.datetime::isoToLocaltime(iso_timestamps) ``` In both cases, the timestamps are interpreted correctly, as also shown in this plot: ```{r fig.width = 5, fig.height = 2, echo = FALSE} out <- capture.output( print_times(kwb.datetime::isoToLocaltime(iso_timestamps)) ) ``` Unfortunately, the timestamps logged by measuring devices often do not contain the additional information on the UTC offset. For this case the package kwb.datetime provides functions that can be applied in a chain to perform a three step process: **Step 1:** Use the function `utcOffsetBerlinTime()` to determine the UTC offsets (for timestamps given in time zone "Europe/Berlin"): ```{r} # Guess and print the UTC offsets for the given timestamps (offsets <- kwb.datetime::utcOffsetBerlinTime(timestamps)) ``` This function requires the timestamps to be sorted in increasing order. Otherwise it cannot decide between CEST and CET for possibly unambiguous timestamps between 02:00 and 03:00 at the day of reverting time from CEST back to CET. **Step 2:** Use these offsets to create timestamps in full ISO 8601 format, i.e. ending in either *+0200* (when referring to CEST) or in *+0100* (when referring to CET): ```{r} # Create ISO 8601 timestamps and print them (iso_timestamps <- sprintf("%s%+03d00", timestamps, offsets)) ``` **Step 3:** Use the function `isoToLocaltime()` to convert these new timestamps from `character` into their corresponding `POSIXct` objects: ```{r} # Create POSIXct-objects in time zone "Europe/Berlin" and print them (kwb.datetime:::isoToLocaltime(iso_timestamps)) ``` ### What is the solution (one step)? The three steps presented above are performed within the function `textToEuropeBerlinPosix()` so that you can do the conversion of the original `timestamps` by calling: ```{r} kwb.datetime::textToEuropeBerlinPosix(timestamps) ``` ## Analyse a sequence of POSIXct objects The package contains a function `getEqualStepRanges()` that helps find unexpected changes in the time step within a vector of `POSIXct`. Applied to the example vector `times_local` from above, this function finds exactly one consistent sequence of times in which the time step is constantly one hour: ```{r} kwb.datetime::getEqualStepRanges(times_local) ``` As already shown above, using `as.POSIXct()` directly on the vector of text timestamps does not return the correct `times`. Using `getEqualStepRanges()` helps understand the problem. It shows four different sub-sequences within `times` in each of which the time step differs from the time step in the previous sub-sequence. ```{r fig.width = 6} # (Badly) Convert to POSIXct bad_times <- as.POSIXct(timestamps_local, tz = "Europe/Berlin", format = format) # Get information on the contained sub-sequences ranges <- kwb.datetime::getEqualStepRanges(bad_times) # Print the sub-sequences ranges # Plot the sub-sequences plot(ranges) ``` What happened? The timestamp `2019-10-27 02:00` appears twice! Once in Central European Summer Time (CEST) and once in Central European Time (CET). As shown above, using the function `textToEuropeBerlinPosix()` can solve the problem: ```{r} # Reformat the timestamps to ISO format iso <- kwb.datetime::reformatTimestamp(timestamps_local, format) # (Correctly) Convert to POSIXct good_times <- kwb.datetime::textToEuropeBerlinPosix(iso) # Show the contained sub-sequences (should be only one now!) kwb.datetime::getEqualStepRanges(good_times) ``` Check if the original date and time objects could be reproduced: ```{r} # Explicitly set the time zone before comparing attr(good_times, "tzone") <- "Europe/Berlin" identical(good_times, times_local) ```