Converting Text to Time Objects

This tutorial describes how to convert text timestamps to POSIXct objects. The conversion becomes necessary when time differences are to be calculated. We consider two cases that differ in the way that the time recording device operates its clock:

  • Case 1: the clock switches between standard (winter) time and summer time.
  • Case 2: the clock does not switch between standard (winter) time and summer time but stays in standard time over the whole year.

In both cases, using the base R function as.POSIXct() to convert text timestamps to POSIXct objects may lead to unintended results (as detailed below).

This package contains a function textToEuropeBerlinPosix() that is a wrapper around as.POSIXct() but that is specialised on the two above cases. It returns POSIXct objects in time zone “Europe/Berlin” and can handle timestamps from both of the above cases as input.

Creation of example timestamps

We start by creating text timestamps as they could have been recorded by a measuring device.

# Define first and last timestamps
from <- "2019-01-01"
to <- "2020-01-01"
step <- 3600 # seconds, i.e. one timestamp per hour

# Case 1: local time, i.e. standard time in winter, summer time in summer
tz <- "Europe/Berlin"
times_local <- seq(as.POSIXct(from, tz), as.POSIXct(to, tz), by = step)

# Case 2: always standard time (Note that GMT-1 refers to UTC+1!)
tz <- "Etc/GMT-1"
times_standard <- seq(as.POSIXct(from, tz), as.POSIXct(to, tz), by = step)

In both cases, all time differences are one (hour), as intended:

all(diff(times_local) == 1)
#> [1] TRUE
all(diff(times_standard) == 1)
#> [1] TRUE

We use format() to simulate that the timestamps are recorded in a certain local format:

format <- "%d.%m.%y %H:%M"
timestamps_local <- format(times_local, format = format)
timestamps_standard <- format(times_standard, format = format)

Show the first three timestamps in each case to check whether the format was applied as intended:

writeLines(head(timestamps_local, 3))
#> 01.01.19 00:00
#> 01.01.19 01:00
#> 01.01.19 02:00
writeLines(head(timestamps_standard, 3))
#> 01.01.19 00:00
#> 01.01.19 01:00
#> 01.01.19 02:00

Note that one of the local timestamps appears twice:

timestamps_local[duplicated(timestamps_local)]
#> [1] "27.10.19 02:00"

In contrast, the standard timestamps are all unique:

any(duplicated(timestamps_standard))
#> [1] FALSE

Case 1: Clock switches between standard time and summer time

In this case the function as.POSIXct() may return unexpected results. These are caused by duplicates in the input vector of text timestamps. In the case considered in this chapter, duplicated timestamps are absolutely valid. They result from shifting time back from summer time to standard time. Let’s see what happens if we call as.POSIXct() on the vector of local timestamps:

# Convert text timestamps to POSIXct objects
times <- as.POSIXct(timestamps_local, tz = "Europe/Berlin", format = format)

Problem

Even though time zone and time format were set correctly, the vector of returned POSIXct objects is corrupt. We can see the problem when looking at the time differences:

table(diff(times))
#> 
#>    0 3600 7200 
#>    1 8758    1

Not all time differences are one hour (3600 seconds) anymore! Once there is no time difference, i.e. two adjacent times are identical, and once there is a time difference of two hours (7200 seconds), i.e. one hour is skipped. Read further down for the reason and for a function allowing to further investigate the problem.

Solution

Use the function textToEuropeBerlinPOSIX() to do the conversion. It interprets ambiguous timestamps correctly, provided that they are given in chronological order. See below for details.

# Convert text timestamps to POSIXct objects
times_local_from_text <- kwb.datetime::textToEuropeBerlinPosix(
  timestamps_local, format = format
)
#> Converting 8761 timestamps to POSIXct ... ok. (0.09 secs)

The created POSIXct objects are now identical to the original ones:

identical(times_local, times_local_from_text)
#> [1] TRUE

Case 2: Clock stays in standard time

What happens in this case, when applying as.POSIXct() on the timestamps?

times <- as.POSIXct(timestamps_standard, tz = "Europe/Berlin", format = format)

Problem

The vector timestamps_standard contains timestamps that do not exist in the time zone Europe/Berlin (for details, see below). Unfortunately, the function as.POSIXct() does not give a warning about this. We convince ourselves that something went wrong by looking at the frequencies of the time differences:

table(diff(times))
#> 
#>    0 3600 7200 
#>    1 8758    1

Not all differences are one hour (3600 seconds) as expected. Once the time difference is zero, i.e. two adjacent times are identical, and once there is a time difference of two hours (7200 seconds), i.e. one hour is skipped. This is the behaviour on a Linux system, on Windows we would get a different, but also unintended result.

Solution

Use the function textToEuropeBerlinPOSIX() to do the conversion. Set switches = FALSE to tell the function that the timestamps were recorded by a clock that does not switch between standard time and summer time:

# Convert text timestamps to POSIXct objects
times_standard_from_text <- kwb.datetime::textToEuropeBerlinPosix(
  timestamps_standard, format = format, switches = FALSE
)
#> Converting 8761 timestamps to POSIXct ... ok. (0.02 secs)

The created POSIXct objects are now (almost) identical to the original ones. The only difference is the time zone attribute that is “Etc/GMT-1” in the original timestamps but “Europe/Berlin” in the timestamps provided by textToEuropeBerlinPosix():

attr(times_standard, "tzone")
#> [1] "Etc/GMT-1"
attr(times_standard_from_text, "tzone")
#> [1] "Europe/Berlin"

We set the time zone of the original times to “Europe/Berlin” (that does not change the underlying time information!) in order to check for identity:

attr(times_standard, "tzone") <- "Europe/Berlin"
identical(times_standard, times_standard_from_text)
#> [1] TRUE

See vignette “Exkurs Zeitzonen” (in German!) for details on this case.

Background and Details

In R, time information are stored in objects of class POSIXct. The function as.POSIXct() is used to convert character strings representing points in time into the corresponding POSIXct objects. This chapter points out some important details about this function.

Problems may arise with times recorded in time zones that change for Daylight Saving. This is the case for the time zone “Europe/Berlin” that is used in the following example. In this time zone, the time is given in Central European Time (CET) in winter and in Central European Summer Time (CEST) in summer.

When does Summer Time start/end?

Use the function date_range_CEST() to find out at what days summer time starts and ends, respectively:

kwb.datetime::date_range_CEST(2017:2019)
#>      begin        end         
#> 2017 "2017-03-26" "2017-10-29"
#> 2018 "2018-03-25" "2018-10-28"
#> 2019 "2019-03-31" "2019-10-27"

Note that this function returns character strings and not, for example, Date objects.

Example: Reading local timestamps in Berlin, Germany

Imagine a measuring device taking measurements every 30 minutes at some location in Berlin, Germany. The clock of the device is configured to switch between standard time and daylight saving time (summer time) and vice versa. In 2017, on October 29, when summer time is reverted back to standard time, the recorded timestamps around the time shift are:

# Define timestamps (character)
timestamps <- c(
  "2017-10-29 01:30:00", # 1: CEST
  "2017-10-29 02:00:00", # 2: CEST
  "2017-10-29 02:30:00", # 3: CEST
  "2017-10-29 02:00:00", # 4: CET
  "2017-10-29 02:30:00", # 5: CET
  "2017-10-29 03:00:00"  # 6: CET
)

The timestamps “02:00” and “02:30” appear twice, at indices 2 and 3, respectively, first and at indices 4 and 5, respectively, second. This is because at 03:00 (CEST) the clock is set back to 02:00 (CET). The first occurrences of the two timestamps refer to summer time (CEST) whereas the second occurrences refer to standard time (CET).

What is the problem?

What happens if we convert these timestamps to time objects? Using as.POSIXct() and the (correct) time zone “Europe/Berlin”, we get the following vector of time objects:

# Convert timestamps to POSIXct and print them
(times <- as.POSIXct(timestamps, tz = "Europe/Berlin"))
#> [1] "2017-10-29 01:30:00 CEST" "2017-10-29 02:00:00 CEST"
#> [3] "2017-10-29 02:30:00 CEST" "2017-10-29 02:00:00 CEST"
#> [5] "2017-10-29 02:30:00 CEST" "2017-10-29 03:00:00 CET"

The function cannot distinguish between the first and second occurrences of the times 02:00 and 02:30. The output and the following plot reveal that the timestamps between 02:00 and 03:00 (exclusive) are always interpreted as being in summer time (CEST).

What is the solution (step by step)?

How can we tell R that the first occurrences of 02:00 and 02:30 refer to CEST and the second ocurrences refer to CET? We could try the following:

as.POSIXct(tz = "Europe/Berlin", c(
  "2017-10-29 01:30:00 CEST",
  "2017-10-29 02:00:00 CEST",
  "2017-10-29 02:30:00 CEST",
  "2017-10-29 02:00:00 CET",
  "2017-10-29 02:30:00 CET",
  "2017-10-29 03:00:00 CET"
))
#> [1] "2017-10-29 01:30:00 CEST" "2017-10-29 02:00:00 CEST"
#> [3] "2017-10-29 02:30:00 CEST" "2017-10-29 02:00:00 CEST"
#> [5] "2017-10-29 02:30:00 CEST" "2017-10-29 03:00:00 CET"

Unfortunately, this does not work! Again, everything between 02:00 and 03:00 (exclusive) is assumed to refer to CEST, as the output above indicates. However, R accepts a format in which the number of hours ahead of Coordinated Universal Time (UTC) is indicated in the timestamps. In our example this looks as follows:

iso_timestamps <- c(
  "2017-10-29 01:30:00+0200",
  "2017-10-29 02:00:00+0200",
  "2017-10-29 02:30:00+0200",
  "2017-10-29 02:00:00+0100",
  "2017-10-29 02:30:00+0100",
  "2017-10-29 03:00:00+0100"
)

Timestamps in CEST are two hours (and zero minutes) ahead of UTC. This is indicated in the timestamp by the ending +0200. Timestamps in CET are only one hour ahead of UTC and thus indicated by +0100. Timestamps given in this format can be converted to POSIXct objects by setting the format argument of the as.POSIXct() function to "%F %T%z":

as.POSIXct(iso_timestamps, tz = "Europe/Berlin", format = "%F %T%z")
#> [1] "2017-10-29 01:30:00 CEST" "2017-10-29 02:00:00 CEST"
#> [3] "2017-10-29 02:30:00 CEST" "2017-10-29 02:00:00 CET" 
#> [5] "2017-10-29 02:30:00 CET"  "2017-10-29 03:00:00 CET"

For the meaning of the placeholders %F, %T and %z, respectively, in the format string, see ?strftime.

The package kwb.datetime provides a function isoToLocaltime() that does the same:

kwb.datetime::isoToLocaltime(iso_timestamps)
#> Converting 6 timestamps ... ok.
#> [1] "2017-10-28 23:30:00 UTC" "2017-10-29 00:00:00 UTC"
#> [3] "2017-10-29 00:30:00 UTC" "2017-10-29 01:00:00 UTC"
#> [5] "2017-10-29 01:30:00 UTC" "2017-10-29 02:00:00 UTC"

In both cases, the timestamps are interpreted correctly, as also shown in this plot:

Unfortunately, the timestamps logged by measuring devices often do not contain the additional information on the UTC offset. For this case the package kwb.datetime provides functions that can be applied in a chain to perform a three step process:

Step 1: Use the function utcOffsetBerlinTime() to determine the UTC offsets (for timestamps given in time zone “Europe/Berlin”):

# Guess and print the UTC offsets for the given timestamps
(offsets <- kwb.datetime::utcOffsetBerlinTime(timestamps))
#> [1] 2 2 2 1 1 1

This function requires the timestamps to be sorted in increasing order. Otherwise it cannot decide between CEST and CET for possibly unambiguous timestamps between 02:00 and 03:00 at the day of reverting time from CEST back to CET.

Step 2: Use these offsets to create timestamps in full ISO 8601 format, i.e. ending in either +0200 (when referring to CEST) or in +0100 (when referring to CET):

# Create ISO 8601 timestamps and print them
(iso_timestamps <- sprintf("%s%+03d00", timestamps, offsets))
#> [1] "2017-10-29 01:30:00+0200" "2017-10-29 02:00:00+0200"
#> [3] "2017-10-29 02:30:00+0200" "2017-10-29 02:00:00+0100"
#> [5] "2017-10-29 02:30:00+0100" "2017-10-29 03:00:00+0100"

Step 3: Use the function isoToLocaltime() to convert these new timestamps from character into their corresponding POSIXct objects:

# Create POSIXct-objects in time zone "Europe/Berlin" and print them
(kwb.datetime:::isoToLocaltime(iso_timestamps))
#> Converting 6 timestamps ... ok.
#> [1] "2017-10-28 23:30:00 UTC" "2017-10-29 00:00:00 UTC"
#> [3] "2017-10-29 00:30:00 UTC" "2017-10-29 01:00:00 UTC"
#> [5] "2017-10-29 01:30:00 UTC" "2017-10-29 02:00:00 UTC"

What is the solution (one step)?

The three steps presented above are performed within the function textToEuropeBerlinPosix() so that you can do the conversion of the original timestamps by calling:

kwb.datetime::textToEuropeBerlinPosix(timestamps)
#> Guessing time format ... ok. (0.00 secs) 
#> Converting 6 timestamps to POSIXct ... ok. (0.00 secs)
#> [1] "2017-10-29 01:30:00 CEST" "2017-10-29 02:00:00 CEST"
#> [3] "2017-10-29 02:30:00 CEST" "2017-10-29 02:00:00 CET" 
#> [5] "2017-10-29 02:30:00 CET"  "2017-10-29 03:00:00 CET"

Analyse a sequence of POSIXct objects

The package contains a function getEqualStepRanges() that helps find unexpected changes in the time step within a vector of POSIXct. Applied to the example vector times_local from above, this function finds exactly one consistent sequence of times in which the time step is constantly one hour:

kwb.datetime::getEqualStepRanges(times_local)
#>   from   to  from_time    to_time    step
#> 1    1 8761 2019-01-01 2020-01-01 1 hours

As already shown above, using as.POSIXct() directly on the vector of text timestamps does not return the correct times. Using getEqualStepRanges() helps understand the problem. It shows four different sub-sequences within times in each of which the time step differs from the time step in the previous sub-sequence.

# (Badly) Convert to POSIXct
bad_times <- as.POSIXct(timestamps_local, tz = "Europe/Berlin", format = format)

# Get information on the contained sub-sequences
ranges <- kwb.datetime::getEqualStepRanges(bad_times)

# Print the sub-sequences
ranges
#>   from   to           from_time             to_time      step
#> 1    1 7178 2019-01-01 00:00:00 2019-10-27 02:00:00 3600 secs
#> 2 7178 7179 2019-10-27 02:00:00 2019-10-27 02:00:00    0 secs
#> 3 7179 7180 2019-10-27 02:00:00 2019-10-27 03:00:00 7200 secs
#> 4 7180 8761 2019-10-27 03:00:00 2020-01-01 00:00:00 3600 secs

# Plot the sub-sequences
plot(ranges)

What happened? The timestamp 2019-10-27 02:00 appears twice! Once in Central European Summer Time (CEST) and once in Central European Time (CET). As shown above, using the function textToEuropeBerlinPosix() can solve the problem:

# Reformat the timestamps to ISO format
iso <- kwb.datetime::reformatTimestamp(timestamps_local, format)

# (Correctly) Convert to POSIXct
good_times <- kwb.datetime::textToEuropeBerlinPosix(iso)
#> Guessing time format ... ok. (0.00 secs) 
#> Converting 8761 timestamps to POSIXct ... ok. (0.09 secs)

# Show the contained sub-sequences (should be only one now!)
kwb.datetime::getEqualStepRanges(good_times)
#>   from   to  from_time    to_time    step
#> 1    1 8761 2019-01-01 2020-01-01 1 hours

Check if the original date and time objects could be reproduced:

# Explicitly set the time zone before comparing
attr(good_times, "tzone") <- "Europe/Berlin"

identical(good_times, times_local)
#> [1] TRUE