| Title: | R Package with Functions for Scraping Data of Wasserportal Berlin |
|---|---|
| Description: | R Package with Functions for Scraping Data of Wasserportal Berlin (https://wasserportal.berlin.de), which contains real-time data of surface water and groundwater monitoring stations. |
| Authors: | Hauke Sonnenberg [aut] (ORCID: <https://orcid.org/0000-0001-9134-2871>), Michael Rustler [aut, cre] (ORCID: <https://orcid.org/0000-0003-0647-7726>), AD4GD [fnd], DWC [fnd], IMPETUS [fnd], PROMISCES [fnd], Kompetenzzentrum Wasser Berlin gGmbH (KWB) [cph] |
| Maintainer: | Michael Rustler <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.5.0 |
| Built: | 2026-06-06 06:25:47 UTC |
| Source: | https://github.com/KWB-R/wasserportal |
Helper function: base url for download
base_url_download()base_url_download()
base url for download of csv/zip files prepared by R package
Create Text Labels from Data Frame Columns
columns_to_labels(data, columns, fmt = "%s: %s", sep = ", ")columns_to_labels(data, columns, fmt = "%s: %s", sep = ", ")
data |
data frame |
columns |
names of columns from which to create labels |
fmt |
format string passed to |
sep |
separator (default: ", ") |
vector of character with as many elements as there are rows in data
data <- data.frame(number = 1:2, name = c("adam", "eva"), value = 3:4) columns <- c("name", "value") columns_to_labels(data, columns) columns_to_labels(data, columns, fmt = "<p>%s: %s</p>", sep = "")data <- data.frame(number = 1:2, name = c("adam", "eva"), value = 3:4) columns <- c("name", "value") columns_to_labels(data, columns) columns_to_labels(data, columns, fmt = "<p>%s: %s</p>", sep = "")
The tables that appear in the API documentation of the wasserportal (https://wasserportal.berlin.de/download/wasserportal_berlin_getting_data.pdf) have been added to the wasserportal package. This function returns a list of data frames with each element representing one of these tables.
get_api_tables(name = NULL)get_api_tables(name = NULL)
name |
of element from the list of data frames to be selected. If this argument is left blank (name = NULL), the default, the list of data frames is returned. |
list of data frames or data frame specified by the name argument
get_api_tables()get_api_tables()
Get Daily Surfacewater Data: wrapper to scrape daily surface water data
get_daily_surfacewater_data( stations, variables = get_surfacewater_variables(), list2df = FALSE )get_daily_surfacewater_data( stations, variables = get_surfacewater_variables(), list2df = FALSE )
stations |
stations as retrieved by by |
variables |
variables as retrieved by by |
list2df |
convert result list to data frame (default: FALSE) |
list or data frame with all available data from Wasserportal
## Not run: stations <- wasserportal::get_stations() variables <- wasserportal::get_surfacewater_variables() variables sw_data_daily <- wasserportal::get_daily_surfacewater_data(stations, variables) ## End(Not run)## Not run: stations <- wasserportal::get_stations() variables <- wasserportal::get_surfacewater_variables() variables sw_data_daily <- wasserportal::get_daily_surfacewater_data(stations, variables) ## End(Not run)
wrapper function to scrape all available raw data, i.e. groundwater level and quality data and save in list
get_groundwater_data( stations, groundwater_options = get_groundwater_options(), debug = TRUE, stations_list = NULL )get_groundwater_data( stations, groundwater_options = get_groundwater_options(), debug = TRUE, stations_list = NULL )
stations |
list as retrieved by |
groundwater_options |
as retrieved by |
debug |
print debug messages (default: TRUE) |
stations_list |
list of station metadata as returned by
|
list with elements "groundwater.level" and "groundwater.quality" data frames
## Not run: stations <- wasserportal::get_stations() gw_data_list <- get_groundwater_data(stations) str(gw_data_list) ## End(Not run)## Not run: stations <- wasserportal::get_stations() gw_data_list <- get_groundwater_data(stations) str(gw_data_list) ## End(Not run)
Helper function: get groundwater options
get_groundwater_options()get_groundwater_options()
return available groundwater data options and prepare for being used
as input for get_groundwater_data
get_groundwater_options()get_groundwater_options()
Wasserportal Berlin: get overview options for stations
get_overview_options()get_overview_options()
list with shortcuts to station overview tables
(wasserportal.berlin.de/messwerte.php?anzeige=tabelle&thema=<shortcut>)
get_overview_options()get_overview_options()
Helper function: get available station variables
get_station_variables(station_df)get_station_variables(station_df)
station_df |
data frame with one row per station and columns "Messstellennummer", "Messstellenname" and additional columns each of which represents a variable that is measured at that station. If the variable columns contain the value "x" it means that the corresponding variable is measured and the name of the column is contained in the returned vector of variable names. |
returns names of available variables for station
Get Stations
get_stations( type = c("list", "data.frame", "crosstable"), run_parallel = TRUE, n_cores = parallel::detectCores() - 1L, debug = TRUE )get_stations( type = c("list", "data.frame", "crosstable"), run_parallel = TRUE, n_cores = parallel::detectCores() - 1L, debug = TRUE )
type |
vector of character describing the type(s) of output(s) to be
returned. Expected values (and default): |
run_parallel |
default: TRUE |
n_cores |
number of cores to use if |
debug |
logical indicating whether or not to show debug messages |
list with general station "overview" (either as list "overview_list" or as data.frame "overview_df") and a crosstable with information which parameters is available per station ("x" if available, NA if not)
stations <- wasserportal::get_stations(n_cores = 2L) str(stations)stations <- wasserportal::get_stations(n_cores = 2L) str(stations)
Get Surface Water Quality for Multiple Monitoring Stations
get_surfacewater_qualities(station_ids, dbg = TRUE)get_surfacewater_qualities(station_ids, dbg = TRUE)
station_ids |
vector with ids of multiple (or one) monitoring stations |
dbg |
print debug messages (default: TRUE) |
data frame with water quality data for multiple monitoring stations
## Not run: stations <- wasserportal::get_stations() station_ids <- stations$overview_list$surface_water.quality$Messstellennummer swq <- wasserportal::get_surfacewater_qualities(station_ids) str(swq) ## End(Not run)## Not run: stations <- wasserportal::get_stations() station_ids <- stations$overview_list$surface_water.quality$Messstellennummer swq <- wasserportal::get_surfacewater_qualities(station_ids) str(swq) ## End(Not run)
Get Surface Water Quality for One Monitoring Station
get_surfacewater_quality(station_id)get_surfacewater_quality(station_id)
station_id |
id of surface water measurement station |
data frame with water quality data for one monitoring station
## Not run: stations <- wasserportal::get_stations() station_id <- stations$overview_list$surface_water.quality$Messstellennummer[1] swq <- wasserportal::get_surfacewater_quality(station_id) str(swq) ## End(Not run)## Not run: stations <- wasserportal::get_stations() station_id <- stations$overview_list$surface_water.quality$Messstellennummer[1] swq <- wasserportal::get_surfacewater_quality(station_id) str(swq) ## End(Not run)
Helper function: get surface water variables
get_surfacewater_variables()get_surfacewater_variables()
vector with surface water variables
Wasserportal Berlin: get master data for a single station
get_wasserportal_master_data(master_url)get_wasserportal_master_data(master_url)
master_url |
url with master data for single station as retrieved by
|
data frame with metadata for selected station
## Not run: stations_list <- wasserportal::get_stations(type = "list") # GW Station master_url <- stations_list %>% kwb.utils::selectElements("groundwater.level") %>% kwb.utils::selectColumns("stammdaten_link")[1L] get_wasserportal_master_data(master_url) # SW Station # Reduce to monitoring stations maintained by Berlin master_urls <- stations_list %>% kwb.utils::selectElements("surface_water.water_level") %>% dplyr::filter(.data$Betreiber == "Land Berlin") %>% dplyr::pull(.data$stammdaten_link) get_wasserportal_master_data(master_urls[1L]) ## End(Not run)## Not run: stations_list <- wasserportal::get_stations(type = "list") # GW Station master_url <- stations_list %>% kwb.utils::selectElements("groundwater.level") %>% kwb.utils::selectColumns("stammdaten_link")[1L] get_wasserportal_master_data(master_url) # SW Station # Reduce to monitoring stations maintained by Berlin master_urls <- stations_list %>% kwb.utils::selectElements("surface_water.water_level") %>% dplyr::filter(.data$Betreiber == "Land Berlin") %>% dplyr::pull(.data$stammdaten_link) get_wasserportal_master_data(master_urls[1L]) ## End(Not run)
Wasserportal Berlin: get master data for a multiple stations
get_wasserportal_masters_data(master_urls, run_parallel = TRUE)get_wasserportal_masters_data(master_urls, run_parallel = TRUE)
master_urls |
URLs to master data as found in column "stammdaten_link"
of the data frame returned by
|
run_parallel |
default: TRUE |
data frame with metadata for selected master urls
## Not run: stations_list <- wasserportal::get_stations(type = "list") # Reduce to monitoring stations maintained by Berlin master_urls <- stations_list$surface_water.water_level %>% dplyr::filter(.data$Betreiber == "Land Berlin") %>% dplyr::pull(.data$stammdaten_link) system.time(master_parallel <- get_wasserportal_masters_data( master_urls )) system.time(master_sequential <- get_wasserportal_masters_data( master_urls, run_parallel = FALSE )) ## End(Not run)## Not run: stations_list <- wasserportal::get_stations(type = "list") # Reduce to monitoring stations maintained by Berlin master_urls <- stations_list$surface_water.water_level %>% dplyr::filter(.data$Betreiber == "Land Berlin") %>% dplyr::pull(.data$stammdaten_link) system.time(master_parallel <- get_wasserportal_masters_data( master_urls )) system.time(master_sequential <- get_wasserportal_masters_data( master_urls, run_parallel = FALSE )) ## End(Not run)
Get Names and IDs of the Stations of wasserportal.berlin.de
get_wasserportal_stations(type = "quality")get_wasserportal_stations(type = "quality")
type |
one of "quality", "level", "flow" |
Wasserportal Berlin: get stations overview table
get_wasserportal_stations_table( type = get_overview_options()$groundwater$level, url_wasserportal = wasserportal_base_url() )get_wasserportal_stations_table( type = get_overview_options()$groundwater$level, url_wasserportal = wasserportal_base_url() )
type |
type of stations table to retrieve. Valid options defined in
|
url_wasserportal |
base url to Wasserportal berlin (default:
|
data frame with master data of selected monitoring stations
types <- wasserportal::get_overview_options() str(types) sw_l <- wasserportal::get_wasserportal_stations_table(type = types$surface_water$water_level) str(sw_l)types <- wasserportal::get_overview_options() str(types) sw_l <- wasserportal::get_wasserportal_stations_table(type = types$surface_water$water_level) str(sw_l)
Get Names and IDs of the Variables of wasserportal.berlin.de
get_wasserportal_variables(station = NULL)get_wasserportal_variables(station = NULL)
station |
station id. If given, only variables that are available for the given station are returned. |
Helper function: list data to csv or zip
list_data_to_csv_or_zip(data_list, file_prefix, to_zip)list_data_to_csv_or_zip(data_list, file_prefix, to_zip)
data_list |
data in list form |
file_prefix |
file prefix |
to_zip |
whether or not to convert to zip file |
loops through list of data frames and uses list names as filenames
Helper function: list masters data to csv
list_masters_data_to_csv(masters_data_list)list_masters_data_to_csv(masters_data_list)
masters_data_list |
masters data in list form as retrieved by
|
loops through list of data frames and uses list names as filenames
## Not run: stations_list <- get_stations(type = "list") masters_data_csv_files <- list_masters_data_to_csv(stations_list) masters_data_csv_files ## End(Not run)## Not run: stations_list <- get_stations(type = "list") masters_data_csv_files <- list_masters_data_to_csv(stations_list) masters_data_csv_files ## End(Not run)
Helper function: list timeseries data to zip
list_timeseries_data_to_zip(timeseries_data_list)list_timeseries_data_to_zip(timeseries_data_list)
timeseries_data_list |
time series data in list form as retrieved by
|
loops through list of data frames and uses list names as filenames
## Not run: stations <- wasserportal::get_stations() # Groundwater Time Series gw_tsdata_list <- wasserportal::get_groundwater_data(stations) gw_tsdata_files <- wasserportal::list_timeseries_data_to_zip(gw_tsdata_list) # Surface Water Time Series sw_tsdata_list <- wasserportal::get_daily_surfacewater_data(stations) sw_tsdata_files <- wasserportal::list_timeseries_data_to_zip(sw_tsdata_list) ## End(Not run)## Not run: stations <- wasserportal::get_stations() # Groundwater Time Series gw_tsdata_list <- wasserportal::get_groundwater_data(stations) gw_tsdata_files <- wasserportal::list_timeseries_data_to_zip(gw_tsdata_list) # Surface Water Time Series sw_tsdata_list <- wasserportal::get_daily_surfacewater_data(stations) sw_tsdata_files <- wasserportal::list_timeseries_data_to_zip(sw_tsdata_list) ## End(Not run)
Helper function to read CSV
read(text, ...)read(text, ...)
text |
text |
... |
... additional arguments passed to |
data frame with values
This function downloads and reads CSV files from wasserportal.berlin.de.
read_wasserportal( station, variables = NULL, from_date = as.character(Sys.Date() - 90L), type = "single", include_raw_time = FALSE, stations_crosstable )read_wasserportal( station, variables = NULL, from_date = as.character(Sys.Date() - 90L), type = "single", include_raw_time = FALSE, stations_crosstable )
station |
station number, as found in column "Messstellennummer" of the
data frame returned by |
variables |
vector of variable identifiers, as returned by
|
from_date |
|
type |
one of "single" (the default), "daily", "monthly" |
include_raw_time |
if |
stations_crosstable |
data frame as returned by
|
The original timestamps (column timestamps_raw in the example below)
are not all plausible, e.g. "31.03.2019 03:00" appears twice! They are
corrected (column timestamp_corr) to represent a plausible sequence of
timestamps in Berlin Normal Time (UTC+01) Finally, a valid POSIXct timestamp
in timezone "Berlin/Europe" (UTC+01 in winter, UTC+02 in summer) is created,
together with the additional information on the UTC offset (column
UTCOffset, 1 in winter, 2 in summer).
data frame read from the CSV file that the download provides. IMPORTANT: It is not yet clear how to interpret the timestamp, see example
## Not run: # Get a list of available water quality stations and variables stations_crosstable <- wasserportal::get_stations(type = "crosstable") # Set the start date from_date <- "2021-03-01" # Read the timeseries (multiple variables for one station) water_quality <- wasserportal::read_wasserportal( station = stations_crosstable$Messstellennummer[1L], from_date = from_date, include_raw_time = TRUE, stations_crosstable = stations_crosstable ) # Look at the first few records head(water_quality) # Check the metadata #kwb.utils::getAttribute(water_quality, "metadata") # Set missing values to NA water_quality[water_quality == -777] <- NA # Look at the first few records again head(water_quality) ### How was the original timestamp interpreted? # Determine the days at which summer time starts and ends, respectively from_year <- as.integer(substr(from_date, 1L, 4L)) switches <- kwb.datetime::date_range_CEST(from_year) # Reformat to dd.mm.yyyy switches <- kwb.datetime::reformatTimestamp(switches, "%Y-%m-%d", "%d.%m.%Y") # Define a pattern to look for timestamps "around" the switches pattern <- paste(switches, "0[1-4]", collapse = "|") # Look at the data for these timestamps water_quality[grepl(pattern, water_quality$timestamp_raw), ] # The original timestamps (timestamps_raw) were not all plausible, e.g. # for March 2019. This seems to have been fixed by the "wasserportal"! sum(water_quality$timestamp_raw != water_quality$timestamp_corr) ## End(Not run)## Not run: # Get a list of available water quality stations and variables stations_crosstable <- wasserportal::get_stations(type = "crosstable") # Set the start date from_date <- "2021-03-01" # Read the timeseries (multiple variables for one station) water_quality <- wasserportal::read_wasserportal( station = stations_crosstable$Messstellennummer[1L], from_date = from_date, include_raw_time = TRUE, stations_crosstable = stations_crosstable ) # Look at the first few records head(water_quality) # Check the metadata #kwb.utils::getAttribute(water_quality, "metadata") # Set missing values to NA water_quality[water_quality == -777] <- NA # Look at the first few records again head(water_quality) ### How was the original timestamp interpreted? # Determine the days at which summer time starts and ends, respectively from_year <- as.integer(substr(from_date, 1L, 4L)) switches <- kwb.datetime::date_range_CEST(from_year) # Reformat to dd.mm.yyyy switches <- kwb.datetime::reformatTimestamp(switches, "%Y-%m-%d", "%d.%m.%Y") # Define a pattern to look for timestamps "around" the switches pattern <- paste(switches, "0[1-4]", collapse = "|") # Look at the data for these timestamps water_quality[grepl(pattern, water_quality$timestamp_raw), ] # The original timestamps (timestamps_raw) were not all plausible, e.g. # for March 2019. This seems to have been fixed by the "wasserportal"! sum(water_quality$timestamp_raw != water_quality$timestamp_corr) ## End(Not run)
Read Wasserportal Raw
read_wasserportal_raw( variable, station, from_date, type = "single", include_raw_time = FALSE, handle = NULL, stations_crosstable, api_version = 2L )read_wasserportal_raw( variable, station, from_date, type = "single", include_raw_time = FALSE, handle = NULL, stations_crosstable, api_version = 2L )
variable |
variable |
station |
station id |
from_date |
start date |
type |
one of "single", "daily", "monthly" (default: "single") |
include_raw_time |
TRUE or FALSE (default: FALSE) |
handle |
handle (default: NULL) |
stations_crosstable |
data frame as returned by
|
api_version |
1 integer number representing the version of wasserportal's API. 1L: before 2023, 2L: since 2023. Default: 2L |
????
read_wasserportal_raw_gw
read_wasserportal_raw_gw( station = 149, stype = "gws", type = "single_all", from_date = "", include_raw_time = FALSE, handle = NULL, as_text = FALSE, dbg = FALSE )read_wasserportal_raw_gw( station = 149, stype = "gws", type = "single_all", from_date = "", include_raw_time = FALSE, handle = NULL, as_text = FALSE, dbg = FALSE )
station |
station id |
stype |
"gws" or "gwq" |
type |
"single" or "single_all" (if stype = "gwq") |
from_date |
(default: "") |
include_raw_time |
default: FALSE |
handle |
default: NULL |
as_text |
if TRUE, the raw text that is returned by the HTTP request to the Wasserportal is returned by this function. Otherwise (the default) the raw text is tried to be interpreted as comma separated values and a corresponding data frame is returned. Use as_text = TRUE to analyse the raw text in case that an error occurs when trying to convert the text to a data frame. |
dbg |
logical indicating whether or not to show debug messages. The default is FALSE |
data.frame with values
## Not run: read_wasserportal_raw_gw(station = 149, stype = "gws") read_wasserportal_raw_gw(station = 149, stype = "gwq") ## End(Not run)## Not run: read_wasserportal_raw_gw(station = 149, stype = "gws") read_wasserportal_raw_gw(station = 149, stype = "gwq") ## End(Not run)
Read CSV File from Package's "extdata" Folder
readPackageFile(file, ...)readPackageFile(file, ...)
file |
file name (without path) |
... |
additional arguments passed to |
data frame representing the content of file
Helper function: Base Url of Berlin Wassersportal
wasserportal_base_url()wasserportal_base_url()
string with base url of Berlin Wasserportal
Wasserportal Master Data: download and Import in R List
wp_masters_data_to_list( overview_list_names, target_dir = tempdir(), file_prefix = "stations_", is_zipped = FALSE )wp_masters_data_to_list( overview_list_names, target_dir = tempdir(), file_prefix = "stations_", is_zipped = FALSE )
overview_list_names |
names of elements in the list returned by
|
target_dir |
target directory for downloading data (default: tempdir()) |
file_prefix |
prefix given to file names |
is_zipped |
are the data to be downloaded zipped (default: FALSE) |
downloads csv master data from Wasserportal
## Not run: overview_list_names <- names(wasserportal::get_stations(type = "list")) wp_masters_data_list <- wp_masters_data_to_list(overview_list_names) ## End(Not run)## Not run: overview_list_names <- names(wasserportal::get_stations(type = "list")) wp_masters_data_list <- wp_masters_data_to_list(overview_list_names) ## End(Not run)
Wasserportal Time Series Data: download and Import in R List
wp_timeseries_data_to_list( overview_list_names, target_dir = tempdir(), is_zipped = TRUE )wp_timeseries_data_to_list( overview_list_names, target_dir = tempdir(), is_zipped = TRUE )
overview_list_names |
names of elements in the list returned by
|
target_dir |
target directory for downloading data (default: tempdir()) |
is_zipped |
are the data to be downloaded zipped (default: TRUE) |
downloads (zipped) data from wasserportal
## Not run: overview_list_names <- names(wasserportal::get_stations(type = "list")) wp_timeseries_data_list <- wp_timeseries_data_to_list(overview_list_names) ## End(Not run)## Not run: overview_list_names <- names(wasserportal::get_stations(type = "list")) wp_timeseries_data_list <- wp_timeseries_data_to_list(overview_list_names) ## End(Not run)