| Title: | Markdown-Documented Data Preparation |
|---|---|
| Description: | R Package for Markdown-documented data preparation. |
| Authors: | Hauke Sonnenberg [aut, cre] (ORCID: <https://orcid.org/0000-0001-9134-2871>), SEMA-BERLIN-2 [fnd], Kompetenzzentrum Wasser Berlin gGmbH (KWB) [cph] |
| Maintainer: | Hauke Sonnenberg <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.3.0 |
| Built: | 2026-05-19 10:32:16 UTC |
| Source: | https://github.com/KWB-R/kwb.prep |
Apply Groups of Filter Criteria from Configuration
apply_filters(data, groups, length_column = NULL, id_columns = names(data)[1L])apply_filters(data, groups, length_column = NULL, id_columns = names(data)[1L])
data |
data frame |
groups |
names of filter criteria groups defined in list returned by
|
length_column |
name of column in |
id_columns |
names of column(s) in |
data, filtered according to the specified criteria. The returned data
frame has an attribute filter_info being a list with as many
elements as there are groups. The elements are named according to
the values given in groups. Each list element is a list with one
element overview (being a data frame with one row per filter
criterion) and further elements removed_<i> being data frames with
only id_columns that represent the records that have been removed in
the according filter step i.
# Define filter criteria criteria <- list( sepal = c( "sepal short" = "Sepal.Length < 5", "sepal narrow" = "Sepal.Width < 3" ), petal = c( "petal short" = "Petal.Length < 5", "petal narrow" = "Petal.Width < 3" ) ) # Write criteria to temporary yaml file tdir <- tempdir() yaml::write_yaml(criteria, file.path(tdir, "filter_criteria.yml")) # Set path to temporary "config" folder so that kwb.prep knows about it kwb.prep:::set_user_config_dir(tdir) # Apply filter groups "sepal" and "petal" to the iris dataset result <- apply_filters(iris, c("sepal", "petal")) # Have a look at the result str(result)# Define filter criteria criteria <- list( sepal = c( "sepal short" = "Sepal.Length < 5", "sepal narrow" = "Sepal.Width < 3" ), petal = c( "petal short" = "Petal.Length < 5", "petal narrow" = "Petal.Width < 3" ) ) # Write criteria to temporary yaml file tdir <- tempdir() yaml::write_yaml(criteria, file.path(tdir, "filter_criteria.yml")) # Set path to temporary "config" folder so that kwb.prep knows about it kwb.prep:::set_user_config_dir(tdir) # Apply filter groups "sepal" and "petal" to the iris dataset result <- apply_filters(iris, c("sepal", "petal")) # Have a look at the result str(result)
Apply Filter Criteria from List
applyFilter(data, criteria_list, element, length_column = NULL)applyFilter(data, criteria_list, element, length_column = NULL)
data |
data frame |
criteria_list |
list of (named) vectors of character representing filter criteria |
element |
name of list element to be selected fom |
length_column |
passed to |
criteria_list <- list( apple = c("is red or green" = "colour %in% c('red', 'green')"), banana = c("is not straight" = "! straight") ) fruit_properties <- data.frame( colour = c("green", "red", "yellow"), straight = c(TRUE, TRUE, FALSE) ) applyFilter(fruit_properties, criteria_list, "apple") applyFilter(fruit_properties, criteria_list, "banana")criteria_list <- list( apple = c("is red or green" = "colour %in% c('red', 'green')"), banana = c("is not straight" = "! straight") ) fruit_properties <- data.frame( colour = c("green", "red", "yellow"), straight = c(TRUE, TRUE, FALSE) ) applyFilter(fruit_properties, criteria_list, "apple") applyFilter(fruit_properties, criteria_list, "banana")
Details about criteria applied and number of rows matching each criterion
is returned in the attribute "details.filter". If a criterion evaluates to
NA, the corresponding row in the data frame is removed (just as if the
criterion evaluated to FALSE).
applyFilterCriteria(x, criteria = NULL, lengthColumn = NULL, ...)applyFilterCriteria(x, criteria = NULL, lengthColumn = NULL, ...)
x |
data frame |
criteria |
vector of character defining filter criteria to be evaluated in x |
lengthColumn |
name of the column containing lengths, e.g. "Length_raw" |
... |
passed to |
# Create a very simple data frame df <- data.frame(value = 1:10, group = rep(c("a", "b"), 5)) # Show the data frame df # Filter for rows meeting two criteria result <- applyFilterCriteria(df, c( "value is below or equal to 5" = "value <= 5", "group is 'a'" = "group == 'a'" )) # Show the result result # Get the evaluation of each criterion in columns kwb.utils::getAttribute(result, "matches")# Create a very simple data frame df <- data.frame(value = 1:10, group = rep(c("a", "b"), 5)) # Show the data frame df # Filter for rows meeting two criteria result <- applyFilterCriteria(df, c( "value is below or equal to 5" = "value <= 5", "group is 'a'" = "group == 'a'" )) # Show the result result # Get the evaluation of each criterion in columns kwb.utils::getAttribute(result, "matches")
Provide all Objects of kwb.prep in the Global Environment
assign_objects()assign_objects()
Create labels for intervals defined by breaks in different possible styles
breaksToIntervalLabels(breaks, style = 5, ...)breaksToIntervalLabels(breaks, style = 5, ...)
breaks |
numeric vector of breaks |
style |
passed to |
... |
further arguments passed to |
Check if the argument can be used as a table name
check_table_name(table_name)check_table_name(table_name)
table_name |
R object to be checked for usage as a table name |
try(check_table_name(c("more", "than", "one", "string"))) try(check_table_name("one_is_ok"))try(check_table_name(c("more", "than", "one", "string"))) try(check_table_name("one_is_ok"))
Stop if File Name Does not End with Zip Extension
check_zip_extension(file)check_zip_extension(file)
file |
path to file to check for .zip or .7z file name extension |
The function does not return anything but stops with a clear error
message in case that file does not end with something that looks
like the file extension of a compressed file.
Compare Two Columns of a Data Frame (Raw Vs Regrouped)
checkGrouping(data, column_raw, column_cat)checkGrouping(data, column_raw, column_cat)
data |
data frame |
column_raw |
name of column in |
column_cat |
name of column in |
Show Number of Unique Values in Selected Columns
checkNumberOfUnique(data, columns = names(data))checkNumberOfUnique(data, columns = names(data))
data |
data frame |
columns |
names of columns in |
Collect Elements of Sublists
collect(x, element, default = NULL)collect(x, element, default = NULL)
x |
a list of lists |
element |
name of list element to be collected from each sublist of
|
default |
value to be returned for lists that do not have an element
called |
x <- list( list(a = 1, b = 2), list(c = 3, a = 4), list(d = 5, e = 6) ) collect(x, "a") collect(x, "a", default = 99)x <- list( list(a = 1, b = 2), list(c = 3, a = 4), list(d = 5, e = 6) ) collect(x, "a") collect(x, "a", default = 99)
Create a get_text() Function
create_text_getter(raw_strings = NULL, FUN = NULL)create_text_getter(raw_strings = NULL, FUN = NULL)
raw_strings |
list of string definitions (key = value) pairs |
FUN |
function to be called to get the string definitions |
a function that can be used to lookup the string constant(s)
get_text <- create_text_getter( list(hello_en = "good morning", hello_de = "sch<oe>ne Gr<ue><ss>e") ) get_text("hello_en") get_text("hello_de") #get_text("no_such_key") # error with clear error messageget_text <- create_text_getter( list(hello_en = "good morning", hello_de = "sch<oe>ne Gr<ue><ss>e") ) get_text("hello_en") get_text("hello_de") #get_text("no_such_key") # error with clear error message
In the character matrix the data frames appear one below the other. Each data frame has a header and each data frame is separated from the following data frame by an empty row.
dataFramesToTextMatrix(data_frames)dataFramesToTextMatrix(data_frames)
data_frames |
list of data frames |
data_frames <- list( data.frame(a = 1:3, b = 2:4), data.frame(a = 1:5, b = 2:6, c = 3:7) ) dataFramesToTextMatrix(data_frames)data_frames <- list( data.frame(a = 1:3, b = 2:4), data.frame(a = 1:5, b = 2:6, c = 3:7) ) dataFramesToTextMatrix(data_frames)
Apply Regrouping of Values in a Data Frame
doRegroupings( Data, regroup.actual = kwb.utils::selectElements(settings, "regroup.actual"), regroup.config = kwb.utils::selectElements(settings, "regroup.config"), settings = NULL, checkRemaining = TRUE, to.factor = FALSE, to.numeric = TRUE, dbg = TRUE )doRegroupings( Data, regroup.actual = kwb.utils::selectElements(settings, "regroup.actual"), regroup.config = kwb.utils::selectElements(settings, "regroup.config"), settings = NULL, checkRemaining = TRUE, to.factor = FALSE, to.numeric = TRUE, dbg = TRUE )
Data |
data frame |
regroup.actual |
default: settings$regroup.actual |
regroup.config |
default: settings$regroup.config |
settings |
list of settings that may contain the elements
|
checkRemaining |
if TRUE (default) it is checked if all values that occurred in a column to be regrouped have been considered in the regrouping |
to.factor |
if |
to.numeric |
(default: |
dbg |
if |
Frequency of Value Combinations in Data Frame Columns
fieldSummary(x, groupBy = names(x)[-1L], lengthColumn = "", na = "Unknown")fieldSummary(x, groupBy = names(x)[-1L], lengthColumn = "", na = "Unknown")
x |
data frame |
groupBy |
vector of character naming the columns (fields) in |
lengthColumn |
optional. Name of column in |
na |
optional. Value to be treated as |
n <- 1000L sample_replace <- function(x, ...) sample(x, size = n, replace = TRUE, ...) x <- data.frame( pipe_id = 1:n, material = sample_replace(c("clay", "concrete", "other")), age_cat = sample_replace(c("young", "old")), length = as.integer(rnorm(n, 50)), stringsAsFactors = FALSE ) fieldSummary(x) fieldSummary(x, "age_cat") fieldSummary(x, "material") fieldSummary(x, "material", lengthColumn = "length")n <- 1000L sample_replace <- function(x, ...) sample(x, size = n, replace = TRUE, ...) x <- data.frame( pipe_id = 1:n, material = sample_replace(c("clay", "concrete", "other")), age_cat = sample_replace(c("young", "old")), length = as.integer(rnorm(n, 50)), stringsAsFactors = FALSE ) fieldSummary(x) fieldSummary(x, "age_cat") fieldSummary(x, "material") fieldSummary(x, "material", lengthColumn = "length")
Fill NA in First Vector With Values From Second Vector
fillUpNA(x, y, dbg = TRUE, name_x = NULL, name_y = NULL)fillUpNA(x, y, dbg = TRUE, name_x = NULL, name_y = NULL)
x |
first vector |
y |
second vector |
dbg |
if |
name_x |
name of x |
name_y |
name of y |
x with NA replaced by the values in y at the
same positions
Show String Constants Used in R Scripts
find_string_constants()find_string_constants()
Get Names of CSV Files Referenced in Config
get_csv_filenames(config, keep_empty = FALSE)get_csv_filenames(config, keep_empty = FALSE)
config |
configuration object (list) with one entry per "table", each of which is expected to have an entry "file" |
keep_empty |
logical. Whether or not to keep "file" entries that are empty ("") |
vector of character with the file names referenced in config
config <- list( table_a = list(file = "table-a.csv"), table_b = list(file = "table-b.csv") ) get_csv_filenames(config)config <- list( table_a = list(file = "table-a.csv"), table_b = list(file = "table-b.csv") ) get_csv_filenames(config)
Lower Case Extension of a File
get_lower_extension(file)get_lower_extension(file)
file |
file path or file name |
get_lower_extension("abc.XYZ")get_lower_extension("abc.XYZ")
Resolve Path from Path Dictionary in Config Folder
get_path(x = NULL, ...)get_path(x = NULL, ...)
x |
key to be looked up in the path dictionary |
... |
possible key = value assignments to be used to replace \<placeholders\> in the path that was looked up |
Get list defining renamings in the form of from = to assignments from
a data frame read by a function that may be specified.
get_renamings(from, to = "column", data = NULL, reader = read_csv_file, ...)get_renamings(from, to = "column", data = NULL, reader = read_csv_file, ...)
from |
name of column of |
to |
name of column of |
data |
data frame defining renamings |
reader |
reader function providing |
... |
arguments passed to the |
list defining renamings as e.g. expected by
renameColumns
Get List of Renamings from Configuration
get_renamings_from_config(config, table_name, all = TRUE)get_renamings_from_config(config, table_name, all = TRUE)
config |
list with one element per table/csv file |
table_name |
name of list element within |
all |
if |
list with original names as names and internal names as values. The
list can be used in a call to renameColumns
Get a Set of Column Names from a Data Frame Defining Selections
get_selection( number = 1, data = NULL, reader = read_csv_file, ..., column = paste0("select.", number), target = "column" )get_selection( number = 1, data = NULL, reader = read_csv_file, ..., column = paste0("select.", number), target = "column" )
number |
number of the selection group, default: 1 |
data |
data frame defining groups of columns |
reader |
reader function providing |
... |
arguments passed to the |
column |
name of column in |
target |
name of column in |
vector of column names
Get Text Constant
get_text(key = NULL, ..., raw_strings = get_raw_strings())get_text(key = NULL, ..., raw_strings = get_raw_strings())
key |
identifier |
... |
additional arguments passed to |
raw_strings |
list with raw string definitions as key = value pairs |
if key is NULL) a list with all text constants or the
text constant looked up for the given key
Get List of User-Defined Text Constants
get_user_strings()get_user_strings()
List Files in Zip Archive
get_zipped_paths(zip_file, include_dirs = FALSE)get_zipped_paths(zip_file, include_dirs = FALSE)
zip_file |
path to zip archive |
include_dirs |
if |
paths to files contained in zip archive
Get Changes of Rows That Are Duplicated in Selected Columns
getChangesOfDuplicates(df, columns, add_columns = columns)getChangesOfDuplicates(df, columns, add_columns = columns)
df |
a data frame |
columns |
names of columns in |
add_columns |
names of additional columns that shall appear in the output even if there are no changes in these columns |
list of data frames. The list has as many elements as there are
different value combinations in columns that appear more than once
in df. Each element is a data frame with all rows from df
that have the same value combination in columns. By default the data
frame contains the columns given in columns and those columns out of
df in which there is at least one change over the values in the
different rows.
df <- data.frame( id = 1:7, name = c("one", "one", "two", "two", "three", "three", "three"), type = c("A", "A", "B", "C", "D", "D", "D"), size = c(10, 11, 12, 12, 13, 13, 14), height = c(1, 1, 2, 3, 4, 4, 5) ) df getChangesOfDuplicates(df, "name") getChangesOfDuplicates(df, c("name", "type"))df <- data.frame( id = 1:7, name = c("one", "one", "two", "two", "three", "three", "three"), type = c("A", "A", "B", "C", "D", "D", "D"), size = c(10, 11, 12, 12, 13, 13, 14), height = c(1, 1, 2, 3, 4, 4, 5) ) df getChangesOfDuplicates(df, "name") getChangesOfDuplicates(df, c("name", "type"))
Get Integer Year Number from Column
getYearFromColumn(data, column)getYearFromColumn(data, column)
data |
data frame |
column |
representing a date or date and time |
vector of integer as long as the number of rows in data
Group values together that belong to the same intervals being defined by breaks
groupByBreaks( x, breaks, values = breaksToIntervalLabels(breaks), right = TRUE, add.Inf.limits = TRUE, to.factor = FALSE, columns = NULL, keyFields = NULL )groupByBreaks( x, breaks, values = breaksToIntervalLabels(breaks), right = TRUE, add.Inf.limits = TRUE, to.factor = FALSE, columns = NULL, keyFields = NULL )
x |
vector of values or a data frame. If |
breaks |
vector of breaks |
values |
values to be assigned |
right |
if TRUE the intervals are right-closed, else left-closed. |
add.Inf.limits |
if TRUE (default), -Inf and Inf are added to the left
and right, respectively, of |
to.factor |
if |
columns |
|
keyFields |
|
groupByBreaks(1:10, breaks = 5, values = c("<= 5", "> 5")) groupByBreaks(1:10, breaks = 5, right = FALSE, values = c("< 5", ">= 5")) # Prepare a simple data frame x <- kwb.utils::noFactorDataFrame( id = c("A", "B", "C"), value = c(10, 20, 30) ) # Keep the ID column of the data frame groupByBreaks(x, breaks = 20, keyFields = "id")groupByBreaks(1:10, breaks = 5, values = c("<= 5", "> 5")) groupByBreaks(1:10, breaks = 5, right = FALSE, values = c("< 5", ">= 5")) # Prepare a simple data frame x <- kwb.utils::noFactorDataFrame( id = c("A", "B", "C"), value = c(10, 20, 30) ) # Keep the ID column of the data frame groupByBreaks(x, breaks = 20, keyFields = "id")
Does a File have a Zip Extension (.zip, .7z)?
has_zip_extension(file, expected = c("zip", "7z"))has_zip_extension(file, expected = c("zip", "7z"))
file |
path(s) to file(s) to be checked for zip extension |
expected |
expected file name extensions. Default: |
vector of logical
all(has_zip_extension(c("a.zip", "b.ZIP", "c.Zip", "d.7z", "e.7Z"))) # TRUE has_zip_extension("a.txt") # FALSEall(has_zip_extension(c("a.zip", "b.ZIP", "c.Zip", "d.7z", "e.7Z"))) # TRUE has_zip_extension("a.txt") # FALSE
The function stops with an error message if the file does not have the
file extension ".zip" or if the zip file does not contain the expected csv
files or if a csv file does not contain all expected fields (columns).
Expected file names and field names are provided config). If
everything looks ok, the csv files in the zip file are extracted into a (new)
folder in the app's "run" directory. The app directory is provided in the
environment variable SEMA_BERLIN_PREP_APP_DIR.
import_db(zip_file, config, base_name = basename(zip_file))import_db(zip_file, config, base_name = basename(zip_file))
zip_file |
path to zip file containing csv files |
config |
configuration object (list) describing the csv files |
base_name |
base name of the folder to be created. The current date will also be encoded in the folder name. By default the base name of the zip file (file name without file extension) is used. |
Create a label for the interval defined by the upper boundary a and
the lower boundary b
intervalLabel(a, b, right = TRUE, style = 1, sep = ",", space = " ")intervalLabel(a, b, right = TRUE, style = 1, sep = ",", space = " ")
a |
upper boundary |
b |
lower boundary |
right |
if TRUE (default) the interval is closed at the upper boundary |
style |
integer number between 1 and 5 indicating one of five possible
styles to name the interval between |
sep |
separator to be used between lower and upper boundary |
space |
space between comparison operators and boundary values. |
# Labels of different styles for right closed intervals (right = TRUE is the # default) intervalLabel(1, 10, style = 1) # "(1,10]" intervalLabel(1, 10, style = 2) # "<= 10" intervalLabel(1, 10, style = 3) # "> 1" intervalLabel(1, 10, style = 4) # "<= " "> 1" (vector of two elements!) intervalLabel(1, 10, style = 5) # "<= 10" "> " (vector of two elements!) # The same with left closed intervals: right <- FALSE intervalLabel(1, 10, right, style = 1) # "[1,10)" intervalLabel(1, 10, right, style = 2) # "< 10" intervalLabel(1, 10, right, style = 3) # ">= 1" intervalLabel(1, 10, right, style = 4) # "< " ">= 1" (vector of two elements!) intervalLabel(1, 10, right, style = 5) # "< 10" ">= " (vector of two elements!)# Labels of different styles for right closed intervals (right = TRUE is the # default) intervalLabel(1, 10, style = 1) # "(1,10]" intervalLabel(1, 10, style = 2) # "<= 10" intervalLabel(1, 10, style = 3) # "> 1" intervalLabel(1, 10, style = 4) # "<= " "> 1" (vector of two elements!) intervalLabel(1, 10, style = 5) # "<= 10" "> " (vector of two elements!) # The same with left closed intervals: right <- FALSE intervalLabel(1, 10, right, style = 1) # "[1,10)" intervalLabel(1, 10, right, style = 2) # "< 10" intervalLabel(1, 10, right, style = 3) # ">= 1" intervalLabel(1, 10, right, style = 4) # "< " ">= 1" (vector of two elements!) intervalLabel(1, 10, right, style = 5) # "< 10" ">= " (vector of two elements!)
Print Data Frame as Markdown Table (Without Row Names by Default)
kable_no_rows(..., row.names = FALSE)kable_no_rows(..., row.names = FALSE)
... |
passed to |
row.names |
passed to |
Rename Data Frame Columns and Print as Markdown
kable_translated(x, ...)kable_translated(x, ...)
x |
x |
... |
passed to |
Convert Vector of Logical to Vector of "Ja"/"Nein"
logicalToYesNo(x, yesno = c("Ja", "Nein"))logicalToYesNo(x, yesno = c("Ja", "Nein"))
x |
vector of logical |
yesno |
vector of character of length two giving the strings to be used
for |
vector of character
logicalToYesNo(c(TRUE, FALSE, TRUE)) logicalToYesNo(c(TRUE, FALSE, TRUE), yesno = c("Yeah!", "Oh no!"))logicalToYesNo(c(TRUE, FALSE, TRUE)) logicalToYesNo(c(TRUE, FALSE, TRUE), yesno = c("Yeah!", "Oh no!"))
Print Markdown Section Header
md_header( level, caption_key = "key?", caption = NULL, print = TRUE, msg = TRUE )md_header( level, caption_key = "key?", caption = NULL, print = TRUE, msg = TRUE )
level |
level |
caption_key |
caption_key |
caption |
caption |
print |
|
msg |
msg |
Overwrite the values in the target column with the values in the source column at indices where the values in the source column are not NA
overwriteIfNotNA(data, target_column, source_column)overwriteIfNotNA(data, target_column, source_column)
data |
data frame |
target_column |
name of target column |
source_column |
name of source column |
Print Result of Data Frame Comparison
## S3 method for class 'data_frame_diff' print(x, ...)## S3 method for class 'data_frame_diff' print(x, ...)
x |
object of class "data_frame_diff" |
... |
currently not used |
Print Number of NA Values in Given Column
printNumberOfNA(data, column, name = NULL)printNumberOfNA(data, column, name = NULL)
data |
data frame |
column |
column name |
name |
name of data |
Print Result of table() for Given Column
printTableForColumn(data, column, name = NULL)printTableForColumn(data, column, name = NULL)
data |
data frame |
column |
column name |
name |
name of data |
Read and Filter "regroup_actual.csv"
read_actual_regrouping( name_actual, group = NULL, columns = NULL, as_list = TRUE )read_actual_regrouping( name_actual, group = NULL, columns = NULL, as_list = TRUE )
name_actual |
Base name of file in config folder, default: "regroup_actual". The file specifies: which regroupings are arcually to be applied? What are the names of input and output columns? |
group |
Name of column in |
columns |
names of (input) columns that are to be regrouped. Only those
regroupings are performed that work on these columns or columns that are
created during the regrouping. By default |
as_list |
it |
Read Data Frame From CSV File
read_csv_file( file, sep = get_column_separator(), dec = ",", encoding = "UTF-8", na.strings = "", ..., remove_comments = TRUE, set_empty_string_to_na = FALSE, dbg = 1L )read_csv_file( file, sep = get_column_separator(), dec = ",", encoding = "UTF-8", na.strings = "", ..., remove_comments = TRUE, set_empty_string_to_na = FALSE, dbg = 1L )
file |
path to csv file |
sep |
Column separator character. Default: semicolon ";" |
dec |
Decimal separator character. Default: comma "," |
encoding |
file encoding string. Default: "UTF-8". Possible other value: "unknown" |
na.strings |
strings occurring in the files representing NA (not available). Default: "" |
... |
further arguments passed to
|
remove_comments |
Should rows starting with "#" be removed (the default)? |
set_empty_string_to_na |
if |
dbg |
if |
Assign Values to Groups of Values
regroup( x, assignments, ignore.case = NULL, to.factor = FALSE, to.numeric = TRUE )regroup( x, assignments, ignore.case = NULL, to.factor = FALSE, to.numeric = TRUE )
x |
vector of values |
assignments |
list of assignments of the form \<key\> = \<values\> with
\<values\> being a vector of elements to be looked up in |
ignore.case |
if |
to.factor |
if |
to.numeric |
if |
vector with as many elements as there are elements in x. The
vector contains \<key\> at positions where the elements in x appeared
in the vector \<values\> of a \<key\> = \<values\> assignment of
assignments
regroup(c("A", "B", "C", "D"), assignments = list( "AB" = c("A", "B"), "CD" = c("C", "D") )) x <- c("A", "B", "C", "D", "E", "A") assignments <- list( "1" = c("A", "B"), "2" = c("C", "D") ) regroup(x, assignments) # to.factor is ignored... regroup(x, assignments, to.factor = TRUE) # ... unless to.numeric is FALSE! regroup(x, assignments, to.factor = TRUE, to.numeric = FALSE)regroup(c("A", "B", "C", "D"), assignments = list( "AB" = c("A", "B"), "CD" = c("C", "D") )) x <- c("A", "B", "C", "D", "E", "A") assignments <- list( "1" = c("A", "B"), "2" = c("C", "D") ) regroup(x, assignments) # to.factor is ignored... regroup(x, assignments, to.factor = TRUE) # ... unless to.numeric is FALSE! regroup(x, assignments, to.factor = TRUE, to.numeric = FALSE)
Regroup Values According to Configuration
regroupedValues( values, config = NULL, labels = "labels1", to.factor = FALSE, to.numeric = TRUE, dbg = TRUE )regroupedValues( values, config = NULL, labels = "labels1", to.factor = FALSE, to.numeric = TRUE, dbg = TRUE )
values |
vector of values |
config |
configuration (list) describing how to regroup. If the list
contains an element |
labels |
default: "labels1" |
to.factor |
if |
to.numeric |
(default: |
dbg |
if |
Which of the actual regroupings would be used if columns were
available in a data frame
regrouping_is_used(columns, actuals)regrouping_is_used(columns, actuals)
columns |
vector of column names for which to check if they are subject to regrouping |
actuals |
list of elements |
vector of logical as long as actuals. Attribute column:
which columns would the data frame have after the regrouping?
Remove Rows That are NA in Given Column
removeRowsThatAreNaInColumn(data, column, dbg = TRUE)removeRowsThatAreNaInColumn(data, column, dbg = TRUE)
data |
data frame |
column |
column name |
dbg |
it |
data with rows removed that are NA in
data[[column]]
df <- data.frame(a = c(1, NA, 3), b = c(11, 22, NA)) df removeRowsThatAreNaInColumn(df, "a") removeRowsThatAreNaInColumn(df, "b")df <- data.frame(a = c(1, NA, 3), b = c(11, 22, NA)) df removeRowsThatAreNaInColumn(df, "a") removeRowsThatAreNaInColumn(df, "b")
Replace Values in Column in Rows Matching Condition
replaceByCondition(df, file = NULL, group = NULL, config = NULL, dbg = TRUE)replaceByCondition(df, file = NULL, group = NULL, config = NULL, dbg = TRUE)
df |
data frame in which to do substitutions |
file |
path to CSV file with columns |
group |
group name. If given, only rows in |
config |
optional. Data frame containing the configuration as being read
from |
dbg |
if |
# Create a very simple data frame df <- data.frame(a = 1:3) # Create a very simple configuration config <- read.table(sep = ",", header = TRUE, text = c( "group,target,condition,replacement", "g1,a,a==2,22", "g2,a,a==3,33" )) # Write the configuration to a temporary file file <- tempfile() write.csv(config, file) # Apply all replacements configured in the file ... replaceByCondition(df, file) # ... or in the configuration replaceByCondition(df, config = config) # Apply selected replacements replaceByCondition(df, file, group = "g1") replaceByCondition(df, file, group = "g2")# Create a very simple data frame df <- data.frame(a = 1:3) # Create a very simple configuration config <- read.table(sep = ",", header = TRUE, text = c( "group,target,condition,replacement", "g1,a,a==2,22", "g2,a,a==3,33" )) # Write the configuration to a temporary file file <- tempfile() write.csv(config, file) # Apply all replacements configured in the file ... replaceByCondition(df, file) # ... or in the configuration replaceByCondition(df, config = config) # Apply selected replacements replaceByCondition(df, file, group = "g1") replaceByCondition(df, file, group = "g2")
Use Elements of Substitute at Indices Where Substitutes Are Not NA
replaceUnlessNA(x, substitute)replaceUnlessNA(x, substitute)
x |
vector in which to substitute |
substitute |
vector containing substitutions |
Count NA in a Column and Give a Message about It
reportNA(data, column, subject = "in data")reportNA(data, column, subject = "in data")
data |
data frame |
column |
name of column in |
subject |
value for placeholder subject in output: "NAs subject: count_NA" |
Set Column
set_column( df, column, value = NULL, indices = NULL, from = NULL, must_exist = TRUE )set_column( df, column, value = NULL, indices = NULL, from = NULL, must_exist = TRUE )
df |
data frame |
column |
column |
value |
value |
indices |
row indices |
from |
name of source column, optional |
must_exist |
is column assumed to exist? |
Get List of User-Defined Text Constants
set_user_strings(x)set_user_strings(x)
x |
list of key = (character) value assignments |
Stop with Error Message Looked Up by Keyword
stop_text(...)stop_text(...)
... |
arguments passed to |
Stop with info message if element is not in expected set of elements
stopIfNotIn( element, elements, singular = "option", plural = paste0(singular, "s"), do_stop = TRUE )stopIfNotIn( element, elements, singular = "option", plural = paste0(singular, "s"), do_stop = TRUE )
element |
element to be looked for in |
elements |
vector of possible elements |
singular |
name of object to appear in error message. Default:
|
plural |
name of object (plural) to appear in error message. Default:
|
do_stop |
if |
Stop If There Are Duplicates over given Columns
stopOnDuplicates(data, columns = names(data), name = NULL)stopOnDuplicates(data, columns = names(data), name = NULL)
data |
data frame |
columns |
names of columns over which to look for duplicates. By
default, all columns in |
name |
name of data |
Unzip Archive
unzip_archive(zip_file, target_dir = tempdir(), flatten = TRUE, dbg = TRUE)unzip_archive(zip_file, target_dir = tempdir(), flatten = TRUE, dbg = TRUE)
zip_file |
path to archive file |
target_dir |
path to target directory |
flatten |
if |
dbg |
whether or not to show debug messages |
Write Information on Filtering to CSV files
write_filter_info(x, target_dir, prefix = deparse(substitute(x)), dbg = TRUE)write_filter_info(x, target_dir, prefix = deparse(substitute(x)), dbg = TRUE)
x |
data frame as returned by |
target_dir |
path to directory into which to write csv files |
prefix |
string by which to prefix all files |
dbg |
whether or not to show debug messages |
x, unchanged, invisibly
Write a Markdown Chapter
write_markdown_chapter(x, caption_key = "key?", level = 3L, caption = NULL)write_markdown_chapter(x, caption_key = "key?", level = 3L, caption = NULL)
x |
x |
caption_key |
caption_key |
level |
level |
caption |
caption |
Write CSV File in a Standardised Manner
writeStandardCsv(x, file, ...)writeStandardCsv(x, file, ...)
x |
data frame |
file |
path to CSV file to be written |
... |
additional arguments passed to |