Title: | Functions to Work with Path Dictionaries |
---|---|
Description: | This package provides functions to work with what I call path dictionaries. Path dictionaries are lists defining file and folder paths. In order not to repeat sub-paths, placeholders can be used. The package provides functions to find duplicated sub-paths and to define placeholders accordingly. |
Authors: | Hauke Sonnenberg [aut, cre] , Michael Rustler [ctb] , Kompetenzzentrum Wasser Berlin gGmbH (KWB) [cph] |
Maintainer: | Hauke Sonnenberg <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2025-01-09 05:26:17 UTC |
Source: | https://github.com/KWB-R/kwb.pathdict |
Get a Path Dictionary
get_dictionary_one_by_one(paths, n = 10)
get_dictionary_one_by_one(paths, n = 10)
paths |
vector of character strings representing file or folder paths |
n |
number of compression levels |
Convert a Named Vector to a Data Frame
named_vector_to_data_frame(x)
named_vector_to_data_frame(x)
x |
named vector |
data frame with columns name
, containing the names of x
and value
, containing the values of x
kwb.pathdict:::named_vector_to_data_frame(c(a = 1, b = 2, c = 3))
kwb.pathdict:::named_vector_to_data_frame(c(a = 1, b = 2, c = 3))
Order Data Frame Decreasingly by one Column
order_decreasingly_by(df, column)
order_decreasingly_by(df, column)
df |
data frame |
column |
name of column by which to order decreasingly. |
(df <- data.frame(a = 1:3, b = 11:13)) kwb.pathdict:::order_decreasingly_by(df, "a")
(df <- data.frame(a = 1:3, b = 11:13)) kwb.pathdict:::order_decreasingly_by(df, "a")
Create Random File Paths Using English Words
random_paths( max_depth = 5, min_chars = 5, max_elements = 10, depth_to_leaf_weight = function(depth) 1.2^depth )
random_paths( max_depth = 5, min_chars = 5, max_elements = 10, depth_to_leaf_weight = function(depth) 1.2^depth )
max_depth |
maximum path depth |
min_chars |
least number of characters per folder or file name |
max_elements |
maximum number of elements (files or subfolders) in a folder |
depth_to_leaf_weight |
function that calculates a weight from the given path depth. The weight is used to increase the probability of a folder element to be a file and not a subdirectory. By default the weight is calculated by 1.2^depth, i.e. for a folder in depth 10 it is about six times (1.2^10 = 6.19) more probable of its elements to be files instead of subfolders |
# Make this example reproducible set.seed(12059) # Create random paths paths <- kwb.pathdict::random_paths(max_depth = 5) # Show the random paths paths # Frequency of path depths table(lengths(kwb.file::split_paths(paths)))
# Make this example reproducible set.seed(12059) # Create random paths paths <- kwb.pathdict::random_paths(max_depth = 5) # Show the random paths paths # Frequency of path depths table(lengths(kwb.file::split_paths(paths)))
Rescore and Reorder Frequency Data
rescore_and_reorder_frequency_data(frequency_data, placeholder_size)
rescore_and_reorder_frequency_data(frequency_data, placeholder_size)
frequency_data |
data frame with columns |
placeholder_size |
size of placeholder in number of characters. The path length will be reduced by this value before being multiplied with the count to calculate the score. |
Decreasingly sorted frequencies of strings, by default weighted by their length. This function can be used to find the most "important" folder paths in terms of frequency and length.
sorted_importance(x, weighted = TRUE)
sorted_importance(x, weighted = TRUE)
x |
vector of character strings |
weighted |
if |
named integer vector (of class table) containing the decreasingly
sorted importance values of the elements in x
. The importance of a
string is either its frequency in x
(if weighted is FALSE) or the
product of this frequency and the string length (if weighted is TRUE)
strings <- c("a", "a", "a", "bc", "bc", "cdefg") (importance <- kwb.pathdict:::sorted_importance(strings)) # Check that each input element is mentioned in the output all(unique(strings) %in% names(importance)) # weighted = FALSE just returns the frequencies of strings in x (importance <- kwb.pathdict:::sorted_importance(strings, weighted = FALSE)) # Check if the sum of frequencies is the number of elements in x sum(importance) == length(strings) # You may use the function to assess the "importance" of directory paths kwb.pathdict:::sorted_importance(dirname(kwb.pathdict:::example_paths()))
strings <- c("a", "a", "a", "bc", "bc", "cdefg") (importance <- kwb.pathdict:::sorted_importance(strings)) # Check that each input element is mentioned in the output all(unique(strings) %in% names(importance)) # weighted = FALSE just returns the frequencies of strings in x (importance <- kwb.pathdict:::sorted_importance(strings, weighted = FALSE)) # Check if the sum of frequencies is the number of elements in x sum(importance) == length(strings) # You may use the function to assess the "importance" of directory paths kwb.pathdict:::sorted_importance(dirname(kwb.pathdict:::example_paths()))
Do Subfolder List Elements Start with Given Folder Names?
starts_with_parts(parts, elements)
starts_with_parts(parts, elements)
parts |
list of list of character as returned by
|
elements |
vector of character giving the sequence of strings to be
found in |
vector of logical as long as parts
containing TRUE
at
positions i
for which all(parts[[i]][seq_along(elements)] ==
elements)
is TRUE
parts <- strsplit(c("a/b/c", "a/b/d", "b/c"), "/") starts_with_parts(parts, elements = c("a", "b")) starts_with_parts(parts, elements = c("b", "c")) subdir_matrix <- kwb.file::to_subdir_matrix(parts) starts_with_parts(subdir_matrix, elements = c("a", "b")) starts_with_parts(subdir_matrix, elements = c("b", "c"))
parts <- strsplit(c("a/b/c", "a/b/d", "b/c"), "/") starts_with_parts(parts, elements = c("a", "b")) starts_with_parts(parts, elements = c("b", "c")) subdir_matrix <- kwb.file::to_subdir_matrix(parts) starts_with_parts(subdir_matrix, elements = c("a", "b")) starts_with_parts(subdir_matrix, elements = c("b", "c"))
In the subdir
matrix, each row represents a file path. The different
parts of the paths (the folder names) appear in the different columns. For
example, the paths "a/b/c" and "d/e" are represented by a matrix with values
"a", "b", "c" in the first and "d", "e", "" in the second row. Each cell of
the subdir
matrix that is not empty gets a number. If two cells of one
column have the same number, this means that the paths to the cells are the
same. See example.
to_cumulative_id(subdirs)
to_cumulative_id(subdirs)
subdirs |
matrix of subdirectory names, as returned by |
# Create a very simple subdirectory matrix (subdirs <- matrix(byrow = TRUE, ncol = 4, c( "a", "b", "c", "d", "a", "b", "d", "", "a", "c", "d", "e" ))) # Give each non-empty cell of the matrix an ID kwb.pathdict:::to_cumulative_id(subdirs) # You can read the matrix column by column. The highest number represents the # number of different paths that reach up to the corresponding path level. # 1st column: The starting parts of the paths in depth 1 are the same: "a". # All cells have ID = 1. # 2nd column: There are two different paths to the folders in depth 2: # "a/b" (ID = 1) and "a/c" (ID = 2). # 3rd column: There are three different paths to the folders in depth 3: # "a/b/c" (ID = 1), "a/b/d" (ID = 2), "a/c/d" (ID = 3). # 4th column: There are only two out of three paths that reach depth 4: # "a/b/c/d" (ID = 1), "a/c/d/e" (ID = 2)
# Create a very simple subdirectory matrix (subdirs <- matrix(byrow = TRUE, ncol = 4, c( "a", "b", "c", "d", "a", "b", "d", "", "a", "c", "d", "e" ))) # Give each non-empty cell of the matrix an ID kwb.pathdict:::to_cumulative_id(subdirs) # You can read the matrix column by column. The highest number represents the # number of different paths that reach up to the corresponding path level. # 1st column: The starting parts of the paths in depth 1 are the same: "a". # All cells have ID = 1. # 2nd column: There are two different paths to the folders in depth 2: # "a/b" (ID = 1) and "a/c" (ID = 2). # 3rd column: There are three different paths to the folders in depth 3: # "a/b/c" (ID = 1), "a/b/d" (ID = 2), "a/c/d" (ID = 3). # 4th column: There are only two out of three paths that reach depth 4: # "a/b/c/d" (ID = 1), "a/c/d/e" (ID = 2)
Create Dictionary from Unique Strings
to_dictionary(x, prefix = "a", leading_zeros = FALSE)
to_dictionary(x, prefix = "a", leading_zeros = FALSE)
x |
vector of strings |
prefix |
prefix to be given to the keys in the dictionary. Default: "a" |
leading_zeros |
whether to make all keys in the dictionary have same
length by adding leading zeros to the keys. Default: |
# Define input strings x <- c("elephant", "mouse", "cat", "cat", "cat", "mouse", "cat", "cat") # Create a dictionary for the unique values in x kwb.pathdict:::to_dictionary(x) # Note that "cat" is the first entry because it has the highest "importance" kwb.pathdict:::sorted_importance(x)
# Define input strings x <- c("elephant", "mouse", "cat", "cat", "cat", "mouse", "cat", "cat") # Create a dictionary for the unique values in x kwb.pathdict:::to_dictionary(x) # Note that "cat" is the first entry because it has the highest "importance" kwb.pathdict:::sorted_importance(x)
Substitute Values that are in a Dictionary with their Keys
use_dictionary(x, dict, method = "full")
use_dictionary(x, dict, method = "full")
x |
vector of character |
dict |
list of key = value pairs. Values of this list that are found in
|
method |
method to be applied, must be one of "full" or "part".
If "full", the full values must match, otherwise the values in |
x
in which values or parts of the values are replaced with
their short forms as they are defined in the dictionary dict
# Define a vector of long values x <- c("What a nice day", "Have a nice day", "Good morning") # Define short forms for full or partial values dict_full <- list(wand = "What a nice day", gm = "Good morning") dict_part <- list(w = "What", nd = "nice day", g = "Good") # Replace long form values with their short forms kwb.pathdict:::use_dictionary(x, dict_full, method = "full") kwb.pathdict:::use_dictionary(x, dict_part, method = "part")
# Define a vector of long values x <- c("What a nice day", "Have a nice day", "Good morning") # Define short forms for full or partial values dict_full <- list(wand = "What a nice day", gm = "Good morning") dict_part <- list(w = "What", nd = "nice day", g = "Good") # Replace long form values with their short forms kwb.pathdict:::use_dictionary(x, dict_full, method = "full") kwb.pathdict:::use_dictionary(x, dict_part, method = "part")