---
title: "Understanding Encoding"
author: "Hauke Sonnenberg"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Understanding Encoding}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Read this first:

[Escaping from character encoding hell in R on Windows](https://dss.iq.harvard.edu/blog/escaping-character-encoding-hell-r-windows)

## Introduction

In our document on best practices in research data managment we recommend to 
stick to a very basic set of characters when [naming files and folders](https://kwb-r.github.io/fakin.doc/best-practices.html#data-storage-naming).

You may ask, why? You may never have had any problems when working with files
containing spaces or special characters. If this is the case for you, you are
lucky. You then most probably 

- do not exchange files between computers with different operating systems (i.e.
Windows vs. Linux) and/or with different regional settings (e.g. German vs. 
English vs. French vs. Bulgarian),

- do not automate tasks by programming.

## Example Sessions

What does R tell us about Encoding?

```{r eval = FALSE}
?Encoding
```

We learn, that R offers a function `Encoding()` as well as the functions
`enc2native()` and `enc2utf8()`. How do these functions work?

Let's have a look at the Examples given in the R documentation:

```{r error = TRUE}
## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)

### The following now gives an error
Encoding(xx) <- "bytes"
xx # will be encoded in hex
cat("xx = ", xx, "\n", sep = "")
```

## Writing to and Reading from Files

Lets write a line containing German special characters to a text file. We use
the function `writeText()` from our 
[kwb.utils](https://github.com/kwb-r/kwb.utils)
package. This function does no more than using `writeLines()`, but additionally 
gives a message about it and returns the path to the file:

```{r}
text <- "Schöne Grüße"

test_file <- kwb.utils::writeText(text, tempfile(fileext = ".txt"))
```

And now read the line back with `readLines()`:

```{r}
readLines(test_file)
```

Ok, no problem so far, because we used the same system to write and read the
file. 

Let's have a look at the file byte-by-byte:

```{r}
con <- file(test_file, "r")

sapply(seq_len(nchar(text, "bytes")), function(i) {
  readChar(con, 1, useBytes = TRUE)
})

close(con)
```

Wow, that is interesting. I did not know that! Standard characters are stored
as one byte and only the special characters are stored as two bytes!

Now read the file again, this time as a vector of `raw`:

```{r}
(raw_bytes <- readBin(test_file, "raw", 100))
```

## Conversion Between Numeric Codes and Characters

What characters do the numeric codes given as hexedecimal numbers in `raw_bytes`
represent?

Let's first see what the hexanumeric codes are in the decimal system:

```{r}
as.integer(raw_bytes)
```

And now let's convert the codes to characters:

```{r}
rawToChar(raw_bytes)
```

The function `rawToChar()` seems to know how to interpret the sequence of byte
codes. 

Note that the last character (`10` in the decimal and `0a` in the hexadecimal
system) represents the newline character `\n`. 

What happens if I ask `rawToChar()` to convert the first and second byte
representing the `ö` character separately?

```{r}
rawToChar(raw_bytes[4])
rawToChar(raw_bytes[5])
```
And now together:

```{r}
rawToChar(raw_bytes[4:5])
```
There is an argument `multiple` to `rawToChar()`. It is `FALSE` by default.
What happens if we set it to `TRUE`?

```{r}
rawToChar(raw_bytes[4:5], multiple = TRUE)
```

As the documentation says, setting `multiple` to `TRUE` returns the single
characters instead of a single string.

How does this look if we convert the whole string?

```{r}
(characters_1 <- rawToChar(raw_bytes, multiple = TRUE))
```

Is this the same as splitting the original string into single characters?

```{r}
strsplit(text, split = "")[[1]]
```

No, the German special characters are here shown as one character only instead
of two. But we can achieve the same when setting the argument `useBytes` to
`TRUE`:

```{r}
(characters_2 <- strsplit(text, split = "", useBytes = TRUE)[[1]])

identical(characters_1[-length(characters_1)], characters_2)
```

# Replace Special Characters with ASCII Characters

```{r}
raw_bytes

rawToChar(raw_bytes)

gsub("\xc3\xb6", "oe", rawToChar(raw_bytes))
gsub("\xc3\xb6", "oe", rawToChar(raw_bytes), useBytes = TRUE)
```

Ok, it seems that we can replace special characters if we know their byte codes
(here: `c3` and `b6` for letter `ö`).

It will be helpful to have a function that shows the special characters and the
corresponding byte codes

```{r error = TRUE}
### This function needs to be checked!
kwb.fakin::get_special_character_info(text)
```