Talk about string encoding

Formerly in R4DS

---

### String encoding

When working with non-English text another common challenge is file encodings.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:

```{r}
charToRaw("Hadley")
```

Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.

Things aren't so easy for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages.
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emoji.

readr uses UTF-8 everywhere.
This is a good default, but will fail for data produced by older systems that don't know use UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:

```{r}
#| message: false
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)

x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)
```

To fix the problem you need to specify the encoding via the `locale` argument:

```{r}
#| message: false
read_csv(x1, locale = locale(encoding = "Latin1"))

read_csv(x2, locale = locale(encoding = "Shift-JIS"))
```

How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.

```{r}
guess_encoding(x1)
guess_encoding(x2)
```

Encodings are a rich and complex topic, and we've only scratched the surface here.
If you'd like to learn more we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Talk about string encoding #1449

String encoding

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Talk about string encoding #1449

Description

String encoding

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions