Skip to content

Talk about string encoding #1449

Open
Open
@hadley

Description

@hadley

Formerly in R4DS


String encoding

When working with non-English text another common challenge is file encodings.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using charToRaw():

charToRaw("Hadley")

Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the American Standard Code for Information Interchange.

Things aren't so easy for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages.
In Latin1, the byte b1 is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emoji.

readr uses UTF-8 everywhere.
This is a good default, but will fail for data produced by older systems that don't know use UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:

#| message: false
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)

x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)

To fix the problem you need to specify the encoding via the locale argument:

#| message: false
read_csv(x1, locale = locale(encoding = "Latin1"))

read_csv(x2, locale = locale(encoding = "Shift-JIS"))

How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides guess_encoding() to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.

guess_encoding(x1)
guess_encoding(x2)

Encodings are a rich and complex topic, and we've only scratched the surface here.
If you'd like to learn more we recommend reading the detailed explanation at http://kunststube.net/encoding/.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions