Description
Formerly in R4DS
String encoding
When working with non-English text another common challenge is file encodings.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using charToRaw()
:
charToRaw("Hadley")
Each hexadecimal number represents a byte of information: 48
is H, 61
is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the American Standard Code for Information Interchange.
Things aren't so easy for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages.
In Latin1, the byte b1
is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emoji.
readr uses UTF-8 everywhere.
This is a good default, but will fail for data produced by older systems that don't know use UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:
#| message: false
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)
x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)
To fix the problem you need to specify the encoding via the locale
argument:
#| message: false
read_csv(x1, locale = locale(encoding = "Latin1"))
read_csv(x2, locale = locale(encoding = "Shift-JIS"))
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides guess_encoding()
to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
guess_encoding(x1)
guess_encoding(x2)
Encodings are a rich and complex topic, and we've only scratched the surface here.
If you'd like to learn more we recommend reading the detailed explanation at http://kunststube.net/encoding/.