Skip to content

Why does Zig choose to use 0xAA as the undefined marker byte instead of 0xCC? #15603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
m13253 opened this issue May 6, 2023 · 6 comments
Closed
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@m13253
Copy link

m13253 commented May 6, 2023

I had a question when I came across #15585:
Why does Zig choose to use 0xAA as the undefined marker byte instead of 0xCC?

As far as I know, Microsoft’s C compiler emits 0xCC as marker byte in Debug mode [1].
They originally chose 0xCC because it decompiles into the x86 instruction “int3”, so the debugger will interrupt in case the CPU tries to execute the padding bytes (back when DEP wasn’t a thing yet).

However, although the modern CPUs protects against executing data sections, 0xCC has already become a sub-culture and many programmers are already able to spot it during a debugging session:

  • Google search “858993460” immediately gives you a lot of questions and answers about this bug [2].
    (Actually it’s −858993460, but you can’t search negative numbers on Google.)
  • DuckDuckGo search “0xCCCCCCCC” brings you to a Wikipedia page that explains it [3].
  • Similarly, “╠╠╠╠” [4], “쳌쳌쳌쳌” [5], “昍昍昍昍” [6].
  • “ÌÌÌÌÌÌÌÌ” [7] also leads to results, but are harder to find due to search engine’s Unicode normalization.
  • “Why does my string consist of this Korean character repeated over and over? – The Old New Thing” [8]
  • Japanese C/C++ developers feel funny about “フフフフフフフフ” [9], as it’s their laughing sound.
  • But nothing competes with “烫烫烫烫” (hot hot hot hot) [10], as almost all Chinese developers know this nursery:
    “手持两把锟斤拷, 口中疾呼烫烫烫。” (“Holding two metal pieces of � [11], I shouted: Hot! Hot! Hot!”)

That is to say, 0xCC is a well-known value with a long history and a sub-culture. When a novice developer learns to code and sees this, searching this value (either in numeric form or text form at any encoding other than UTF-8) will lead them to the solution.
However, using 0xAA won’t lead these young programmers into anything: I searched “1431655766”, “2863311530”, “6148914691236517206”, and “12297829382473034410”. Only the third number led me to a couple of web pages owned by LLVM [12] or GCC [13], none of them explains anything related to uninitialized variables.

TL;DR: I want to propose we modify 0xAA into 0xCC.


Screenshot of a debugger
Image: Screenshot of a C++ debugger. [Image source]

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label May 6, 2023
@andrewrk andrewrk added this to the 0.12.0 milestone May 6, 2023
@m13253 m13253 changed the title Why does Zig choose to use 0xAA as the undefined padding byte instead of 0xCC? Why does Zig choose to use 0xAA as the undefined marker byte instead of 0xCC? May 6, 2023
@matu3ba
Copy link
Contributor

matu3ba commented Jun 17, 2023

They originally chose 0xCC because it decompiles into the x86 instruction “int3”, so the debugger will interrupt in case the CPU tries to execute the padding bytes (back when DEP wasn’t a thing yet).

  1. Having int3 on one architecture and not on another one is suprising behavior. For example aarch64 crashes with access violation. From quick glimpse, on other architectures 0xcc looks like a valid instruction.
    Why should Zig now grant x86 and aarch64 a special status? Why not risc5 EBREAK or ECALL ?
    This feels like endless churn.

  2. 0x00 and 0xFF and values close to them are too common as data pattern and non-regular byte patterns are hard to catch by eye or debugger. 0xAA = 0b10101010 is trivial to spot like 0x55 = 0b01010101.

@m13253
Copy link
Author

m13253 commented Jun 17, 2023

  1. Having int3 on one architecture and not on another one is suprising behavior. For example aarch64 crashes with access violation. From quick glimpse, on other architectures 0xcc looks like a valid instruction.

Stack filling int3 is no longer effective in modern x86 systems because stack section can never be executed in protected mode. It doesn’t matter whether it originates from x86 or ARM or RISC-V as it’s no longer executed nowadays.
It remains more of a cultural magic number instead of a real instruction. 0xCC is 0xCC because C/C++ developers recognize this number (and its derivatives like −858993460), not because of its instructional origin.

  1. 0x00 and 0xFF and values close to them are too common as data pattern and non-regular byte patterns are hard to catch by eye or debugger. 0xAA = 0b10101010 is trivial to spot like 0x55 = 0b01010101.

No. 0xAA or 0x55 aren’t easier to observe (than 0xCC) unless on a oscilloscope. Electronic engineers like to use 0xAA and 0x55 due to the use of oscilloscopes.
In fact, as I already stated in an above post, 0xCC decoded using any encoding or any radix, is directly searchable on Google, leading to the exact information about how to debug such an issue. 0xAA doesn’t own such a useful trait of having existing rich search engine resources.

@m13253
Copy link
Author

m13253 commented Jun 17, 2023

P.S.:
Additionally, even on an oscilloscope (which is impractical for a regular Zig developer), 0xCC is easier to spot than 0xAA.
Because 0xAA may be mistakenly recognized as a 1-cycle clock signal, but 0xCC is 0b11001100 (slowed to 2-cycle).

@matu3ba
Copy link
Contributor

matu3ba commented Jun 17, 2023

Stack filling int3 is no longer effective in modern x86 systems because stack section can never be executed in protected mode. It doesn’t matter whether it originates from x86 or ARM or RISC-V as it’s no longer executed nowadays.

I do agree that it's more of niche. Incremental compilation and/or dynamic (re)loading via linker utilizes it.

Additionally, even on an oscilloscope (which is impractical for a regular Zig developer), 0xCC is easier to spot than 0xAA.
Because 0xAA may be mistakenly recognized as a 1-cycle clock signal, but 0xCC is 0b11001100 (slowed to 2-cycle).

That sounds more convincing to me. Additionally, the alternate of 0x33 = 0b00110011 has a higher usage probability, so 0xCC sounds better.

@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Jul 9, 2023
@mnemnion
Copy link

mnemnion commented Jun 9, 2024

I was considering opening a bikeshed issue about 0xaa, but there's no need for two.

The reason I was going to do so, is that I recently spotted a valid 0xaaaaaaaaaaaaaaaa in my own code. Nothing bad happened as a result, mind you, it's just a bitmask for odd numbers, but there it was.

My proposal is 0xa5a5a5a5a5a5a5a5. That makes the bit pattern 0x10100101, twice it's 10100101 10100101. This is even less likely to show up in a real program than the current 0xaaaaa etc. It's odd on both ends, which has some advantages, and the hex pattern is jagged, but observably regular, making it that much easier to spot. It's also a better "weird oscilloscope" pattern than either 0xaa or 0xcc, for what that's worth.

The argument for 0xcc hinges entirely on tradition in Windows programming. That isn't a good reason. Searching for "Zig -6148914691236517206" and "Zig 0xaaaaaaa" both turn up the meaning on the top page, and this would remain true no matter what the uninitialized value is.

Do I think it's important to change the value we're using? No, I don't. I do think a5 has some modest advantages, though. I got curious enough to see if there was ever an issue about the value, and seeing as there is one, and it proposes a change which (my opinion) would be worse than status quo, I figured it was worth a post to toss another hat into the ring.

On the other hand, it doesn't literally scream at you like AAAAAAAAAAA does. That might be sufficient reason to keep the current choice.

@andrewrk andrewrk closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2024
@andrewrk
Copy link
Member

andrewrk commented Jun 9, 2024

I like 0xaa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

4 participants