A little update to the bar receipt encoding mystery: I was looking at the wrong code page! While I’ve studied Ancient Greek at school, we didn’t learn about ancient Greek 8-bit encodings – thanks for nothing, German education system!
data:image/s3,"s3://crabby-images/47c38/47c3843516672458b682aca658da2a0deee41914" alt=""
It turns out that code page 737 is the common Greek 8-bit encoding, not code page 869. Using that, we can reconstruct better what happened to the receipts full of question marks.
Example
Let’s compare a good and a messed-up receipt: Φ.Π.Α.
turns into ”.?.€.
data:image/s3,"s3://crabby-images/20793/20793f5c609538cd0c47cbae8bc1334c932c0929" alt=""
When we look at the code pages involved:
Code page 737 | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
80 | Α | Β | Γ | Δ | Ε | Ζ | Η | Θ | Ι | Κ | Λ | Μ | Ν | Ξ | Ο | Π | 8F |
90 | Ρ | Σ | Τ | Υ | Φ | Χ | Ψ | Ω | α | β | γ | δ | ε | ζ | η | θ | 9F |
Code page 1252 | |||||||||||||||||
80 | € | ‚ | ƒ | „ | … | † | ‡ | ˆ | ‰ | Š | ‹ | Œ | Ž | 8F | |||
90 | ‘ | ’ | “ | ” | • | – | — | ˜ | ™ | š | › | œ | ž | Ÿ | 9F |
…we can trace the conversions of each character.
- Letters
Φ
andΑ
are encoded as 94 and 80 (hexadecimal) in code page 737 - When bytes 94 and 80 get parsed as 1252 data, they map to
”
and€
- The dot
.
is at 2E in both code pages and stays intact - Letter
Π
is 8F in 737 - But 8F is not assigned in code page 1252 (red gap in the table above)
- It gets replaced with a
?
- Result:
”.?.€.
Something is still missing
Or rather: Too much is missing! In other examples of good v. mixed-up texts, there are more question marks than we would expect.
data:image/s3,"s3://crabby-images/73d06/73d06be18a35ff7736939146b66fa2da22320632" alt=""
Original | Ξ Ε Ν Ο Δ Ο Χ Ε Ι Α Κ Ω (Ν ) |
Σ Υ Ν Ο Λ Ο |
---|---|---|
Expected | ? „ Œ Ž ƒ Ž • „ ˆ € ‰ — |
‘ “ Œ Ž Š Ž |
Actual | ? „ ? ? ƒ ? ? „ ? € ? ? |
? “ ? ? ? ? |
So the “target” code page cannot be the 1252 encoding we know today. It must be a variant with more gaps, i.e. unassigned byte positions, leading to more question marks in the output.
Uppercase Greek letters | ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ |
---|---|
737 interpreted as 1252 | €?‚ƒ„…†‡ˆ‰Š‹Œ?Ž??‘’“”•–— |
Observed in the examples (incomplete) | € ƒ„ ‡???????????“”? ? |
737 interpreted as 1253 (wild guess) | €?‚ƒ„…†‡?‰?‹?????‘’“”•–— |
While code page 1253 (1252-variant for Greek)
matches somewhat better, it’s not a full match.
Capital Kappa Κ
maps
to ‰
, but it should be a ?
, etc.
Phew! So we’re searching for a 1252-like encoding…
- that contains
€ ƒ ‡ “ ”
- but not
Œ Š ‹ Ž ‰ • — ‘ ’
I think I’ll start looking at the beach, with a drink that has ‰! :)