How to finally understand how to use Unicode.

This post is not going to be long. It’s not going to have any examples, details, or anything of the sort. It will just teach you what Unicode is, how to use it, what an encoding is and what UTF-8 means (hint: it’s an encoding). If you sort-of, semi-understand what Unicode is, this should clarify everything.

Understanding Unicode is really simple. The first thing you need to know is that Unicode is a standard that is used for the representation and handling of most of the world’s writing systems. If you want to write a program that prints text in more languages than English, Unicode is how you should do it.

Things get a bit more confusing when you read, though, and you hear things like encodings mentioned, UTF-8, UTF-16, etc etc). What is all this?

The best way to think about Unicode is that it’s magic. Seriously. Other representations, like ASCII, are just bytes that correspond to characters on the screen, but Unicode is magical. It’s not represented in bytes, it just exists in another, divine realm by itself. Each Unicode character isn’t bytes or bits, it’s a magical, idealised thing that exists in the magical Unicode realm. These can be characters of any language, all programs understand them, they print them properly and everything works fine and magically, because it’s Unicode.

When you try and save Unicode, though, is when problems start. Since Unicode characters are magical, they can’t be represented in the mundane realm of computer bits, so you have to somehow convert them to a format the computer can understand. This is where the concept of an encoding comes in. The encoding is just a way to save the magical Unicode character so all its magic can be preserved and it can be restored to its former magic status when your application needs it again.

UTF-8 is an encoding, which means that it’s a way to convert divine Unicode characters to mundane bits and bytes. UTF-16 is another encoding, and there are many more. There are also encodings that don’t support the full Unicode set and only support a certain language. If you try to encode a Unicode character in an encoding that can’t handle its awesome magic, you’re going to get an error. Also, you obviously have to decode the character (i.e. restore its magic) the same way that you encoded it, otherwise the magic will be all wrong and it will kill you and you will die.

So, that was it. I hope you now have a better understanding of how Unicode works and why you didn’t understand if before.

It’s just because nobody told you how it worked.

Magic.

Stavros' Stuff

On programming and other things.

How to finally understand how to use Unicode.

Conceived on Aug 3, 2010

Stavros

Guy who likes computers

Connect with me

This site is part of the webring:

Recent Posts

Made with ♥ in Greece