Unicode Confusion

Josh · September 18, 2017

I have started switching the engine over to unicode by replacing all occurances of std:;string with std::wstring. There are a bunch of little functions and variables that have to be changed (char to wchar_t, "" to L"", etc.) but it is pretty straightforward.

Lua 5.3 supposedly supports unicode strings but the manual states that the lua_getglobal() function accepts a char* parameter:
https://www.lua.org/manual/5.3/manual.html#lua_getglobal

There's a little information here but it is not very clear:
https://www.lua.org/manual/5.3/manual.html#6.5

So how are you supposed to make unicode work in Lua? Switching data back and forth between wstrings and strings is a recipe for disaster.

Josh · September 18, 2017

More confusion. Apparently std::string supports UTF-8 in C++11:

std::string msg = u8"महसुस";

So I guess Leadwerks already supports unicode and my job is done?

Josh · September 18, 2017

I believe this code will successfully open a weird-character file on any platform:

std::string filename = u8"⺹.txt";
	
#ifdef _WIN32
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
auto f = _wfopen(converter.from_bytes(filename).c_str(), L"rb");
#else
auto f = _fopen(filename.c_str(), "rb");
#endif

Roland · September 18, 2017

Why not just use wstring, wchar_t, wfstream etc .. and that's it. A search and replace and maybe some modification here and there and its done

Josh · September 19, 2017

Unicode sucks because it uses a variable character size. This makes search and replace operations very difficult. However, Linux does not accept wstrings in commands like fopen. At this point I am thinking we will store strings as wstrings and then convert to UTF-8 std::strings when calling Linux system commands. Why is everything in Linux designed as if computers have one kb memory?

The whole unicode design is idiotic. They made a very complicated system when all they had to do was use 2 bytes per character and have one number for every character. I guess making something that actually works would be "boring". Yes, I know there are ancient vietnamese characters that are no longer in use that push the character count past 65,000 but who cares about that? Why should be handicap modern computing for a bunch of vietnamese people who died three centuries ago? They're dead so they don't care, and if they had anything interesting to say it would have been made into a movie already.

Roland · September 19, 2017

Aaah. I see. Thank's (y)

Josh · September 19, 2017

I got a window created with chinese characters but I can't print them out to the console:

wprintf(L"%ls \n", L"A wide string");
wprintf(L"%ls \n", L"勝遂記暮恐村日性周報著身催");
wprintf(L"Why? 为什么？\n");

Josh · September 19, 2017

Also fails:

DWORD i;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"勝遂記暮恐村日性周報著身催\n", 14, &i, NULL);

Josh · September 19, 2017

I'm thinking my console font probably just cannot display the characters.

I tried to write a wstring to a text file but that didn't work out too well either when I opened it in Notepad++.

Josh · September 19, 2017

Okay, I discovered if you want to write a wstring (utf-16) text file you have to first write an unsigned short integer 65279 to the file.

Josh · September 19, 2017

Sadly, Leadwerks 5 will not support 15th century Vietnamese computers.

Einlander · September 19, 2017

I found this blog post a few days ago through reddit to be insightful. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

It made me realise everything other than utf-16 is basically a beautiful hack.

it also speaks to the about wcs functions in c++

Josh · September 19, 2017

1 hour ago, Einlander said:

I found this blog post a few days ago through reddit to be insightful. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

It made me realise everything other than utf-16 is basically a beautiful hack.

it also speaks to the about wcs functions in c++

Thanks for the info. You're right, everything but UCS-2 (two byte) Unicode is a stupid idea because it means you are translating text through two layers of conversion. (The fact that some characters no one uses go beyond the 65,000 character limit does not matter.)

So in Leadwerks we will replace all strings with wstring, replace all Windows API calls with Windows API -W, and for Lua or Linux system calls we convert the wstring to UTF-8 (for opening files, etc.). Strings will be stored in files as UCS-2.

It is interesting to see that all the tech enthusiasts keep claiming UTF-8 is the best but people who actually write software use UTF-16:
https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

AggrorJorn · September 19, 2017

3 hours ago, Josh said:

Sadly, Leadwerks 5 will not support 15th century Vietnamese computers.

Dammit, there goes the vast majority of my target audience...

Josh · September 20, 2017

This is getting very complicated and I am reconsidering this.

Why do we need Russian and Chinese characters?

Loading or saving a file.
Drawing text on the screen or in a GUI element.
Storing a variable for one of the above two purposes.

Do we really need to change every other string in Leadwerks in order to accommodate these goals, or can we simply add overloads for a few commands and use std::wstring for internal file path values?

Do we care if the user can name an entity "汽车" in the editor, or should they be expected to use latin characters for something like this?

I don't know if Lua 5.3 will really support unicode strings.

I don't know if the Steamworks commands use unicode at all. They all just accept a char* value.

I don't know if these will be stored the same way on Windows and Linux.

I still have 2991 errors in the engine to fix. At first I thought we should change every single variable but now I am not sure if that is a good idea.

I could just add a few commands like this and be done with it:

Widget::SetText(const std::string& text)
Widget::SetText(const std::wstring& text)
Context::DrawText(const std::string& text)
Context::DrawText(const std::wstring& text)
shared_ptr<Model> LoadModel(const std::string& path)
shared_ptr<Model> LoadModel(const std::wstring& path)

However, this means potentially a mix of std::string and std::wstring values will be present in the engine.

Einlander · September 20, 2017

I would make sure that the rest of the engine works with utf8 and let lua itself fail with the encoding. 5.3 has utf8 support https://www.lua.org/manual/5.3/manual.html#6.5 but it's not very robust.

Since it is still early, you do have the option to choose Lua derived language or a completely different language not based on Lua. As distasteful as it may be the API is changing, the scripts will need to be updated and it might be simpler to start over early with something else.

Who knows.

Edited September 20, 2017 by Einlander

Josh · September 21, 2017

13 hours ago, Einlander said:

I would make sure that the rest of the engine works with utf8 and let lua itself fail with the encoding. 5.3 has utf8 support https://www.lua.org/manual/5.3/manual.html#6.5 but it's not very robust.

Since it is still early, you do have the option to choose Lua derived language or a completely different language not based on Lua. As distasteful as it may be the API is changing, the scripts will need to be updated and it might be simpler to start over early with something else.

Who knows.

Then say goodbye to String::Split(), Lower(), Upper(), Mid() and all other string manipulation commands, and your file paths will have to be 100% exact or files will fail to load. UTF-8 is a fraud and its proponents should be imprisoned for crimes against humanity.

Einlander · September 21, 2017

Hey now, I like utf8, I have just never had to deal with coding anything Unicode on Linux. All os's could have conflicting implementations. Is there a bsd/public domain lib that handles Unicode ?

Josh · September 21, 2017

56 minutes ago, Einlander said:

Hey now, I like utf8, I have just never had to deal with coding anything Unicode on Linux. All os's could have conflicting implementations. Is there a bsd/public domain lib that handles Unicode ?

Haha, yeah that is the catch. It's basically a compressed format so traversing it is impossible.

Sign In

Unicode Confusion

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation