Brett's Blog: Unicode Converter

Friday, June 29, 2007

Unicode Converter

Please leave any comments here for my Unicode Converter Firefox extension.

21 Comments:

It's great, really helps me read those unicode texts.

By Photographic Enthusiasts Group, at Tuesday, 04 September, 2007
You are welcome to submit your extension on http://www.babelzilla.org if you wish to get more languages for free.

By Goofy, at Sunday, 18 November, 2007
Thank you very much, goofy...

By Brett, at Sunday, 18 November, 2007
Hey Brett, great addon. This is pretty much what I was hoping to eventually do with mine, but never had the time (or talents) to put effort into it. Good job.

By Anonymous, at Tuesday, 17 June, 2008
Hi Jordan,

Did you see my mention of you in the Readme's acknowledgments? No special skills here, just lucky to have some time on hand to study...

One remaining idea left for the extension is for me to develop an auto-import routine for the text files I'm using from Unicode now into Firefox's SQLite database so that the user can search for a description, say "ligature", and find all the characters with that in the description (and maybe speed up the Ajax access of descriptions in the chart view while I'm at it, especially so as to support the (optional!) 27MB Unihan file with all the CJK (Chinese-Japanese-Korean) characters). If it's in a database, correlating the other data the Unicode texts provide would also be facilitated, such as to allow the user to immediately find the upper case variant of a form, etc. But that is a project, perhaps for another day.

You're welcome to build on or contribute to the code as and if you like. Anyhow, really, thanks for the encouragement and feel free to drop any suggestions if you like.

By Brett, at Tuesday, 17 June, 2008
Thank you.

ၬၭၳ

By Anonymous, at Friday, 20 June, 2008
Hello and thanks for your thanks, zinyaw.

By Brett, at Thursday, 26 June, 2008
Hi,

I was wondering if someone could help me here. I'm using Firefox now, but I cannot figure out how to enter Unicode symbols in forums and comboxes and the like.

For example, lets say I want to enter the Unicode symbol for a "check box" right here in this comment area for all of you to see. I found the code is
U + 2611
but I cannot seem to make it work/display. I know about the ASCII Alt codes, but it is a little different for Unicode. I type Alt and 2661 and Alt + 2661 but nothing works.

I saw your Firefox Unicode program, but I'm worried that if others dont have the program they wont be able to see my Unicode symbols. I have Vista, but I would imagine even XP is equipped to show basic Unicode symbols like the checkbox and such without having to download programs.

I also saw some people doing Unicode that displayed whole sentences backwards as well as upside down, that would be cool to learn as well.

Sorry if this is an annoying question, but I have looked all over google but no luck.

Thanks

By Nick, at Wednesday, 22 October, 2008
Hi Nick,

You'll need to add the ampersand followed by '#' and then 'x' ('x' represents hexadecimal which is the system used in the U+ notation that you refer to. You can also reference Unicode without the 'x', but in such a case only numbers are allowed (and it will be different than the the U+ value).

So, you can do ☑ to show ☑ (note that in my code, in order to show you the ampersand, I've had to escape each ampersand with "&"). If I forget the 'x', it will display ਲ਼ which is the decimal form, and thus maps to a different character than that using hexadecimal notation (hex counts from 1-16 kind of like a deck of cards count up to 13; in order to do so, hex uses A-F (like Jack, Queen, King, etc.) in addition to the digits, but sometimes, as in this case, the letters are not used.)

I am not very familiar with how different operating systems handle input by Unicode if they do (thus my creating this extension).

You don't have to worry that using my program will affect whether others can see the characters or not. The only factors that decide that are 1) whether you have saved the file as Unicode/UTF-8 and 2) whether the person's browser (and its fonts) support Unicode (or other program: in Microsoft Word, the font you need is Arial Unicode MS to support as many characters as possible, though even that doesn't support all). Unicode is increasingly becoming the standard, so there's no reason to go with anything else. If the character is obscure though, it is possible, no one will be able to read it (yet). You'd have to experiment to find out what works on what systems and in what programs. Be aware, however, that some websites, automatically escape entities, so if you try to use the entity form (i.e., with ampersands), it is possible they will be converted into escaped form. In my program, you can convert an entity that you type into the real character and paste it into the textbox, etc. (though using the character depends on that website's form accepting the encoding correctly and displaying it in the correct encoding). As you can see, Google does it correctly. (If you go to Tools->Page Info in Firefox, you can confirm that the encoding is indeed UTF-8 Unicode to know whether the character will work correctly, assuming the submission process doesn't corrupt anything).

The other choice you have to make is whether to type in the symbol directly or to use an entity. (My program can help you convert back-and-forth between them.) The advantage of the character itself is of course is that it is directly readable when in text form (e.g., in a website, when you view the code of the website, it will also show that character in your text editor). But the problem with this is if you have an old text editor which doesn't yet support Unicode. That's why people bother with entities. The entity is equivalent to the character, but since it uses ASCII, it won't get messed up in display or when saving, if someone's editor doesn't handle the character directly. But the disadvantage is that it is ugly to read the entity and unclear what it represents unless you happen to remember the code. In XML/XHTML, you can make up your own entities to represent different sequences (e.g., you could define &checkbox; to represent the checkbox symbol); you can read up on DTD's to find out more about that.

As far as backwards/upside down text, Unicode only relates to the abstract representation of a character (e.g., the letter 'a', regardless of whether it is the handwritten style or the typed style, as a font would determine), and not so much as to how it is displayed. I therefore don't believe you'll find Unicode characters which are technically called "Backwards 'a'" unless there really is an independent use for a backwards 'a' (e.g., if there is a special use in mathematics). However, given how many scripts are represented in Unicode, some of whose characters look like backwards English, some have taken advantage of this fact to find characters that look like upside-down equivalents.

There are, though, some special characters like combining diacritics which do nothing alone, but which say, add a dot to the preceding character in a given position (e.g., to the bottom).

Hope that answers your questions...

By Brett, at Wednesday, 22 October, 2008
Brett,

Sorry for not getting back earlier. I greatly appreciate your response, it helped me a lot.

About the backwards/upside down thing, you were correct, after I read your comments I looked into it more and it is not actually upside down but since there are so many 'shapes/letters' in Unicode there is bound to be one that looks like an upside down letter for the whole English Alphabet, and so the converter program does a simple substitution.

By Nick, at Tuesday, 28 October, 2008
Great to hear it helped, Nick. One last thing I think is helpful to know (at least to answer the nagging question of why Unicode has different encodings). Unicode alone does not tell how a file should be encoded. There are even several standard ways to encode a file into Unicode.

Although UTF-8 is more common, UTF-16 is another fairly common encoding of Unicode you may come across (there are different varieties), but you really shouldn't typically need to bother with anything besides UTF-8, at least in saving files. UTF-16 is more efficient in handling the CJK (Chinese-Japanese-Korean) characters (takes less memory to store), but for English it is less efficient. A good ASCII file is also a good UTF-8 file, whereas ASCII will not be a good UTF-16 file, so UTF-8 is popular with Western audiences. However, European accented characters will still get garbled if you try to use say a file in ISO 8859-1 (a superset of ASCII) as UTF-8, though the English ones will not.

UTF-16 is simpler in concept than UTF-8, since, with one rare group of exceptions where two 2-byte pseudo-characters (reserved for this purpose) are treated as one, UTF-16 represents each character with a fixed amount of memory (2 bytes). So an earlier version of UTF-16 (UCS-2) which simply didn't offer the 4-byte sequence pair) was chosen for internal use in JavaScript and Java. Thus, in most cases, you can, for example, loop through a string in these languages, one "character" segment at a time, and it will indeed itself be a valid character. (The exception is for those pseudo-characters (called surrogates), but their use is so rare, it is uncommon to need to address them, and doing so is not difficult.) UTF-16 file readers do not get confused into having to decide whether to treat the surrogate as an independent character, because surrogates have no independent meaning--bytes in that numeric range are only used as surrogates, so a processor reading a byte in that numeric range will know that it must look one more byte ahead to determine the complete character.

UTF-8 varies more frequently in the amount of memory used for each character (1-4 bytes), and it is able to get by doing this since it reserves certain bits on each byte for use in indicating how many bytes long a character sequence representing a full character will be--so a UTF-8-aware program viewing the file won't get confused as it reads bytes (by the variability of amount of memory used for each character--1 to 4 bytes) in determining where one character begins and another one ends--the first byte will indicate, for example, "this character is going to be 3 bytes long").

I think I covered the above correctly, though I'm by no means an expert. Hope that helps...

By Brett, at Tuesday, 28 October, 2008
*** EN

Great tool, even if I do not understand all the options in English (I am French, others in other languages would surely translations) :)

Hi, the 2.5.3 update was expected ;) Thanks.

1) The window takes up too much space, hard to see everything ! Temporary problem => I expect the next update.

2) Lack of ergonomics for use.

Great tool, thank you very much anyway.

*** FR

Great outil, même si je ne comprends pas toutes les options en anglais (Je suis français ; d'autres utilisateurs dans d'autres langues aimeraient sûrement des traductions) :)

Salut, la mise à jour 2.5.3 était attendue ;) Merci.

1) La fenêtre prend trop de place ; Difficile de tout voir ! Problème temporaire => J'attends la prochaine mise à jour.

2) Manque d'ergonomie pour utilisation.

Great outil, merci beaucoup tout de même.

By Anonymous, at Sunday, 18 January, 2009
Hello Frédéric,

Thanks for your comments!

1) There is a French translation that is far along, but not finished yet. I could include the partial translation, but I don't know if that would just be annoying to people. What do you think? Although the interface could be translated soon, some parts would take a LONG time to translate, such as the database of character descriptions (by Unicode.org). By the way, if you know someone who could help translate, they can go to http://www.babelzilla.org/index.php?option=com_wts&Itemid=88&language=7&extension=4533&type=filelist

2) Unfortunately, layout, as you can see is not my strong area. I've asked for some help among Firefox designers though, and I'm really hoping to get some help. We have help for translators, but no formal way that I know of to get help from designers. One other person complained about an apparently opposite problem related to layout, that it didn't fit in the window, so I obviously need to do something.

If you know any designers who can help, please send them my way! :)

By Brett, at Sunday, 18 January, 2009
What is the license of this plugin? I found an MIT license in content/overlay.js, but I can't find one for the other files, so standard copyright applies.

I'd like to recommend this to people having trouble editing non-western content on a non-UTF-8 wiki. But I first would like to know if this is libre software.

By Anonymous, at Friday, 13 February, 2009
Hello Samgee,

Yes, but I will now put it under LGPL 2.1 though (and dual license with MPL 1.1: http://www.mozilla.org/MPL/MPL-1.1.html ) ...

I plan to fix that with notices, etc., in the next version, but I hereby release it now under LGPL 2.1 and MPL 1.1... Sound ok? Prefer another license to be added? If so, feel free to let me know your reasons. But you can consider the whole Unicode Input Tool/Converter program to now be available under either of these open source licenses.

By the way, does the wiki you are working with allow entities to be passed in unescaped? If not, entities (if that's what you were planning to pass in) might not work anyways. Hopefully the wiki can be made to work with Unicode though!...

By he way, feel free to offer any other recommendations you can think of--unfortunately, i don't know how to address some of the formatting complaints I've received in my latest update--works fine for me in Windows (Vista) and I thought it the latest release should be more flexible, but I hope I could fix it to work fine with other systems, if the formatting problems are caused by different system implementations.

And also feel free to pass any would-be translators my way or directly to http://www.babelzilla.org/ (though there is a lot to translate with all of the obscure language names, etc., and the tens of thousands of character descriptions were not localized by Unicode, so I couldn't give an option to download a different database of character descriptions.)

By Brett, at Friday, 13 February, 2009
The Unicode database is under its own license, so I'm curious whether I can package it together with GPL code (making a triple-license along with LGPL and MPL)? My program would be accessing its contents of course, so I don't know whether it would be rendered incompatible.

And while I'm asking, if I can't use the full GPL, do you know where I can find the text of a LGPL/MPL 1.1 dual license (I found a triple one), so I can paste it at the Addons site?

By Brett, at Friday, 13 February, 2009
Those licenses sound very good. Any reason for choosing (L)GPLv2.1 over (L)PGLv3? It doesn't matter much, but it just makes sense to me to go with the updated version.

The wiki troubles are described in this thread: http://lists.gnu.org/archive/html/gnewsense-users/2009-02/msg00029.html

As for recommendations: I'm sure you are already aware that the design is not very well suited for low resolutions (like 800x600). I wish I could help you with that, but that's not really my area.

The add-on site doesn't require you to put the whole license(s) there, does it? I think it would be sufficient to just link to each of the licenses separately. The most important thing is that the distributed files in the add-on are clearly licensed.

By Anonymous, at Sunday, 15 February, 2009
Hi Samgee,

My reasons for LGPLv2.1 are more just because I'm resistant to change. :) Actually, no, if you'd like to know my reasons (perhaps in part based on incomplete information), I've just made a post here, where I pose some questions in the first paragraph related to my uncertainty on this.

I'll have to look into making the window itself flex to below 800x600 size, thanks.

No, you're right, the add-on site doesn't require you to put the whole license there. They have a big backlog for approving extensions, so whenever I may get around to uploading the version with licenses included, it may still take a while to get approved.
But I do plan to do that for the next release. In the meantime, legally speaking, I believe that by my giving permission to release it, effectively means you or anyone encountering this blog can make it GPL yourself (at least based on the current version). (Of course, I'm not violating GPL by my current version not having the notices, since I own the copyright.) But again, I DO plan to release it under the licenses I said (though perhaps under GPL3 if I can be persuaded)...

By the way, as I recall, shifting from ISO-8859-1 to UTF-8 should be relatively painless for a site just using English, but it would mess up European accented characters and such.

best wishes!

By Brett, at Tuesday, 17 February, 2009
Ok, I read up a bit more on GPL 3, and think I feel comfortable enough releasing it under LGPL 3.

I've just uploaded a version to the Addons site which includes the license text in each source file (except for the Unicode data files which are under their own copyright).

I didn't bother with MPL, but my prior statement here can be considered as permission to release it under that license as well, if anyone wishes.

I also checked for height problems and didn't see anything which explicitly required a minimum height. I reduced the textbox size in a few places which could have led to a minimum height, however, and allowed for text overflow in another area, so hopefully that might have fixed the problem, but it's hard for me to know.

By Brett, at Saturday, 21 February, 2009
I had to increase my screen resolution from 1024x768 to see the ok and cancel buttons.

Here are the error messages from Console² extension when opening Unicode Input Tool/Converter 2.5.6:

* Warning: XUL box for hbox element contained an inline #text child, forcing all its children to be wrapped in a block. Source file: chrome://charrefunicode/content/uresults.xul

* Database [xpconnect wrapped nsILocalFile] doesn't exist

* Warning: XUL box for caption element contained an inline #text child, forcing all its children to be wrapped in a block. Source file: chrome://charrefunicode/content/uresults.xul

Fx version:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10

By Morat, at Monday, 11 May, 2009
It's really great to have people take the time to give feedback, and good information at that, but unfortunately though I've solved those two warnings (the error is probably because you haven't downloaded the Unihan database--see preferences), this doesn't solve, and I don't know how to solve, the height problem. I am not fixing the height, so I don't know what's happening... :(

By Brett, at Monday, 11 May, 2009

Brett's Blog
Edit New

Friday, June 29, 2007

Unicode Converter

21 Comments:

About

About Me

Previous

Recommended Browser

Recommended books

Email Notification

Brett's Blog Edit New

Friday, June 29, 2007

Unicode Converter

21 Comments:

About

About Me

Previous

Recommended Browser

Recommended books

Email Notification

Brett's Blog
Edit New