HTTP payload character encoding requires odd workaround

Found a bug in EventGhost? Report it here.

HTTP payload character encoding requires odd workaround

Postby davidmark » Tue Nov 15, 2016 11:11 pm

Have long had a problem with Unicode strings passed from the Web server plugin. When I passed a Unicode string (e.g. u'Señor') to an event from a Python code module, all was well; but when the same event received its payload via HTTP, could see in the log that it was a different Unicode string with some sort of odd encoding (middle character was replaced with with two numeric escape sequences). After extensive experimentation and searching around I found that this fixed it:

text.encode("windows-1252").decode("utf-8")

...and the A with the hat character in this example somewhat matched what I saw when trying to display the string without these two extra calls: Señor

http://stackoverflow.com/questions/1311 ... -in-python

I'm no Python expert by any stretch, but this seems like a hard way to go. I would expect to see u'Señor' in the payload on HTTP events and not have to pass the result through the above two methods. Worst case scenario should be calling the second decode; having to call both seems to indicate a bug. Perhaps not, but expect it would be good to document this (or at least explain it here).
davidmark
Experienced User
 
Posts: 85
Joined: Thu Jan 01, 2015 5:25 pm

Re: HTTP payload character encoding requires odd workaround

Postby kgschlosser » Wed Nov 16, 2016 2:57 am

did you happen to simply try wrapping the utf-8 encoded Unicode string in str().


and Unicode is a bear to deal with because there is only 255 characters available to the standard ascii set. and other languages have special characters. i don't understand why you are encoding and then decoding. you should be able to just decode it. or encode it. i can never remember what is what... but i think that encoding means it makes a str() and decoding makes a unicode(). but if you know the language you should be able to use just one. like ('latin-1"). i don't know a lick of how all this encoding and decoding stuff works. and i am trying to learn it but there are so many different examples and articles on this. and a lot of them contradict each other.
A loved one and Time, The 2 things that can never be replaced.

Family, The only thing you don't get to choose in life.
User avatar
kgschlosser
Site Admin
 
Posts: 1368
Joined: Fri Jun 05, 2015 5:43 am
Location: Rocky Mountains, Colorado USA

Re: HTTP payload character encoding requires odd workaround

Postby davidmark » Wed Nov 16, 2016 9:25 am

kgschlosser wrote:did you happen to simply try wrapping the utf-8 encoded Unicode string in str().


and Unicode is a bear to deal with because there is only 255 characters available to the standard ascii set. and other languages have special characters. i don't understand why you are encoding and then decoding. you should be able to just decode it. or encode it. i can never remember what is what... but i think that encoding means it makes a str() and decoding makes a unicode(). but if you know the language you should be able to use just one. like ('latin-1"). i don't know a lick of how all this encoding and decoding stuff works. and i am trying to learn it but there are so many different examples and articles on this. and a lot of them contradict each other.


Tried everything. Nothing worked except the snippet I posted. Best clue is in the log as passing u'Señor' works fine and displays as:

u'Se\xf1or'

...But the payload for the HTTP event shows:

u'Se\xc2\xb1or'

...This seems to indicate a mixup between UTF-8 and Windows-1252.

https://en.wikipedia.org/wiki/Windows-1252

The containing StackOverflow article seems to have some clues related to Windows-1252 and UTF-8, but I don't have time to look into this any further right now. Here is the document often referenced by similar articles:

https://docs.python.org/3/howto/unicode.html

Also, remember the Broadcaster issue? Seems the supplied workaround is just treating the symptom. The exceptions being caught are related to encoding and decoding Unicode.

http://www.eventghost.org/forum/viewtop ... 8&start=30

Somebody should fix this ASAP as UDP and HTTP are both broken for commonly occurring Unicode characters. One curly quote is all it takes to either throw an exception (now caught in Broadcaster) or munge the result (in the Web server). I can't entirely explain my workaround, but it fixed all of my problems related to OSD, email and speech. Suspect it will lead to a general fix for the related EG code.

Good luck! It seems Python is a real bear when it comes to character encoding. Certainly I just got lucky.

PS This forum lost my changes when my session timed out the first time I wrote this. :(
davidmark
Experienced User
 
Posts: 85
Joined: Thu Jan 01, 2015 5:25 pm

Re: HTTP payload character encoding requires odd workaround

Postby davidmark » Wed Nov 16, 2016 6:09 pm

Damn this forum. It lost my post on a session timeout again. This BB thing sucks Beyond Belief.

Anyway, I had a detailed post, but will have to settle for the hint that it's worse than I originally thought. My workaround for the accented characters chokes on the apparently UTF-8 encoded curly quotes. I now have an even worse workaround that involves replacing the curly quotes, hyphens, etc. before doing the original workaround. Ugh.

This is an example of the payload shown in the print log:

u"\xe2\x80\x98"

...Now, I played around with that for an hour trying to turn it into a usable single quote. All I got were errors or a strange A-like character. On the other hand:

b"\xe2\x80\x98"

...is easily turned into the single quote (decode method IIRC). All I can figure is the above in not what's intended. I'm sure there's some way to make Python convert it to a usable string, but none of this should be necessary. Regardless, does anyone have a clue about that?

Looked at the Web server plug-in (or a Web server plugin as there are two in my installation) on Github. Will go way out on a limb and point at this line as the culprit:

queries = [unquote_plus(part).decode("latin1") for part in queries]

Why would it use "latin1" instead of "utf-8"? If that's not it, there's something else silly going on with that plug-in (and Broadcaster as well). I'll try changing it the next time I feel like restarting EG.

Thanks!
davidmark
Experienced User
 
Posts: 85
Joined: Thu Jan 01, 2015 5:25 pm

Re: HTTP payload character encoding requires odd workaround

Postby davidmark » Wed Nov 16, 2016 6:34 pm

Yep, the "latin1" decoding in the Web server plug-in is the problem.

This appears to solve the whole thing:

text.encode("latin1").decode("utf-8");

So please change the decoding in the plug-in to "utf-8". Thanks!
davidmark
Experienced User
 
Posts: 85
Joined: Thu Jan 01, 2015 5:25 pm

Re: HTTP payload character encoding requires odd workaround

Postby davidmark » Wed Nov 16, 2016 6:47 pm

I created a pull request on Github. No idea how that affects your distribution; I've read there are issues there.

The change in the repository is awaiting approval by "Blackwind". Good luck and thanks for trying!
davidmark
Experienced User
 
Posts: 85
Joined: Thu Jan 01, 2015 5:25 pm

Re: HTTP payload character encoding requires odd workaround

Postby Sem;colon » Wed Nov 16, 2016 11:05 pm

The webserver plugin on github is not the latest version, it would make sense to first merge it with the version in the webserver plugin support topic.
Sem;colon
Experienced User
 
Posts: 559
Joined: Sat Feb 18, 2012 10:51 am
Location: Germany

Re: HTTP payload character encoding requires odd workaround

Postby davidmark » Wed Nov 16, 2016 11:07 pm

Sem;colon wrote:The webserver plugin on github is not the latest version, it would make sense to first merge it with the version in the webserver plugin support topic.


Fair enough, but somebody else will have to deal with that one. I just changed the one line on the version on GitHub.
davidmark
Experienced User
 
Posts: 85
Joined: Thu Jan 01, 2015 5:25 pm

Re: HTTP payload character encoding requires odd workaround

Postby kgschlosser » Thu Nov 17, 2016 4:02 am

if you could go into the webserver thread and post the fix there that would be very helpful.that way it gets to the right place and the author of the plugin can apply the fix.

i do thank you for looking into this. and it's always a loosing battle when it comes to unicode encoding. because there is no 100% method to figure out what it's been encoded with.


and what you were seeing in the log is correct because it was a 16 bit encoding. 2 hex values that is the giveaway that it's been encoded with something like latin1
A loved one and Time, The 2 things that can never be replaced.

Family, The only thing you don't get to choose in life.
User avatar
kgschlosser
Site Admin
 
Posts: 1368
Joined: Fri Jun 05, 2015 5:43 am
Location: Rocky Mountains, Colorado USA

Re: HTTP payload character encoding requires odd workaround

Postby davidmark » Thu Nov 17, 2016 10:11 am

kgschlosser wrote:if you could go into the webserver thread and post the fix there that would be very helpful.that way it gets to the right place and the author of the plugin can apply the fix.


It's just one string literal, but I suppose I can do it tomorrow.

kgschlosser wrote:i do thank you for looking into this. and it's always a loosing battle when it comes to unicode encoding. because there is no 100% method to figure out what it's been encoded with.


No problem.

kgschlosser wrote:and it's always a loosing battle when it comes to unicode encoding. because there is no 100% method to figure out what it's been encoded with.


No, UTF-8 is the standard for this; "latin1" was clearly an incorrect choice. As would be expected, my workaround fixed all of my issues related to this bug and the plug-in update will fix same without requiring additional users to encode or decode anything.

https://en.wikipedia.org/wiki/Query_string

Thanks!
davidmark
Experienced User
 
Posts: 85
Joined: Thu Jan 01, 2015 5:25 pm


Return to Bug Reports

Who is online

Users browsing this forum: No registered users and 3 guests