Page 1 of 3 123 LastLast
Results 1 to 10 of 28

  Click here to go to the first staff post in this thread.   Thread: Story UTF-8 problems.

  1. #1

    Story UTF-8 problems.

    Aside from the asinine requirement of no ascii... For some reason my (actually UTF-8 encoded) story cannot be uploaded.
    Re-encoded it again using notepad++, however upon upload it STILL claims the file is not UTF-8.

    A proper method would be to server side re-encode things to UTF-8. It may garble stories with characters that aren't basic ASCII, but it is still better than a hard failure.

    Also when I hit back, I am not able to select the story again, the open dialog pops up, but when I select the text file, it does not populate the element.

    Also, might I suggest allowing uploads of flat html (with basic stripping of external urls and javascript) and EPUB?

    also you can view the document at http://evil-spork.net/nesetalis/Desires_of_Demons1.txt
    Last edited by nesetalis; 02-16-2013 at 03:18 PM.

  2.   Click here to go to the next staff post in this thread.   #2
    Retired Staff catalepsy's Avatar
    Join Date
    Nov 2012
    Location
    ::1
    Posts
    71
    Thanks for the report. We'll look into the back-button issue. Also, future improvements will include other upload types. I agree that the hard failure is a little off-putting. The really proper method is for the site to return a more-informative error that highlights the problem characters; this is again an improvement for a future release.

    I've looked at your story, too. While it has a UTF-8 byte-order mark, it is not properly encoded — there are Latin1 characters scattered throughout, mostly the curly apostrophes. This is one of the hard problems with trying to determine character encoding by looking at the file, by the way —*it says it's UTF-8 but has a Latin-1 bytestream. The other problem with "server-side re-encoding things" as utf-8 is that the client does not transmit the character set, and it is not possible to reliably determine this by looking at the file. Assuming that it's latin1 (or its cousin, Win1252) and transcoding under that assumption is not better than rejecting badly-encoded files outright; this is how mojibake propagates, and I'd rather not contribute to that.

    Finally, plain ASCII is a subset of UTF-8, so if that's really what you're uploading you'll never run into this problem.

  3. #3
    hmm.. odd, thought I had switched out all the latin characters.
    But yes you are right, there are a few. And no, the proper method is not letting the user handle it. 99% of users don't have a clue what UTF-8 is, let alone how to properly re-encode something. If your server wants UTF-8, then accept a file, determine the byteorder, and check for non-encoding characters. Either display them as a ? like most displays do, or find/replace with regex the common ones (such as curly apostrophes.)
    We are in the age of facebook and twitter. Expecting the users to know encoding and code pages is a little unreasonable. Especially so considering the even programmers for the most part have trouble using UTF-8 correctly so how can you expect the average joe to do it? If your site is UTF-8 only, then you HAVE to handle it gracefully, or at least fail in such a way that the user doesn't really care.

  4.   Click here to go to the next staff post in this thread.   #4
    Retired Staff catalepsy's Avatar
    Join Date
    Nov 2012
    Location
    ::1
    Posts
    71
    Quote Originally Posted by nesetalis View Post
    If your server wants UTF-8, then accept a file, determine the byteorder, and check for non-encoding characters. Either display them as a ? like most displays do, or find/replace with regex the common ones (such as curly apostrophes.)
    The problem lies in determining the encoding of the file; like I said earlier, the client does not send this information. I don't much like the idea of guessing, especially when files wind up with both UTF-8 and latin1 characters in the data. Users probably don't want us throwing away all of their accents and special characters and replacing them with ?-marks (what if someone uploads a file written entirely in Japanese?), so I'm loath to do that. I am also not on board with blindly accepting data that says it's UTF8 when it isn't (Say the previous japanese user stuck a BOM on their Shift-JIS text?). Finally, hacking at mixed-encoding strings with regex is the antithesis of handling UTF-8 correctly (it is very much a PHP solution), so I'm not sure why you're recommending it.
    Quote Originally Posted by nesetalis View Post
    We are in the age of facebook and twitter. Expecting the users to know encoding and code pages is a little unreasonable. [...] If your site is UTF-8 only, then you HAVE to handle it gracefully, or at least fail in such a way that the user doesn't really care.
    We are in the age of internationalization (meaning that proper Unicode handling is mandatory), and while I agree that the current solution is not ideal (by any means; a hard error does suck), you should be able to set your editor to save as UTF-8 by default and largely forget about the issue for now. In the future, improperly encoded uploads will be handled more gracefully; this is what we're stuck with for technical reasons.

    In short, we hear you, but this is a Hard Problem requiring More Work. Sorry. (Also, Windows is the only platform that doesn't use UTF-8 by default, so….) Ideally, when we hit an encoding error, we can throw up an intermediate "Warning!" page and highlight the offending characters. Maybe this page would offer a 'Stomp everything down to ?-marks' option; it'd be more likely to have a "Try other encoding" dropdown.

    Also, If you'd like to help us test further fixes, we'd certainly be glad for the assistance! Thanks
    Last edited by catalepsy; 02-16-2013 at 08:23 PM.

  5. #5
    Actually linux doesn't default to proper UTF-8 encoding either on most things. Hell getting it to work in a terminal is such a pain some times.

    Linux usually defaults to ISO 8859-1 in the US. And yeah I know its not particularly easy, but displaying a garbled <?> character in my opinion is better than erroring out. Since the text can be easily modified, and most readers CAN figure out what was being said if for the most part its ASCII or UTF-8.

    The problem comes for me, from the fact that I use libre office to write, and I have to use something like notepad++ to re-encode. Libre office also defaults to ISO 8859-1 and I've not found a place to change the encoding. (though windows defaulting to UTF-16 is annoying.)

    A nice option would be an intermediary page that threw up the offending characters in highlight and let the user change them. As well as a nice interface for find and replace.. so you could find a specific character.. the — for instance in latin extended B, click it, then type in the string you want to use in UTF-8 instead in a text box, and it replaces all of them.

    as for help, I'd be happy to if I'm free! :3
    Last edited by nesetalis; 02-16-2013 at 09:42 PM.

  6. #6
    Sir:

    Wow where to start. I know this is a Beta, so let me first start by saying THANK you nice to see a new site coming up.

    I have a few questions, i was trying to UL a story to your sight, what the SCAT is a UTF-8 encoding and why do we need it? I just WRITE stories I am not a code monkey

    A few issues I am having, having to re register here was a minor pain... caused me confusion for like 30 min till I realized I had to re register again

    I am also having an issue that it keeps telling me that the things I am viewing I MAY not want to see (based on the tags) I want to see EVERYTHING..and i cannot get the blinking thing to LET me see...it's even censoring pictures *I* uploaded from myself.

  7.   Click here to go to the next staff post in this thread.   #7
    Retired Staff catalepsy's Avatar
    Join Date
    Nov 2012
    Location
    ::1
    Posts
    71
    Quote Originally Posted by nesetalis View Post
    Libre office also defaults to ISO 8859-1 and I've not found a place to change the encoding. (though windows defaulting to UTF-16 is annoying.)
    In LibreOffice, take your original document in whatever encoding you have, and then Save As -> Text Encoded. That defaults to UTF-8. You can also hit "Edit filter settings" if you really want.

    Quote Originally Posted by nesetalis View Post
    I'd be happy to if I'm free! :3
    Excellent. We'll contact you eventually.

    Quote Originally Posted by Lordgriffin View Post
    I have a few questions, i was trying to UL a story to your sight, what the SCAT is a UTF-8 encoding and why do we need it?
    UTF-8 is a standardized way to represent non-English symbols. Most programs can actually generate this kind of file, it just takes knowing how. What program do you use to write?

  8. #8
    good to know about libreoffice.. Never saved out TXT from libreoffice, only doc, fodt, and html.

  9. #9
    I Use microsoft word sir.... and what about the rest? it is preventing me from seeing Mature explicit content, when I want to see explicit content

  10. #10
    enable +18 content in your settings.

 

 

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •