Follow TV Tropes

Following

Converting all data to UTF8 (now complete)

Go To

Hey tropers, we are about to undergo a large update to the tvtropes' database to convert the encoding. During this time the login system will be offline. You'll still be able to read TVTropes but you won't be able to post or edit anything during the conversion.

How long will the login system be down?

My most recent test run took 9 hours (my first test took 20+ hours so this is after a lot of optimizations to get down to 9 hours). ***it will now take 11 hours. see update below

Can you give me more details?

TVTropes was originally hosted on Windows' servers back in the day and all content on the site is encoded with Windows 1252 (a superset of ISO-8859-1, aka latin1).

According to the W3Tech, only 1.3% of internet traffic is ISO-8859-1. I'm guessing a large amount of that is TVTropes considering we have millions of pages defined with that charset.

The majority of the internet is encoded in UTF8. By doing this conversion it will bring us up to modern standards with the rest of the web and allow us to more easily support other languages and icons. It will also allow us to use more modern tools to help with editing such adding a WYSIWYG editor option. It will also help with code development as we often have to add special workarounds to continue to support this long deprecated charset.

Are we changing anything else?

While we have the login system offline, we are going to upgrade the edits history database table to include a sequence number. We've always wanted to do this but it could not be done while the site was online. That table has 40M+ rows of data. With this change we'll be able to make it so we can easily jump to any page when filtering to edits from a specific page or from a specific user. We have this for the forum and that's why you can jump to page 500 for example of a long running thread and not have a long delay.

When will this happen?

UPDATE: (take two) This is now scheduled for Mon Dec 5th at 8:30PM PST until Tue Dec 6th at 7:30AM

At 8:30PM PST you will be logged-off of TVTropes so make sure to save anything before that time. Once the conversion is done the next morning, I'll point the code to the new server and everyone should be auto logged in. If not, when you see the all clear announcement at the top of the page, go ahead and try logging in again.

The process will take approximately 11 hours. I had it down to 9 hours until I found examples of utf8 encoded values inside latin1 columns so I had to add some extra testing to ensure that data doesn't get double encoded in the process.

If you have any issues during the migration please send an email to thestaff@tvtropes.org

UPDATE: (complete) The migration is officially complete. It ended up taking nearly 12 hours but we got there. All data on TVTropes (hundreds of millions of rows of data) has now been converted from Latin1 to UTF8. We now have the same encoding as 99% of other websites and can support special characters and other languages. It will also allow us to build other tools such as a WYSIWYG editor.

Edited by itcdr on Dec 6th 2023 at 8:48:40 AM

Ayumi-chan Aramis from Calvard (Apprentice) Relationship Status: Serial head-patter
#26: Dec 1st 2023 at 9:34:00 PM

Once the site is converted, will this affect possible logging in once all is said and done.

For reference, I did not log in to the test run because I didn't thought much of it until now.

She/Her | Currently cleaning N/A
Lymantria Tyrannoraptoran Reptiliomorph from Toronto Since: Apr, 2015 Relationship Status: Historians will say we were good friends.
Tyrannoraptoran Reptiliomorph
#27: Dec 1st 2023 at 10:44:04 PM

Where is the link to the test run?

[down] I probably can’t access it, then. Oh well.

Edited by Lymantria on Dec 1st 2023 at 7:26:19 PM

Join the Five-Man Band cleanup project!
WarJay77 Discarded and Feeling Blue (Troper Knight)
Discarded and Feeling Blue
#28: Dec 1st 2023 at 11:13:38 PM

It's an engineer and mod thing, though you can see our updates in the engineer and mod chat. Rest assured that things work fine for the most part, as far as our tests have gone at least.

Currently Working On: Incorruptible Pure Pureness
thatsnumberwang Since: Oct, 2010
#29: Dec 2nd 2023 at 12:05:13 AM

I'm not really that technically minded so please forgive me if this is a dumb question. Does this mean we will no longer have to for example, use two speech marks for italics, or a * for a line break etc. there will just be an button press option to do it like on Word? Because if so, that is a vast improvement. It should cut down on a lot of editing mistakes.

Is a spell check option also possible with the new system? Again it would cut down on many mistakes.

Amonimus the Retromancer from <<|Wiki Talk|>> (Sergeant) Relationship Status: In another castle
the Retromancer
#30: Dec 2nd 2023 at 12:14:46 AM

[up] If you mean the editing toolbar, then yes, but it's off-topic. And spellcheckers require quite a huge grammar libraries, that's why there are browser extensions many of which are paid, so it wouldn't surprise me if the site opts to not have own natively.

TroperWall / WikiMagic Cleanup
Lymantria Tyrannoraptoran Reptiliomorph from Toronto Since: Apr, 2015 Relationship Status: Historians will say we were good friends.
Tyrannoraptoran Reptiliomorph
#31: Dec 2nd 2023 at 12:34:56 AM

[up][up] A WYSIWYG editor is probably a good addition, but I wouldn’t want to be completely unable to edit the traditional way.

Join the Five-Man Band cleanup project!
Amonimus the Retromancer from <<|Wiki Talk|>> (Sergeant) Relationship Status: In another castle
the Retromancer
#32: Dec 2nd 2023 at 12:37:08 AM

[up] You can switch between the two like on FANDOM.

TroperWall / WikiMagic Cleanup
HarmonyBunny2000 Bun who just exists from my workplace (AKA my bedroom) (Troper in training) Relationship Status: He makes me feel like I have a heart
Bun who just exists
#33: Dec 2nd 2023 at 12:39:17 AM

Hopefully the conversion will mean more reliable servers and faster load times.

thatsnumberwang Since: Oct, 2010
#34: Dec 2nd 2023 at 1:03:00 AM

Sorry for going off topic there, but thank you for answering the question. You guys do not get enough thanks for what you are doing behind the scenes imo smile

Fighteer Lost in Space from The Time Vortex (Time Abyss) Relationship Status: TV Tropes ruined my love life
Lost in Space
#35: Dec 2nd 2023 at 8:00:33 AM

Not trying to speak for the admins, but from my understanding:

  • Logins will not change. We still restrict usernames to the US alphabet (a-z, A-Z), digits (0-9), and hyphens (-), without diacritics. If that is ever changed in the future, we'll let people know.
  • I can't imagine that it would affect stored passwords, but if something horribly breaks in that regard, we can always revert to the backup database. If you weren't using extended characters in your password before, you shouldn't be affected regardless.
  • Load times and server reliability should not be affected by a code-page update. However, new features and changes to existing features can be developed more easily without the need to deal with all of the conversions.

Edited by Fighteer on Dec 2nd 2023 at 11:02:18 AM

"It's Occam's Shuriken! If the answer is elusive, never rule out ninjas!"
Amonimus the Retromancer from <<|Wiki Talk|>> (Sergeant) Relationship Status: In another castle
the Retromancer
#36: Dec 2nd 2023 at 10:28:55 AM

Remind me, do normal characters in UTF 8-4 use 1 byte or all characters will use 4 bytes. Because it implies pages would need 4 times longer to download.

Don't believe logins or passwords could be affected unless the password validation function changes.

TroperWall / WikiMagic Cleanup
skewview Since: Jun, 2013
#37: Dec 2nd 2023 at 11:05:13 AM

UTF-8 encodes all valid Unicode code points using one to four one-byte (8-bit) code units. Code points with lower numerical values that tend to occur more frequently, are encoded using fewer bytes. So for common characters in Latin1, there will be no change (128 chars of ASCII), next 1920 will be 2 bytes, next 61440 - 3 bytes, and well, the next 1048576 will require 4 bytes — but if we've gone this far, we're literally covering all written symbols that are available to us in common use, including emoji (excluding cuneiform, ancient egyptian and so on, though from my understanding, these special symbols will be covered by a special extension covered by 8 bytes per char.).

edit: wikiword + clarification

Edited by skewview on Dec 2nd 2023 at 7:15:30 PM

AFK with issues, will return
Amonimus the Retromancer from <<|Wiki Talk|>> (Sergeant) Relationship Status: In another castle
the Retromancer
#38: Dec 2nd 2023 at 11:11:08 AM

Then I guess nothing can get worse after this, we're just expanding the availavle chatset.

TroperWall / WikiMagic Cleanup
skewview Since: Jun, 2013
#39: Dec 2nd 2023 at 11:30:22 AM

Definitely, it is like the upgrade of upgrades, though I admit I'm biased.

But just to err on the side of caution, perhaps draft a few abuse preventative measures and guidelines?

Something like, no more than x emojis or something like that. Because initially, I actually thought it was an abuse preventative measure that was hardcoded.

AFK with issues, will return
Amonimus the Retromancer from <<|Wiki Talk|>> (Sergeant) Relationship Status: In another castle
the Retromancer
#40: Dec 2nd 2023 at 11:34:19 AM

Don't see what kind of abuse could there be or why limit emojis outside our pre-existing "don't overstylize text unless it's JFF or Self-Demonstrating".

TroperWall / WikiMagic Cleanup
skewview Since: Jun, 2013
#41: Dec 2nd 2023 at 11:42:41 AM

I guess that rule does cover it, good future proofing right there!

I have a question, will the site have a banner up to inform everybody site-wide? (I got here from the reopened wishlist item by kory).

AFK with issues, will return
gjjones Musician/Composer from South Wales, New York Since: Jul, 2016
Musician/Composer
#42: Dec 2nd 2023 at 12:58:19 PM

Also, will the Japanese characters (i.e. Kanji, Hiragana, Katakana) on certain work pages be affected by the conversion?

He/His/Him. No matter who you are, always Be Yourself.
skewview Since: Jun, 2013
#43: Dec 2nd 2023 at 2:44:55 PM

No special character typical for a non Latin-language will be affected negatively, they simply won't require a keyboard entry scheme/combination for them to work, the coding on the server-side of TVT will simply not have to convert/maintain them in that format for storage any longer. This should actually reduce the byte for byte size of the special characters since they will not be kept in a format that describes them through an addressing scheme: &aacute; for á. This means 1112064 characters that cover every language currently in use and written, it even includes phonetic characters, mathematical symbols and emoji.

AFK with issues, will return
VampireBuddha Calendar enthusiast from Ireland (Wise, aged troper) Relationship Status: Complex: I'm real, they are imaginary
Calendar enthusiast
#44: Dec 2nd 2023 at 10:56:22 PM

Oh wow, I recall people were asking for this back when Eddie and Janitor were around. Great to see it's finally happening!smile

Does this mean we'll be able to use punctuation and diacritics directly in URLs and links?

Ukrainian Red Cross
skewview Since: Jun, 2013
#45: Dec 3rd 2023 at 5:14:10 AM

Yes, diacritics will work in URLs, even other characters that anyone can think of.

May I be so humble as to generalize: this change has no known disadvantage, both for TVT's user-base and its developers.

A few things might start to work better:

  • Internal search, and external indexing of pages for search.
  • Translation of pages to and from non-English languages.
  • Faster load times... well, slightly faster.
  • An entire content-conversion process on the server might be dropped.
  • Fewer instances of Data Vampires going around.
  • Since we are going into YMMV and WMG territory, we'll stop right here.

AFK with issues, will return
Lymantria Tyrannoraptoran Reptiliomorph from Toronto Since: Apr, 2015 Relationship Status: Historians will say we were good friends.
Tyrannoraptoran Reptiliomorph
#46: Dec 3rd 2023 at 12:25:18 PM

How will linking work? Will we be able to put punctuation and diacritics in links to articles? And will there be spaces between words in links?

[down] Okay.

Edited by Lymantria on Dec 3rd 2023 at 11:33:20 AM

Join the Five-Man Band cleanup project!
Amonimus the Retromancer from <<|Wiki Talk|>> (Sergeant) Relationship Status: In another castle
the Retromancer
#47: Dec 3rd 2023 at 12:46:10 PM

The change is to make foreign text to display as foreign text during editing instead of "&#73 ;&#39 ;&#25 ; &#53 ;&#86 ;&#75 ;&#72 ;&#10 ;&#19 ;&#18 ;&#11". I think people overthink what conversion means.

TroperWall / WikiMagic Cleanup
VampireBuddha Calendar enthusiast from Ireland (Wise, aged troper) Relationship Status: Complex: I'm real, they are imaginary
Calendar enthusiast
#48: Dec 3rd 2023 at 2:30:32 PM

What I'm asking is, will we be able to type {{You're Just Jealous}} in the editing window and have it link to You're Just Jealous? Because that would be awesometacular.

Edited by VampireBuddha on Dec 3rd 2023 at 10:30:55 AM

Ukrainian Red Cross
skewview Since: Jun, 2013
#49: Dec 3rd 2023 at 5:30:21 PM

^ I thought you were asking about the same thing that Cutegirl920fire was talking about the previous page... because, what you are asking about is punctuation — which might need a whishlist entry to ask for support.

Edited by skewview on Dec 3rd 2023 at 1:31:00 PM

AFK with issues, will return
itcdr Since: Aug, 2014
#50: Dec 3rd 2023 at 8:18:06 PM

UPDATE: I've been testing and found more edge cases. It turns out we have some utf8 encoded values inside some latin1 tables. A small amount relative to the full database (~15K rows out of hundreds of millions of rows of data). But those values get double encoded when I do a large scale convert so I've had to add some extra steps to the process to account for this with testing every row and column. It has taken the script from 9 hours to 11 hours to safely convert everything.

I'm running another full test tonight. Assuming it completes without errors in the morning then I'm planning to start the migration tomorrow night (Monday Dec 4th 8:30pm PST to Tuesday Dec 5th 7:30am). I'll add a site wide announcement tomorrow if it's a go.

Edited by itcdr on Dec 3rd 2023 at 8:18:54 AM


Total posts: 331
Top