Misencoding FAQ

  1. Misencoding FAQ
    1. Purpose
    2. FAQ
        1. When I used the tagger with this anime soundtrack CD, it showed a release with all kinds of garbage characters; what's up with that?
        2. Do you have to be able to read Chinese (etc.) to be able to fix these? Or is there an idiot proof approach that lets English-only editors do it?
        3. Google turned up something that might be a description of the release in question, but since I only read English, how can I be sure it's really the right one?
        4. I tried Google, but the only European words were CD-2 and RMX; none of the search results seem to be relevant. What else can I try?
        5. I found the FreeDB entry and selected Big5 encoding; now I have a page with a bunch of something that looks like it might be Chinese, but since I can't read it, how can I be sure it's the right encoding?
        6. This is an old release with no link to the original FreeDB entry and I can't find it with FreeDB search; why can't I just manually select the correct encoding for the MusicBrainz release page and use that?
        7. My browser has automatic detection of character encoding; why can't I use that instead of trial and error to discover the correct encoding to use?
        8. Instead of garbage characters some releases just have lots of ????s; how can I find the correct text for these, and why does this happen?
        9. I've corrected the encoding problems with this MusicBrainz release entry, but my MP3 player/application can't display Japanese (etc.). How can I get a translation?
        10. What special things should I do when correcting the encoding for an artist name?
        11. How can I tell what the sortname should be for a artist like 김종환 when I can't read Korean?
        12. I found several transliterations for an artist, which one should I use for the sortname?
        13. I really like fixing these misencoded MusicBrainz entries; how can I find more of them?

Purpose

This FAQ deals with techniques for fixing misencoded entries for artists, releases and tracks; there is some overlap with InterNationalization, but that page is concerned mostly with MusicBrainz developers, and this one has information to help editors discover the correct encoding for misencoded entries so that they can submit corrections.

If that still doesn't help, please find us in one of the MusicBrainzForums or via the contact page. If you would like to update the content of this page on the wiki, feel free to do so but please do not add questions without answers.

FAQ

When I used the tagger with this anime soundtrack CD, it showed a release with all kinds of garbage characters; what's up with that?

There are a number of different CharacterEncodings that can be used to represent accented characters and letters in non-Roman scripts, like Chinese and other languages. MusicBrainz uses Unicode, and specifically the UTF-8 encoding, to represent these international characters. However, IDv3 tags and FreeDB entries may be encoded using many other encodings, like Big5 or KOI8-R. When these FreeDB entries are imported, or titles are copied from these IDv3 tags without converting to UTF-8, a name like 김종환 can end up as ±èÁ¾È¯ instead.

To correct these misencoded entries, you need to find some way to display the artist name or release/track title correctly (it doesn't have to be in UTF-8 as long as the system knows the correct encoding) and then you can cut&paste the correct name or title into the MusicBrainz editing pages.


Do you have to be able to read Chinese (etc.) to be able to fix these? Or is there an idiot proof approach that lets English-only editors do it?

No ability to read foreign languages is required; in fact, you can even do it without asian fonts in some cases (the mozilla browser on one linux system I use has no asian fonts, and displays unicode "box" characters with the unicode hex digits - this is less desirable than proper fonts, but is workable).

If there are some European words in the titles try to Google them and see if it turns up a listing on amazon.co.jp, yesasia.com, Boxup media or another music vendor site. Even if there are no European words in the titles, you can sometimes get a Google hit with "garbage" (this works better with misencoded Russian, Hebrew, or Greek, where almost all the characters are alphabetic with accents, instead of the other symbols common for asian text).


Google turned up something that might be a description of the release in question, but since I only read English, how can I be sure it's really the right one?

There are some sanity checks that you should make before you start cutting and pasting artist names and release/track titles. The first one is that the number of tracks should be the same as the misencoded release, and any recognizable European words or punctuation like (parentheses) should be in the same places in the same track titles for both the MusicBrainz and other music site. If the other music site has track times, they should be within 5-10 seconds of the times on MusicBrainz for all tracks.

Finally, the titles should be comparable in length for all tracks. It's unlikely that 사랑을 위하여(듀엣 조영남) would be misencoded as µ¹¾Æ¿Í. For Asian text (Chinese, Japanese, or Korean) a single character like 사 will typically be misencoded as two characters (in this case, »ç); for all other text, the misencodings will typically be one-for-one (this is also the case when misencoded entries are all or mostly question marks, regardless of whether the original text was Asian or not).

If the entry you find passes these sanity checks, you can be reasonably sure that it is a match. If you want to be extra-sure, you can find the original FreeDB entry (see below) and use the correct encoding for viewing; you can then do a visual check that the funny symbols on the FreeDB page look very much like the funny symbols on the other music site page.

It is also a very good idea to add an edit note with a link to the release on another music site, showing the correctly encoded data; this allows voters to easily confirm that your "de"coding of the misencoding passes these sanity checks as well, and greatly increases the chance that your changes will be approved.


I tried Google, but the only European words were CD-2 and RMX; none of the search results seem to be relevant. What else can I try?

If you don't get a hit (or you get too many unrelated results) with Google, you can also just go to the original FreeDB entry. Since the server update on 2004-07-25, the ModBot adds a note to all FreeDB imports with the original FreeDB URL, which makes it very easy to find (with older ones, there is no note, and you must rely on the FreeDB search capabilities, which are very limited, and don't work well with accented letters or symbols).

Once you get the FreeDB entry you can try viewing the FreeDB page by manually specifying the encoding in your browser. Browser Encoding:

  • Internet Explorer:

    • Windows: you can select any supported encoding from the View -> Encoding -> More menu item.

    • MacOS X: View > Character Set > Your-Character-Set-Choice.

  • Gecko based browsers: Mozilla, Netscape 7.x & FireFox, the View -> Character Coding -> More menu item offers submenus with regions; the available encodings are listed on these submenus (earlier Mozilla or Navigator use View -> Character Set).

  • KHTML

    • Safari has the View -> Text Encoding menu item.

    • Konqueror uses the View -> Set Encoding -> Manual menu item.

  • OmniWeb

    • 4.x: Place the character encoding pop-up in the tool bar, then Character Encoding > Region / Category / From Server > Encoding.

    • 5.xβ: I haven't downloaded this version yet.

  • Opera

    • MacOS X, 7.5.x: View > Encoding > Encoding-Category > Encoding (has an option called "Automatic Selection," I don't use Opera much so I can't say if it is auto detect or just accept whatever the server uses).

  • RealPlayer 10

    • MacOS X, using WebKit: There is no option to change encoding, default or instance.

(Please add instructions for other browsers if you know them!)


I found the FreeDB entry and selected Big5 encoding; now I have a page with a bunch of something that looks like it might be Chinese, but since I can't read it, how can I be sure it's the right encoding?

Open Google search in another browser window or tab, and cut and paste the artist name and release title into the search text box. If you get a bunch of hits that look like music vendors, you probably have the right encoding. If you just get a few hits, especially where the search text just appears in the middle of sentences, you may have coincidental matches for semi-garbage text. If you get no hits, even for just the artist name, you definitely have the wrong encoding; go back to the FreeDB page and try another encoding.


This is an old release with no link to the original FreeDB entry and I can't find it with FreeDB search; why can't I just manually select the correct encoding for the MusicBrainz release page and use that?

When FreeDB entries are imported into MusicBrainz, they undergo a conversion from ISO 8859-1 (Western European Latin-1) encoding to UTF-8 encoding, even if the FreeDB entry was encoded differently, using e.g. GB18030. When your browser attempts to interpret that UTF-8 text as GB18030, you will just get a (different kind of) garbage. Only interpreting the original unconverted FreeDB entry will work correctly.


My browser has automatic detection of character encoding; why can't I use that instead of trial and error to discover the correct encoding to use?

Current versions of both Internet Explorer and Mozilla have excellent automatic detection of character encoding. These work by looking at the distribution of character codes on the page in question and comparing them to the distributions for typical text in different languages using different encodings. However, they will not override encoding tags such as the CONTENT="text/html; charset=ISO-8859-1" present in the FreeDB HTML pages. Most corrections from FreeDB can be made by forcing encoding to UTF-8.

You may get useful results by clicking through to the FreeDB text version of the page (look for the hexadecimal number hyperlink on the ids line: e.g. ids: misc / 220eca26 on [WWW] this will take you to [WWW] that. Try these with your browser and see if it displays the correct Japanese text on the second (text) page. Be sure that you have selected (checked) View -> Encoding -> Auto-Select for Internet Explorer, or View -> Character Coding -> Auto-Detect -> Universal for Mozilla.

The Konqueror browser has a semi-automatic detection of character encoding (View -> Set Encoding -> Automatic Detection -> Semi-Automatic) but I have found it to perform poorly. Even the more accurate automatic detection of Internet Explorer or Mozilla requires a sufficient amount of text in the track listing for good results, and can be thrown off by large amounts of European text in the titles. In Apple's Safari (using the WebKit implementation of KHTML) in the Appearance preference pane you can set your default encoding. I recommend UTF-8. As of version 1.2.2 (v125.8), there does not seem to be a force auto-detect in the Mozilla tradition. To change the page encoding, go to View > Text Encoding > Your-Encoding-Choice.

If automatic detection isn't working for a particular entry, and you have checked that it is enabled, you may get better results by cutting and pasting the "garbage" artist, release name, and track numbers and titles into a plain text file with a .html extension (it may be helpful to omit European words when you do this). Open the .html file with Internet Explorer or Mozilla using double-click or a [WWW] file:/path/name URL, and automatic detection may work after all. If even this doesn't work, check CharacterEncodings for some hints about how to distinguish different types of "garbage" and suggestions for likely encodings.


Instead of garbage characters some releases just have lots of ????s; how can I find the correct text for these, and why does this happen?

If there are European words in the titles, Googling may still work. If it is a recent (post 2004-07-25) entry, the ModBot will have added a note to the release add with the original FreeDB URL (click on the "View release edits" link to see the note) and the original FreeDB entry will have the "garbage" rather than ????s, allowing you to specify the encoding or have it auto-detected. If there is no FreeDB URL and there is nothing other than ????s, the entry is useless and should be removed or voted down. However, even just one or two European words in a name or title, like disk-2, may allow you to use [WWW] FreeDB advanced search to display all entries that match; you can then use the browser search on page function to find the matching text exactly (this is admittedly rather difficult).

It's not entirely clear why some releases have ????s rather than garbage characters; however, an unscientific sampling of half a dozen entries with this problem added since the 2004-07-25 server update shows that 100% of the FreeDB entries for these are actually encoded in UTF-8.

There may be a bug in the FreeDB-import code where entries that are already in UTF-8 are converted to 8859-1 before converting them back to UTF-8. However, since non-Western-European characters in UTF-8 are not representable in 8859-1, they could be replaced with ? characters, leading to this problem. In any case, if you are trying to guess the correct encoding for a FreeDB entry where the MusicBrainz release has ????s; try UTF-8 first as it is very likely to be correct.


I've corrected the encoding problems with this MusicBrainz release entry, but my MP3 player/application can't display Japanese (etc.). How can I get a translation?

Google searches on the artist and release/track titles plus some key English words may turn up translations, but this is usually limited to popular anime titles. Otherwise, you will have to learn to read the language (or find a friend who does) and translate it yourself. If you find (or make) a translation, you should see InterNationalization for style guidelines about how to enter this.


What special things should I do when correcting the encoding for an artist name?

When correcting an artist name, there are some extra considerations beyond what is necessary when correcting titles. First, if the artist name is made up of ????s, you should not use the Edit Artist Name link, since there are probably other releases by unrelated artists that got misencoded to the same number of ????s. Instead, use the Add Artist link on the navigation panel to create a new entry for the artist and then use the Move link on the release to move the release to that newly created artist.

When correcting an artist name (not creating a new one for a release with ???s) you should add an alias with the misencoded artist name. This will allow other FreeDB imports with encoding problems to still be assigned to the correct artist.

When correcting (or creating) an artist name, the ArtistSortname should be the official (or standard) transliteration of the name into Roman letters, with the family name (if any) first, followed by a comma. For example, the ArtistSortname for the artist 김종환 is Kim, Jong-Hwan. Romanization of sortnames serves two purposes: it provides some sort of useful display for those without the necessary fonts, and more importantly, it provides a meaningful and consistent universal sorting order; without romanized sortnames, there is no meaningful order between 김종환 and Колибри and Γιάννης Κότσιρας and שבק ס to a person that only understands the common roman characters [a-zA-Z]. It is important to point out that Unicode characters do have a natural sort order, and using it is better than leaving the field blank.

Additionally, since the ArtistSortname field is not currently used for name searches, you should enter the Roman transliteration as an artist alias, using the most common form(s) of the name, in this case Jong-Hwan Kim and Jong Hwan Kim (an argument could probably be made for Kim Jong Hwan but since searches don't care much about word order, and aliases are rarely displayed, this is not critical). You should also enter aliases for any alternate transliterations.


How can I tell what the sortname should be for a artist like 김종환 when I can't read Korean?

A Google search for the artist name will usually turn up an entry in some page that gives a transliteration. In this case, the second page of results for [WWW] 김종환 turned up an entry in [WWW] kpopdb.com the Korean Pop artist database. Adding a link to such a page as the edit note for the Edit Artist Name is strongly encouraged, even for AutoEdit's whose changes will go through without voting.

In particular, the yesasia.com music vendor site is one that has English versions for all their pages, including transliterations of the artist name; searching for an artist's Asian name and yesasia may turn up a relevant yesasia.com page. Their English transliterations are sometimes less common than other alternate forms, but they are always worth entering as aliases.

Another trick is to look at the URLs. Since non-European characters in URLs must be %-encoded, website developers often use English transliterations of artist names in the URL paths. Additional searches with the original name and the relevant part(s) of the URL will often turn up confirmation.

Another option if you're having trouble finding a sortname for Chinese and Japanese names is to use an OnlineTranslator, some of which can also do transliteration. There's [WWW] gb.waiyu.org and [WWW] big5.waiyu.org for simplified Chinese and traditional Chinese respectively and the [WWW] Japanese<=>English Dictionary for Japanese can be used by unchecking words, checking names and changing "word starting with pattern" to "word matching pattern". The Japanese dictionary can't search for several words at once, so you may need to remove characters from the end one by one until you get a match. For example, 松浦亜弥 and 松浦亜 won't match, but 松浦 will find Madzura, Matsumura, Matsura and Matsuura, having found one half, searching for the other half, 亜弥, will find Aya. Searching Google for 松浦亜弥 and Aya will find several references to "Aya Matsuura" on the Japanese Amazon and HMV sites.


I found several transliterations for an artist, which one should I use for the sortname?

If there's an "official" site for the artist, you should prefer the one used there. If not, a plausible way to decide is to see which one gets more Google hits (Google-fight).


I really like fixing these misencoded MusicBrainz entries; how can I find more of them?

First, you might want to consider getting a life (or failing that, a degree in Linguistics :-). If you really are interested in doing this (maybe you're hoping for an AutoEditor nomination?) the MusicBrainz navigation toolbar has a [WWW] report listing many (though not all) misencoded entries (you can also find this in the MusicBrainz navigation panel under [WWW] Edit the Data -> Suggestions -> Wrong Charset). As of this writing (2004-07-29), there are 64 pages with 1595 possibly misencoded tracks; enough to keep even the most obsessive-compulsive editor busy for days if not weeks (I should know ;) ).


CategoryWikiDocsPage CategoryFAQ

 
Creative Commons EFF GPL LGPL Valid XHTML 1.1 Valid CSS 2.0
Original Design|vacubomb.com Contact details Server version: 20071014
Served by child pid: 19272
This mirror was last updated: 2008-11-20 19:59:59.792682+00