c++ - Can't read unicode (japanese) from a file -
hi have file containing japanese text, saved unicode file.
i need read file , display information stardard output.
i using visual studio 2008
int main() { wstring line; wifstream myfile("d:\sample.txt"); //file containing japanese characters, saved unicode file //myfile.imbue(locale("japanese_japan")); if(!myfile) cout<<"while opening file error encountered"<<endl; else cout << "file opened" << endl; //wcout.imbue (locale("japanese_japan")); while ( myfile.good() ) { getline(myfile,line); wcout << line << endl; } myfile.close(); system("pause"); return 0; }
this program generates random output , don't see japanese text on screen.
oh boy. welcome fun, fun world of character encodings.
the first thing need know console not unicode on windows. way you'll ever see japanese characters in console application if set non-unicode (ansi) locale japanese. make backslashes yen symbols , break paths containing european accented characters programs using ansi windows api (which supposed have been deprecated when windows xp came around, people still use day...)
so first thing you'll want build gui program instead. i'll leave exercise interested reader.
second, there lot of ways represent text. first need figure out encoding in use. is utf-8? utf-16 (and if so, little or big endian?) shift-jis? euc-jp? can use wstream
read directly if file in little-endian utf-16. , need futz internal buffer. other utf-16 , you'll unreadable junk. , case on windows well! other oses may have different wstream
representation. it's best not use wstream
s @ really.
so, let's assume it's not utf-16 (for full generality). in case must read char stream - not using wstream
. must convert character string utf-16 (assuming you're using windows! other oses tend use utf-8 char*
s). on windows can done multibytetowidechar
. make sure pass in right code page value, , cp_acp
or cp_oemcp
wrong answer.
now, may wondering how determine code page (ie, character encoding) correct. short answer you don't. there no prima facie way of looking @ text string , saying encoding is. sure, there may hints - eg, if see byte order mark, chances it's whatever variant of unicode makes mark. in general, have told user, or make attempt guess, relying on user correct if you're wrong, or have select fixed character set , don't attempt support others.
Comments
Post a Comment