c++ - Can't read unicode (japanese) from a file -


hi have file containing japanese text, saved unicode file.

i need read file , display information stardard output.

i using visual studio 2008

int main()    {         wstring line;         wifstream myfile("d:\sample.txt"); //file containing japanese characters, saved unicode file         //myfile.imbue(locale("japanese_japan"));         if(!myfile)               cout<<"while opening file error encountered"<<endl;         else               cout << "file opened" << endl;         //wcout.imbue (locale("japanese_japan"));         while ( myfile.good() )         {               getline(myfile,line);               wcout << line << endl;         }         myfile.close();         system("pause");         return 0;   }   

this program generates random output , don't see japanese text on screen.

oh boy. welcome fun, fun world of character encodings.

the first thing need know console not unicode on windows. way you'll ever see japanese characters in console application if set non-unicode (ansi) locale japanese. make backslashes yen symbols , break paths containing european accented characters programs using ansi windows api (which supposed have been deprecated when windows xp came around, people still use day...)

so first thing you'll want build gui program instead. i'll leave exercise interested reader.

second, there lot of ways represent text. first need figure out encoding in use. is utf-8? utf-16 (and if so, little or big endian?) shift-jis? euc-jp? can use wstream read directly if file in little-endian utf-16. , need futz internal buffer. other utf-16 , you'll unreadable junk. , case on windows well! other oses may have different wstream representation. it's best not use wstreams @ really.

so, let's assume it's not utf-16 (for full generality). in case must read char stream - not using wstream. must convert character string utf-16 (assuming you're using windows! other oses tend use utf-8 char*s). on windows can done multibytetowidechar. make sure pass in right code page value, , cp_acp or cp_oemcp wrong answer.

now, may wondering how determine code page (ie, character encoding) correct. short answer you don't. there no prima facie way of looking @ text string , saying encoding is. sure, there may hints - eg, if see byte order mark, chances it's whatever variant of unicode makes mark. in general, have told user, or make attempt guess, relying on user correct if you're wrong, or have select fixed character set , don't attempt support others.


Comments

Popular posts from this blog

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

c# - How to add a new treeview at the selected node? -

java - netbeans "Please wait - classpath scanning in progress..." -