.txt file and Japanese text

**shilpal** · August 8th, 2011, 07:54 AM

Hi,

In one of my VC++ application i am using FILE::Write(); funciton to write data to a notepd file. It is fine, but now my requirement is to write Japanese text data to a notepad, for this what i have to do, is there any option to do so? please help me.

pal

**VictorN** · August 8th, 2011, 08:00 AM

I guess zou mean .txt files (not a notepad).
You should create a UNICODE file (just use a UNICODE build) and preferably with a BOM.

**Igor Vartanov** · August 8th, 2011, 02:12 PM

now my requirement is to write Japanese text data

Any text is just a set of bytes. You compose those properly in memory, then you write the bytes to file. So, what is your real problem?

**shilpal** · August 8th, 2011, 11:30 PM

Yes VictorN you are right, it is .txt file, i want to write JP text to it.
what's your solution in that case.
And also sometimes i want to write data to .csv file , what about that one?

**shilpal** · August 8th, 2011, 11:33 PM

thanks Igor Vartanov ,
Can you please tell me how to compose bytes so that i can put JP text in .txt file and .csv files.
My project settings is unicode for character encoding

**Chris_F** · August 8th, 2011, 11:46 PM

I would suggest using WideCharToMultiByte to convert your wstring or wchar* data to a UTF-8 encoded string, then you can write the data to file like normal, preferably without a BOM.

**Igor Vartanov** · August 9th, 2011, 12:03 AM

Can you please tell me how to compose bytes so that i can put JP text in .txt file and .csv files.

You know, this sounds a bit weird to me. You get some bytes from database, text file or user input. You already should know what the bytes are, and of what encoding (code page 932 for example, or some flavor of unicode, as Chris suggested) they are. Then you put the bytes, to txt or csv file. I don't see any problem here, do you?

Specifically to putting the text to txt. In case your bytes are in CP932, you need to do nothing but write the bites directly to file. It's important to understand, that in this case to see Japanese characters in notepad you have to set up location information for non-unicode text to Japanese.

In case of some unicode I'd recommend not to avoid BOM, as Victor already said.

**vcdebugger** · August 9th, 2011, 12:11 AM

what is BOM?

**Chris_F** · August 9th, 2011, 12:19 AM

Originally Posted by Igor Vartanov

I'd recommend not to avoid BOM, as Victor already said.

There is no point of a BOM if you are using UTF-8, since BOM is meant to indicate the byte order (endianness) and there simply is no such thing in UTF-8, so having one is meaningless. I think that dealing with UTF-8 is easier and more readily supported by programs and other operating systems.

**Igor Vartanov** · August 9th, 2011, 02:16 AM

Originally Posted by Chris_F

There is no point of a BOM if you are using UTF-8, since BOM is meant to indicate the byte order (endianness) and there simply is no such thing in UTF-8, so having one is meaningless.

Really? Other people (see BOM article that Victor recommended) wouldn't agree with this your statement.

Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.
. . .
UTF-8

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF.

**Eri523** · August 9th, 2011, 07:03 AM

Originally Posted by Chris_F

There is no point of a BOM if you are using UTF-8, since BOM is meant to indicate the byte order (endianness) and there simply is no such thing in UTF-8, so having one is meaningless. [...]

Besides what Igor already posted: There definitely is a point in using a BOM for UTF-8. It allows to reliably distiguish the UTF-8 from MBCS encoding which is something at least I really appreciate.

**Chris_F** · August 9th, 2011, 10:01 AM

It may very well have it's uses for some, but I personally think it is ugly and unnecessary. UTF-8 is supposed to be backward compatible with ASCII and if you treat a UTF-8 document containing only ASCII characters as if it were ASCII encoded (as you aught to be able to do) then you will end up with 3 garbage characters at the beginning.

From Wikipedia:

While the Unicode Standard does allow a BOM in UTF-8,[2] it does not require or recommend it.[3] Byte order has no meaning in UTF-8[4] so a BOM serves only to identify a text stream or file as UTF-8.

Sure, I'm somewhat naive, but its 2011 and think everyone should just be using BOMless UTF-8 for everything and pretend like other character encodings never even existed. Luckily I use Linux and I'm pretty much able to do just that.

**Eri523** · August 9th, 2011, 06:47 PM

Originally Posted by Chris_F

UTF-8 is supposed to be backward compatible with ASCII and if you treat a UTF-8 document containing only ASCII characters as if it were ASCII encoded (as you aught to be able to do) then you will end up with 3 garbage characters at the beginning.

Well, strictly speaking, a UTF-8 file (without BOM) that only contains ASCII characters is no UTF-8 file, it's an ASCII file. And of course this has the advantage of being compatible with plainly everything (at least as far as the character set is concerned). If I wanted to get the best of both worlds without requiring the user to always make an explicit choice of character encoding, I'd scan the data before saving it to determine whether it actually does contain non-ASCII characters and insert a BOM only if it does. IMO a justifyable effort with respect to the convenience gain it yields, unless the file is really huge.

Most (Windows) programs I encounter nowadays simply default to MBCS encoding in the absence of a BOM, no matter what otherwise (and given they support more than one encoding at all). If I give them a BOM-less UTF-8 file with extended characters, I get something ugly as well, though admittedly perhaps not within the first three characters (that, OTOH, are pretty easy to locate

).

However, .NET programs are an exception from the "most (Windows) programs" rule above: Their stream reader and stream writer constructors prefer to default to BOM-less UTF-8 when writing or when reading a BOM-less file. So I need to make a little extra effort to "get the best of both worlds" in my .NET programs, but I think it's worth it, until the world has become 99% unicodified... (Not all of us share the bliss of writing exclusively or at least mostly on *nix...

)

**olivthill2** · August 10th, 2011, 01:48 AM

I, too, was wondering whether I should write a BOM header or not.
Eventually, I decided to follow what Notepad (under Windows 7 home edition 32-bit) is doing.

And what is Notepad doing when it saves some text containing exotic characters?
It writes the BOM characters: FF FE.
And Notepad don't use UTF-8, but Unicode having a fixed length of 2 bytes per character.

In my software, I write them with good old C functions (fopen(), fputc(), fclose()) initially designed for Ascii text. I add the BOM, and I call two times fputc() for each character. This is somehow a primitive way of doing things, but it works for me.

**Igor Vartanov** · August 10th, 2011, 03:15 AM

Originally Posted by olivthill2

And what is Notepad doing when it saves some text containing exotic characters?
It writes the BOM characters: FF FE.
And Notepad don't use UTF-8, but Unicode having a fixed length of 2 bytes per character.

Surprise!

Thread: .txt file and Japanese text

Thread Tools

Display

.txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Re: .txt file and Japanese text

Posting Permissions