-
June 26th, 2013, 01:33 PM
#1
writing and reading bytes - a Visual Studio issue ?
I have used the following code to write and read wchar_t bytes from a disk file:
Code:
int WriteBytesW(wchar_t * wcp, int nsz, wchar_t * wcfilepath)
{
wfstream wf;
codecvt_utf16<wchar_t, 0x10ffff, little_endian> ccvt(1);
locale wloc(wf.getloc(), &ccvt);
wf.imbue(wloc);
wf.open(wcfilepath, ios::out | ios::binary);
if(!wf) { wprintf(_T("Unable to open file %s"), wcfilepath); return 0; }
wf.write((wchar_t *) wcp, (streamsize)(nsz));
wf.close();
return 1;
}// WriteBytesW(wchar_t * wcp, int nsz, wchar_t * wcfilepath)
/// reads raw bytes from a file all at once
/// see: http://www.cplusplus.com/reference/istream/istream/tellg/
int ReadBytesW(wchar_t * wcfilepath, wchar_t * pwbuf, long &lsz)
{
wfstream wf;
codecvt_utf16<wchar_t, 0x10ffff, little_endian> ccvt(1);
locale wloc(wf.getloc(), &ccvt);
wf.imbue(wloc);
// see: http://www.codeguru.com/forum/showthread.php?t=511113
wf.open(wcfilepath, ios::in|ios::binary);
if(!wf) { wprintf( _T("Unable to open file %s"), wcfilepath); return 0; }
// get length of file:
wf.seekg (0, wf.end);
int length = wf.tellg();
wf.seekg (0, wf.beg);
lsz = length;
pwbuf = new wchar_t [length+1];
wmemset(pwbuf, 0x0000, length+1);
wf.read(pwbuf, (streamsize) length);
wf.close();
// print content
for(int i = 0; i < length/2; i++)
{
printf("%0.4X ", pwbuf[i]);
}
printf("\n");
delete [] pwbuf; pwbuf = 0;
return 1;
}// ReadBytesW(wstring wsfilepath)
I have run this simple experiment where the wide byte 0xFFFF is present or absent.
Code:
int _tmain(int argc, _TCHAR* argv[])
{
wchar_t wbuf[10];
wbuf[0] = 0x1234;
wbuf[1] = 0x5678;
wbuf[2] = 0x9abc;
wbuf[3] = 0xef12;
wbuf[4] = 0xabcd;
wbuf[5] = 0xfe21;
wbuf[6] = 0xdcba;
wbuf[7] = 0x1f2a;
wbuf[8] = 0xefff;
wbuf[9] = 0x02ff;
int n = WriteBytesW(wbuf, 10, _T("bravo.dat"));
if(n) { printf("save bytes succeeded\n"); } else { printf("save bytes failed\n"); }
wchar_t * wbuf2 = 0;
long nsz = 0;
n = ReadBytesW(_T("bravo.dat"), wbuf2, nsz);
if(n) { printf("read bytes succeeded\n"); } else { printf("read bytes failed\n"); }
return 0;
}
Output:
save bytes succeeded
1234 5678 9ABC EF12 ABCD FE21 DCBA 1F2A EFFF 02FF
read bytes succeeded
nsz =: 20
Now, if wbuf[8] = 0xefff; is replaced by wbuf[8] = 0xffff;
Output:
save bytes succeeded
1234 5678 9ABC EF12 ABCD FE21 DCBA 1F2A 02FF
read bytes succeeded
nsz =: 18
Obviously, the 0xffff wbyte is not read. WHY ?
This presents a significant problem when attempting to read ALL wbytes from a file. Is there any work around ? Is this a VS problem ?
mpliam
-
June 26th, 2013, 02:11 PM
#2
Re: writing and reading bytes - a Visual Studio issue ?
First, you don't need to use the _T() macro. If the parameter expects a "const wchar_t*" then just put an L on the front of the literal
>> Obviously, the 0xffff wbyte is not read. WHY ?
It may have something to do with that fact that U+FFFF is not a value character code point. Have you stepped through it in the debugger? Or do you have Express with no CRT source?
gg
-
June 26th, 2013, 02:19 PM
#3
Re: writing and reading bytes - a Visual Studio issue ?
Works okay on 2012 Win7.
1234 5678 9ABC EF12 ABCD FE21 DCBA 1F2A EFFF 02FF
-
June 26th, 2013, 02:24 PM
#4
Re: writing and reading bytes - a Visual Studio issue ?
>> 1234 5678 9ABC EF12 ABCD FE21 DCBA 1F2A EFFF 02FF
Try it with that set to 0xffff.
Also, what is the size of the file? Wondering if write() did the initial filtering.
gg
-
June 26th, 2013, 02:50 PM
#5
Re: writing and reading bytes - a Visual Studio issue ?
I could reproduce the issue with VC++ 2010 on XP Pro SP3.
Originally Posted by Codeplug
Also, what is the size of the file? Wondering if write() did the initial filtering.
Your suspicion is right: It's the writing phase where the word gets dropped.
Originally Posted by Codeplug
It may have something to do with that fact that U+FFFF is not a value character code point. [...]
I initially suspected something in that direction as well, but refrained from posting when I saw that the files are opened in binary mode. Can it still be a Unicode (non-)character issue?
[...] Have you stepped through it in the debugger? Or do you have Express with no CRT source?
Express does come with CRT sources, at least the 2010 version.
I was thrown out of college for cheating on the metaphysics exam; I looked into the soul of the boy sitting next to me.
This is a snakeskin jacket! And for me it's a symbol of my individuality, and my belief... in personal freedom.
-
June 26th, 2013, 03:30 PM
#6
Re: writing and reading bytes - a Visual Studio issue ?
I've been using Win 7 (64-bit) Ultimate (SvcPk 1) on Dell XPS 8300. Interestingly, if one tries merely to read unsigned char (bytes) from a disk file (even though you must cast the filestream::read( (char*) ...); as (char*), it will read all bytes from 0x00 to 0xFF. Go figure.
Code:
int WriteBytesA(unsigned char uc[], int nz, char * sfilepath)
{
fstream f;
f.open(sfilepath, ios::out | ios::binary);
if(!f) { printf("Unable to open file %s", sfilepath); return 0; }
f.write((char *) uc, (streamsize)(nz));
f.close();
return 1;
}// WriteBytesA(unsigned char * uc, int nz, char * sfilepath)
int ReadBytesA(char * sfilepath, unsigned char * ucbuf, long &lsz)
{
fstream f;
f.open(sfilepath, ios::in|ios::binary);
if(!f) { printf("Unable to open file %s", sfilepath); return 0; }
// get length of file:
f.seekg (0, f.end);
long length = (long) f.tellg();
f.seekg (0, f.beg);
lsz = length;
ucbuf = new unsigned char [length+1];
memset(ucbuf, 0x00, length+1);
//f.read(ucbuf, (streamsize) length); // won't accept this
f.read((char*)ucbuf, (streamsize) length);
f.close();
// print content
for(int i = 0; i < length; i++)
{
printf("%0.2X ", ucbuf[i]);
}
printf("\n");
delete [] ucbuf; ucbuf = 0;
return 1;
}// ReadBytesA(char * sfilepath, unsigned char * pucbuf, long &lsz)
int _tmain(int argc, _TCHAR* argv[])
{
unsigned char uc[22];
uc[0] = 0x34;
uc[1] = 0x12;
uc[3] = 0x34;
uc[4] = 0x78;
uc[5] = 0x56;
uc[6] = 0xbc;
uc[7] = 0x9a;
uc[8] = 0x12;
uc[9] = 0xef;
uc[10] = 0xcd;
uc[11] = 0xab;
uc[12] = 0x21;
uc[13] = 0xfe;
uc[14] = 0xba;
uc[15] = 0xdc;
uc[16] = 0x2a;
uc[17] = 0x1f;
uc[18] = 0xff;
uc[19] = 0xef;
uc[20] = 0xff;
uc[21] = 0x02;
long nz = 22;
int n = WriteBytesA(uc, nz, "ssm.dat");
if(n) { printf("save bytes succeeded\n"); } else { printf("save bytes failed\n"); }
nz = 0;
memset(uc, 0x00, 22);
n = ReadBytesA("ssm.dat", uc, nz);
if(n) { printf("read bytes succeeded\n"); } else { printf("read bytes failed\n"); }
printf("nz = %d\n", nz);
return 0;
}
Output:
save bytes succeeded
34 12 CC 34 78 56 BC 9A 12 EF CD AB 21 FE BA DC 2A 1F FF EF FF 02
read bytes succeeded
nz = 22
It may be that when one attempts to write diverse wbytes to a disk file, the safest method would be to first convert all the wbytes into bytes and then save them as bytes (the latter method above). The following code will accomplish this for you and you can control the 'endianness':
Code:
/// Converts a wide char (wchar_t) array to an unsigned char (byte) array.
/// This routine converts the byte order depending upon the byte order marker.
/// Caller is responsible for allocating and deallocating uc memory.
/// The difference between bigE and littleE is whether the least significant
/// byte is at the lowest address or not.
/// BOM
/// UTF-16 (BE) 0xFEFF - highest value byte at lowest address index
/// UTF-16 (LE) 0xFFFE - lowest value byte at lowest address index
int wcstoucs(wchar_t wcs[], int nsz, unsigned char uc[], wchar_t wcbom )
{
printf("wcbom =: %0.4X\n", wcbom);
bool bLittleE = false;
bool bBigE = false;
if(wcbom == 0xFEFF) { bBigE = true; printf("big-endian\n");}
if(wcbom == 0xFFFE) { bLittleE = true; printf("little-endian\n");}
wchar_t wch = ' ';
int wdx = 0;
for(size_t i = 0; i < 2 * nsz; i+=2)
{
wch = wcs[wdx];
if(bBigE)
{
uc[i] = LOBYTE(wch); // bigEndian
uc[i+1] = HIBYTE(wch);
}
if(bLittleE)
{
uc[i] = HIBYTE(wch); // littleEndian (x86)
uc[i+1] = LOBYTE(wch);
}
wdx++;
}
return 2 * nsz;
}// wcstoucs(wchar_t wcs[], int nsz, unsigned char uc[], wchar_t wcbom )
It occurs to me that the problem saving wide chars to disk might be a bug.
Last edited by Mike Pliam; June 26th, 2013 at 03:38 PM.
mpliam
-
June 26th, 2013, 05:09 PM
#7
Re: writing and reading bytes - a Visual Studio issue ?
I think Codeplug nailed it. Single stepping the code into CRT shows that 0xFFFF is the wide EOF character so your workaround isn't the way to go.
-
June 26th, 2013, 07:41 PM
#8
Re: writing and reading bytes - a Visual Studio issue ?
I am thinking it is a bug in wfstream ... you can use fstream instead.
Below is your code changed to fstream with a few other minor cosmetic
changes:
Code:
// win32_console.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <cstdio>
#include <fstream>
#include <codecvt>
using namespace std;
int WriteBytesW(wchar_t * wcp, int nsz, wchar_t * wcfilepath)
{
ofstream wf(wcfilepath, ios::out | ios::binary);
if(!wf) { wprintf(L"Unable to open file %s", wcfilepath); return 0; }
codecvt_utf16<wchar_t, 0x10ffff, little_endian> ccvt(1);
locale wloc(wf.getloc(), &ccvt);
wf.imbue(wloc);
wf.write(reinterpret_cast<char*>(wcp),nsz);
return 1;
}// WriteBytesW(wchar_t * wcp, int nsz, wchar_t * wcfilepath)
/// reads raw bytes from a file all at once
/// see: http://www.cplusplus.com/reference/istream/istream/tellg/
int ReadBytesW(wchar_t * wcfilepath, wchar_t * & pwbuf, long &lsz)
{
ifstream wf(wcfilepath, ios::in|ios::binary);
if(!wf) { wprintf( L"Unable to open file %s", wcfilepath); return 0; }
codecvt_utf16<wchar_t, 0x10ffff, little_endian> ccvt(1);
locale wloc(wf.getloc(), &ccvt);
wf.imbue(wloc);
// see: http://www.codeguru.com/forum/showthread.php?t=511113
// get length of file:
wf.seekg (0, ios::end);
int length = wf.tellg();
wf.seekg (0, ios::beg);
lsz = length;
pwbuf = new wchar_t [length+1];
wmemset(pwbuf, 0x0000, length+1);
wf.read((char*)pwbuf, (streamsize) length);
wf.close();
printf("length = %d\n",length);
// print content
for(int i = 0; i < length/2; i++)
{
printf("%d : %0.4X \n", i,pwbuf[i]);
}
printf("\n");
delete [] pwbuf; pwbuf = 0;
return 1;
}// ReadBytesW(wstring wsfilepath)
int _tmain(int argc, _TCHAR* argv[])
{
wchar_t wbuf[10];
wbuf[0] = 0x1234;
wbuf[1] = 0x5678;
wbuf[2] = 0x9abc;
wbuf[3] = 0xef12;
wbuf[4] = 0xabcd;
wbuf[5] = 0xfe21;
wbuf[6] = 0xdcba;
wbuf[7] = 0x1f2a;
wbuf[8] = 0xffff;
wbuf[9] = 0x02ff;
int n = WriteBytesW(wbuf, 10*sizeof(wchar_t), L"bravo.dat");
if(n) { printf("save bytes succeeded\n"); } else { printf("save bytes failed\n"); }
wchar_t * wbuf2 = 0;
long nsz = 0;
n = ReadBytesW(L"bravo.dat", wbuf2, nsz);
if(n) { printf("read bytes succeeded\n"); } else { printf("read bytes failed\n"); }
return 0;
}
-
June 27th, 2013, 05:13 AM
#9
Re: writing and reading bytes - a Visual Studio issue ?
Originally Posted by S_M_A
I think Codeplug nailed it. Single stepping the code into CRT shows that 0xFFFF is the wide EOF character so your workaround isn't the way to go.
But as the file is being opened in binary mode, should the bit contents of the file or what is being read/written make any difference? I'm inclined to agree with Philip that it looks like a bug in wfstream when in binary mode.
All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
June 27th, 2013, 09:22 AM
#10
Re: writing and reading bytes - a Visual Studio issue ?
Originally Posted by 2kaud
But as the file is being opened in binary mode, should the bit contents of the file or what is being read/written make any difference? I'm inclined to agree with Philip that it looks like a bug in wfstream when in binary mode.
no, the binary mode doesn't change the fact that the streams will read/write elements of type char_type interpreted according to the corresponding char_traits specialization. In this case, the choice of the implementation to define char_traits<wchar_t>::int_type as an unsigned short ( with 0xFFFF used as EOF ) looks legitimate to me. Moreover, I have no experience with char sets, but the resulting behavior of "ignoring" the non-character 0xffff is consistent with the UTF-16 spec, isn't it ?
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|