[RESOLVED] Read special characters like swedish å ä ö using fgetc

**alderaan** · June 7th, 2011, 07:36 PM

I can not use fgetc to read special characters like swedish å ä ö. I would like to read character by character using fgetc and then present the decimal ascii-result in a consol window and then write the result to a file called ascii.txt.

So my question is: How do i specify the read-encoding for special characters like swedish å ä ö - or in better way the read-default character for the system runned in. I guess i should use UTF7/UTF8, or encoding "default" (unicode) - but how can i specify that in using fgetc?

I´ve tried to google this and i haven´t found the answer. Would be really happy if you could help me.

I´m using a very simple ansi .txt-file for an example. My whatever.txt contains only aåäöb.

The result is "97 -27 -28 -10 98" written to console and to a.txt-file.

Desired result should be "97 134 132 148 98". General Extended Ascii.

Yes i´m very new in programming c/c++. I would appreciate that you explain in that way to.

I use Visual Studio 2010 as a compiler.

Please help me. Presenting my full code:

#include "stdafx.h"
#include <stdio.h>
#include <fstream>

int print;
char characters[10];
char SPACE[] = " ";
int toint;
char text;
FILE *open, *converttoascii;

int main(int)
{
open = fopen("whatever.txt", "r");
converttoascii = fopen("ascii.txt", "a");

do
{
text = fgetc(open);

if (text == 'ÿ')
break;

if (text > 255)
break;

print = text;
printf("%d ", print);

toint = text;
itoa(toint, characters, 10);

fwrite(characters, strlen(characters), 1, converttoascii);
fwrite(&SPACE, strlen(SPACE), 1, converttoascii);
} while (text != EOF);

printf("\n\nDONE!!\n\n");
system("pause");

return 0;
}

**Eri523** · June 8th, 2011, 07:41 AM

The C++ char data type is signed, therefore, with 8 bits, it can only represent numbers from -128 to 127 and any ASCII character with a code above 127 will come out as a negative number.

Try to make text a variable of type unsigned char instead of char and it should work.

Please use code tags when posting code.

Ah, and... Welcome to CodeGuru!

**aamir121a** · June 8th, 2011, 08:40 AM

If that does not work you can try ( WCHAR ) which is Unicode , if you need to display them please make sure you are Swedish fonts installed .

**aamir121a** · June 8th, 2011, 08:43 AM

here is the complete code map for Unicode

http://en.wikipedia.org/wiki/List_of...#Indic_scripts

**Eri523** · June 8th, 2011, 10:18 AM

Originally Posted by aamir121a

If that does not work you can try ( WCHAR ) which is Unicode [...].

Or, since we're not in the Windows-specific section here, use wchar_t and remain within the C++ standard. The Windows-specific WCHAR isn't anything else than a typedef of wchar_t anyway.

**Chris_F** · June 8th, 2011, 10:57 AM

Using wchar_t isn't guaranteed to work either, since it's usually 16-bits, it would still fail on 32-bit code points. You're probably better off sticking to UTF-8 and if you really need to fetch a single code point at a time, use fgetc and inspect the value to see if it's the first byte of a multi-byte code point. If it is, continue fetching bytes till you have the full code point.

**Eri523** · June 8th, 2011, 03:46 PM

Well, as I interpret TC++PL §4.3, the implementation-dependent size of wchar_t is sufficient to store the largest character set from the implementation's locale.

However, of course that's only safe as long as you can assume that the text files you're reading have been written by programs having been compiled by just the same C++ implementation that you use for your app. The size of a wchar_t under VC++ 2010 is 16 bit (as you already seemed to imply), but of course no one stops any programmer from writing out UTF-32 encoded Text, which is, for instance, pretty easy in the .NET framework.

Of course your proposed approach of using UTF-8 exclusively would be safe, but if you are able to enforce this, you can enforce using nothing larger than UTF-16 as well. (UTF-8 also has the advantage of being the "most compatible" one of the Unicode encodings with regard to plain ASCII which is what I like about it specifically, but that's not the topic of this thread.)

**alderaan** · June 9th, 2011, 07:15 PM

This is unfortunately not the solution. I´ve tried unsigned char.
The result is: 97 229 228 246 98
It should be: 97 134 132 148 98

According to the general extended ascii-table, åäö is 134 132 148, and ÅÄÖ is 143 142 153.

**alderaan** · June 9th, 2011, 07:52 PM

This is a possible solution (works), but i don´t want to have to define the right ascii decimal for each false ascii decimal using unsigned char or unsigned int.
This is only applicable on åäö and ÅÄÖ.
Of course i want the right ascii-decimal value whatever character-set being used.
For example the cyrillic or greek alphabet.

int theChar;
FILE *fp;

if((fp = fopen("whatever.txt", "r")) == NULL)
{
printf("can't open file\n");
}

while((theChar=fgetc(fp)) != EOF)
{
if(theChar==229)
theChar=134;
if(theChar==228)
theChar=132;
if(theChar==246)
theChar=148;
if(theChar==197)
theChar=143;
if(theChar==196)
theChar=142;
if(theChar==214)
theChar=153;
printf("%d ", theChar);
}

fclose(fp);

**Eri523** · June 9th, 2011, 08:30 PM

Originally Posted by alderaan

This is unfortunately not the solution. I´ve tried unsigned char.
The result is: 97 229 228 246 98
It should be: 97 134 132 148 98

According to the general extended ascii-table, åäö is 134 132 148, and ÅÄÖ is 143 142 153.

What you are getting are Unicode character codes (IOW code points) but I must admit I don't have the slightest idea of what it is what you want to get. What "extended ASCII table" are you referring to? Nowadays Unicode should be the preferred choice if you have a choice.

Your if statement approach (that BTW would look much neater if you changed it into a switch statement but that wouldn't change much about the principle) is something I wouldn't like either. I have no idea why, but if you really need character codes of an encoding that's not available from the C++ runtime or the OS (which seems to be Windows) you can use a translation table. It's basically an array that you index with the character code you have and that stores the character code you want at that position so you can get it there. You can set that up with some manual work if you have character code tables for both the encodings involved. You seem to have a table of that "extended ANSI", and Unicode tables can be found all over the place.

However, as you also mention cyrillic and greek alphabets, it looks like you would need more than one of these translation tables. Depending on the number of character sets you need to support, this would make the approach impractical from some point on...

**alderaan** · June 9th, 2011, 08:39 PM

wchar_t is giving me the same result.

This is unfortunately not the solution.
The result is: 97 229 228 246 98
It should be: 97 134 132 148 98

According to the general extended ascii-table, åäö is 134 132 148, and ÅÄÖ is 143 142 153.

As asked for i don´t know how to implement UTF-8 in my c-code. Please show me how to correct my code using UTF-8. Another question is:
When i create a text-document ANSI MIME type "text/plain" (as i guess) when right-click my windows7(64-bit)-desktop-new text-document. Do i have a 16-bit, 32-bit or 64-bit-encoded txt-file? Maybe you think i´m a stupid lamer; i confess - i don´t understand the full meaning with your help as long, but i don´t get why it has to be so complicated reading a simple text using visual studio 2010?
Hope you have a good answer to that, and thank you for the welcome to codeguru.

**alderaan** · June 9th, 2011, 09:12 PM

I refer to http://www.asciitable.com/
I don´t understand the Unicode character codes.
If you type alt+ascii-decimal in a text-document created with microsoft word , or simple text-editors like "notepad" or "write" in microsoft windows then you get the right character provided you enter the right alt+ascii-code.
It would be nice if that was the same definition when reading characters from a consol-line in c++?
My intention now is to write a conversion table so i can get the right ascii-decimal for each unicode character in whatever languages or character-sets.
Is a ascii-table country specific? I don´t think so?

The idea behind my questions here are that i´m writing a encryption-application to "scramble" the text in whatever text-documents... using prime number-scrambling of existing binary equivalent of the characters ascii-decimal value in a text-document.

**Eri523** · June 9th, 2011, 11:10 PM

Originally Posted by alderaan

Is a ascii-table country specific? I don´t think so?

Well, yes and no: The original ASCII set which was based on character codes of only 7 bit had no country-specific characters at all. And do you really believe all the country-specific characters of all the languages in the world fit into the 8-bit space of just 265 characters (more specifically: the upper half of that space which comprises just 128 characters as well, since the lower half still holds the traditional 7-bit ASCII characters)?

That's why code pages were invented. A code page is a mapping of selected characters from a specific language into the very limited upper half of the 8-bit ASCII space. I couldn't find out what code page the table you linked to shows but since it contains all those graphics characters (that BTW eat up a considerable share of the limited character space that thereby becomes unavailable to language-specific characters) it seems to originate in the DOS world. I'm really tired at the moment and don't want to do deeper research on that now, but maybe someone already has jumped in when I'm back.

In turn, all the above was the reason to invent Unicode. It supports an incredible number of available characters, but of course they don't easily fit into 8-bit character code boxes. Therefore Unicode is more complex to handle but I think it's definitely worth the effort: It's not only really versatile, it also is the future (well, at least for the foreseeable part of it

).

**alderaan** · June 16th, 2011, 06:25 PM

Thank you for your help for a higher understanding in character encoding. I was hoping that it was a method to always get the right ascii-decimal (country and character-set independant), but i now realize that´s a lot more complicated then that.

The underlying problem with my question is now solved.
My primary form-application that will ask you to choose a txt-file from an openfiledialog will use UTF7-encoding for the streamreader routine and default(unicode) or UTF8-encoding for the streamwriter routine to present a copy of the txt-file in use.
(I will use this copy later on to make a replace of the desired txt-file to be scrambled).

In this case i don´t need the right ascii-decimal (as i thought) to achieve the desired result due to a "decryption" right back from unicode decimal codes created by a wchar_t declaration of the character-by-character read by fgetc to desired character.
When i do an unicode-integer to array conversion i get the right character-array as the original-file in use.

This code works perfectly.. but i don´t know how it works with additional character-sets and tables. I will try it on the cyrillic, greek and chinese- tables as soon as possible.
I guess i have to learn a lot more about this. I will return with an answer to this.

Thread: [RESOLVED] Read special characters like swedish å ä ö using fgetc

Thread Tools

Display

[RESOLVED] Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Re: Read special characters like swedish å ä ö using fgetc

Posting Permissions