What encoding would be best for my XML database?

**Mike Pliam** · November 6th, 2013, 05:18 PM

I have an idea for an XML database. I have actually been playing around with this for several years and have made considerable progress. Now, I need to do some fine tuning. I am asking your opinion, based on your own experience, of which encoding would be best to use for the XML database document. It seems that the choices are: 1) ANSI, 2) Unicode, 3) Unicode big endian, 4) UTF-8, and 5) UTF-16. Up until recently, I thought that I understood what each of these encodings meant. But after reading over http://en.wikipedia.org/wiki/Utf-8#Description, I am more confused than ever. Mind you, my sole purpose in using encoding other than ANSI was to enable various 'unicode' characters, Chinese, Hindu, and Cyrillic text. At first, I thought I needed UTF-16 to accomplish this. But I am now inclined to think that even ANSI or UTF-8 might suffice. Unfortunately, I have constructed alot of code using UTF-16 encoding and have run into snags in attempting random access of the files (ala SAX). I would greatly value your opinions on this matter. Thanks.

**Arjay** · November 6th, 2013, 10:00 PM

To me, the most important part of the database is how it does it job in terms of interface, what is the various language support, acid properties, fault tolerance and so on. How it internally stores its data is of lesser importance.

In terms of databases, there are many choices available that have already worked through these issues, including several free ones like MS SQL and My SQL.

I am kind of wondering what the motivation would be to roll your own database using xml file storage? I'm not trying to be flip, but this sort of thing seems already very well done and readily accessible.

I guess I am asking what problem are you trying to solve other than doing it yourself?

**OReubens** · November 7th, 2013, 08:08 AM

THe default for a stored XML as a file is UTF8

Internally XML is always based around unicode, and all the libs work with it as such.
only your api's may be ANSI at which point you'll have lots of conversions between ANSI and unicode.

Unless you have a really good reason not to, keep xml in files stored in UTF-8, many tools won't work properly if you use a different encoding for the files.

"raw Unicode " XML doesn't really exist. there's Always an encoding. be it UTF8 or UTF16, an code page or something.
both UTF8 and UTF16 are encodings for unicode, they are not raw unicode. they allow you to use all 1million something unicode code points (not all of them assigned yet).

**Mike Pliam** · November 9th, 2013, 05:27 PM

what the motivation would be to roll your own database using xml file storage?

Xml fits nicely into a database structure, can be used independent of MFC or any other proprietary forms, is readily available and programable, has utility as for web-based applications, is much more readable and editable than binary documents, and besides all that, I just want to do it, no real reason.

Unless you have a really good reason not to, keep xml in files stored in UTF-8, many tools won't work properly if you use a different encoding for the files.

There is a problem with UTF-8 and 16-bit characters. I discovered this by accident when I tried to edit one of my UTF-16 xml files using FrontPage. I only changed UTF-16 to UTF-8. Saving the file resulted in the addition of the leading 3 bytes of the UTF-8 BOM, EF BB BF, and conversion of all the wide char (16-bit) to 8-bit chars. According to: http://en.wikipedia.org/wiki/Byte_order_mark

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.

The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8,[4] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[5] [6]

Reasons the standard does not advocate the UTF-8 BOM include:
To encourage conversion to Unicode. Usually ASCII-based legacy encodings can be detected because sequences of bytes with the high bit sets are unlikely to be UTF-8 sequences. Therefore the BOM is not needed to determine whether the text stream is UTF-8 or not. Using the BOM by convention adds to the work of programming, which would discourage programmers from using it and instead continue to use legacy encodings.[citation needed]
A plain ASCII file is in UTF-8 encoding. Requiring a BOM makes an artificial distinction between ASCII and UTF-8.[citation needed]
A language parser that transparently handles bytes with the high bit set in certain free-text contexts (such as string literals or comments) but otherwise uses a syntax defined only by ASCII characters, is already able to read and process UTF-8 correctly, even if it is not designed for Unicode. However the BOM at the start would violate its syntax and cause a parsing error. This is true of almost all languages written for personal computers and designed to handle legacy encodings such as CP1252.
It defeats software that uses pattern matching on the start of a text file, since it inserts 3 bytes before the pattern. Though commonly associated with the Unix shebang at the start of an interpreted script,[7] the problem is more widespread. For instance PHP will not recognize its leading commands in a page if a BOM is at the start.

Despite this, Microsoft compilers[8] and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

I don't know what will happen when I rewrite my xml files as UTF-8 with 16 bit wchars and no BOM. I'll get back with the results.

**Arjay** · November 9th, 2013, 08:33 PM

Originally Posted by Mike Pliam

Xml fits nicely into a database structure, can be used independent of MFC or any other proprietary forms, is readily available and programable, has utility as for web-based applications, is much more readable and editable than binary documents, and besides all that, I just want to do it, no real reason.

There are plenty of db clients in c++ the aren't MFC. The ATL consumer classes work well as do ADO. They all work on a variety of data sources (ms sql, my sql, odbc, excell, etc.). The thing about going with the many free offerings out their such as sql and mysql is there is already extensive tool support oout there. Other than wanting to do this for yourself, it kind of is tough to justify rolling your own consodering the amount of work involved in the db, provider and tools.

**Igor Vartanov** · November 10th, 2013, 03:43 AM

Originally Posted by Mike Pliam

Xml fits nicely into a database structure, can be used independent of MFC or any other proprietary forms, is readily available and programable, has utility as for web-based applications, is much more readable and editable than binary documents, and besides all that, I just want to do it, no real reason.

Mike, whatever you believe, how much you know about database engine internals no matter RDBMS or No-SQL? None of the existent well-known engines is XML-based. Don't you think there's some reason behind this fact?

XML is good for storing hierarchical data, (de-)serializing binary data structures, an attributed mark-up, etc., but who told you that XML engines are equally good for database appliance? Did you do any benchmarks? Read? Write? How many 'records' it's going to deal with to get CPU choked, or RAM exhausted? Will it provide multi-user access and resolve conflicts? Be transactional? Keep your data safe of being messed on crashes? Did you develop any requirements your 'database' must meet? I don't think so, as your reasoning about being independent of MFC, readability, etc. have really nothing common with database world.

Thread: What encoding would be best for my XML database?

Thread Tools

Display

What encoding would be best for my XML database?

Re: What encoding would be best for my XML database?

Re: What encoding would be best for my XML database?

Re: What encoding would be best for my XML database?

Re: What encoding would be best for my XML database?

Re: What encoding would be best for my XML database?

Posting Permissions