I was wondering what the limitations are (performance and otherwise) for in memory collections.
I am building a new solution which will hold a lot of data. Normally I would use SQL Server to store the data. The solution is basically a business management solution. Storing customer and employee data. Processing sales, and various other things.
I have a book on WPF and its showing that that each class, customer, employee, transactions. etc, are being stored in a List<> which is serialized and saved. Instead of using 3rd party databases with my software.
I'm wondering how this works for a real-time program. would there be any performance difference between having say, 50,000 records in SQL vs. the same amount in a list collection? I'm assuming that when I desearialize the class all records are loaded into memory. How much RAM would this program use having 50k records in a list at run-time?
What other options do I have? I don't like using the open source MySQL and MS SQL is ungodly expensive. If I use It I will be limiting my software sales to only rich people, leaving almost all small businesses out.
Is using a database server the only option I have?
BigEd781
May 8th, 2009, 06:47 PM
How big are these records? If you want it to scalable, and if the information / database design would be non-trivial, use a database. Otherwise, your application may work for a while, but when someone has a 'database' of 500,000+ records it may just explode trying to load it all into memory.
Arjay
May 8th, 2009, 09:27 PM
SQL server doesn't have to be expensive - check out the SQL Server Compact Edition (formerly known as MSDE).
nelo
May 9th, 2009, 03:49 AM
What other options do I have? I don't like using the open source MySQL and MS SQL is ungodly expensive.
Just to add to what Arjay has said have a look at SQL Server 2005 or 2008 Express Edition. It is completely free. And Microsoft also gives you the Management Studio for free as well.
With regards to performance I think the database option would be best. Databases have been specially designed for large data. For example you can define indexes that make data retrieval easier. If you were looking for a particular record in a collection of type List<T> the lookup time will grow proportionally with the number of records. You would actually have to go through each record to find what you are looking for. You can off course consider using Dictionary<> where you wouldn't have a performance hit on look ups. But not everything maps to a dictionary. Ultimately you can design your application so that easily replace the data access layer. You can have a façade that exposes the functionality of the data access layer. Then you can actually have the two implementations and through configuration you can switch which implementation you use (database or xml files with in memory collections).
Incidentally you can still use the serialization mechanism and store the data in the database instead of data files. Also for the data does not change frequently you can cache it. It is generally quicker to do a lookup in memory as opposed to going all the way to a database and back every time you need some data.
pgrammer
May 9th, 2009, 11:06 AM
I don't know anything about how to cache data. I know that SQL Express is free but it has a 4GB limit and I'm not sure how long that would take to use up. Just the schema of my database with tables, indexes, stored procedures, etc, is about 10MB with no data in it.
I might go with SQL Server, I guess MySQL is free as well, but I've heard a lot of stories about how it doesn't always work correctly, it doesn't know how to handle Guid's which is a major part of my software, every record has a static Guid assigned to it so it can be looked up.
You mentioned indexes in database. If I go that way instead of a List or XML, I'm not 100% sure how those work. If its just another copy of the record in a index location or what. One thing that has always confused me was about indexing for example,
SelectEmployees
@_id uniqueidentifier
SELECT
FirstName,
LastName,
HireDate,
SocialSecurityNumber
FROM
Employees
WHERE
ID = @_id
If ID is a primary key, does it automatically use the index when looking this up, or do I need to call it manually, and second, so I need to add a seperate index for each of the fields (first,last,hire,etc...) included in that query? I've also noticed that you can add multiple fields to a single index, I don't know why. I'm not even sure if PrimaryKey and Index are the same thing. Then there is UniqueKey. Are they all indexes?
And lastly, If I had the following simple tables
TABLE Customers
ID uniqueidentifier
FirstName nvarchar(20)
LastName nvarchar(20)
TABLE CustomerNotes
note_id uniqueidentifier (needed for updating this note later)
customer_id uniqueidentifier
dateEntered datetime
noteText nvarchar(1500)
If I created a relationship between those two tables, Customer.ID would be the (i think its called) primary key, and CustomerNote.Customer_ID would be the forgin key?
I think in the long run maybe using List<> would be a bad idea, but using a database does have a couple downsides, the cost (which maybe can be avoided like you said) and also anyone that buys my software will have to get a copy of MS SQL Server and install it. What if they dont have a deticated machine to use for SQL Server, and what if they don't know how to setup the server. They just want to install my program and run it like any other.
I don't think that any of my data (Customers, Venders, Inventory, Employees) that I will save will ever get over 10,000 (probably not even half that) records each so if that could be managed nicely in a collection or XML file, it might be the way to go. The only possible object that will get over 10,000 is Transactions (sales they make on my cash register part of the program) but that shouldn't be a problem because my program (like any other cash register) will have a "Open Day" and "Close Day" function, or a "Open Week", "Close Week" function. At that time processes are ran and reports generated. Also the existing Transactions can be moved to a new serialized file of the class with a file name like "04102009.trans" (and that name is the week or day ending date). This file is moved to an "Archives" directory and then a new file is created so that the next time the register is open the Transactions will be reset back to 0 records. You can still lookup those old transactions if needed. Because each transaction has a serial number which will include the date and time of the transaction, and will be a scannable barcode on the receipt. When that barcode is scanned the software will know what file to look in, even if its located in an archive file.
I really cannot see the records getting over 10,000 for each class. However, I could design my program correctly like you were saying and if I notice that records are reaching that limit simply create SQL Database to hold the data at that time, and create a SQLProvider, instead of a SerialProvider. Drop that new DLL in the application path and update one setting in a configuration file. The program will then use the database for everything. I would create a simple utility that would take all of the existing records and flash them into the database and out of the file system.
Let's say I created my class and loaded 10,000 instances of it into a List<> just to test response time of read/write/search. How would I figure out how much RAM that List<> object is taking up? Is there a way to do that?
dglienna
May 9th, 2009, 02:23 PM
Your C# app can automatically deploy and set up MSQL Express versions. (I accidentally deployed it on a server that had SQL 2005 on it) Of course it didn't run correctly.
The other problem is if you have more than one terminal then you'll NEED a way to keep track of who did what and when. SQL can do that.
I don't think the actual size of a schema matters. Hard to believe that a simple cash register app won't fit.
pgrammer
May 9th, 2009, 03:22 PM
The other problem is if you have more than one terminal then you'll NEED a way to keep track of who did what and when. SQL can do that.
How can SQL do that for me? I didn't know it could.
nelo
May 9th, 2009, 04:35 PM
How can SQL do that for me? I didn't know it could.
Each time you make a connection SQL Server keeps a record of the machine that initiated that connection, what credentials where used, etc.
My first thought would have been to recommend a database. As stated earlier by dglienna you can easily deploy the SQL Server Express along with your application. However there might another option for you: SQL Server Compact Edition. This is an even more lightweight solution. Basically you would be deploying the database file and a few dlls (3 I think) that would allow you to connect to the database using standard ADO.NET. There a few limitations with this option. I think you can't run any DDL commands like creating tables, indexes at runtime. I'm not even sure it supports stored procedures. But then again it is very light. You would deploy it very easily. But I think to begin with you should go for the file based approach storing the serialised data in Xml files. If that is not good enough for some clients you could have another version of the application that would require a database. Talking about databases have you thought about MS Access? I know many don't rate it but it should do fine for a cash register application...:) Anyway so you could have two versions. Customers with higher data demands would recognise that more capability would come at an extra cost. I'm not aware of limitations of MySQL with regards to Guids (I've no experience with MySQL other than installing it). Bear in mind that it is an open source program. There might be people looking into those issues even now. Also I'm not entirely sure why you need to use guids in your system. A good reason for using Guids is where you can have data coming from different sources (often disconnected) or you need to merge the data (to keep it synchronised) or move it around without losing the uniqueness of each record. However if you have a setup whereby the application is always connected to the database you could do with autoincrementing integers for your primary keys (something which SQL Server and other databases do for you).
BigEd781
May 9th, 2009, 10:17 PM
I don't know anything about how to cache data. I know that SQL Express is free but it has a 4GB limit and I'm not sure how long that would take to use up. Just the schema of my database with tables, indexes, stored procedures, etc, is about 10MB with no data in it.
Well, I think you answered your own question. If you think that the database will grow to > 4GB, how would you expect to load that into a bunch of collections (i.e., into memory at one time)? You need to use a database if you are talking about any moderate amount of information.
dglienna
May 9th, 2009, 10:38 PM
Maybe a week later?
Arjay
May 10th, 2009, 03:50 PM
You mentioned indexes in database. If I go that way instead of a List or XML, I'm not 100% sure how those work.You'll need to read up on table indexes in relational databases. One thing nice about SQL server is that it comes with a tuning wizard that helps you optimize (and suggest) which columns to index.
As far as choosing between a db and a list, by the sounds of it, the work you are trying to do screams DATABASE. Consider that this kind of work is what db's do and they are designed and optimized for this. If you went with the list route, you would specifically need to write additional code if you wanted any of the traditional database features (such as indexes, transactions, acid behavior, etc.).
As far as scalability, practically you would never be able to reach 4GB with and in memory list (or xml), so if you are expecting these larger sizes, a db (that is designed for this) probably is the only option.
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.