Like many of you, sometimes I have to collaborate with others.
I've been given the task of importing a .csv file with various customer information, and performing validation on some of the fields.
Currently validation happens on specified fields of each record (there are multiple fields) within a foreach loop. Something along the lines of this:
foreach item in records
find item 1 within record
validate item 1
if item 1 passes validation
add item 1 to db
if item 1 doesn't pass
add a validation message to a list
find item 2 within record
validate item 2
if item 2 passes validation
add item 2 to db
if item 2 doesn't pass
add a validation message to a list
and so on, until the end of the text file
Most of the validation can be done with a regex (number fields). Any records that don't pass validation don't get added to the db - but I will need to retain the original record, and a validation error message. Seems simple enough, right?
The last part of the validation involves checking a db table to see if the item exists. This would happen after the other regex checks. Since the text file could potentially be tens of thousands of records, a concern is tens of thousands of trips to the database to run the validation per record. The db table itself contains 1 char(6) field and 1 int field and has approx 150,000 rows.
One alternate suggestion is to import the file, and do the regex validation, record by record in the foreach loop, and add the records to the db *before* the last validation check. Perform the last validation check by using one query, (Select * items we just entered into the database that do not exist in this other database) to retrieve all of the records just added that don't exist in the db table (not existing means they're not valid). So that eliminates the round trips to the db. With this method, there is only one call to the db for validation on the entire list of items, instead of one item at a time. Then, of course, the offending records would have to be deleted from the db.
The problem I'm having with that suggestion is matching the field from the last db table validation check with the original record it came from in the original text file -- remember, I have to maintain a list of the original offending records and a validation message (explaining why this record wasn't added).
I'd really like to perform all of the validation checks before adding anything to the db. Since the repeated db calls are a concern, I wondered about reading in the entire table (the one I have to check existence in), holding on to it in memory, for the duration of the processing of the imported text file, and checking the items against it while traversing the records in the for each loop. Maybe use a list (can lists hold that many items?), and check if the list contains the item (or doesn't, in this case).
The size of the table in sql uses 3 MB of storage space. I don't know what that translates to in terms of memory usage within the program this is happening in, but I'd like to have a valid argument to use this method instead of the suggested alternative I wrote about above.
Will my idea (reading the entire table into memory) be more efficient than repeated calls to the db? And is it better than adding records, then deleting them because some items failed the validation check.
Perhaps storing the table in cache?
I'm open to any other suggestions.
Last edited by gdev; June 8th, 2012 at 07:42 AM.
Reason: Additional idea
I do very similar stuff at work all day. I actually have a dll that has an entire flat file validation. The dll calls for a custom XML made for individual files I insert into a database. In my XML it has <FieldSeparator/> <TextQualifier/> <FixedWidth/> <FirstRowIsHeader/> and a couple other smaller features. Then I have
<Column name="" index="" maxLength="" minLength="" width="" required="" trim="" regex="" />
are among my column validations. So I have an Xmlreader that grabs all this info set within the XML, and goes through my file. I actually use a For loop with int's (start, end, cursor) and grab each value separated and checks for data within qualifiers if there are any. If it fails the regex that was set in the XML my validator grabs and holds the errors, then after the file is completely processed it sends it to a text file, with the specific error that caused it to fail, for example: Column Count Mismatch, Failed Regex, Length of data is too long = ##, etc.
Very easy to validate the file, now, for the SQL issue, I do not recommend loading tables into memory. Though, the reason I don't is because some of the tables I work with can be anywhere between 1mb - 2gb+ But to the tables I need to add records that do not exist, I have an automation/SQL job, that is ran daily every morning. That I have a file assigned to a TaskID, the validation XML is in a column, couple other columns. So when the job kicks off it grabs all this info from SQL and uses and SSIS package, so before the data hits SSIS it validates, if it fails, that task fails. If it passes, continues to the SSIS package and from there, a multitude of ways to look-up data in SQL or update tables and output a log of what already existed, etc.
Well that's my two cents, lol
Last edited by Deranged; June 10th, 2012 at 01:55 PM.