CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 11 of 11

Thread: CSV Parser

  1. #1
    Join Date
    Jun 2012
    Posts
    1

    CSV Parser

    I was hoping someone might be able to help me with this scenario.

    You have a text file in the following format:

    Each record contains six fields. Each field is contained in between two quotation marks and delimited by a comma: "VALUE","VALUE",etc..

    There is no additional delimiter between records, simply one long string of text. You have no idea how many records are in the file. There is absolutely no restriction on what can be contained in the fields themselves other than the length can range from 0 characters to 32000 characters per field. That is, any field can contain any combination of characters (including delimiters) in any order, such as "","","","", being the value of a field.

    How difficult would it be for an average programmer to write a CSV parser that can reliably and efficiently parse any file in this format that a team of testers could theoretically throw at it?

  2. #2
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: CSV Parser

    Not possible; the format is ambiguous. Suppose I wish to encode a single record with six fields: 5 of them empty and the other equivalent to ","

    Suppose the first field is the "," field, then this would be encoded as:

    "","","","","","",""

    Suppose the last field is the "," field, then this would be encoded as

    "","","","","","","" (i.e. the same thing)

    Thus the decomposition of this format is ambiguous.

    If you wish to allow delimiters to be part of a field, you must use escape characters. Once that restriction is met, the task is trivial (<30 minutes, assuming a novice programmer).
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  3. #3
    Join Date
    Nov 2011
    Posts
    36

    Re: CSV Parser

    I am really curious about this question, which I am having a hard time understanding. Are you saying you have 6 columns, and one column could be something like:

    "Value1, "Value","1"still","Value1still","Value2","Value3","Value4","Value5","Value6"

    or

    " "," ","Value2","Value3","etc.."
    ^^
    That is value 1?

    Can the text qualifier be changed by chance? or does it have to always be a quote? Cause you can always parse the file, and if you run into inconsistent lines, it will set them to the side and after it's done will return multiple combinations of that line parsed.

  4. #4
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: CSV Parser

    return multiple combinations of that line parsed
    Yes, but down that pathway lies madness. Suppose you have one field that is composed of:

    ","","","","","

    and all other fields empty then the representation is:

    "","","","","","","","","","",""

    Which can be split up into fields 10 choose 6 (=210) ways. Surely you want ONE interpretation of each row, now hundreds. Note that if you choose slightly more perverse field contents (say a similar field with 44 commas) then you can easily return millions of possibilities for each record.
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  5. #5
    Join Date
    Nov 2011
    Posts
    36

    Re: CSV Parser

    Well there are many ways to reduce those combinations.

    1. We don't really know what kind of data he is working with. If he had dates in one of the columns, for example, column 4, you can regex check for the dates. If you find that date in column 4, obviously there can only be two more columns after Column4 and three before column 4. This can lighten up on the combinations, significantly.

    2. What I also use when I don't use my command line flat file Validator, I have a form that if the parser cannot properly find the column count on that line, or if it could only match the consistency of so much. It will open a window with the resulting line and I can tell it visually which one is which column. (Now given this way still requires manual process, this is the most effective way if you want it automated, who ever is supplying the file's should better format them or deal with the file and find ways that work)


    But like I said, we still don't know what the data is. Because this is a super easy fix if for instance the first column being:
    "","","","","","",
    and second column is a date column:
    6/19/2012 14:37:42
    Regex couldn't match the 6 possible columns in the ACTUAL column 1. When it finally does match, you found column 2, everything before it must be column 1. Then of course if maybe there were more than 1 date columns, look for another before making any brash decisions.

    This of course can get extremely annoying to code, but I am used to running into stupiddddddd flat files.


    *Edit:

    BioPhysEngr,

    This is another example why I would not use String.Split. We had talked about it briefly in a previous topic. Using other means can make it possible to parse an ugly separated value file.
    Last edited by Deranged; June 19th, 2012 at 04:43 PM.

  6. #6
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: CSV Parser

    Quote Originally Posted by Deranged View Post
    BioPhysEngr,

    This is another example why I would not use String.Split. We had talked about it briefly in a previous topic. Using other means can make it possible to parse an ugly separated value file.
    Yup! I was thinking of your comment when I was reading this new post.
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  7. #7
    Join Date
    Nov 2011
    Posts
    36

    Re: CSV Parser

    But either way, it can be a pain, lol

  8. #8
    Join Date
    Jun 2012
    Posts
    4

    Re: CSV Parser

    Either the quote character must be forbidden within the fields, or it must be escaped, preferrably with another quote in front. Otherwise the problem is not solvable.

    So a single quote followed by 5 empty fields would be encoded thus:
    """","","","","",""

    Provided that, any decent programmer should be able to do this in a few minutes +/- the time to code the I/O portion.

  9. #9
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: CSV Parser

    Quote Originally Posted by jmcdtucson View Post
    Either the quote character must be forbidden within the fields, or it must be escaped, preferrably with another quote in front.
    To avoid confusion, choosing an escape character other than the field delimiter would probably be simpler. The most common escape character is a slash: \
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  10. #10
    Join Date
    Jun 2012
    Posts
    4

    Re: CSV Parser

    Quote Originally Posted by BioPhysEngr View Post
    To avoid confusion, choosing an escape character other than the field delimiter would probably be simpler. The most common escape character is a slash: \
    Yeah, that's pretty common, but in 'standard' CSV (there really is no standard) the quote is escaped with a double quote. Either would work, but if you use a backslash, you have to escape that too.

    http://en.wikipedia.org/wiki/Comma-separated_values

  11. #11
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,016

    Re: CSV Parser

    Quote Originally Posted by jmcdtucson View Post
    Yeah, that's pretty common, but in 'standard' CSV (there really is no standard) the quote is escaped with a double quote.
    Ah, I didn't know that. Madness! :-) Thank you for educating me, then.
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured