Read Extremely large file efficiently in C#. Currently using StreamReader
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 11 of 11

Thread: Read Extremely large file efficiently in C#. Currently using StreamReader

Hybrid View

  1. #1
    Join Date
    May 2012
    Location
    Earth!
    Posts
    9

    Read Extremely large file efficiently in C#. Currently using StreamReader

    I have a Json file that is sized 50GB and beyond. Following is what I have written to read a very small chunk of the Json. I now need to modify this to read the large file.

    Code:
    internal static IEnumerable<T> ReadJson<T>(string filePath)
    {
        DataContractJsonSerializer ser = new DataContractJsonSerializer(typeof(T));
        using (StreamReader sr = new StreamReader(filePath))
        {
            String line;
            // Read and display lines from the file until the end of
            // the file is reached.
            while ((line = sr.ReadLine()) != null)
            {
                byte[] jsonBytes = Encoding.UTF8.GetBytes(line);
                XmlDictionaryReader jsonReader = JsonReaderWriterFactory.CreateJsonReader(jsonBytes, XmlDictionaryReaderQuotas.Max);
                var myPerson = ser.ReadObject(jsonReader);
                jsonReader.Close();
    
                yield return (T)myPerson;
            }
        }
    }
    Would it suffice if I specify the buffer size while constructing the StreamReader in the current code?
    Please correct me if I am wrong here.. The buffer size basically specifies how much data is read from disk to memory at a time. So if File is 100MB in size with buffer size as 5MB, it reads 5MB at a time to memory, until entire file is read.
    Assuming my understanding of point 3 is right, what would be the ideal buffer size with such a large text file? Would int.Max size be a bad idea? In 64-bit PC int.Max size is 2147483647. I presume buffer size is in bytes, which evaluates to about 2GB. This itself could consume time. I was looking at something like 100MB - 300MB as buffer size.
    Last edited by BioPhysEngr; August 24th, 2012 at 11:38 PM. Reason: change quote tags to code tags

  2. #2
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,006

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    Looks OK to me. Although I usually write this loop as:

    Code:
    StreamReader r = new StreamReader(path);
    while( !r.EndOfStream )  //This is a little cleaner to my eyes
    {
        string line = r.ReadLine()
        //Processing
    }
    I would not choose a 2 GB buffer (I think the per-process limit is only 3GB). Also, huge memory consumption could cause thrashing (swapping memory in-and-out of the hard drive in a multitasking environment), which would negate any performance benefit. 100 MB should be entirely adequate.
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  3. #3
    Join Date
    Apr 2010
    Posts
    131

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    Don't read line by line if you don't nead to -- use ReadBlock() instead.

  4. #4
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,006

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    Do you have data to suggest ReadBlock performs better than ReadLine?
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  5. #5
    Join Date
    Apr 2010
    Posts
    131

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    There are tons of factors that would come into play - what type of drive, what type of interface, fragmentation, how big the "line" being read is, free memory and caching, etc. If you have small lines, then ReadBlock will cause one IO exchange for, say, 100 MB of data, whereas ReadLine will require a million or more. Generally (and I do mean generally given all the factors involved), the fewer round trips, the better. Since you are going to do buffering anyway, it also seems easier for your task since you don't have to measure bytes, you can just call for a block of 100MB at a time. Conversion of ReadLine to ReadBlock would take like 5 lines of code - I'm interested to see if it makes a difference in your particular super-large file scenario. If you decide to try it, please let me know. Otherwise, just a friendly suggestion. =)

  6. #6
    Join Date
    Apr 2010
    Posts
    131

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    Accidental double post -- sorry
    Last edited by mrgr8avill; August 26th, 2012 at 09:23 PM.

  7. #7
    Join Date
    Apr 2010
    Posts
    131

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    Okay, I got curious. I created a text file with random characters of between 10 and 50 per line, containing 13,526,630,400 bytes. I created four methods:

    (Was right to begin with).

    METHOD A uses ReadLine 1 million lines at a time, appending a stringbuilder. Each 1 Million, the stringbuilder length is captured and SB is nulled. This simulates basic processing on the same thread;
    METHOD B uses ReadLine with no processing just to read the file (processing on another thread);
    METHOD C is same as A but uses ReadBlock instead of ReadLine - 100 MB chunks;
    METHOD D is same as B but ReadBlock instead of ReadLine

    Tests were run in random order three times each with average results shown.

    On a vertex III Sata 6:

    Method A: 101 seconds 77% processor thread;
    Method B: 82 seconds 54% thread;
    Method C: 91 seconds, 38% thread;
    Method D: 75 seconds, 30% thread.


    So on my machine, ReadBlock provides ~10% increase in time performance, with an equivalent reduction in processor usage. Not only is there evidence that ReadBlock is faster than ReadLine in this situation, but that the Read() action is faster than even basic processing of the read results; processing the read on another thread provides ~20% time savings. Overall, combination of ReadBlock and off-thread processing would (in this case) produce 26% time savings on a 13Gig file.


    Your mileage may vary - happy coding!!
    Last edited by mrgr8avill; August 26th, 2012 at 09:51 PM.

  8. #8
    Join Date
    Feb 2011
    Location
    United States
    Posts
    1,006

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    Quality analysis, thanks!
    Best Regards,

    BioPhysEngr
    http://blog.biophysengr.net
    --
    All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.

  9. #9
    Join Date
    Apr 2010
    Posts
    131

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    You're welcome. With a 50Gig file to process and your username, I figured your work was probably more important than me catching "Army Wives" live lol. Hope you see similar performance gains, and I'll keep my eye out for anything faster in the NET framework. Cheers!

  10. #10
    Join Date
    Jul 2012
    Posts
    90

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    According to the research of Peter Kukol and Jim Grey (Microsoft Technical Report MSR-TR-2004-136) in their report "Sequential File Programming Patterns and Performance with .NET", optimal throughput is achieved with a buffer of 64K (65536 bytes). Above this point the cost in processor cycles per byte read cause a slight decrease in throughput.
    Last edited by CGKevin; August 30th, 2012 at 07:38 AM.

  11. #11
    Join Date
    Jul 2012
    Posts
    90

    Re: Read Extremely large file efficiently in C#. Currently using StreamReader

    I probably should have elaborated on that last post. It is due to the way the native Windows IO handles buffering. When .Net calls the CreateFile API passing a buffer size (during creation of the stream), Windows actually creates three buffers of that size. The post-read buffer, the read buffer, and the read-ahead buffer. Shuffling bytes between these buffers creates a processor cycle cost per byte of reading from the disk. Even if you use the unbuffered stream class in .Net, this native buffering (using a default bufer size of, I believe, 8K) is taking place. The only way to truely have unbufered IO in .Net is to obtain your filehandle from a CreateFile API call (using the FILE_FLAG_NO_BUFFERING flag) and pass that handle to the stream constructor.

    The upshot is that above a buffer size of 64K you experience diminishing returns on throughput.
    Last edited by CGKevin; August 29th, 2012 at 04:13 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center