Excessively slow file sequence access

**Retarded** · January 10th, 2009, 01:48 AM

I am having trouble getting good performance from CreateFile/ReadFile when accessing a sequence of files.

I am using FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN.

My sequence has files of ~10 MB each, with 200 files in the sequence.

If I store the same data in a single contiguous file, I can read 40 chunks/sec (400 MB/sec). If I read the files as a sequence, I can read 10 chunks/sec (100 MB/sec).

I read the files in 3 API calls after opening the files:

Read header (4kb max)
Seek
Read data (~10 MB)

How can I increase the performance of this situation?

I am using windows XP pro.

**Codeplug** · January 10th, 2009, 10:23 AM

High-end SATA SSD's top out at ~200 MB/s (avg. read) - so either your measurements are flawed, or I'm extremely jealous of your storage setup.

Since we're talking about performance, we'll need some good benchmarks to come to any rational conclusions. Naturally we can assume that opening 200 files isn't going to be as fast as opening a single file - just be virtue of having to call CreateFile 199 extra times. But that overhead could be completely "washed out" in comparison to time it takes just to process all that data.

What are you doing with this data that you're reading?

gg

**Retarded** · January 10th, 2009, 11:23 AM

Be jealous. It is a RAID array that should be able to do 400 MB/sec.

I didn't expect the file sequence to be as fast, but I also didn't expect it to be more than twice as slow. I'd be happy with half as fast.

Right now, I'm doing nothing with the data. I have a multicore CPU and my data processing won't saturate the remaining cores.

**Codeplug** · January 10th, 2009, 12:14 PM

Taking more than twice as long to read the same amount of data, but spread across 200 files instead of one, does seem a bit fishy - specially on a raid array. Perhaps we can say more after seeing the file reading code and actual measurements.

Let's start with something simple:

Code:

struct scoped_Timer
{
    LARGE_INTEGER m_start, m_end, m_freq;
    scoped_Timer()
    {
        QueryPerformanceFrequency(&m_freq);
        QueryPerformanceCounter(&m_start);
    }//constructor
    ~scoped_Timer()
    {
        QueryPerformanceCounter(&m_end);
        printf("Time (ms) = %I64u\n",
            (m_end.QuadPart - m_start.QuadPart) / (m_freq.QuadPart / 1000));
    }//destructor
};//scoped_Timer

bool stdio_read(const char *pathname)
{
    int len;
    unsigned char buffer[8 * 1024];

    {//Timer Block
        scoped_Timer timer;

        FILE *file;
        if ((file = fopen(pathname, "rb")) == NULL)
        {
            printf("Failed to open file %s", pathname);
            return false;
        }//if

        while ((len = fread(buffer, 1, sizeof(buffer), file)) )
            ;

        fclose(file);
    }//Timer Block

    return true;
}//stdio_read

What is the size of the one big file? What is the size of each of the 200 smaller files?
What times do you get for reading the one big file vs. the sum of reading the 200 smaller files?

gg

**Retarded** · January 10th, 2009, 07:04 PM

Chunks are 10 MB, so 2 gig for the one contiguous file, the file sequence files are 10 MB.

It takes 5 seconds to read the single file, 20 seconds to read the sequence.

Btw, I've read more about the format I am dealing with, and realized it is possible to gaurantee the max file size so I am now reading the entire file in a single call. So it looks like this:

Open file
Get file size (GetFileSizeEx)
Read entire file
Close

Is there maybe some kind of file pool I need to use for this kind of task? Or some way to prefetch information about files later in the sequence if it can be helped by a cache of some kind (file table info?)

**Codeplug** · January 10th, 2009, 09:13 PM

Code:

#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <stdio.h>

//------------------------------------------------------------------------------

struct scoped_Timer
{
    LARGE_INTEGER m_start, m_freq;
    scoped_Timer()
    {
        QueryPerformanceFrequency(&m_freq);
        QueryPerformanceCounter(&m_start);
    }//constructor

    DWORD Elapsed_ms()
    {
        LARGE_INTEGER end;
        QueryPerformanceCounter(&end);
        return  (DWORD)((end.QuadPart - m_start.QuadPart) / 
                        (m_freq.QuadPart / 1000));
    }//Elapsed_ms
};//scoped_Timer

//------------------------------------------------------------------------------

DWORD stdio_read(const char *pathname)
{
    unsigned char buffer[4 * 1024];
    scoped_Timer timer;

    FILE *file;
    if ((file = fopen(pathname, "rb")) == NULL)
    {
        printf("Failed to open file for reading &#37;s\n", pathname);
        return 0;
    }//if

    while (fread(buffer, sizeof(buffer), 1, file))
        ;

    fclose(file);
    return timer.Elapsed_ms();
}//stdio_read

//------------------------------------------------------------------------------

bool create_files();

const size_t big_sz = 2147483647; // 2 gig (fits in unsigned)
const size_t chunk_sz = 1024 * 1024 * 10;

const size_t num_chunks = (big_sz / chunk_sz) + 1; // +1 for good measure

int main()
{
    if (!create_files())
        return 1;

    SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);

    DWORD big_tm = stdio_read("file_big.bin");
    DWORD chunk_tm = 0;
    
    size_t n = 0;
    for (; n < num_chunks; ++n)
    {
        char fn[32];
        sprintf(fn, "file_%u.bin", n + 1);
        chunk_tm += stdio_read(fn);
    }//for

    printf("Big file size   = %u\n", big_sz);
    printf("Chunk file size = %u\n", chunk_sz);
    printf("# Chunks        = %u\n", num_chunks);
    printf("Chunk Total     = %u\n", num_chunks * chunk_sz);
    printf("Total Diff      = %d\n\n", big_sz - (num_chunks * chunk_sz));

    printf("Time to read big file = %u ms\n", big_tm);
    printf("Time to read chunks   = %u ms\n", chunk_tm);

    return 0;
}//main

//------------------------------------------------------------------------------

bool create_file(const char *pathname, size_t sz)
{
    FILE *file;
    if ((file = fopen(pathname, "wb")) == NULL)
    {
        printf("Failed to open file for writing %s\n", pathname);
        return false;
    }//if

    bool ret = true;
    if (fseek(file, sz, SEEK_SET))
    {
        printf("Failed to seek for %s\n", pathname);
        ret = false;
    }//if

    char c = 0;
    if (fwrite(&c, 1, 1, file) != 1)
    {
        printf("Failed to write to %s\n", pathname);
        ret = false;
    }//if

    fclose(file);
    return ret;
}//create_file

//------------------------------------------------------------------------------

bool create_files()
{
    if (!create_file("file_big.bin", big_sz))
        return false;

    size_t n = 0;
    for (; n < num_chunks; ++n)
    {
        char fn[32];
        sprintf(fn, "file_%u.bin", n + 1);
        if (!create_file(fn, chunk_sz))
            return false;
    }//for

    return true;
}//create_files

Code:

Big file size   = 2147483647
Chunk file size = 10485760
# Chunks        = 205
Chunk Total     = 2149580800
Total Diff      = -2097153

Time to read big file = 9446 ms
Time to read chunks   = 8839 ms

So chunk-reading read an extra 2097153 bytes and was still "faster". What does the above program produce on your system?.

gg

**Retarded** · January 10th, 2009, 09:37 PM

First of all, thanks for your assistance! You are going above and beyond what I expected.

But, your test is flawed:

1) stdio will buffer the file operations. FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN when using the windows API file IO makes a massive difference in the performance of the IO. It is a 50-100% increase in performance to use this pair of flags.

2) Since stdio is buffering, it is likely your tests are being affected by windows caching its buffering. I have noticed this affect when I am using smaller sequences (~150 files or less), the second time passing over a sequence it is extremely fast. In your case, the files/file table info are probably still in windows' cache because you wrote the files recently. However, I can't rely on the windows cache to help me out because a 2 gig/200 file sequence is relatively very tiny compared to the real world data this app must process. My real files/file sequences will be in the 100's of gig range. Also I need it to be fast the first time through the sequence, not just after caching it.

3) I don't think this will make a big difference in the relative performance between the two methods, but you are reading in 4 kb blocks, whereas I typically read in multi-megabyte blocks at once.

**Codeplug** · January 12th, 2009, 01:50 PM

Those may be reasons not to compare standard I/O with Win32 I/O. But the test itself is only "flawed" because you don't care about stdio performance

My thinking was that if there was really a 2x + slowdown in reading chunks, then I would expect to see the same slowdown in API's built on top of CreateFile/ReadFile - and I already had that code lying around

There is, however, a major flaw in the file creation code. Since NTFS supports "sparse files", then the resulting zero filled files may actually be represented on disk with something much smaller. I should have noticed this in the previous results. Reading 2 gigs in 9446 ms is the equivalent of 227.3 MB/s and my one SATA drive has a max burst of 123 MB/s (according to HDTach). Using Win32 code (provided below), the zero filled 2 gig file was read in less than half a second! So I went back and filled the files with 0xA5 and the timings started making sense.

Below is my new and improved benchmarking code. Highlights include: 1) Pages are locked into memory to eliminate potential paging to disk. 2) Demonstrates how to overlap work with I/O requests using FILE_FLAG_OVERLAPPED. 3) Improved stdio performance due to "S" mode.

Code:

#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <iostream>
#include <sstream>
#include <cstdio>
using namespace std;

//------------------------------------------------------------------------------

struct no_copy
{
protected:
    no_copy() {}
    ~no_copy() {}
private:
    no_copy(const no_copy&);
    const no_copy& operator=(const no_copy&);
};//no_copy

//------------------------------------------------------------------------------

class scoped_Timer : public no_copy
{
    LARGE_INTEGER m_start, m_freq;
public:
    scoped_Timer()
    {
        QueryPerformanceFrequency(&m_freq);
        Sleep(0); // start timer a new quantum
        QueryPerformanceCounter(&m_start);
    }//constructor

    DWORD Elapsed_ms()
    {
        LARGE_INTEGER end;
        QueryPerformanceCounter(&end);
        return (DWORD)
            ((end.QuadPart - m_start.QuadPart) / (m_freq.QuadPart / 1000));
    }//Elapsed_ms
};//scoped_Timer

//------------------------------------------------------------------------------

class LockedPhysicalMemory : public no_copy
{
    void *m_p;
    size_t m_sz;
public:
    explicit LockedPhysicalMemory(size_t sz) : m_p(0), m_sz(sz) 
    {
        void *p = VirtualAlloc(0, m_sz, MEM_COMMIT, PAGE_READWRITE);
        if (!p)
        {
            cerr << "Failed to alloc pages, le = " << GetLastError() << endl;
            return;
        }//if

        if (!VirtualLock(p, m_sz))
        {
            if (GetLastError() == ERROR_WORKING_SET_QUOTA)
            {
                SIZE_T minWSS, maxWSS;
                HANDLE hProc = GetCurrentProcess();
                if (GetProcessWorkingSetSize(hProc, &minWSS, &maxWSS))
                {
                    // increase to allow for m_sz + 10 pages (assuming 4k page)
                    const size_t increase = m_sz + (10 * 4096);
                    maxWSS += increase;
                    minWSS += increase;
                    
                    if (!SetProcessWorkingSetSize(hProc, minWSS, maxWSS))
                    {
                        cout << "SetProcessWorkingSetSize failed, le = " 
                             << GetLastError() << endl;
                    }//if
                    else if (!VirtualLock(p, m_sz))
                    {
                        cout << "VirtualLock failed, le = " 
                             << GetLastError() << endl;
                    }//else if
                    else
                        m_p = p;
                }//if
            }//if

            if (!m_p)
            {
                cerr << "Failed to lock pages, le = " << GetLastError() << endl;
                VirtualFree(p, 0, MEM_RELEASE);
            }//if
        }//if
        else
            m_p = p;
    }//constructor

    ~LockedPhysicalMemory() 
    {
        if (m_p)
        {
            VirtualUnlock(m_p, m_sz);
            VirtualFree(m_p, 0, MEM_RELEASE);
        }//if
    }//destructor

    void* Ptr() const {return m_p;}
    size_t Size() const {return m_sz;}
};//LockedPhysicalMemory

//------------------------------------------------------------------------------

class scoped_HANDLE : public no_copy
{
    HANDLE m_h;
public:
    explicit scoped_HANDLE(HANDLE h) : m_h(h) {}
    ~scoped_HANDLE() {if (m_h && m_h != INVALID_HANDLE_VALUE) CloseHandle(m_h);}
};//scoped_HANDLE

//------------------------------------------------------------------------------

class scoped_FILE : public no_copy
{
    FILE *m_f;
public:
    explicit scoped_FILE(FILE *f) : m_f(f) {}
    ~scoped_FILE() {if (m_f) fclose(m_f);}
};//scoped_FILE

//------------------------------------------------------------------------------

DWORD win32_read(const wchar_t *pathname);
DWORD stdio_read(const wchar_t *pathname);

// comment this to use stdio_read() instead of win32_read
#define WIN32_READ

const size_t big_sz = 2147483647; // 2 gig (fits in unsigned)
const size_t chunk_sz = 1024 * 1024 * 10; // 10 meg
const size_t num_chunks = (big_sz / chunk_sz) + 1; // +1 for good measure

// Tried: 4k, 32k, 64k, 128k, 1M, 4M
const size_t buff_sz = 1024 * 32;

void *buff1, *buff2;
HANDLE ev_read;

//------------------------------------------------------------------------------

int main()
{
    DWORD (*pfn_read)(const wchar_t*);
    const char *fn_name;

    LockedPhysicalMemory lpm1(buff_sz), 
                         lpm2(buff_sz);
    buff1 = lpm1.Ptr();
    buff2 = lpm2.Ptr();
    if (!buff1 || !buff2)
        return 1; // error already logged

#ifdef WIN32_READ
    pfn_read = &win32_read;
    fn_name = "win32_read";
    ev_read = CreateEventW(0, TRUE, FALSE, 0);
    if (!ev_read)
    {
        cerr << "Failed to create event, le = " << GetLastError() << endl;
        return 1;
    }//if
    scoped_HANDLE sh_ev_read(ev_read);
#else
    pfn_read = &stdio_read;
    fn_name = "stdio_read";
#endif

    SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);

    cout << "Reading large file...please wait" << endl;
    DWORD big_tm = pfn_read(L"file_big.bin");
    
    cout << "Reading file chunks...please wait" << endl;
    DWORD chunk_tm = 0;
    size_t n = 0;
    for (; n < num_chunks; ++n)
    {
        wostringstream fns;
        fns << L"file_" << n + 1 << L".bin";
        chunk_tm += pfn_read(fns.str().c_str());
    }//for

    size_t chunk_tot_sz = num_chunks * chunk_sz;

    cout << "Big file size   = " << big_sz << endl;
    cout << "Chunk file size = " << chunk_sz << endl;
    cout << "# Chunks        = " << num_chunks << endl;
    cout << "Chunk Total     = " << chunk_tot_sz << endl;
    cout << endl;
    
    cout << "Test                  = " << fn_name << endl;
    cout << "Buffer size           = " << buff_sz / 1024 << " K" << endl;;
    cout << "Time to read big file = " << big_tm << " ms" << endl;
    cout << "Time to read chunks   = " << chunk_tm << " ms" << endl;
    
    // NOTE: 1 MB = 1,000,000 bytes (used by HDTach and marketing folks)
    double big_thrput = big_sz / (big_tm / 1000.0) / 1000000;
    double chunk_thrput = chunk_tot_sz / (chunk_tm / 1000.0) / 1000000;
    cout << "Big throughput        = " << big_thrput << " MB/s" << endl;
    cout << "Chunk throughput      = " << chunk_thrput << " MB/s" << endl;
    cout << "Throughput %Diff      = "
         << (chunk_thrput - big_thrput) / 
            ((chunk_thrput + big_thrput) / 2) * 100 << " %" << endl;

    return 0;
}//main

//------------------------------------------------------------------------------

DWORD win32_read(const wchar_t *pathname)
{
    OVERLAPPED osReader = {0};
    osReader.hEvent = ev_read;
    
    void *read_buff = buff1,
         *work_buff = buff2;
    DWORD nLastRead;
    ULONGLONG &read_offset = *reinterpret_cast<ULONGLONG*>(&osReader.Pointer);

    scoped_Timer timer;

    const DWORD flags = FILE_FLAG_NO_BUFFERING | 
                        FILE_FLAG_OVERLAPPED;
    HANDLE hFile = CreateFileW(pathname, GENERIC_READ, FILE_SHARE_READ, 0,
                               OPEN_EXISTING, flags, 0);
    if (hFile == INVALID_HANDLE_VALUE)
    {
        wcerr << L"Failed to open file for reading: " << pathname
              << L", le = " << GetLastError() << endl;
        return 0;
    }//if
    scoped_HANDLE sh_hFile(hFile);

    if (!ReadFile(hFile, read_buff, buff_sz, 0, &osReader))
    {
        const DWORD le = GetLastError();
        if (le != ERROR_IO_PENDING)
        {
            wcerr << L"ReadFile failed: " << pathname << L", le = " 
                  << GetLastError() << endl;
            return 0;
        }//if
    }//if

    bool bNoMoreReading = false;
    for (;;)
    {
        if (!GetOverlappedResult(hFile, &osReader, &nLastRead, TRUE))
        {
            wcerr << L"GetOverlapedResult failed: " << pathname << L", le = " 
                  << GetLastError() << endl;
            return 0;
        }//if

        read_offset += nLastRead;

        // read just completed into read_buff, which we'll work on after 
        // issuing the next read
        std::swap(read_buff, work_buff);

        if (!ReadFile(hFile, read_buff, buff_sz, 0, &osReader))
        {
            const DWORD le = GetLastError();
            if (le != ERROR_IO_PENDING)
            {
                if (le == ERROR_HANDLE_EOF ||     // read up to EOF
                    le == ERROR_INVALID_PARAMETER) // offset beyond EOF
                {
                    bNoMoreReading = true;
                }//if
                else
                {
                    wcerr << L"ReadFile failed: " << pathname << L", le = " 
                          << GetLastError() << endl;
                    return 0;
                }//else
            }//if
        }//if
        //else not possible

        // "overlap" of our work on work_buff while the OS is off reading 
        // the next buff_sz bytes into read_buff
        //DoWork(work_buff, nLastRead);

        if (bNoMoreReading)
            break;
    }//for

    // NOTE: CloseHandle() not included in time
    return timer.Elapsed_ms();
}//win32_read

//------------------------------------------------------------------------------

// Leaving these commented seems to give to best results
//#define CRT_NO_BUFFERS
//#define CRT_PHYS_BUFFERS

DWORD stdio_read(const wchar_t *pathname)
{
    scoped_Timer timer;

    FILE *file;
    if ((file = _wfopen(pathname, L"rbS")) == NULL)
    {
        printf("Failed to open file for reading %s\n", pathname);
        return 0;
    }//if
    scoped_FILE sf_file(file);

#if defined(CRT_PHYS_BUFFERS)
    // setup CRT buffering to use our physical mem block
    setvbuf(file, reinterpret_cast<char*>(buff1), _IOFBF, buff_sz);
#elif defined(CRT_NO_BUFFERS)
    // disable CRT buffer
    setvbuf(file, 0, _IONBF, 0);
#else
    // use default buffers
#endif

    while (fread(buff2, buff_sz, 1, file))
        ;

    // NOTE: fclose() not included in time
    return timer.Elapsed_ms();
}//stdio_read

//------------------------------------------------------------------------------

And here are my results:

Code:

Big file size   = 2147483647
Chunk file size = 10485760
# Chunks        = 205
Chunk Total     = 2149580800

Test = win32_read

Buffer size           = 4 K
Big throughput        = 38.5144 MB/s
Chunk throughput      = 36.5202 MB/s
Throughput %Diff      = -5.31522 %

Buffer size           = 32 K
Big throughput        = 42.1522 MB/s
Chunk throughput      = 38.3258 MB/s
Throughput %Diff      = -9.509 %

Buffer size           = 64 K
Big throughput        = 41.164 MB/s
Chunk throughput      = 38.0733 MB/s
Throughput %Diff      = -7.80106 %

Buffer size           = 128 K
Big throughput        = 41.3025 MB/s
Chunk throughput      = 38.2549 MB/s
Throughput %Diff      = -7.66147 %

Buffer size           = 1024 K
Big throughput        = 42.035 MB/s
Chunk throughput      = 39.4556 MB/s
Throughput %Diff      = -6.33051 %

Buffer size           = 4096 K
Big throughput        = 41.2074 MB/s
Chunk throughput      = 38.1564 MB/s
Throughput %Diff      = -7.68861 %

Observations:
- Buffer size has next to no affect on throughput, starting at 32K
- Big throughput is ~41.5 MB/s
- Chunk throughput is ~38.5 MB/s
- Reading from chunks is always less than 10% slower

Here are some other benchmarks to demonstrate the utility of FILE_FLAG_NO_BUFFERING:

Code:

Test                  = win32_read (FILE_FLAG_OVERLAPPED only)
Buffer size           = 32 K
Big throughput        = 37.7208 MB/s
Chunk throughput      = 31.4671 MB/s
Throughput %Diff      = -18.0775 %

Test                  = win32_read (overlapped + FILE_FLAG_SEQUENTIAL_SCAN)
Buffer size           = 32 K
Big throughput        = 41.8409 MB/s
Chunk throughput      = 38.3641 MB/s
Throughput %Diff      = -8.66967 %

Observations:
- FILE_FLAG_SEQUENTIAL_SCAN alone achieves near identical performance that FILE_FLAG_NO_BUFFERING achieves when reading sequentially. (Using page aligned memory in both cases may be factor. FILE_FLAG_NO_BUFFERING is useful when access isn't always sequential.)

And one more visit to stdio:

Code:

Test                  = stdio_read (default CRT buffers)
Buffer size           = 32 K
Big throughput        = 35.321 MB/s
Chunk throughput      = 38.7298 MB/s
Throughput %Diff      = 9.20653 %

Observations:
- stdio incurs extra overhead when dealing with a single large file (for some reason).
- For smaller files, the performance is near identical to Win32

Of course, the other disadvantage with stdio is that you can't overlap work with I/O requests.

gg

**Retarded** · January 13th, 2009, 02:34 AM

I ran this test, I get ~700 MB/sec for stdio big and chunked. The best this RAID array does is ~550 MB/sec, so this result is being affected by caching.

I used one of my large data files as the file_big.bin, and I got 300 MB/sec. I changed the buffer size from 4k to 4 MB, and it gets 510 MB/sec. The chunked result got 12 gig/sec, so clearly something went wrong there.

I think that the RAID array is making this behave too differently on our two machines. I think that my problem is a bottleneck in windows, but on a standard SATA drive, you are hitting the harddrive limits first.

I don't know that I've ever observed a useful difference when setting SEQUENTIAL_SCAN (I just set it because maybe it does have an effect on some machines/drives, and I do access in a sequential scan), but NO_BUFFERING absolutely makes a huge difference. FYI, I took out FILE_FLAG_NO_BUFFERING in your test, and throughput dropped from 510 MB/sec to 190 MB/sec. I suspect that the stdio performance would be on par with windows IO without FILE_FLAG_NO_BUFFERING.

**Codeplug** · January 13th, 2009, 01:08 PM

>> I ran this test, I get ~700 MB/sec for stdio big and chunked.
It doesn't make any sense that the highest numbers you've posted (~700) is coming from stdio. All stdio provides is in-process software caching that isn't really used effectively due to the way the file is being read. It's just an extra layer on top of Win32 I/O with SEQUENTIAL_SCAN (with mode "S") - so it shouldn't do any better than that (on the same set of files of course). But then you say you got 190 after removing NO_BUFFERING (and I assume you kept SEQUENTIAL_SCAN).

>> I suspect that the stdio performance would be on par with windows IO without FILE_FLAG_NO_BUFFERING.
Right - with SEQUENTIAL_SCAN on - and assuming mode "S" is used with stdio. Which is why I don't understand the ~700 vs. 190 numbers. On the same set of files, stdio should be as-fast if not slightly slower due to the extra layer.

>> I got 300 MB/sec. I changed the buffer size from 4k to 4 MB, and it gets 510
What about 64K or 512K? Using 4K wasn't even optimal on my HW. Finding the smallest buffer that achieves maximum throughput is ideal.

Questions:
Are you using the same set of files for all tests?
Do any of these files have long runs of zero's in them?
Would you mind posting a screen shot of an HDTach analysis of your system? So we know what to expect. (http://www.simplisoftware.com/Public...request=HdTach)
Does your raid controller have any on board cache?

I want to get our systems as close to "apples to apples" as possible by reading the same files with the same data. Here's the file creation program so that we'll be operating on the same file data. This will rule out any sparse-file optimizations that are affecting results:

Code:

#define _CRT_SECURE_NO_WARNINGS
#include <iostream>
#include <sstream>
#include <cstdio>
using namespace std;

//------------------------------------------------------------------------------

const size_t buff_sz = 1024 * 1024;
char buff[buff_sz];

const size_t big_sz = 2147483647; // 2 gig (fits in unsigned)
const size_t chunk_sz = 1024 * 1024 * 10; // 10 meg
const size_t num_chunks = (big_sz / chunk_sz) + 1; // +1 for good measure

//------------------------------------------------------------------------------

struct scoped_FILE
{
    FILE *m_f;
    explicit scoped_FILE(FILE *f) : m_f(f) {}
    ~scoped_FILE() {if (m_f) fclose(m_f);}
};//scoped_FILE

//------------------------------------------------------------------------------

bool create_file(const wchar_t *pathname, size_t sz)
{
    FILE *file;
    if ((file = _wfopen(pathname, L"wb")) == NULL)
    {
        wcerr << L"Failed to open file for writing: " << pathname << endl;
        return false;
    }//if
    scoped_FILE sf_file(file);

    size_t full_writes = sz / buff_sz;
    size_t n = 0;
    for (; n < full_writes; ++n)
    {
        if (fwrite(buff, buff_sz, 1, file) != 1)
        {
            wcerr << L"Failed to write to: " << pathname << endl;
            return false;
        }//if
    }//for

    size_t partial_sz = sz % buff_sz;
    if (partial_sz &&
        (fwrite(buff, partial_sz, 1, file) != 1))
    {
        wcerr << L"Failed to write to: " << pathname << endl;
        return false;
    }//if

    return true;
}//create_file

//------------------------------------------------------------------------------

int main()
{
    memset(buff, 0xa5, buff_sz);

    cout << "Creating file_big.bin...please wait" << endl;
    if (!create_file(L"file_big.bin", big_sz))
        return false;

    size_t n = 0;
    for (; n < num_chunks; ++n)
    {
        wostringstream fns;
        fns << L"file_" << n + 1 << L".bin";
        const wstring &pathname = fns.str();
        wcout << L"Creating " << pathname << L"...please wait" << endl;

        if (!create_file(pathname.c_str(), chunk_sz))
            return 1;
    }//for

    return 0;
}//main

I also have a raid system which would be a better "apple" for comparison - although not as nice as yours. I've modified the benchmark code slightly to avoid any confusion as to what's being run (full source is attached below).

For the test runs, I've stuck with 128K and 4M buffers:

Code:

Big file size   = 2147483647
Chunk file size = 10485760
# Chunks        = 205
Chunk Total     = 2149580800

******** FILE_FLAG_NO_BUFFERING ********

Test                  = win32_read_nocache
Buffer size           = 128 K
Big throughput        = 85.4278 MB/s
Chunk throughput      = 78.0445 MB/s
Throughput %Diff      = -9.03302 %

Test                  = win32_read_nocache
Buffer size           = 4096 K
Big throughput        = 83.017 MB/s
Chunk throughput      = 74.3825 MB/s
Throughput %Diff      = -10.9714 %

******** FILE_FLAG_SEQUENTIAL_SCAN ********

Test                  = win32_read_seq
Buffer size           = 128 K
Big throughput        = 83.0812 MB/s
Chunk throughput      = 75.4451 MB/s
Throughput %Diff      = -9.63394 %

Test                  = win32_read_seq
Buffer size           = 4096 K
Big throughput        = 75.1867 MB/s
Chunk throughput      = 42.8246 MB/s
Throughput %Diff      = -54.8458 %

******** stdio, "S", default CRT buffers ********

Test                  = stdio_read
Buffer size           = 128 K
Big throughput        = 68.2803 MB/s
Chunk throughput      = 66.0759 MB/s
Throughput %Diff      = -3.28144 %

Interesting results! On my raid system, the sequential reading of chunks has horrible performance for a large buffer size. Even the large single file throughput seems to be affected.

So you are probably right in that the file-system cache can cause poor performance for high-throughput controllers/drivers. For my raid system, the size of the read buffer seems to trigger a bottle-neck using sequential-scan (which wasn't seen on my single-sata system).

And stdio doesn't do nearly as well in any category on my raid system.

To appease my curiosity, would you please generate the test files with the attached code and run the same tests - and copy/paste the results here. And provide an HDTach screenshot.

Here's what HDTach shows on my raid system - which is Silicon Image (Sil3132) adapter on a PCI-Xpress bus with 2 sata-II drives: Name: HDTach.png
Views: 699
Size: 10.0 KB

Name: HDTach.png
Views: 699
Size: 10.0 KB

And here is the full, revised, benchmark code: main_IO_Benchmark.cpp

gg

**iznotek** · February 2nd, 2009, 02:58 PM

hye Retarded, we are in the same configuration as you and we get the same problems, the only one difference is that we try to read 2 huge file( with FILE_FLAG_NO_BUFFERING of course ) at 200MB/s on raid0 (chipset ich10r)... when we read only one all is ok(200MB/s), but if we try to access two file the performance are falling down( 120MB/s )...

we also discovered that the problem come from the raid because on a single disk, we can get multiple files access without problem ( but limited at about 100MB/s on a 10000tr/s harddisk)

just as info the SEQUENTIAL_SCAN flag is only used without the NO_BUFFERING flag because it's only for caching system performance...

do you find a solution??

Thread: Excessively slow file sequence access

Thread Tools

Display

Hybrid View

Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Re: Excessively slow file sequence access

Posting Permissions