CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 2 of 4 FirstFirst 1234 LastLast
Results 16 to 30 of 53
  1. #16
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    Thank you for your kind and informative message.

    C is in fact, random, without repetitions. It is basically an array representing a subset of
    graph vertices. B - is a row in the graph's adjacency matrix.

    Yes, all threads are accessing the same matrix and performing identical code. I am not
    sure what the "false sharing" means though.

    I will post a "Greedy algorithm for the Minimal Independent Set in random Graph", about 40 lines
    of C-code. But then this has to be called in parallel with random graph subsets to reproduce my scenario,
    I personally use VC++ 2017 and "parallel_for" constructs in the calling procedure.

  2. #17
    2kaud's Avatar
    2kaud is offline Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    7,824

    Re: Is there anyway to make this C-code faster?

    I've done some tests - and I'm amazed at the results! This is against what previous tests have shown (a few years ago) so it looks like MS has done a lot of work with their optimisations.

    The test result timings should only be used in relation to each other and have no other meaning. The results in increasing order of timing are

    Code:
         for (size_t i = 0; i<N; ++i)
              Result |= B[C[i]];
    timing: 4.68439

    Code:
         for (size_t i = 0; i < N; Result |= B[C[i++]]);
    timing: 4.69151

    Code:
         for (size_t i = 0; i < N && !Result; Result = B[C[i++]]);
    timing: 5.45841

    Code:
         for (int *cp = C - 1, *ce = C + N; cp != ce; Result |= B[*++cp]);
    timing: 5.65873

    Code:
         for (int *cp = C - 1, *ce = C + N; cp != ce && !Result; Result = B[*++cp]);
    timing: 6.37585
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  3. #18
    Join Date
    Oct 2008
    Posts
    1,456

    Re: Is there anyway to make this C-code faster?

    Quote Originally Posted by mkh01 View Post
    I am not sure what the "false sharing" means though.
    basically, "false sharing" occurs when two or more cores try to access different memory location lying in the same cache line; they look like being indipendent read/writes but turn out effectively resulting in serial execution because the cache choerency protocol disallows concurrent access of the same cache line.

  4. #19
    Join Date
    Oct 2008
    Posts
    1,456

    Re: Is there anyway to make this C-code faster?

    Quote Originally Posted by 2kaud View Post
    I've done some tests - and I'm amazed at the results! This is against what previous tests have shown (a few years ago) so it looks like MS has done a lot of work with their optimisations.
    can you post the asm ? assuming you populated B and C randomly and uniformly, I'd guess the compiler vectorized the code via a gather addressing SIMD instruction in the first case ... BTW, in all linear algebra libraries I know ( where vectorization is a must ) indices addressing is used because pointer based loops inhibit vectorization

  5. #20
    2kaud's Avatar
    2kaud is offline Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    7,824

    Re: Is there anyway to make this C-code faster?

    I've done a lot more testing, and the results vary significantly as to the distribution of 1's in B and how B is accessed via C. If a 1 is accessed in B fairly quickly in the loop then the code that tests for Result is quicker than those that don't. However if a 1 is accessed late on in the loop, then the code that doesn't test for Result is the quicker.

    In order to investigate further, I've quickly knocked up some test code that will do multiple iterations. Consider
    Code:
    #pragma warning(disable : 4121)
    #define WIN32_LEAN_AND_MEAN
    #include <Windows.h>
    
    #include <iostream>
    #include <set>
    #include <chrono>
    #include <random>
    using namespace std;
    using namespace std::chrono;
    
    using fun = size_t(*)(BYTE &);
    
    enum statis {tavrg = 0, tmaxm, tminm, cavrg, cmaxm, cminm, eend = cminm + 1};
    
    const size_t sizeB = 5000000;
    const size_t sizeC = 10000;
    const size_t iter = 200;
    
    BYTE* const B = new BYTE[sizeB];
    size_t* const C = new size_t[sizeC];
    size_t const N = sizeC;
    
    size_t fun1(BYTE & Result)
    {
    	Result = 0;
    	size_t i = 0;
    
    	for (; i < N; ++i)
    		Result |= B[C[i]];
    
    	return i;
    }
    
    size_t fun2(BYTE & Result)
    {
    	Result = 0;
    	size_t i = 0;
    
    	for (; i < N; Result |= B[C[i++]]);
    
    	return i;
    }
    
    size_t fun3(BYTE & Result)
    {
    	Result = 0;
    	size_t i = 0;
    
    	for (; i < N && !Result; Result = B[C[i++]]);
    
    	return i;
    }
    
    size_t fun4(BYTE & Result)
    {
    	Result = 0;
    	size_t *cp = C, *ce = C + N;
    
    	for (; cp != ce; Result |= B[*cp++]);
    
    	return cp - C;
    }
    
    size_t fun5(BYTE & Result)
    {
    	Result = 0;
    	size_t *cp = C, *ce = C + N;
    
    	for (; cp != ce && !Result; Result = B[*cp++]);
    
    	return cp - C;
    }
    
    const fun funcs[] = { fun1, fun2, fun3, fun4, fun5 };
    const size_t funsize = sizeof(funcs) / sizeof(funcs[0]);
    
    int main()
    {
    	double res[funsize][eend] = { 0.0 };
    
    	for (size_t r = 0; r < funsize; ++r)
    		res[r][tminm] = res[r][cminm] = 9999999.0;
    
    	auto Bgen = mt19937 { random_device {}() };
    	auto Cgen = mt19937 { random_device {}() };
    	auto Bud = uniform_int_distribution<short> { 0, 1 };
    	auto Cud = uniform_int_distribution<size_t> { 0, sizeB - 1 };
    
    	for (size_t it = 0; it < iter; ++it) {
    		set<size_t> sst;
    
    		for (size_t i = 0; i < sizeB; ++i)
    			B[i] = static_cast<BYTE>(Bud(Bgen));
    
    		for (size_t i = 0; i < sizeC; ++i) {
    			size_t r;
    
    			do {
    				r = Cud(Cgen);
    			} while (sst.insert(r).second == false);
    
    			C[i] = r;
    		}
    
    		for (size_t f = 0; f < funsize; ++f) {
    			const auto start = high_resolution_clock::now();
    
    			BYTE Res;
    			const size_t c = funcs[f](Res);
    
    			const auto stop = high_resolution_clock::now();
    			const auto dur = duration<double, nano>(stop - start).count();
    
    			res[f][tavrg] += dur;
    			res[f][cavrg] += c;
    
    			if (res[f][tmaxm] < dur)
    				res[f][tmaxm] = dur;
    
    			if (res[f][tminm] > dur)
    				res[f][tminm] = dur;
    
    			if (res[f][cmaxm] < c)
    				res[f][cmaxm] = c;
    
    			if (res[f][cminm] > c)
    				res[f][cminm] = c;
    		}
    	}
    
    	cout << "For " << iter << " iterations\n\n\t\t";
    
    	for (size_t i = 0; i < funsize; ++i)
    		cout << "fun" << i + 1 << "\t";
    
    	cout << "\n\nAverage time: \t";
    	for (size_t i = 0; i < funsize; ++i)
    		cout << res[i][tavrg] / iter << "\t";
    
    	cout << "\n\nMaximum time: \t";
    	for (size_t i = 0; i < funsize; ++i)
    		cout << res[i][tmaxm] << "\t";
    
    	cout << "\n\nMinimum time: \t";
    	for (size_t i = 0; i < funsize; ++i)
    		cout << res[i][tminm] << "\t";
    
    	cout << "\n\nAverage count: \t";
    	for (size_t i = 0; i < funsize; ++i)
    		cout << res[i][cavrg] / iter << "\t";
    
    	cout << "\n\nMaximum count: \t";
    	for (size_t i = 0; i < funsize; ++i)
    		cout << res[i][cmaxm] << "\t";
    
    	cout << "\n\nMinimum count: \t";
    	for (size_t i = 0; i < funsize; ++i)
    		cout << res[i][cminm] << "\t";
    
    	cout << endl;
    
    	delete[] B;
    	delete[] C;
    }
    This gives on my system an output for 1 run of

    Code:
    For 200 iterations
    
                    fun1    fun2    fun3    fun4    fun5
    
    Average time:   60313.2 46729.4 322     46525.9 254.835
    
    Maximum time:   105494  63612   396     58871   396
    
    Minimum time:   51759   42276   0       41881   0
    
    Average count:  10000   10000   2.195   10000   2.195
    
    Maximum count:  10000   10000   8       10000   8
    
    Minimum count:  10000   10000   1       10000   1
    Where the loop using pointers with termination test (fun5) is the quickest!

    PS

    The code in the funx() functions has been changed to have Result as a ref parameter. Upon looking at the generated assembler (for superbonzo post #19), the wily compiler was just returning the value of N for fun1() and fun2() as the whole code for the function and had removed completely the for loop as Result wasn't used! Doh!! Now that Result is a ref parameter, the whole for loop has to be executed. Hence the new timings
    Last edited by 2kaud; September 2nd, 2017 at 04:41 PM. Reason: PS
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  6. #21
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    Sincere thanks for your help.

    I am working on a self contained C-code to reproduce my usage scenarios
    and timings.
    It is under VC++ 2017 and should be ready shortly.

    Should I make it a separate post? This one is getting very long...

  7. #22
    2kaud's Avatar
    2kaud is offline Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    7,824

    Re: Is there anyway to make this C-code faster?

    Should I make it a separate post? This one is getting very long...
    As it's all related to the original interesting (IMO!) question regarding speeding up code, I would suggest there's no need to start a new thread.
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  8. #23
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    Thanks again.

    I was able to make a short code that somewhat reproduce my usage scenario.

    Actual function I am using is "Principal", the rest including random graph generation
    is just a demo code.

    I also included the previous suggestions on codeguru and timed them.

    Here is a link to googledrive:

    https://drive.google.com/file/d/0B2l...ew?usp=sharing


    and the C-code. This was compiled and run using VC++ 2017

    Code:
    #include	"stdafx.h"
    #include	<ppl.h>
    
    using namespace std;
    using namespace concurrency;
    
    
    #define	IXS_PARALLEL
    
    
    typedef unsigned _int16		VTX,*PVTX;
    
    
    enum {
    	ITER_DEPTH	= 200
    };
    
    
    __forceinline double	ffrand() { return rand() / (double)RAND_MAX; }
    
    
    int	GetNumLogicalCpus()
    {
    SYSTEM_INFO SysInfo;
    	GetSystemInfo(&SysInfo);
    	return SysInfo.dwNumberOfProcessors;
    }
    
    
    double GetElapsedTime( LARGE_INTEGER Start )
    {
    	// stop timer
    LARGE_INTEGER t;
    	QueryPerformanceCounter(&t);
    LARGE_INTEGER freq;
    	QueryPerformanceFrequency(&freq);
    
    	return double(t.QuadPart - Start.QuadPart)/freq.QuadPart;
    }
    
    
    
    ///////////////////////////////////////////////////////////
    //
    //
    //
    ///////////////////////////////////////////////////////////
    static int	Principal_ExtractIndependentSets(
    		PBYTE		G,
    		int		NV,
    		PVTX		S,
    		int		N,
    		PINT		ISlng,
    		PVTX		Temp )
    {
    	if ( !N ) return 0;
    
    PVTX	Xset = Temp;
    
    int	NX = 0;
    int	NumRem = N;
    	while (NumRem) {
    		Xset[0] = S[0];
    		int	xlng = 1;
    
    		PVTX	Dst = S;
    		for (int r = 1; r<NumRem; r++) {
    			int	re = S[r];
    			const PBYTE rConn = G+re*NV;
    
    			//XXXXXXXXXXXXXXXX
    			// BOTTLENECK
    			//
    
    	
    			// VariantA
    			BYTE  cn = rConn[Xset[0]];
    			for (int i = 1; i<xlng; i++)
    				cn |= rConn[Xset[i]];
    
    
    			/*
    			// Variant B: 15% slower on  i7-5930K
    			
    			BYTE cn = 0;
    			for (int i = 0; i<xlng; i++) {
    				cn |= rConn[Xset[i]];
    				if ( cn ) break;
    			}
    			*/
    
    			/*
    			// VariantC:   40% slowe
    			BYTE cn = 0;
    			for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn|= rConn[*xp++] );
    			*/
    
    			//XXXXXXXXXXXXXXXXXXX
    
    			if (cn)	*Dst++ = re;
    			else		Xset[xlng++] = re;
    		}
    
    		Xset += xlng;
    		ISlng[NX++] = xlng;
    		NumRem = int(Dst-S);
    	}
    
    	movemem( S,Temp,N*sizeof(VTX));
    	return NX;
    }
    
    
    ///////////////////////////////////////////////////////////////////////
    //
    //	Function creates a random graph and tests the "Independent Set"
    //	function.
    //	Testing graphs  NV=1000,  Density=0.5,  Number Iterations = 20
    //
    ////////////////////////////////////////////////////////////////////////
    double	IXSetTest(
    		int		NV,
    		double	Dens,
    		int		NumIter )
    {
    PINT  ISlngBuf	= NULL;
    PVTX  TempBuf	= NULL;
    PBYTE M		= NULL;
    PVTX  Subset      = NULL;
    bool  res		= false;
    
    
    
    
    	// Generate the graph
    	M = (PBYTE)calloc(NV*NV, 1);
    	if ( !M ) goto func_exit;
    
    	srand(100);
    	for (int i = 0; i<NV; i++) {
    		M[i*NV + i] = 1;
    		for (int j = i + 1; j<NV; j++)
    			if (ffrand()<Dens)
    				M[i*NV + j] = M[j*NV + i] = 1;
    	}
    
    	
    
    	// just a sample initialization 
    	Subset = (PVTX)malloc(NV * sizeof(VTX));
    
    	// ... and create a "random" subset
    	for (int j = 0; j<NV; j++) Subset[j] = j;
    
    
    
    const int	NumCpu = GetNumLogicalCpus(),
    		NumRepeat = NumCpu*ITER_DEPTH;
    
    	ISlngBuf = (PINT)malloc( NumRepeat*NV*sizeof(int));
    	if (!ISlngBuf) goto func_exit;
    
    	TempBuf  = (PVTX)malloc( NumRepeat*NV *sizeof(VTX));
    	if (!TempBuf) goto func_exit;
    
    LARGE_INTEGER Start;
    	QueryPerformanceCounter(&Start);
    
    
    	for ( int i=0; i<NumIter; i++ )
    #ifdef IXS_PARALLEL
    		parallel_for(size_t(0), size_t(NumRepeat), size_t(1), [=](size_t r) {
    #else
    		for ( int r=0; r<NumRepeat; r++ ) {
    #endif
    			int boffs = r*NV;
    			Principal_ExtractIndependentSets( M,NV,Subset,NV,ISlngBuf+boffs,TempBuf+boffs );
    
    #ifdef IXS_PARALLEL
    		});
    #else
    		}
    #endif
    
    
    double Tm = GetElapsedTime( Start );
    
    	PR_DisplayDoubleValue(_T("Elapsed Time (sec)"),Tm );
    
    
    	res = true;
    func_exit:
    	free( ISlngBuf );
    	free( TempBuf );
    	free( M );
    	free( Subset );
    
    	return Tm;
    }
    Last edited by 2kaud; September 2nd, 2017 at 12:31 PM. Reason: Added code tags

  9. #24
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    I posted my usage scenario and fully functional source code.
    Thank you so much.

  10. #25
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    >>disallows concurrent access of the same cache line.

    Super interesting.
    I am contemplating how to account for this in the code....

  11. #26
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    I posted the actual code that mostly resembles my results in practice.

    SIMD: I actually did several recompilations (VC++) with SSE, SSE2 abd "No Enchanced Instructions"
    flag and timing results were identical.

  12. #27
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    Thanks
    My code looks ancient compared to yours, i need time to read through
    this.

  13. #28
    Join Date
    Aug 2017
    Posts
    31

    Re: Is there anyway to make this C-code faster?

    Also, slightly off the original question, I would not mind to
    have a graph represented as a 1-bit matrix ( better cached), but
    then the accumulation loop will slow down too much, but may
    be i just was not able to figure it out...

  14. #29
    2kaud's Avatar
    2kaud is offline Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    7,824

    Re: Is there anyway to make this C-code faster?

    When posting code, please use code tags so that the code is readable.

    For variant c,

    Code:
    for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn|= rConn[*xp++] );
    should be

    Code:
    for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn = rConn[*xp++] );
    If you change this, what does Variant c then give for your system?

    I'll have a look at the code when I have some time, but we have a quite major situation here so I don't know when that will be.
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  15. #30
    2kaud's Avatar
    2kaud is offline Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    7,824

    Re: Is there anyway to make this C-code faster?

    Quote Originally Posted by mkh01 View Post
    Thanks
    My code looks ancient compared to yours, i need time to read through
    this.
    The code in my post #20 is c++. The c++ standard changes about every 3 years. We've now on c++17 for which VS2017 release 15.3.3 supports many (but not yet all) new features. The previous standard was c++14.
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++23 Compiler: Microsoft VS2022 (17.6.5)

Page 2 of 4 FirstFirst 1234 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured