Is there anyway to make this C-code faster?

**mkh01** · September 1st, 2017, 02:09 PM

Thank you for your kind and informative message.

C is in fact, random, without repetitions. It is basically an array representing a subset of
graph vertices. B - is a row in the graph's adjacency matrix.

Yes, all threads are accessing the same matrix and performing identical code. I am not
sure what the "false sharing" means though.

I will post a "Greedy algorithm for the Minimal Independent Set in random Graph", about 40 lines
of C-code. But then this has to be called in parallel with random graph subsets to reproduce my scenario,
I personally use VC++ 2017 and "parallel_for" constructs in the calling procedure.

**2kaud** · September 1st, 2017, 02:34 PM

I've done some tests - and I'm amazed at the results! This is against what previous tests have shown (a few years ago) so it looks like MS has done a lot of work with their optimisations.

The test result timings should only be used in relation to each other and have no other meaning. The results in increasing order of timing are

Code:

     for (size_t i = 0; i<N; ++i)
          Result |= B[C[i]];

timing: 4.68439

Code:

     for (size_t i = 0; i < N; Result |= B[C[i++]]);

timing: 4.69151

Code:

     for (size_t i = 0; i < N && !Result; Result = B[C[i++]]);

timing: 5.45841

Code:

     for (int *cp = C - 1, *ce = C + N; cp != ce; Result |= B[*++cp]);

timing: 5.65873

Code:

     for (int *cp = C - 1, *ce = C + N; cp != ce && !Result; Result = B[*++cp]);

timing: 6.37585

**superbonzo** · September 1st, 2017, 03:11 PM

Originally Posted by mkh01

I am not sure what the "false sharing" means though.

basically, "false sharing" occurs when two or more cores try to access different memory location lying in the same cache line; they look like being indipendent read/writes but turn out effectively resulting in serial execution because the cache choerency protocol disallows concurrent access of the same cache line.

**superbonzo** · September 1st, 2017, 03:17 PM

Originally Posted by 2kaud

I've done some tests - and I'm amazed at the results! This is against what previous tests have shown (a few years ago) so it looks like MS has done a lot of work with their optimisations.

can you post the asm ? assuming you populated B and C randomly and uniformly, I'd guess the compiler vectorized the code via a gather addressing SIMD instruction in the first case ... BTW, in all linear algebra libraries I know ( where vectorization is a must ) indices addressing is used because pointer based loops inhibit vectorization

**2kaud** · September 1st, 2017, 05:14 PM

I've done a lot more testing, and the results vary significantly as to the distribution of 1's in B and how B is accessed via C. If a 1 is accessed in B fairly quickly in the loop then the code that tests for Result is quicker than those that don't. However if a 1 is accessed late on in the loop, then the code that doesn't test for Result is the quicker.

In order to investigate further, I've quickly knocked up some test code that will do multiple iterations. Consider

Code:

#pragma warning(disable : 4121)
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>

#include <iostream>
#include <set>
#include <chrono>
#include <random>
using namespace std;
using namespace std::chrono;

using fun = size_t(*)(BYTE &);

enum statis {tavrg = 0, tmaxm, tminm, cavrg, cmaxm, cminm, eend = cminm + 1};

const size_t sizeB = 5000000;
const size_t sizeC = 10000;
const size_t iter = 200;

BYTE* const B = new BYTE[sizeB];
size_t* const C = new size_t[sizeC];
size_t const N = sizeC;

size_t fun1(BYTE & Result)
{
	Result = 0;
	size_t i = 0;

	for (; i < N; ++i)
		Result |= B[C[i]];

	return i;
}

size_t fun2(BYTE & Result)
{
	Result = 0;
	size_t i = 0;

	for (; i < N; Result |= B[C[i++]]);

	return i;
}

size_t fun3(BYTE & Result)
{
	Result = 0;
	size_t i = 0;

	for (; i < N && !Result; Result = B[C[i++]]);

	return i;
}

size_t fun4(BYTE & Result)
{
	Result = 0;
	size_t *cp = C, *ce = C + N;

	for (; cp != ce; Result |= B[*cp++]);

	return cp - C;
}

size_t fun5(BYTE & Result)
{
	Result = 0;
	size_t *cp = C, *ce = C + N;

	for (; cp != ce && !Result; Result = B[*cp++]);

	return cp - C;
}

const fun funcs[] = { fun1, fun2, fun3, fun4, fun5 };
const size_t funsize = sizeof(funcs) / sizeof(funcs[0]);

int main()
{
	double res[funsize][eend] = { 0.0 };

	for (size_t r = 0; r < funsize; ++r)
		res[r][tminm] = res[r][cminm] = 9999999.0;

	auto Bgen = mt19937 { random_device {}() };
	auto Cgen = mt19937 { random_device {}() };
	auto Bud = uniform_int_distribution<short> { 0, 1 };
	auto Cud = uniform_int_distribution<size_t> { 0, sizeB - 1 };

	for (size_t it = 0; it < iter; ++it) {
		set<size_t> sst;

		for (size_t i = 0; i < sizeB; ++i)
			B[i] = static_cast<BYTE>(Bud(Bgen));

		for (size_t i = 0; i < sizeC; ++i) {
			size_t r;

			do {
				r = Cud(Cgen);
			} while (sst.insert(r).second == false);

			C[i] = r;
		}

		for (size_t f = 0; f < funsize; ++f) {
			const auto start = high_resolution_clock::now();

			BYTE Res;
			const size_t c = funcs[f](Res);

			const auto stop = high_resolution_clock::now();
			const auto dur = duration<double, nano>(stop - start).count();

			res[f][tavrg] += dur;
			res[f][cavrg] += c;

			if (res[f][tmaxm] < dur)
				res[f][tmaxm] = dur;

			if (res[f][tminm] > dur)
				res[f][tminm] = dur;

			if (res[f][cmaxm] < c)
				res[f][cmaxm] = c;

			if (res[f][cminm] > c)
				res[f][cminm] = c;
		}
	}

	cout << "For " << iter << " iterations\n\n\t\t";

	for (size_t i = 0; i < funsize; ++i)
		cout << "fun" << i + 1 << "\t";

	cout << "\n\nAverage time: \t";
	for (size_t i = 0; i < funsize; ++i)
		cout << res[i][tavrg] / iter << "\t";

	cout << "\n\nMaximum time: \t";
	for (size_t i = 0; i < funsize; ++i)
		cout << res[i][tmaxm] << "\t";

	cout << "\n\nMinimum time: \t";
	for (size_t i = 0; i < funsize; ++i)
		cout << res[i][tminm] << "\t";

	cout << "\n\nAverage count: \t";
	for (size_t i = 0; i < funsize; ++i)
		cout << res[i][cavrg] / iter << "\t";

	cout << "\n\nMaximum count: \t";
	for (size_t i = 0; i < funsize; ++i)
		cout << res[i][cmaxm] << "\t";

	cout << "\n\nMinimum count: \t";
	for (size_t i = 0; i < funsize; ++i)
		cout << res[i][cminm] << "\t";

	cout << endl;

	delete[] B;
	delete[] C;
}

This gives on my system an output for 1 run of

Code:

For 200 iterations

                fun1    fun2    fun3    fun4    fun5

Average time:   60313.2 46729.4 322     46525.9 254.835

Maximum time:   105494  63612   396     58871   396

Minimum time:   51759   42276   0       41881   0

Average count:  10000   10000   2.195   10000   2.195

Maximum count:  10000   10000   8       10000   8

Minimum count:  10000   10000   1       10000   1

Where the loop using pointers with termination test (fun5) is the quickest!

PS

The code in the funx() functions has been changed to have Result as a ref parameter. Upon looking at the generated assembler (for superbonzo post #19), the wily compiler was just returning the value of N for fun1() and fun2() as the whole code for the function and had removed completely the for loop as Result wasn't used! Doh!!

Now that Result is a ref parameter, the whole for loop has to be executed. Hence the new timings

**mkh01** · September 2nd, 2017, 07:30 AM

Sincere thanks for your help.

I am working on a self contained C-code to reproduce my usage scenarios
and timings.
It is under VC++ 2017 and should be ready shortly.

Should I make it a separate post? This one is getting very long...

**2kaud** · September 2nd, 2017, 07:42 AM

Should I make it a separate post? This one is getting very long...

As it's all related to the original interesting (IMO!) question regarding speeding up code, I would suggest there's no need to start a new thread.

**mkh01** · September 2nd, 2017, 11:22 AM

Thanks again.

I was able to make a short code that somewhat reproduce my usage scenario.

Actual function I am using is "Principal", the rest including random graph generation
is just a demo code.

I also included the previous suggestions on codeguru and timed them.

Here is a link to googledrive:

https://drive.google.com/file/d/0B2l...ew?usp=sharing

and the C-code. This was compiled and run using VC++ 2017

Code:

#include	"stdafx.h"
#include	<ppl.h>

using namespace std;
using namespace concurrency;


#define	IXS_PARALLEL


typedef unsigned _int16		VTX,*PVTX;


enum {
	ITER_DEPTH	= 200
};


__forceinline double	ffrand() { return rand() / (double)RAND_MAX; }


int	GetNumLogicalCpus()
{
SYSTEM_INFO SysInfo;
	GetSystemInfo(&SysInfo);
	return SysInfo.dwNumberOfProcessors;
}


double GetElapsedTime( LARGE_INTEGER Start )
{
	// stop timer
LARGE_INTEGER t;
	QueryPerformanceCounter(&t);
LARGE_INTEGER freq;
	QueryPerformanceFrequency(&freq);

	return double(t.QuadPart - Start.QuadPart)/freq.QuadPart;
}



///////////////////////////////////////////////////////////
//
//
//
///////////////////////////////////////////////////////////
static int	Principal_ExtractIndependentSets(
		PBYTE		G,
		int		NV,
		PVTX		S,
		int		N,
		PINT		ISlng,
		PVTX		Temp )
{
	if ( !N ) return 0;

PVTX	Xset = Temp;

int	NX = 0;
int	NumRem = N;
	while (NumRem) {
		Xset[0] = S[0];
		int	xlng = 1;

		PVTX	Dst = S;
		for (int r = 1; r<NumRem; r++) {
			int	re = S[r];
			const PBYTE rConn = G+re*NV;

			//XXXXXXXXXXXXXXXX
			// BOTTLENECK
			//

	
			// VariantA
			BYTE  cn = rConn[Xset[0]];
			for (int i = 1; i<xlng; i++)
				cn |= rConn[Xset[i]];


			/*
			// Variant B: 15% slower on  i7-5930K
			
			BYTE cn = 0;
			for (int i = 0; i<xlng; i++) {
				cn |= rConn[Xset[i]];
				if ( cn ) break;
			}
			*/

			/*
			// VariantC:   40% slowe
			BYTE cn = 0;
			for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn|= rConn[*xp++] );
			*/

			//XXXXXXXXXXXXXXXXXXX

			if (cn)	*Dst++ = re;
			else		Xset[xlng++] = re;
		}

		Xset += xlng;
		ISlng[NX++] = xlng;
		NumRem = int(Dst-S);
	}

	movemem( S,Temp,N*sizeof(VTX));
	return NX;
}


///////////////////////////////////////////////////////////////////////
//
//	Function creates a random graph and tests the "Independent Set"
//	function.
//	Testing graphs  NV=1000,  Density=0.5,  Number Iterations = 20
//
////////////////////////////////////////////////////////////////////////
double	IXSetTest(
		int		NV,
		double	Dens,
		int		NumIter )
{
PINT  ISlngBuf	= NULL;
PVTX  TempBuf	= NULL;
PBYTE M		= NULL;
PVTX  Subset      = NULL;
bool  res		= false;




	// Generate the graph
	M = (PBYTE)calloc(NV*NV, 1);
	if ( !M ) goto func_exit;

	srand(100);
	for (int i = 0; i<NV; i++) {
		M[i*NV + i] = 1;
		for (int j = i + 1; j<NV; j++)
			if (ffrand()<Dens)
				M[i*NV + j] = M[j*NV + i] = 1;
	}

	

	// just a sample initialization 
	Subset = (PVTX)malloc(NV * sizeof(VTX));

	// ... and create a "random" subset
	for (int j = 0; j<NV; j++) Subset[j] = j;



const int	NumCpu = GetNumLogicalCpus(),
		NumRepeat = NumCpu*ITER_DEPTH;

	ISlngBuf = (PINT)malloc( NumRepeat*NV*sizeof(int));
	if (!ISlngBuf) goto func_exit;

	TempBuf  = (PVTX)malloc( NumRepeat*NV *sizeof(VTX));
	if (!TempBuf) goto func_exit;

LARGE_INTEGER Start;
	QueryPerformanceCounter(&Start);


	for ( int i=0; i<NumIter; i++ )
#ifdef IXS_PARALLEL
		parallel_for(size_t(0), size_t(NumRepeat), size_t(1), [=](size_t r) {
#else
		for ( int r=0; r<NumRepeat; r++ ) {
#endif
			int boffs = r*NV;
			Principal_ExtractIndependentSets( M,NV,Subset,NV,ISlngBuf+boffs,TempBuf+boffs );

#ifdef IXS_PARALLEL
		});
#else
		}
#endif


double Tm = GetElapsedTime( Start );

	PR_DisplayDoubleValue(_T("Elapsed Time (sec)"),Tm );


	res = true;
func_exit:
	free( ISlngBuf );
	free( TempBuf );
	free( M );
	free( Subset );

	return Tm;
}

**mkh01** · September 2nd, 2017, 11:24 AM

I posted my usage scenario and fully functional source code.
Thank you so much.

**mkh01** · September 2nd, 2017, 11:26 AM

>>disallows concurrent access of the same cache line.

Super interesting.
I am contemplating how to account for this in the code....

**mkh01** · September 2nd, 2017, 11:31 AM

I posted the actual code that mostly resembles my results in practice.

SIMD: I actually did several recompilations (VC++) with SSE, SSE2 abd "No Enchanced Instructions"
flag and timing results were identical.

**mkh01** · September 2nd, 2017, 11:35 AM

Thanks
My code looks ancient compared to yours, i need time to read through
this.

**mkh01** · September 2nd, 2017, 11:41 AM

Also, slightly off the original question, I would not mind to
have a graph represented as a 1-bit matrix ( better cached), but
then the accumulation loop will slow down too much, but may
be i just was not able to figure it out...

**2kaud** · September 2nd, 2017, 12:38 PM

When posting code, please use code tags so that the code is readable.

For variant c,

Code:

for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn|= rConn[*xp++] );

should be

Code:

for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn = rConn[*xp++] );

If you change this, what does Variant c then give for your system?

I'll have a look at the code when I have some time, but we have a quite major situation here so I don't know when that will be.

**2kaud** · September 2nd, 2017, 12:59 PM

Originally Posted by mkh01

Thanks
My code looks ancient compared to yours, i need time to read through
this.

The code in my post #20 is c++. The c++ standard changes about every 3 years. We've now on c++17 for which VS2017 release 15.3.3 supports many (but not yet all) new features. The previous standard was c++14.

Thread: Is there anyway to make this C-code faster?

Thread Tools

Display

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Re: Is there anyway to make this C-code faster?

Posting Permissions