-
September 1st, 2017, 02:09 PM
#16
Re: Is there anyway to make this C-code faster?
Thank you for your kind and informative message.
C is in fact, random, without repetitions. It is basically an array representing a subset of
graph vertices. B - is a row in the graph's adjacency matrix.
Yes, all threads are accessing the same matrix and performing identical code. I am not
sure what the "false sharing" means though.
I will post a "Greedy algorithm for the Minimal Independent Set in random Graph", about 40 lines
of C-code. But then this has to be called in parallel with random graph subsets to reproduce my scenario,
I personally use VC++ 2017 and "parallel_for" constructs in the calling procedure.
-
September 1st, 2017, 02:34 PM
#17
Re: Is there anyway to make this C-code faster?
I've done some tests - and I'm amazed at the results! This is against what previous tests have shown (a few years ago) so it looks like MS has done a lot of work with their optimisations.
The test result timings should only be used in relation to each other and have no other meaning. The results in increasing order of timing are
Code:
for (size_t i = 0; i<N; ++i)
Result |= B[C[i]];
timing: 4.68439
Code:
for (size_t i = 0; i < N; Result |= B[C[i++]]);
timing: 4.69151
Code:
for (size_t i = 0; i < N && !Result; Result = B[C[i++]]);
timing: 5.45841
Code:
for (int *cp = C - 1, *ce = C + N; cp != ce; Result |= B[*++cp]);
timing: 5.65873
Code:
for (int *cp = C - 1, *ce = C + N; cp != ce && !Result; Result = B[*++cp]);
timing: 6.37585
All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
September 1st, 2017, 03:11 PM
#18
Re: Is there anyway to make this C-code faster?
Originally Posted by mkh01
I am not sure what the "false sharing" means though.
basically, "false sharing" occurs when two or more cores try to access different memory location lying in the same cache line; they look like being indipendent read/writes but turn out effectively resulting in serial execution because the cache choerency protocol disallows concurrent access of the same cache line.
-
September 1st, 2017, 03:17 PM
#19
Re: Is there anyway to make this C-code faster?
Originally Posted by 2kaud
I've done some tests - and I'm amazed at the results! This is against what previous tests have shown (a few years ago) so it looks like MS has done a lot of work with their optimisations.
can you post the asm ? assuming you populated B and C randomly and uniformly, I'd guess the compiler vectorized the code via a gather addressing SIMD instruction in the first case ... BTW, in all linear algebra libraries I know ( where vectorization is a must ) indices addressing is used because pointer based loops inhibit vectorization
-
September 1st, 2017, 05:14 PM
#20
Re: Is there anyway to make this C-code faster?
I've done a lot more testing, and the results vary significantly as to the distribution of 1's in B and how B is accessed via C. If a 1 is accessed in B fairly quickly in the loop then the code that tests for Result is quicker than those that don't. However if a 1 is accessed late on in the loop, then the code that doesn't test for Result is the quicker.
In order to investigate further, I've quickly knocked up some test code that will do multiple iterations. Consider
Code:
#pragma warning(disable : 4121)
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <iostream>
#include <set>
#include <chrono>
#include <random>
using namespace std;
using namespace std::chrono;
using fun = size_t(*)(BYTE &);
enum statis {tavrg = 0, tmaxm, tminm, cavrg, cmaxm, cminm, eend = cminm + 1};
const size_t sizeB = 5000000;
const size_t sizeC = 10000;
const size_t iter = 200;
BYTE* const B = new BYTE[sizeB];
size_t* const C = new size_t[sizeC];
size_t const N = sizeC;
size_t fun1(BYTE & Result)
{
Result = 0;
size_t i = 0;
for (; i < N; ++i)
Result |= B[C[i]];
return i;
}
size_t fun2(BYTE & Result)
{
Result = 0;
size_t i = 0;
for (; i < N; Result |= B[C[i++]]);
return i;
}
size_t fun3(BYTE & Result)
{
Result = 0;
size_t i = 0;
for (; i < N && !Result; Result = B[C[i++]]);
return i;
}
size_t fun4(BYTE & Result)
{
Result = 0;
size_t *cp = C, *ce = C + N;
for (; cp != ce; Result |= B[*cp++]);
return cp - C;
}
size_t fun5(BYTE & Result)
{
Result = 0;
size_t *cp = C, *ce = C + N;
for (; cp != ce && !Result; Result = B[*cp++]);
return cp - C;
}
const fun funcs[] = { fun1, fun2, fun3, fun4, fun5 };
const size_t funsize = sizeof(funcs) / sizeof(funcs[0]);
int main()
{
double res[funsize][eend] = { 0.0 };
for (size_t r = 0; r < funsize; ++r)
res[r][tminm] = res[r][cminm] = 9999999.0;
auto Bgen = mt19937 { random_device {}() };
auto Cgen = mt19937 { random_device {}() };
auto Bud = uniform_int_distribution<short> { 0, 1 };
auto Cud = uniform_int_distribution<size_t> { 0, sizeB - 1 };
for (size_t it = 0; it < iter; ++it) {
set<size_t> sst;
for (size_t i = 0; i < sizeB; ++i)
B[i] = static_cast<BYTE>(Bud(Bgen));
for (size_t i = 0; i < sizeC; ++i) {
size_t r;
do {
r = Cud(Cgen);
} while (sst.insert(r).second == false);
C[i] = r;
}
for (size_t f = 0; f < funsize; ++f) {
const auto start = high_resolution_clock::now();
BYTE Res;
const size_t c = funcs[f](Res);
const auto stop = high_resolution_clock::now();
const auto dur = duration<double, nano>(stop - start).count();
res[f][tavrg] += dur;
res[f][cavrg] += c;
if (res[f][tmaxm] < dur)
res[f][tmaxm] = dur;
if (res[f][tminm] > dur)
res[f][tminm] = dur;
if (res[f][cmaxm] < c)
res[f][cmaxm] = c;
if (res[f][cminm] > c)
res[f][cminm] = c;
}
}
cout << "For " << iter << " iterations\n\n\t\t";
for (size_t i = 0; i < funsize; ++i)
cout << "fun" << i + 1 << "\t";
cout << "\n\nAverage time: \t";
for (size_t i = 0; i < funsize; ++i)
cout << res[i][tavrg] / iter << "\t";
cout << "\n\nMaximum time: \t";
for (size_t i = 0; i < funsize; ++i)
cout << res[i][tmaxm] << "\t";
cout << "\n\nMinimum time: \t";
for (size_t i = 0; i < funsize; ++i)
cout << res[i][tminm] << "\t";
cout << "\n\nAverage count: \t";
for (size_t i = 0; i < funsize; ++i)
cout << res[i][cavrg] / iter << "\t";
cout << "\n\nMaximum count: \t";
for (size_t i = 0; i < funsize; ++i)
cout << res[i][cmaxm] << "\t";
cout << "\n\nMinimum count: \t";
for (size_t i = 0; i < funsize; ++i)
cout << res[i][cminm] << "\t";
cout << endl;
delete[] B;
delete[] C;
}
This gives on my system an output for 1 run of
Code:
For 200 iterations
fun1 fun2 fun3 fun4 fun5
Average time: 60313.2 46729.4 322 46525.9 254.835
Maximum time: 105494 63612 396 58871 396
Minimum time: 51759 42276 0 41881 0
Average count: 10000 10000 2.195 10000 2.195
Maximum count: 10000 10000 8 10000 8
Minimum count: 10000 10000 1 10000 1
Where the loop using pointers with termination test (fun5) is the quickest!
PS
The code in the funx() functions has been changed to have Result as a ref parameter. Upon looking at the generated assembler (for superbonzo post #19), the wily compiler was just returning the value of N for fun1() and fun2() as the whole code for the function and had removed completely the for loop as Result wasn't used! Doh!! Now that Result is a ref parameter, the whole for loop has to be executed. Hence the new timings
Last edited by 2kaud; September 2nd, 2017 at 04:41 PM.
Reason: PS
All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
September 2nd, 2017, 07:30 AM
#21
Re: Is there anyway to make this C-code faster?
Sincere thanks for your help.
I am working on a self contained C-code to reproduce my usage scenarios
and timings.
It is under VC++ 2017 and should be ready shortly.
Should I make it a separate post? This one is getting very long...
-
September 2nd, 2017, 07:42 AM
#22
Re: Is there anyway to make this C-code faster?
Should I make it a separate post? This one is getting very long...
As it's all related to the original interesting (IMO!) question regarding speeding up code, I would suggest there's no need to start a new thread.
All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
September 2nd, 2017, 11:22 AM
#23
Re: Is there anyway to make this C-code faster?
Thanks again.
I was able to make a short code that somewhat reproduce my usage scenario.
Actual function I am using is "Principal", the rest including random graph generation
is just a demo code.
I also included the previous suggestions on codeguru and timed them.
Here is a link to googledrive:
https://drive.google.com/file/d/0B2l...ew?usp=sharing
and the C-code. This was compiled and run using VC++ 2017
Code:
#include "stdafx.h"
#include <ppl.h>
using namespace std;
using namespace concurrency;
#define IXS_PARALLEL
typedef unsigned _int16 VTX,*PVTX;
enum {
ITER_DEPTH = 200
};
__forceinline double ffrand() { return rand() / (double)RAND_MAX; }
int GetNumLogicalCpus()
{
SYSTEM_INFO SysInfo;
GetSystemInfo(&SysInfo);
return SysInfo.dwNumberOfProcessors;
}
double GetElapsedTime( LARGE_INTEGER Start )
{
// stop timer
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
return double(t.QuadPart - Start.QuadPart)/freq.QuadPart;
}
///////////////////////////////////////////////////////////
//
//
//
///////////////////////////////////////////////////////////
static int Principal_ExtractIndependentSets(
PBYTE G,
int NV,
PVTX S,
int N,
PINT ISlng,
PVTX Temp )
{
if ( !N ) return 0;
PVTX Xset = Temp;
int NX = 0;
int NumRem = N;
while (NumRem) {
Xset[0] = S[0];
int xlng = 1;
PVTX Dst = S;
for (int r = 1; r<NumRem; r++) {
int re = S[r];
const PBYTE rConn = G+re*NV;
//XXXXXXXXXXXXXXXX
// BOTTLENECK
//
// VariantA
BYTE cn = rConn[Xset[0]];
for (int i = 1; i<xlng; i++)
cn |= rConn[Xset[i]];
/*
// Variant B: 15% slower on i7-5930K
BYTE cn = 0;
for (int i = 0; i<xlng; i++) {
cn |= rConn[Xset[i]];
if ( cn ) break;
}
*/
/*
// VariantC: 40% slowe
BYTE cn = 0;
for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn|= rConn[*xp++] );
*/
//XXXXXXXXXXXXXXXXXXX
if (cn) *Dst++ = re;
else Xset[xlng++] = re;
}
Xset += xlng;
ISlng[NX++] = xlng;
NumRem = int(Dst-S);
}
movemem( S,Temp,N*sizeof(VTX));
return NX;
}
///////////////////////////////////////////////////////////////////////
//
// Function creates a random graph and tests the "Independent Set"
// function.
// Testing graphs NV=1000, Density=0.5, Number Iterations = 20
//
////////////////////////////////////////////////////////////////////////
double IXSetTest(
int NV,
double Dens,
int NumIter )
{
PINT ISlngBuf = NULL;
PVTX TempBuf = NULL;
PBYTE M = NULL;
PVTX Subset = NULL;
bool res = false;
// Generate the graph
M = (PBYTE)calloc(NV*NV, 1);
if ( !M ) goto func_exit;
srand(100);
for (int i = 0; i<NV; i++) {
M[i*NV + i] = 1;
for (int j = i + 1; j<NV; j++)
if (ffrand()<Dens)
M[i*NV + j] = M[j*NV + i] = 1;
}
// just a sample initialization
Subset = (PVTX)malloc(NV * sizeof(VTX));
// ... and create a "random" subset
for (int j = 0; j<NV; j++) Subset[j] = j;
const int NumCpu = GetNumLogicalCpus(),
NumRepeat = NumCpu*ITER_DEPTH;
ISlngBuf = (PINT)malloc( NumRepeat*NV*sizeof(int));
if (!ISlngBuf) goto func_exit;
TempBuf = (PVTX)malloc( NumRepeat*NV *sizeof(VTX));
if (!TempBuf) goto func_exit;
LARGE_INTEGER Start;
QueryPerformanceCounter(&Start);
for ( int i=0; i<NumIter; i++ )
#ifdef IXS_PARALLEL
parallel_for(size_t(0), size_t(NumRepeat), size_t(1), [=](size_t r) {
#else
for ( int r=0; r<NumRepeat; r++ ) {
#endif
int boffs = r*NV;
Principal_ExtractIndependentSets( M,NV,Subset,NV,ISlngBuf+boffs,TempBuf+boffs );
#ifdef IXS_PARALLEL
});
#else
}
#endif
double Tm = GetElapsedTime( Start );
PR_DisplayDoubleValue(_T("Elapsed Time (sec)"),Tm );
res = true;
func_exit:
free( ISlngBuf );
free( TempBuf );
free( M );
free( Subset );
return Tm;
}
Last edited by 2kaud; September 2nd, 2017 at 12:31 PM.
Reason: Added code tags
-
September 2nd, 2017, 11:24 AM
#24
Re: Is there anyway to make this C-code faster?
I posted my usage scenario and fully functional source code.
Thank you so much.
-
September 2nd, 2017, 11:26 AM
#25
Re: Is there anyway to make this C-code faster?
>>disallows concurrent access of the same cache line.
Super interesting.
I am contemplating how to account for this in the code....
-
September 2nd, 2017, 11:31 AM
#26
Re: Is there anyway to make this C-code faster?
I posted the actual code that mostly resembles my results in practice.
SIMD: I actually did several recompilations (VC++) with SSE, SSE2 abd "No Enchanced Instructions"
flag and timing results were identical.
-
September 2nd, 2017, 11:35 AM
#27
Re: Is there anyway to make this C-code faster?
Thanks
My code looks ancient compared to yours, i need time to read through
this.
-
September 2nd, 2017, 11:41 AM
#28
Re: Is there anyway to make this C-code faster?
Also, slightly off the original question, I would not mind to
have a graph represented as a 1-bit matrix ( better cached), but
then the accumulation loop will slow down too much, but may
be i just was not able to figure it out...
-
September 2nd, 2017, 12:38 PM
#29
Re: Is there anyway to make this C-code faster?
When posting code, please use code tags so that the code is readable.
For variant c,
Code:
for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn|= rConn[*xp++] );
should be
Code:
for ( PVTX xp=Xset,Xend=Xset+xlng; xp!=Xend && !cn; cn = rConn[*xp++] );
If you change this, what does Variant c then give for your system?
I'll have a look at the code when I have some time, but we have a quite major situation here so I don't know when that will be.
All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!
C++23 Compiler: Microsoft VS2022 (17.6.5)
-
September 2nd, 2017, 12:59 PM
#30
Re: Is there anyway to make this C-code faster?
Originally Posted by mkh01
Thanks
My code looks ancient compared to yours, i need time to read through
this.
The code in my post #20 is c++. The c++ standard changes about every 3 years. We've now on c++17 for which VS2017 release 15.3.3 supports many (but not yet all) new features. The previous standard was c++14.
All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!
C++23 Compiler: Microsoft VS2022 (17.6.5)
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
|