Any way to improve this?

Re: Any way to improve this?

1. The total number of 5 letter words is 25,266.

2. It takes only 1 second to load the vector.

Re: Any way to improve this?

I did a quick little test, and I might have been wrong about the using
char[6] instead of string not being much faster.

Instead of vector<string> I used vector<S>, with S defined as:

Code:

struct S { char buff[6]; };

And it cut the processing time by over 50%.

Re: Any way to improve this?

Code:

struct S { char buff[6]; }; int main() { Graph g; vector<S> wordList; ifstream file("words"); if( !file ) { cerr << "Cannot open the dictionary list " << endl; return 1; } else { string word; if (file.is_open()) { while (getline (file,word)) { if (word.size() == 5) { wordList.push_back(word); } } file.close();

Ok, this is what I have right now but I'm not sure how to read in and store the words in the buffer.

Re: Any way to improve this?

Here is was testing code (I created a file with 400000 words in it,
with 36000 being five letter words ... all random.

Code:

#include <vector> #include <ctime> #include <iostream> #include <fstream> using namespace std; struct S { char buff[6]; }; int main() { vector<S> wordList; clock_t start , finish; start = clock(); ifstream in("p.txt"); S s; int nFive = 0; char buff[80]; // must be large enough for each line in the file while (in.getline(buff,80)) { if (strlen(buff) == 5) { strcpy(s.buff,buff); wordList.push_back(s); ++nFive; } } finish = clock(); cout << "cycles to read/push data = " << finish - start << "\n"; cout << "number of 5 letter words = " << nFive << "\n"; start = clock(); for (size_t i = 0; i < wordList.size(); ++i) { for (size_t j = i+1; j < wordList.size(); ++j) { int count = 0; for (int k = 0; k < 5; ++k) { if ( wordList[i].buff[k] != wordList[j].buff[k]) { if( ++count > 1) break; } } if (count == 1) { // } } } finish = clock(); cout << "cycles to process data = " << finish - start << "\n"; return 0; }

Re: Any way to improve this?

Quote:

Originally Posted by Philip Nicoletti

I might have been wrong about the using
char[6] instead of string not being much faster...

So that makes me what? Right? ;)

Re: Any way to improve this?

I'm not sure if the purpose of 'improve' is clarity or speed, but I thought I'd try this version....

It assumes VS2K5.NET, a console application supporting MFC (so the Windows functions for file mapping are included).

I only bothered with the file read and stuffing the vector - the processing I didn't bother working on....yet, I have an idea and I'll come back with that in a bit....

This version ran on a 2Ghz AMD 32 bit cpu, on an example file of 400,000 words of random sizes, in which 39,000 or so were 5 letter words, and loaded the 5 letter words into a vector of struct S as depicted by Philip Nicoletti.

The example took 16 clock ticks ( 16ms ), which I think is under the resolution of the timer.

So I tried a file of 1.6 million words that had 159,000 + 5 letter words in it.

It did that in 62ms.

It uses a read only memory mapped file to pull the data in....

Eventually I got around to trying a sample of 40 million words (a 300Mbyte source file). The standard streams approach took 25 seconds when the file was already in cache, whereas the memory mapped file approach took only 3.188 seconds on the same volume.

Memory mapped files are widely recognized as the fasted mechanism for Windows to read data from a file.

Code:

#include "stdafx.h" // supplied by VSNET for a console build with MFC support // The one and only application object CWinApp theApp; #include <vector> using namespace std; struct S { char buff[ 6 ]; }; int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) { // ignore this opening nonsense, it's about MFC initialization int nRetCode = 0; // initialize MFC and print and error on failure if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0)) { // TODO: change error code to suit your needs _tprintf(_T("Fatal Error: MFC initialization failed\n")); nRetCode = 1; } else { // the stuff above this (from _tmain) is supplied by VS Studio - ignore it for now // ============================================== // Here's where the code actually begins..... // ============================================== // I start the clock, and end it, in 'processing' steps compatible with Philip Nicoletti's example clock_t tstart = clock(); vector<S> wordlist; HANDLE memfile = CreateFile( _T("p.txt"), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL ); unsigned long fsize = GetFileSize( memfile, NULL ); if ( memfile != INVALID_HANDLE_VALUE ) { HANDLE hMap = CreateFileMapping( memfile, NULL, PAGE_READONLY, 0, fsize, NULL ); void *fileview = MapViewOfFile( hMap, FILE_MAP_READ, 0, 0, fsize ); char *FileBase = (char *) fileview; char *Limit = FileBase + fsize; char *ptr = FileBase; int word5count = 0, wordcount=0; S tmpS; while( ptr < Limit ) { char *wstart = ptr; while( *ptr >= 65 && ptr < Limit ) ++ptr; if ( ptr < Limit ) { if ( (ptr - wstart) == 5 ) { ++word5count; strncpy( tmpS.buff, wstart, 5 ); tmpS.buff[5] = 0; wordlist.push_back( tmpS ); } ++wordcount; } while( *ptr < 65 && ptr < Limit ) ++ptr; } clock_t tend = clock(); printf("Word Count %d, 5 Count %d\n", wordcount, word5count ); printf("Duration %d\n", tend - tstart ); UnmapViewOfFile( fileview ); CloseHandle( hMap ); CloseHandle( memfile ); } } return nRetCode; }

It's a tad hasty, and very C like. Just a look into another 'way' that focuses on speed of file reading.

If I were going to use something like this I'd make a file mapping object and use that to return a pointer to the base of the file.

...and as to earlier posts I didn't reply to....(and sorry for the tangent on fragmentation )

VladimirF:

Somehow I was thinking 400 million words, not 400 thousand words - probably the gin & tonic :)

IllegalCharacter:

Regarding garbage collection.....when you say it can move things around, we're talking about different things.....

It's what spoulson was talking about (requiring descriptors).

The operating system uses that to perform VM.

However, that doesn't affect fragmentation of RAM in C++, because the fragmentation applies to the numeric value of the pointers, not their physical location in RAM. Think about it carefully and you'll realize we're talking about two different subjects there. Think about what it would take if a 'standard' C++ or C program had, say, a couple of pointers on a buffer, and garbage collection moved the allocated memory to another location - how would it 'know' to change the value each of these pointers identified in kind?

It wouldn't.

This means the heap C++ is managing is fragmented because of the values of the pointers it's dispensed from CRT - the couldn't be relocated. Even if you moved the physical RAM around (which does happen when VM activates for virtually any reason), the locations identified by these pointers are not moved - for a C++ program, and neither is the heap within the CRT.

Garbage collection works in C# and VB because the handles (which replace pointers, one might think 'references' in Java) are indirect. The OS can run through all handles and 'adjust' the locations in RAM they're identifying, without altering the handle as 'known' by the application level code.

If, in C++, you used smart pointers, and never standard pointers, the indirection posed by having the pointer in 'two parts' would permit a garbage collection system to function.

Re: Any way to improve this?

OK, a few hours later...I have something interesting.

The idea here is to make the S struct hold a few copies of the word, rotated through the 5 permutations.

The vector is sorted by each of these permutations, and the list is compared (searched) in sorted order for matches by 4 characters.

The result SHOULD be all pairs that differ only in 1 character....maybe I misunderstand the requirement.

I have kept, after my version, the original processing version from Philip Nicoletti for comparison.

This reduced the speed of the research algorithm from about 60 seconds on a volume 39,000 to under 0.3 seconds.

That's right, 300 milliseconds on a 2Ghz AMD 32 bit CPU (actually, the research portion was 265 ms, the file loading was another 32 ms ).

In the original algorithm, the use of " if ( count==1 ) " means the algorithm rejects duplicates. I think duplicates actually fit the spec of a match, but perhaps not. Anyway, comment in my example section shows how to match the Philip Nicolleti's collection set either way.

Code:

#include "stdafx.h" // This is a console app with MFC support for memory mapped files CWinApp theApp; #include <vector> #include <algorithm> using namespace std; // S taken from Philip Nicoletti's version, adding features struct S { char buff[ 6 ]; char compbuf[ 4 ][ 6 ]; // used to store rotations of buff // naive but functional generation of rotations // ABCDE - buff // BCDEA - 0 of compbuf // CDEAB - 1 // DEABC - 2 // EABCD - 3 inline void prep() { strcpy( compbuf[ 0 ], buff + 1 ); strncat( compbuf[ 0 ], buff, 1 ); strcpy( compbuf[ 1 ], buff + 2 ); strncat( compbuf[ 1 ], buff, 2 ); strcpy( compbuf[ 2 ], buff + 3 ); strncat( compbuf[ 2 ], buff, 3 ); strcpy( compbuf[ 3 ], buff + 4 ); strncat( compbuf[ 3 ], buff, 4 ); } }; // series of compare functions for STL sort, one for each 'column' of the rotations bool Scomp0( S & s1, S & s2 ) { if ( strncmp( s1.buff, s2.buff, 5 ) > 0 ) return true; return false; } bool Scomp1( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[0], s2.compbuf[0], 5 ) > 0 ) return true; return false; } bool Scomp2( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[1], s2.compbuf[1], 5 ) > 0 ) return true; return false; } bool Scomp3( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[2], s2.compbuf[2], 5 ) > 0 ) return true; return false; } bool Scomp4( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[3], s2.compbuf[3], 5 ) > 0 ) return true; return false; } int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) { int nRetCode = 0; // This target is a console app with MFC support (for mem mapped files only) // initialize MFC and print and error on failure if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0)) { // TODO: change error code to suit your needs _tprintf(_T("Fatal Error: MFC initialization failed\n")); nRetCode = 1; } else { // ================================= // This is the top of "main" - the junk just above is Microsoft doing IT's stuff // ================================= clock_t tstart = clock(); vector<S> wordlist; // 'out of the book' read only memory mapped file HANDLE memfile = CreateFile( _T("p.txt"), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL ); unsigned long fsize = GetFileSize( memfile, NULL ); if ( memfile != INVALID_HANDLE_VALUE ) { HANDLE hMap = CreateFileMapping( memfile, NULL, PAGE_READONLY, 0, fsize, NULL ); void *fileview = MapViewOfFile( hMap, FILE_MAP_READ, 0, 0, fsize ); char *FileBase = (char *) fileview; char *Limit = FileBase + fsize; char *ptr = FileBase; int word5count = 0, wordcount=0; S tmpS; // at this point the file looks like a block of RAM while( ptr < Limit ) { char *wstart = ptr; while( *ptr >= 65 && ptr < Limit ) ++ptr; if ( ptr < Limit ) { if ( (ptr - wstart) == 5 ) { // have a word 5 characters long ++word5count; strncpy( tmpS.buff, wstart, 5 ); tmpS.buff[5] = 0; tmpS.prep(); // generate rotations wordlist.push_back( tmpS ); } ++wordcount; } while( *ptr < 65 && ptr < Limit ) ++ptr; } clock_t tend = clock(); printf("Word Count %d, 5 Count %d\n", wordcount, word5count ); printf("Duration %d\n", tend - tstart ); UnmapViewOfFile( fileview ); CloseHandle( hMap ); CloseHandle( memfile ); tstart = clock(); sort( wordlist.begin(), wordlist.end(), Scomp0 ); int test_edgecount = 0; vector<S>::iterator Si = wordlist.begin(); vector<S>::iterator Slimit = wordlist.end(); vector<S>::iterator St = wordlist.begin(); S *prev = NULL; // loop through the default word (non rotated) // the array is now sorted by the non-rotated word while( Si != Slimit ) { if ( prev ) { St = Si; while ( St != Slimit && strncmp( prev->buff, (*St).buff, 4 ) ==0 ) { // note, this test of 5 characters rejects an exact match, which // gives the same results as the Philip Nicoletti algorithm // however, if you want to collect duplicates (exact matches) // take this match anyway (skip the test, take the 'edge' if ( strncmp( prev->buff, (*St).buff, 5 ) !=0 ) { // we have a target ++test_edgecount; } ++St; } } prev = & (*Si); ++Si; } // now, loop through each 'rotation' of the word. Sort the array // by that rotation, and search the list again for( int comp_pass=0; comp_pass < 4; ++comp_pass ) { switch( comp_pass ) { case 0: sort( wordlist.begin(), wordlist.end(), Scomp1 ); break; case 1: sort( wordlist.begin(), wordlist.end(), Scomp2 ); break; case 2: sort( wordlist.begin(), wordlist.end(), Scomp3 ); break; case 3: sort( wordlist.begin(), wordlist.end(), Scomp4 ); break; } Si = wordlist.begin(); Slimit = wordlist.end(); prev = NULL; while( Si != Slimit ) { if ( prev ) { St = Si; while ( St != Slimit && strncmp( prev->compbuf[comp_pass], (*St).compbuf[comp_pass], 4 ) ==0 ) { if ( strncmp( prev->compbuf[comp_pass], (*St).compbuf[comp_pass], 5 ) !=0 ) { ++test_edgecount; } ++St; } } prev = & (*Si); ++Si; } } tend = clock(); printf("exp algorithm duration %d, edgecount %d\n", tend - tstart, test_edgecount ); // from here down is the original method from Philip Nicoletti's code // comment out if you don't want two tests timed against each other tstart = clock(); int edgecount = 0; for (size_t i = 0; i < wordlist.size(); ++i) { for (size_t j = i+1; j < wordlist.size(); ++j) { int count = 0; for (int k = 0; k < 5; ++k) { if ( wordlist[i].buff[k] != wordlist[j].buff[k]) { if( ++count > 1) break; } } // note, this 'rejects' exact matches - use count <= 1 if you want // to collect exact matches if (count == 1) { ++edgecount; } } } tend = clock(); printf("Duration %d, count %d\n", tend - tstart, edgecount ); } } return nRetCode; }

In case it's not clear, here's my thinking.....

Consider non-word letter combinations like these:

BKFPR - BAFPR

In this case the only mismatch is the second character

BKFPR - BKCPR

Here the mismatch is the 3rd

Now, look at this pair again:

BKFPR - BAFPR
FPRBK - FPRBA < - the same pair, rotated

In the structure S, the second row is a rotation of the main word.
It would be found in row 2 of the compbuf array

Notice that if the entire list is sorted, these two would match up, missing
only the 4th character of the rotated version.

That's the basic idea. Using that feature, I sort the list on each of
these rotated permutations.

It appears you can gain over a 50 fold improvement in speed.

The old books are right about optimization.

More than a shift from one storage method or another, a shift of algorithm is most often the backbone of performance.

Sometimes things actually look simpler when they're faster (like Qsort).

Sometimes they look downright ugly and involved.

Here, the trick is to do just about anything to avoid that loop inside a loop inside a loop (the most interior loop was on each character).

For 39,000 words of 5 characters that's turning out to be billions of iterations - geometric expansion. Yet, the algorithm looks compact, lightweight, even good.

The thing I ended up splattering on the forum is a 5 step memory hog, and there's some naive methods used (I'm not doing this for work you know :) ), but it does avoid the geometric expansion problem.

I tested on a database of 1.6 million words, with a '5 word' count of 159,940. It took 125 ms to load the data, 1.296 seconds for my version to research it, while the original method took ... well, I don't know, I'm still waiting...

I'll post back when it finishes....

Ah, there it is...... 976.891 seconds.

Interesting curve....

Volume 39000, 0.3 sec vs 60.something sec.

Volume 159000, 1.296 sec vs 976.891 sec.

Volume 3999802, 52.297 sec vs ...I don't know, days (fair calculations suggest 7 days, or about 165 to 170 hours)

That last one was from a file of 40,000,000 words, out of which not quite 4 million were 5 letter (making the probability of matches increase). It it took 3.469 secs to load that file.

With just a tad of engineering, this could easily be split into threads for dual/quad core advantage (so could the previous algorithm).

If I were on some commercial project, and this was an experiment to finding a solution, I think I would consider this a significant candidate, but I'd take the example and re-factor nearly everything with better OOP techniques, to generalize the algorithm a little more (maybe consider 6, 7, n letter words), use an object for the file mapping, support up to quad core threading (higher thread/core support would be a little nastier), and rework that naive rotation in struct S (in fact, to support threading, struct S would be split into pieces I think).

Re: Any way to improve this?

Bump

I hope the moderators will forgive a triple post (yish, I'm bad).

I wanted Ehump20 to find this, as it had been pushed down to page 2. The post above represents a dramatic speed improvement over the given method (0.3 sec vs 60 sec on my machine), and that's what Ehump20 had asked for...any way to improve this...

Re: Any way to improve this?

Quote:

Originally Posted by JVene

...any way to improve this...

1. Remove your compbuf array; you never use different rotations at the same time - so you could rotate in-place. Will make your code simpler and save some memory.
2. Pad your struct to 8 byte
3. To compare first four characters, instead of strncmp() use integer compare.

4. MAKE SURE YOUR CODE ISN'T OPTIMIZED OUT!
<correction> - sorry, I see that you are printing the edge count, so ignore my remark #4!

Re: Any way to improve this?

Hey VladimirF

Thanks for looking!

My reactions to your ideas.....

Quote:

Remove your compbuf array; you never use different rotations at the same time - so you could rotate in-place. Will make your code simpler and save some memory.

I don't think so. One of the points I make at the end is that this may lead to dividing the material into threads, which would need all 4 copies for each thread to operate separately - meaning all 4 would, in that case, be used simultaneously.

Rotating in place would destroy the original (though it could be returned in one extra phase), and I see in the OP's code, the 'count' of edges is a place where the OP's code posts the output to some graph routine.

Quote:

Pad your struct to 8 byte

This would definately help the CPU access. Something for OP to keep in mind.

Quote:

To compare first four characters, instead of strncmp() use integer compare.

Very good point - probably gain quite a bit since there will be around 300,000 of these or so, depending on the data.

My version was a 'concept' post, which I mentioned was a candidate for complete re-work for a production code (if that were the target), but since the performance gain was over 100 time the speed of the OP version, and in volumes of 4 million target words (out of 40 million in the source base), the improvement is over 10,000 times faster, I thought I'd gone far enough.

...but, like they say, there's usually something else you can squeeze out.

Your suggestions might drop my 0.3 sec (vs 60 seconds) down to 0.2 :)

...when I see a triple nested loop, I just can't help myself anymore :)

Re: Any way to improve this?

Quote:

Originally Posted by JVene

Your suggestions might drop my 0.3 sec (vs 60 seconds) down to 0.2 :)

Well, in my 400,000 test (all 5 letter words, we don't need to measure file reading / parsing anymore) I got ~50% saving: from 3.235 sec to 1.542 sec
The potential saving from parallel procesing is unknown. Anyway, you should probably keep five rotations in the separate vectors to avoid concurent access to the same memory pages from different threads.
But the real saving is from keeping data close together, to reduce cache misses.
Anyway, your idea is great!
Here is my code for those who is interested (it might be just two of us :) )

Code:

// General sorting idea and algo by JVene // minor modifications by Vlad struct S { char buff[ 8 ]; }; // naive but functional generation of rotations // ABCDE - buff // BCDEA - 0 of compbuf // CDEAB - 1 // DEABC - 2 // EABCD - 3 inline void rotate( char b[ 8 ] ) { char ch = b[0]; for(int i = 0; i < 4; ++i) b[i] = b[i+1]; b[4] = ch; } // single compare function for STL sort bool Scomp0( S & s1, S & s2 ) { if ( strncmp( s1.buff, s2.buff, 5 ) > 0 ) return true; return false; } int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) { int nRetCode = 0; // ... the file reading part is yours... -rotation... tstart = clock(); int test_edgecount = 0; vector<S>::iterator Si; vector<S>::iterator Slimit; vector<S>::iterator St; for( int comp_pass=0; comp_pass < 5; ++comp_pass ) { sort( wordlist.begin(), wordlist.end(), Scomp0 ); Si = wordlist.begin(); Slimit = wordlist.end(); St = wordlist.begin(); S *prev = NULL; // loop through the default word (non rotated) // the array is now sorted by the non-rotated word while( Si != Slimit ) { if ( prev ) { St = Si; //while ( St != Slimit && strncmp( prev->buff, (*St).buff, 4 ) ==0 ) while ( St != Slimit && ( *(int*)prev->buff == *(int*)(*St).buff ) ) { // note, this test of 5 characters rejects an exact match, which // gives the same results as the Philip Nicoletti algorithm // however, if you want to collect duplicates (exact matches) // take this match anyway (skip the test, take the 'edge' //if ( strncmp( prev->buff, (*St).buff, 5 ) !=0 ) if ( prev->buff[4] != (*St).buff[4] ) { // we have a target ++test_edgecount; } ++St; } } prev = & (*Si); ++Si; } // rotate it one char at a time; // the sort is at the top of this "while" loop Si = wordlist.begin(); Slimit = wordlist.end(); while( Si != Slimit ) { rotate( (*Si).buff ); ++Si; } } tend = clock(); printf("exp algorithm duration %d, edgecount %d\n", tend - tstart, test_edgecount ); } } return nRetCode; }

Re: Any way to improve this?

Yet another dramatic performance gain.

VladimirF's proposals doubled the performance on a large sample.

At 4 million words (from a base of 40 million), the example I posted run the study at 52 seconds (give or take a few tenths).

With Vlad's changes, the run completed in 26 seconds.

Still, at the moment of output, the original data is unavailable without a copy of the data for rotation. That wouldn't do well for the OP's purpose, so a second member in S would be required (the objective for OP isn't just a count, but a collection of the values - I think of the original and the match).

This makes the entire routine much more compact and readable. The integer compare reduces the match test from a function call and string scan down to a few inline machine instructions!

Considering that that OP required over 160 hours to complete 4 million words (I think many trillions of comparisons were used), cutting it down to 26 seconds is a really good demonstration of various techniques and thought avenues.

In fact, the OP complained that a 30K word sample (or thereabouts) was taking 30 seconds.

Should be snap of the finger this way :)

Re: Any way to improve this?

Yeap, as I thought - it's just JVene and I here...
I thought of another great improvement: if we are not counting duplicates - why do we push them into the vector?
Unfortunately in my 400,000 random sample, 393,000+ were unique! (Who there complained about rand() function? ). Not much saving, but a lot of overhead...

Re: Any way to improve this?

Quote:

Originally Posted by JVene

...Still, at the moment of output, the original data is unavailable without a copy of the data for rotation. That wouldn't do well for the OP's purpose, so a second member in S would be required (the objective for OP isn't just a count, but a collection of the values - I think of the original and the match).

Well, just do one more rotation - it takes no time at all! :wave: