Any way to improve this?

Code:

int main() { Graph g; vector<string> wordList; ifstream file("words"); if( !file ) { cerr << "Cannot open dictionary.list " << endl; return 1; } else { string word; if (file.is_open()) { while (getline (file,word)) { if (word.size() == 5) { wordList.push_back(word); } } file.close(); for (vector<string>::size_type i = 0; i < wordList.size(); ++i) { g.addVertex(wordList[i]); } for (vector<string>::size_type i = 0; i < wordList.size(); ++i) { const string & word1 = wordList[i]; for (vector<string>::size_type j = i+1; j < wordList.size(); ++j) { int count = 0; const string & word2 = wordList[j]; for (int k = 0; k < 5; ++k) { if (word1[k] != word2[k]) { if( ++count > 1) break; } } if (count == 1) { g.addEdge(word1, word2, 1); g.addEdge(word2, word1, 1); } } } string word1; string word2; string again; do { cout << "Please enter a five letter word:"; cin >> word1; cout << endl << "Please enter another five letter word:"; cin >> word2; cout << endl; g.unweighted(word1); g.printPath(word2); cout << "Would you like to run this again? (Y or N)"; cin >> again; }while(again != "N"); } else { cout << "Unable to open file"; return 0; } } }

This code works just fine. It has a list of all 5 letter words, and two words are connected if they can be changed to one or another by a one letter change. The only problem with it is that it takes about 30 seconds to run until it prompts the user for input, which seems like a very long time. The file "words" has a list of almost 400,000 words in it, so maybe there is no way to get around this slowness because of the size of the file and number of compares that have to be performed. Anyone have any ideas on how to improve the performance of this? or is there no way around it?

Re: Any way to improve this?

Try using a list instead of a vector, I'm not sure if this is true or not but every time you call push_back with the vector it has to allocate a new array and copy all the elements into it again. Since you don't really need the random access of the vector, it will function the same way if you use a list and the list doesn't take a lot of time when it grows.

Re: Any way to improve this?

Hi.

Quote:

Originally Posted by IllegalCharacter

every time you call push_back with the vector it has to allocate a new array and copy all the elements into it again

Not 'every time'... but, usually (some implementations) when you reach the end of a vector it's resized to twice the original size. That's why iterators to vector can become invalid. Note, however, that a list will actually allocate memory for 'every' push back.

Anyway, if the file is always around this big, a first (and quick) attempt to improve performance is to initialize the vector with 400.000 elements. Try this and let us know.

Re: Any way to improve this?

IllegalCharacter has a good point here.

Vectors of this size don't always work out so well, and as you push the limits of RAM you run into a RAM fragmentation problem.

That is, unless you have a 64bit OS and gobs of RAM.

Certainly we know that the string will allocate it's own space, and that will be a small amount, but the vector is 'an array of string objects' that require contiguous RAM. As you fill the vector, the problem IllegalCharacter identified causes RAM to be fragmented, among all those small string objects.

Eventually you reach a 'straw that breaks the camel's back' situation, where you have available RAM, with VM to back it up, but you get an out of memory error because a contiguous block of RAM can no longer be found to expand the vector, for it's array of string objects.

The list all but eliminates this problem, so long as you don't have OTHER large sized objects requiring contiguous RAM for allocation.

The copying of the vector is also a speed issue that is alleviated with 'size' for the projected volume - and, too, if it succeeds, it succeeds. That helps avoid the fragmentation because, since it no longer copies or moves the vector - things stay fairly organized in RAM (give or take).

If you're working on example code, this is all fine - you get it working, move on to the next assignment.

On the other hand, if you're preparing to write 'production code' - this is one of those issues that can require understanding. It's also a reason C# is becoming to popular - you don't have to think of this stuff so much.

This whole problem was much worse 'back in the day' - when 32Mbytes of RAM was $2,000. These days, though - when the problem sizes increase, it works out to be the same thing all over again.

Re: Any way to improve this?

Code:

list<string>::iterator i; list<string>::iterator j; for (i = wordList.begin(); i != wordList.end(); ++i) { const string & word1 = *i; for (j = ++i; j != wordList.end(); ++j) { int count = 0; const string & word2 = *j; for (int k = 0; k < 5; ++k) { if (word1[k] != word2[k]) { if( ++count > 1) break; } } if (count == 1) { g.addEdge(word1, word2, 1); g.addEdge(word2, word1, 1); } } }

Ok, so I tried implementing it as a list and I am trying to use iterators to search through the list, but I am having trouble trying to figure out how to use them so they still compare all pairs of words.

Re: Any way to improve this?

Your new code should work, except for when you assign j to ++i. This will correctly assign j, however it will also increment your i. Then when the end of the first for loop is reached, your i will be incremented as well, so you'll be jumping over every second word. I assume this is not what you want, since your first program did not do this.

Quote:

Originally Posted by ltcmelo

Anyway, if the file is always around this big, a first (and quick) attempt to improve performance is to initialize the vector with 400.000 elements. Try this and let us know.

This should work just fine, also without much worries about memory fragmentation as you are not allocating several small chunks. If there is enough memory free (not necessarily contiguous) then I believe the OS will defragment the memory to make a block big enough for your vector and all will be nice. Lists will also get the job done, they DO allocate memory each time but its not much and there is no copying of data each time which is the real slow-down of a vector.

Re: Any way to improve this?

How can I change the assignment to j so that it works correctly and is always one place ahead of i? I'm having trouble with this.

Re: Any way to improve this?

Quote:

If there is enough memory free (not necessarily contiguous) then I believe the OS will defragment the memory to make a block big enough for your vector

This is true in managed C++, if using managed types - but this isn't the managed C++ forum, so I assume it's not a managed target.

In native (unmanaged, as in the way it's been for decades) - there is no de-fragmentation of RAM by the standard CRT.

It can 'psuedo' de-fragment, if you release all memory :)

That's why I pointed this out in the earlier post in this thread.

Re: Any way to improve this?

Quote:

Originally Posted by Ehump20

How can I change the assignment to j so that it works correctly and is always one place ahead of i? I'm having trouble with this.

When you initialize j, you set it equal to i, and then before you do anything to it you put j++, like this:

Code:

for(j = i; j < wordList.end(); j++){ j++; //... etc...

Quote:

Originally Posted by JVene

there is no de-fragmentation of RAM by the standard CRT.

I've never actually used managed C++, so I don't know how it handles memory allocation. Doesn't Windows handle memory defragmentation? It's silly that it doesn't (if it doesn't), however I suppose Windows isn't exactly the most efficient software out there...

Re: Any way to improve this?

It's not so much a 'Windows' problem, but an issue of how memory is represented.

Say you have a few thousand objects allocated from the heap, all held by pointers.

Ask yourself, how could the operating system move the memory around without invalidating the pointers?

It can't.

In managed code, it's expected that physical locations can change because the allocated memory isn't held by a pointer, but by a handle. The execution of the program is stopped while .NET's runtime moves memory around. The definition of the language (even the extensions to C++) allow this to happen, such that when the defrag is over, execution can continue without having invalidated handles identifying allocated material.

To say it's silly that Windows doesn't do this for unmanaged code is, well....a bit naive, if you'll pardon my saying that (I don't mean to be rude about that). This is true for all C and C++ (Pascal and a host of other languages, too) - whether in Linux, Mac, DOS, Windows - anywhere.

The entire concept of garbage collection (which can be read as de-fragmentation of the heap) is a computer science classic. There may be C or C++ 'like' languages that support it, there may even be a garbage collected manager for C and C++ you can use (I'm not aware of any).

Garbage collection imposes a performance penalty. For certain targets it can be a reasonable performance penalty, favoring the management of the heap over speed. There are entire categories of application targets for which the performance penalty is inappropriate, and those targets are exactly where C++ is most suitable. For that, C and C++ developers must take the issue into account. It's obviously viable to do so, because some of the world's most ambitious, reliable and high performance applications are written in C++ (3DS Max and Photoshop, for example).

This issue is exactly why some developers prefer C# over C++ (and, conversely, why C++ programmers prefer C++ - when they know how to handle the issue).

Re: Any way to improve this?

The reason I believe that defragmentation isn't done in native C++ is because it's too low level.

Managed C++ is another layer of abstraction on top of C++, where it includes a higher level memory allocation and garbage collection scheme.

At the native C++ level, all memory is flat. On a 32-bit PC with up to 4GB of ram, all that memory is addressable with a 32-bit pointer. When calling malloc/new you are returned a pointer for that 4GB span of addressable space. It's static. If you move the memory block, the pointer becomes invalid.

All Intel-based processors from 386's and up provided the 'protected mode' that officially made the 640k barrier obsolete. (remember that?) When in protected mode, a 32-bit pointer addresses up to 4GB of RAM, instead of the segment:index pair that could address only 1MB. However, the 32-bit pointers are still relative to a base offset that's defined in the descriptor table. Code references a descriptor by referencing the index (aka selector) in the descriptor table of the element and places it in the appropriate 'ds' or 'es' selector register (no longer a segment register).

It would be feasible then to create descriptors for separate memory blocks, where you could defragment silently by moving the memory and updating the descriptor. The selector and pointer will remain valid.

Great idea, but it is not always feasible for code to create a descriptor for every block of memory ever allocated. The local descriptor table can only hold 8191 elements. Therefore, we must utilize static pointers on a common selector. Usually, this is a local descriptor table associated with the running process. Therefore, you have only access to memory in your own process and cannot corrupt memory outside, or you get a 'segmentation fault' or 'page fault exception' depending on your platform.

Sorry for the ramble. :)

Re: Any way to improve this?

I love how this topic went from helping somebody speed something up to a discussion of memory fragmentation, etc. You guys should teach a course ;)

Quote:

Originally Posted by JVene

Ask yourself, how could the operating system move the memory around without invalidating the pointers?

It can't.

Actually it can, since the memory addresses used in a program are in logical memory space, whereas the OS has access to supervisor mode and can manipulate memory pages in machine space. I don't think this has an effect on what we are talking about, but anyway.

I myself am not a huge fan of managed languages, as the garbage collection system can be a bit clunky and you can't do pointer tricks! I love the beauty of a good smart pointer.

Re: Any way to improve this?

Quote:

Originally Posted by JVene

Vectors of this size don't always work out so well, and as you push the limits of RAM you run into a RAM fragmentation problem.

That is, unless you have a 64bit OS and gobs of RAM.

What "that size"? What "gobs of RAM"?
400,000 5-characters words take 2MB, 2.5MB if you count zero-terminators.
This amount of memory is not a problem for years now.
If all words are 5 characters long, there is a very little benefit of keeping them in strings. You would use "string" class when you do some string manipulations, specifically if you need to change its size.
A simple 6-char buffer will do just fine.
You can get the size of the file, divide by 6 to get maximum number of 5-characters words and reserve() that space in your vector. Then, you don't even need a vector. With a fixed word length, a char array will do too.
Also, you need to do some sort of profiling before talking about performance. Do you know WHAT takes 30 seconds? Put a few call to GetTickCount() between your blocks of code to narrow down the problem.
Looks like your triple-nested loop is a suspect. Also, how much time does it take to add things to your Graph? Comment all (well, all three) calls to it and see if it saves any time (or most of it).
BTW, using list for a large number of small objects is NOT a good idea. The overhead will kill you. The benefits of the list over vector are cheap insertions / additions / deletes. The code here doesn't insert or delete, and the additions are taken care of by a call to reserve(). The drawback of the list is more expensive access, which we have plenty of in this example.

Re: Any way to improve this?

How do I create a 6 char buffer for this?

Re: Any way to improve this?

Some things:

1) What is the number of 5 letter words ? If there are only 400000
total words in the file, I suspect the number of 5 letter words is
a lot less.

2) Even if all the words are 5 letters, it should not take long to read
them and push them onto a vector. Have you timed this part only ?
I doubt that this will take much over a second. I doubt changing to a list
or a char[6] buffer will decrease your total time by much.

3) The main problem is the algorithm is O(N*N). Unfortunately, off the
top of my head, I can not think of any way of reducing this do
O(N*LOGN) or better.

Re: Any way to improve this?

1. The total number of 5 letter words is 25,266.

2. It takes only 1 second to load the vector.

Re: Any way to improve this?

I did a quick little test, and I might have been wrong about the using
char[6] instead of string not being much faster.

Instead of vector<string> I used vector<S>, with S defined as:

Code:

struct S { char buff[6]; };

And it cut the processing time by over 50%.

Re: Any way to improve this?

Code:

struct S { char buff[6]; }; int main() { Graph g; vector<S> wordList; ifstream file("words"); if( !file ) { cerr << "Cannot open the dictionary list " << endl; return 1; } else { string word; if (file.is_open()) { while (getline (file,word)) { if (word.size() == 5) { wordList.push_back(word); } } file.close();

Ok, this is what I have right now but I'm not sure how to read in and store the words in the buffer.

Re: Any way to improve this?

Here is was testing code (I created a file with 400000 words in it,
with 36000 being five letter words ... all random.

Code:

#include <vector> #include <ctime> #include <iostream> #include <fstream> using namespace std; struct S { char buff[6]; }; int main() { vector<S> wordList; clock_t start , finish; start = clock(); ifstream in("p.txt"); S s; int nFive = 0; char buff[80]; // must be large enough for each line in the file while (in.getline(buff,80)) { if (strlen(buff) == 5) { strcpy(s.buff,buff); wordList.push_back(s); ++nFive; } } finish = clock(); cout << "cycles to read/push data = " << finish - start << "\n"; cout << "number of 5 letter words = " << nFive << "\n"; start = clock(); for (size_t i = 0; i < wordList.size(); ++i) { for (size_t j = i+1; j < wordList.size(); ++j) { int count = 0; for (int k = 0; k < 5; ++k) { if ( wordList[i].buff[k] != wordList[j].buff[k]) { if( ++count > 1) break; } } if (count == 1) { // } } } finish = clock(); cout << "cycles to process data = " << finish - start << "\n"; return 0; }

Re: Any way to improve this?

Quote:

Originally Posted by Philip Nicoletti

I might have been wrong about the using
char[6] instead of string not being much faster...

So that makes me what? Right? ;)

Re: Any way to improve this?

I'm not sure if the purpose of 'improve' is clarity or speed, but I thought I'd try this version....

It assumes VS2K5.NET, a console application supporting MFC (so the Windows functions for file mapping are included).

I only bothered with the file read and stuffing the vector - the processing I didn't bother working on....yet, I have an idea and I'll come back with that in a bit....

This version ran on a 2Ghz AMD 32 bit cpu, on an example file of 400,000 words of random sizes, in which 39,000 or so were 5 letter words, and loaded the 5 letter words into a vector of struct S as depicted by Philip Nicoletti.

The example took 16 clock ticks ( 16ms ), which I think is under the resolution of the timer.

So I tried a file of 1.6 million words that had 159,000 + 5 letter words in it.

It did that in 62ms.

It uses a read only memory mapped file to pull the data in....

Eventually I got around to trying a sample of 40 million words (a 300Mbyte source file). The standard streams approach took 25 seconds when the file was already in cache, whereas the memory mapped file approach took only 3.188 seconds on the same volume.

Memory mapped files are widely recognized as the fasted mechanism for Windows to read data from a file.

Code:

#include "stdafx.h" // supplied by VSNET for a console build with MFC support // The one and only application object CWinApp theApp; #include <vector> using namespace std; struct S { char buff[ 6 ]; }; int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) { // ignore this opening nonsense, it's about MFC initialization int nRetCode = 0; // initialize MFC and print and error on failure if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0)) { // TODO: change error code to suit your needs _tprintf(_T("Fatal Error: MFC initialization failed\n")); nRetCode = 1; } else { // the stuff above this (from _tmain) is supplied by VS Studio - ignore it for now // ============================================== // Here's where the code actually begins..... // ============================================== // I start the clock, and end it, in 'processing' steps compatible with Philip Nicoletti's example clock_t tstart = clock(); vector<S> wordlist; HANDLE memfile = CreateFile( _T("p.txt"), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL ); unsigned long fsize = GetFileSize( memfile, NULL ); if ( memfile != INVALID_HANDLE_VALUE ) { HANDLE hMap = CreateFileMapping( memfile, NULL, PAGE_READONLY, 0, fsize, NULL ); void *fileview = MapViewOfFile( hMap, FILE_MAP_READ, 0, 0, fsize ); char *FileBase = (char *) fileview; char *Limit = FileBase + fsize; char *ptr = FileBase; int word5count = 0, wordcount=0; S tmpS; while( ptr < Limit ) { char *wstart = ptr; while( *ptr >= 65 && ptr < Limit ) ++ptr; if ( ptr < Limit ) { if ( (ptr - wstart) == 5 ) { ++word5count; strncpy( tmpS.buff, wstart, 5 ); tmpS.buff[5] = 0; wordlist.push_back( tmpS ); } ++wordcount; } while( *ptr < 65 && ptr < Limit ) ++ptr; } clock_t tend = clock(); printf("Word Count %d, 5 Count %d\n", wordcount, word5count ); printf("Duration %d\n", tend - tstart ); UnmapViewOfFile( fileview ); CloseHandle( hMap ); CloseHandle( memfile ); } } return nRetCode; }

It's a tad hasty, and very C like. Just a look into another 'way' that focuses on speed of file reading.

If I were going to use something like this I'd make a file mapping object and use that to return a pointer to the base of the file.

...and as to earlier posts I didn't reply to....(and sorry for the tangent on fragmentation )

VladimirF:

Somehow I was thinking 400 million words, not 400 thousand words - probably the gin & tonic :)

IllegalCharacter:

Regarding garbage collection.....when you say it can move things around, we're talking about different things.....

It's what spoulson was talking about (requiring descriptors).

The operating system uses that to perform VM.

However, that doesn't affect fragmentation of RAM in C++, because the fragmentation applies to the numeric value of the pointers, not their physical location in RAM. Think about it carefully and you'll realize we're talking about two different subjects there. Think about what it would take if a 'standard' C++ or C program had, say, a couple of pointers on a buffer, and garbage collection moved the allocated memory to another location - how would it 'know' to change the value each of these pointers identified in kind?

It wouldn't.

This means the heap C++ is managing is fragmented because of the values of the pointers it's dispensed from CRT - the couldn't be relocated. Even if you moved the physical RAM around (which does happen when VM activates for virtually any reason), the locations identified by these pointers are not moved - for a C++ program, and neither is the heap within the CRT.

Garbage collection works in C# and VB because the handles (which replace pointers, one might think 'references' in Java) are indirect. The OS can run through all handles and 'adjust' the locations in RAM they're identifying, without altering the handle as 'known' by the application level code.

If, in C++, you used smart pointers, and never standard pointers, the indirection posed by having the pointer in 'two parts' would permit a garbage collection system to function.

Re: Any way to improve this?

OK, a few hours later...I have something interesting.

The idea here is to make the S struct hold a few copies of the word, rotated through the 5 permutations.

The vector is sorted by each of these permutations, and the list is compared (searched) in sorted order for matches by 4 characters.

The result SHOULD be all pairs that differ only in 1 character....maybe I misunderstand the requirement.

I have kept, after my version, the original processing version from Philip Nicoletti for comparison.

This reduced the speed of the research algorithm from about 60 seconds on a volume 39,000 to under 0.3 seconds.

That's right, 300 milliseconds on a 2Ghz AMD 32 bit CPU (actually, the research portion was 265 ms, the file loading was another 32 ms ).

In the original algorithm, the use of " if ( count==1 ) " means the algorithm rejects duplicates. I think duplicates actually fit the spec of a match, but perhaps not. Anyway, comment in my example section shows how to match the Philip Nicolleti's collection set either way.

Code:

#include "stdafx.h" // This is a console app with MFC support for memory mapped files CWinApp theApp; #include <vector> #include <algorithm> using namespace std; // S taken from Philip Nicoletti's version, adding features struct S { char buff[ 6 ]; char compbuf[ 4 ][ 6 ]; // used to store rotations of buff // naive but functional generation of rotations // ABCDE - buff // BCDEA - 0 of compbuf // CDEAB - 1 // DEABC - 2 // EABCD - 3 inline void prep() { strcpy( compbuf[ 0 ], buff + 1 ); strncat( compbuf[ 0 ], buff, 1 ); strcpy( compbuf[ 1 ], buff + 2 ); strncat( compbuf[ 1 ], buff, 2 ); strcpy( compbuf[ 2 ], buff + 3 ); strncat( compbuf[ 2 ], buff, 3 ); strcpy( compbuf[ 3 ], buff + 4 ); strncat( compbuf[ 3 ], buff, 4 ); } }; // series of compare functions for STL sort, one for each 'column' of the rotations bool Scomp0( S & s1, S & s2 ) { if ( strncmp( s1.buff, s2.buff, 5 ) > 0 ) return true; return false; } bool Scomp1( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[0], s2.compbuf[0], 5 ) > 0 ) return true; return false; } bool Scomp2( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[1], s2.compbuf[1], 5 ) > 0 ) return true; return false; } bool Scomp3( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[2], s2.compbuf[2], 5 ) > 0 ) return true; return false; } bool Scomp4( S & s1, S & s2 ) { if ( strncmp( s1.compbuf[3], s2.compbuf[3], 5 ) > 0 ) return true; return false; } int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) { int nRetCode = 0; // This target is a console app with MFC support (for mem mapped files only) // initialize MFC and print and error on failure if (!AfxWinInit(::GetModuleHandle(NULL), NULL, ::GetCommandLine(), 0)) { // TODO: change error code to suit your needs _tprintf(_T("Fatal Error: MFC initialization failed\n")); nRetCode = 1; } else { // ================================= // This is the top of "main" - the junk just above is Microsoft doing IT's stuff // ================================= clock_t tstart = clock(); vector<S> wordlist; // 'out of the book' read only memory mapped file HANDLE memfile = CreateFile( _T("p.txt"), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL ); unsigned long fsize = GetFileSize( memfile, NULL ); if ( memfile != INVALID_HANDLE_VALUE ) { HANDLE hMap = CreateFileMapping( memfile, NULL, PAGE_READONLY, 0, fsize, NULL ); void *fileview = MapViewOfFile( hMap, FILE_MAP_READ, 0, 0, fsize ); char *FileBase = (char *) fileview; char *Limit = FileBase + fsize; char *ptr = FileBase; int word5count = 0, wordcount=0; S tmpS; // at this point the file looks like a block of RAM while( ptr < Limit ) { char *wstart = ptr; while( *ptr >= 65 && ptr < Limit ) ++ptr; if ( ptr < Limit ) { if ( (ptr - wstart) == 5 ) { // have a word 5 characters long ++word5count; strncpy( tmpS.buff, wstart, 5 ); tmpS.buff[5] = 0; tmpS.prep(); // generate rotations wordlist.push_back( tmpS ); } ++wordcount; } while( *ptr < 65 && ptr < Limit ) ++ptr; } clock_t tend = clock(); printf("Word Count %d, 5 Count %d\n", wordcount, word5count ); printf("Duration %d\n", tend - tstart ); UnmapViewOfFile( fileview ); CloseHandle( hMap ); CloseHandle( memfile ); tstart = clock(); sort( wordlist.begin(), wordlist.end(), Scomp0 ); int test_edgecount = 0; vector<S>::iterator Si = wordlist.begin(); vector<S>::iterator Slimit = wordlist.end(); vector<S>::iterator St = wordlist.begin(); S *prev = NULL; // loop through the default word (non rotated) // the array is now sorted by the non-rotated word while( Si != Slimit ) { if ( prev ) { St = Si; while ( St != Slimit && strncmp( prev->buff, (*St).buff, 4 ) ==0 ) { // note, this test of 5 characters rejects an exact match, which // gives the same results as the Philip Nicoletti algorithm // however, if you want to collect duplicates (exact matches) // take this match anyway (skip the test, take the 'edge' if ( strncmp( prev->buff, (*St).buff, 5 ) !=0 ) { // we have a target ++test_edgecount; } ++St; } } prev = & (*Si); ++Si; } // now, loop through each 'rotation' of the word. Sort the array // by that rotation, and search the list again for( int comp_pass=0; comp_pass < 4; ++comp_pass ) { switch( comp_pass ) { case 0: sort( wordlist.begin(), wordlist.end(), Scomp1 ); break; case 1: sort( wordlist.begin(), wordlist.end(), Scomp2 ); break; case 2: sort( wordlist.begin(), wordlist.end(), Scomp3 ); break; case 3: sort( wordlist.begin(), wordlist.end(), Scomp4 ); break; } Si = wordlist.begin(); Slimit = wordlist.end(); prev = NULL; while( Si != Slimit ) { if ( prev ) { St = Si; while ( St != Slimit && strncmp( prev->compbuf[comp_pass], (*St).compbuf[comp_pass], 4 ) ==0 ) { if ( strncmp( prev->compbuf[comp_pass], (*St).compbuf[comp_pass], 5 ) !=0 ) { ++test_edgecount; } ++St; } } prev = & (*Si); ++Si; } } tend = clock(); printf("exp algorithm duration %d, edgecount %d\n", tend - tstart, test_edgecount ); // from here down is the original method from Philip Nicoletti's code // comment out if you don't want two tests timed against each other tstart = clock(); int edgecount = 0; for (size_t i = 0; i < wordlist.size(); ++i) { for (size_t j = i+1; j < wordlist.size(); ++j) { int count = 0; for (int k = 0; k < 5; ++k) { if ( wordlist[i].buff[k] != wordlist[j].buff[k]) { if( ++count > 1) break; } } // note, this 'rejects' exact matches - use count <= 1 if you want // to collect exact matches if (count == 1) { ++edgecount; } } } tend = clock(); printf("Duration %d, count %d\n", tend - tstart, edgecount ); } } return nRetCode; }

In case it's not clear, here's my thinking.....

Consider non-word letter combinations like these:

BKFPR - BAFPR

In this case the only mismatch is the second character

BKFPR - BKCPR

Here the mismatch is the 3rd

Now, look at this pair again:

BKFPR - BAFPR
FPRBK - FPRBA < - the same pair, rotated

In the structure S, the second row is a rotation of the main word.
It would be found in row 2 of the compbuf array

Notice that if the entire list is sorted, these two would match up, missing
only the 4th character of the rotated version.

That's the basic idea. Using that feature, I sort the list on each of
these rotated permutations.

It appears you can gain over a 50 fold improvement in speed.

The old books are right about optimization.

More than a shift from one storage method or another, a shift of algorithm is most often the backbone of performance.

Sometimes things actually look simpler when they're faster (like Qsort).

Sometimes they look downright ugly and involved.

Here, the trick is to do just about anything to avoid that loop inside a loop inside a loop (the most interior loop was on each character).

For 39,000 words of 5 characters that's turning out to be billions of iterations - geometric expansion. Yet, the algorithm looks compact, lightweight, even good.

The thing I ended up splattering on the forum is a 5 step memory hog, and there's some naive methods used (I'm not doing this for work you know :) ), but it does avoid the geometric expansion problem.

I tested on a database of 1.6 million words, with a '5 word' count of 159,940. It took 125 ms to load the data, 1.296 seconds for my version to research it, while the original method took ... well, I don't know, I'm still waiting...

I'll post back when it finishes....

Ah, there it is...... 976.891 seconds.

Interesting curve....

Volume 39000, 0.3 sec vs 60.something sec.

Volume 159000, 1.296 sec vs 976.891 sec.

Volume 3999802, 52.297 sec vs ...I don't know, days (fair calculations suggest 7 days, or about 165 to 170 hours)

That last one was from a file of 40,000,000 words, out of which not quite 4 million were 5 letter (making the probability of matches increase). It it took 3.469 secs to load that file.

With just a tad of engineering, this could easily be split into threads for dual/quad core advantage (so could the previous algorithm).

If I were on some commercial project, and this was an experiment to finding a solution, I think I would consider this a significant candidate, but I'd take the example and re-factor nearly everything with better OOP techniques, to generalize the algorithm a little more (maybe consider 6, 7, n letter words), use an object for the file mapping, support up to quad core threading (higher thread/core support would be a little nastier), and rework that naive rotation in struct S (in fact, to support threading, struct S would be split into pieces I think).

Re: Any way to improve this?

Bump

I hope the moderators will forgive a triple post (yish, I'm bad).

I wanted Ehump20 to find this, as it had been pushed down to page 2. The post above represents a dramatic speed improvement over the given method (0.3 sec vs 60 sec on my machine), and that's what Ehump20 had asked for...any way to improve this...

Re: Any way to improve this?

Quote:

Originally Posted by JVene

...any way to improve this...

1. Remove your compbuf array; you never use different rotations at the same time - so you could rotate in-place. Will make your code simpler and save some memory.
2. Pad your struct to 8 byte
3. To compare first four characters, instead of strncmp() use integer compare.

4. MAKE SURE YOUR CODE ISN'T OPTIMIZED OUT!
<correction> - sorry, I see that you are printing the edge count, so ignore my remark #4!

Re: Any way to improve this?

Hey VladimirF

Thanks for looking!

My reactions to your ideas.....

Quote:

Remove your compbuf array; you never use different rotations at the same time - so you could rotate in-place. Will make your code simpler and save some memory.

I don't think so. One of the points I make at the end is that this may lead to dividing the material into threads, which would need all 4 copies for each thread to operate separately - meaning all 4 would, in that case, be used simultaneously.

Rotating in place would destroy the original (though it could be returned in one extra phase), and I see in the OP's code, the 'count' of edges is a place where the OP's code posts the output to some graph routine.

Quote:

Pad your struct to 8 byte

This would definately help the CPU access. Something for OP to keep in mind.

Quote:

To compare first four characters, instead of strncmp() use integer compare.

Very good point - probably gain quite a bit since there will be around 300,000 of these or so, depending on the data.

My version was a 'concept' post, which I mentioned was a candidate for complete re-work for a production code (if that were the target), but since the performance gain was over 100 time the speed of the OP version, and in volumes of 4 million target words (out of 40 million in the source base), the improvement is over 10,000 times faster, I thought I'd gone far enough.

...but, like they say, there's usually something else you can squeeze out.

Your suggestions might drop my 0.3 sec (vs 60 seconds) down to 0.2 :)

...when I see a triple nested loop, I just can't help myself anymore :)

Re: Any way to improve this?

Quote:

Originally Posted by JVene

Your suggestions might drop my 0.3 sec (vs 60 seconds) down to 0.2 :)

Well, in my 400,000 test (all 5 letter words, we don't need to measure file reading / parsing anymore) I got ~50% saving: from 3.235 sec to 1.542 sec
The potential saving from parallel procesing is unknown. Anyway, you should probably keep five rotations in the separate vectors to avoid concurent access to the same memory pages from different threads.
But the real saving is from keeping data close together, to reduce cache misses.
Anyway, your idea is great!
Here is my code for those who is interested (it might be just two of us :) )

Code:

// General sorting idea and algo by JVene // minor modifications by Vlad struct S { char buff[ 8 ]; }; // naive but functional generation of rotations // ABCDE - buff // BCDEA - 0 of compbuf // CDEAB - 1 // DEABC - 2 // EABCD - 3 inline void rotate( char b[ 8 ] ) { char ch = b[0]; for(int i = 0; i < 4; ++i) b[i] = b[i+1]; b[4] = ch; } // single compare function for STL sort bool Scomp0( S & s1, S & s2 ) { if ( strncmp( s1.buff, s2.buff, 5 ) > 0 ) return true; return false; } int _tmain(int argc, TCHAR* argv[], TCHAR* envp[]) { int nRetCode = 0; // ... the file reading part is yours... -rotation... tstart = clock(); int test_edgecount = 0; vector<S>::iterator Si; vector<S>::iterator Slimit; vector<S>::iterator St; for( int comp_pass=0; comp_pass < 5; ++comp_pass ) { sort( wordlist.begin(), wordlist.end(), Scomp0 ); Si = wordlist.begin(); Slimit = wordlist.end(); St = wordlist.begin(); S *prev = NULL; // loop through the default word (non rotated) // the array is now sorted by the non-rotated word while( Si != Slimit ) { if ( prev ) { St = Si; //while ( St != Slimit && strncmp( prev->buff, (*St).buff, 4 ) ==0 ) while ( St != Slimit && ( *(int*)prev->buff == *(int*)(*St).buff ) ) { // note, this test of 5 characters rejects an exact match, which // gives the same results as the Philip Nicoletti algorithm // however, if you want to collect duplicates (exact matches) // take this match anyway (skip the test, take the 'edge' //if ( strncmp( prev->buff, (*St).buff, 5 ) !=0 ) if ( prev->buff[4] != (*St).buff[4] ) { // we have a target ++test_edgecount; } ++St; } } prev = & (*Si); ++Si; } // rotate it one char at a time; // the sort is at the top of this "while" loop Si = wordlist.begin(); Slimit = wordlist.end(); while( Si != Slimit ) { rotate( (*Si).buff ); ++Si; } } tend = clock(); printf("exp algorithm duration %d, edgecount %d\n", tend - tstart, test_edgecount ); } } return nRetCode; }

Re: Any way to improve this?

Yet another dramatic performance gain.

VladimirF's proposals doubled the performance on a large sample.

At 4 million words (from a base of 40 million), the example I posted run the study at 52 seconds (give or take a few tenths).

With Vlad's changes, the run completed in 26 seconds.

Still, at the moment of output, the original data is unavailable without a copy of the data for rotation. That wouldn't do well for the OP's purpose, so a second member in S would be required (the objective for OP isn't just a count, but a collection of the values - I think of the original and the match).

This makes the entire routine much more compact and readable. The integer compare reduces the match test from a function call and string scan down to a few inline machine instructions!

Considering that that OP required over 160 hours to complete 4 million words (I think many trillions of comparisons were used), cutting it down to 26 seconds is a really good demonstration of various techniques and thought avenues.

In fact, the OP complained that a 30K word sample (or thereabouts) was taking 30 seconds.

Should be snap of the finger this way :)

Re: Any way to improve this?

Yeap, as I thought - it's just JVene and I here...
I thought of another great improvement: if we are not counting duplicates - why do we push them into the vector?
Unfortunately in my 400,000 random sample, 393,000+ were unique! (Who there complained about rand() function? ). Not much saving, but a lot of overhead...

Re: Any way to improve this?

Quote:

Originally Posted by JVene

...Still, at the moment of output, the original data is unavailable without a copy of the data for rotation. That wouldn't do well for the OP's purpose, so a second member in S would be required (the objective for OP isn't just a count, but a collection of the values - I think of the original and the match).

Well, just do one more rotation - it takes no time at all! :wave: