SortNames Program

**inbugable** · December 19th, 2006, 10:14 AM

I was asked to write a sorting program in c++

Specification:
A program to sort a list of names is required.
The file also contains the person’s age and phone number.
The main challenge is that the entire list can not fit into
physical memory at any given time.
There is no limit to the amount of physical disk space.

The Input

An ASCII TSV (Tab Separated values) file containing a list of names.
Each line is of the format:
<Last Name>, <First Name><tab><age><tab><Phone Number><CR>

Sample:

Kalou, Salomon 30 476-2769
Ad****or, Emmanuel 31 310-4951
Morais, Nuno 37 660-6290
Poom, Mart 34 469-8379
Shevchenko, Andriy 26 922-2033
Obi, Mikel 35 888-9370
Ljungberg, Fredrik 34 492-1755
Davenport, Calum 28 240-7734
O'Hara, Jamie 30 333-5168

The Output

A file in the same format sorted by Last Name.

The Program

The final program must take three arguments

Maxnames – the maximum number of names which can be loaded
into memory at any given time
Input - name of the input file
Output – name of the results output file

For example: SortNames.exe 10 input.txt output.txt

So what I manage to achieve so far is to read 10 record from file, sort them and output the result.

CodeListing:

<code>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>

using namespace std;

////////////////////////////////////
//Header
////////////////////////////////////
struct PersonInfo
{
string firstName;
string secondName;
int age;
string phoneNumber;

PersonInfo(const string& fName = "",
const string& sName = "",
int personsAge = 0,
const string& number = ""):
firstName(fName), secondName(sName), age(personsAge), phoneNumber(number)
{}

friend bool operator<(const PersonInfo& person1, const PersonInfo& person2);
};

bool operator<(const PersonInfo& person1, const PersonInfo& person2)
{
int c1 = person1.firstName.compare(person2.firstName);
int c2 = person1.secondName.compare(person2.secondName);
int c3 = person1.phoneNumber.compare(person2.phoneNumber);

return (c1 < 0) ||
(c1 == 0 && c2 < 0) ||
(c1 == 0 && c2 == 0 && person1.age < person2.age) ||
(c1 == 0 && c2 == 0 && person1.age == person2.age && c3 < 0);
}

vector<PersonInfo> PersonInfoslist;

void Run(const char* infile, int* maxEntries);
//void WriteToTempFile(const vector<PersonInfo>& p);

////////////////////////////////////
//Source
////////////////////////////////////
void Run(const char* infile,int maxEntries)
{
ifstream inputFile(infile, ios::in);

if (!inputFile)
{
cerr << "Input file "<< infile << " could not be opened!"<<endl;
exit(1);
}

for (int i = 0;i < maxEntries;i++)
{
PersonInfo pInfo;
char* buffer;
inputFile.getline( buffer,2048);
string line = buffer;

int itemNum = 0;
string token;
size_t p0 = 0, p1 = string::npos;

while(p0 != string::npos)
{
p1 = line.find_first_of(",\t ", p0);
if(p1 != p0)
{
token = line.substr(p0, p1 - p0);
}
p0 = line.find_first_not_of(",\t ", p1);

switch( itemNum )
{
case 0:
pInfo.firstName = token;
itemNum++;
break;
case 1:
pInfo.secondName = token;
itemNum++;
break;
case 2:
pInfo.age = atoi(token.c_str());
itemNum++;
break;
case 3:
pInfo.phoneNumber = token;
itemNum=0;
PersonInfoslist.push_back(pInfo);
break;
default:
cout << "Error in tokenizing string!" << endl;
break;
}
}
}

//Sort the items in the vector
sort(PersonInfoslist.begin(), PersonInfoslist.end());

for(vector<PersonInfo>::const_iterator it = PersonInfoslist.begin(); it != PersonInfoslist.end(); ++it)
{
cout << it->firstName << ", " << it->secondName << ", " << it->age << ", " << it->phoneNumber << endl;
}
}

int main(int argc, char** argv)
{
if (argc != 4)
{
cout << "Usage: SortNames <Number of names to read> <Input file> <Output file>" << endl;
exit(1);
}

//Getting the arguments from command line.
int maxEntries = atoi(argv[1]);
const char* inFile = argv[2];
const string outFile = argv[3];

Run(inFile,maxEntries);

return 0;
}
</code>

So I was wondering if a generous C++ programmer out there will be kind enough to help me out on this task. Thanks in advance for any suggestion on improving the current solution

**GNiewerth** · December 19th, 2006, 10:28 AM

There´s an algorithm described on wikipedia:

External sorting

It´s not that difficult, give it a try.

Regards,
Guido

**dcjr84** · December 19th, 2006, 02:02 PM

Also, please use [ code ][ /code ] tags when posting source code.

It is almost impossible to read the code you have posted.

**inbugable** · December 21st, 2006, 05:39 AM

Where I am confuse is who I can generate the temp files. I have to read 100 records to read by 10 at the time. I was thinking about the following:

Read from file;
If person starting with letter A or a
Create a file A.txt and append the person data

When whole finish I can open the entire files created and sort them one after the other, by appending them in the output.txt file. But there is a problem with this technique, it can result to an long switch or if statement.

Any body can help ?

**GNiewerth** · December 21st, 2006, 06:35 AM

Hi,

you can´t bucket-sort your whole recordset like you stated, because you can´t assume that the records are equally distributed. Say if you have 100,000,000,000 records of person and all of them begin with an "A" you actually don´t split your temporary file and eventually run out of memory when loading it into memory. You must split the temporary files into a fixed number of records.

What you have to do is following:

1) determine a reasonable size of records that can be processed in memory, say 10,000 or 100,000, or maybe 1,000,000,000.

2) read a chunk (size calculated in 1) from your input file, sort them and write them back into a temporary file. The last chunk won´t be complete, but that doesn´t matter.
Count the number of temporary files you have created, you´ll need it later. Starting with 0 you can name your files like tmpsrt00001.dat, tmpsrt00002.dat,... Now you have a number of temporary files with their records sorted.

3) Create a queue for each temporary file.

4) Pick the smallest front record from all queues and write it into the final output file.
Remove that record from the queue.
If there are records left in the temporary file, read one record and append it to the queue.
If there are no more records in the temporary file and the queue is empty, mark the queue as fully processed.

5) Repeat step 4) until all queues are fully processed.

6) delete the temporary files.

That´s the basic algorithm, performance issues neglected. To improve performance you can read multiple records into your queue, like 1000, so you reduce slow I/O operations.

Regards,
Guido

Thread: SortNames Program

Thread Tools

Display

SortNames Program

Re: SortNames Program

Re: SortNames Program

Re: SortNames Program

Re: SortNames Program

Posting Permissions