[RESOLVED] Extracting time stamps from pdf
Hi, I am trying to automatically extract all time stamps in a pdf file. These are typically in a line like:
when="2010-07-30T15:20:30+04:00"
For this I was thinking of using CStdioFile and the ReadString function. Somehow this doesn't work. My example code is below. Is this because pdf is not a true text file, because strings read can be longer than some max,...? What is my mistake or anyone have another quick way of reading the file and extracting the desired text between the brackets?
Code:
CStdioFile InputFile;
if (InputFile.Open(FileName,CFile::modeRead))
{
CString Line; CString ToFind; ToFind.Format("when");
while (InputFile.ReadString(Line))
{
if (Line.Find(ToFind)!=-1)
{
CString Item; AfxExtractSubString(Item,Line,0,'"');
AfxMessageBox(Item); // to be replaced with further processing
}
}
InputFile.Close();
}
Re: Extracting time stamps from pdf
Well, open your .pdf file in the notepad and search for "when" text. Will it be found?
Re: Extracting time stamps from pdf
Quote:
Originally Posted by
VictorN
Well, open your .pdf file in the notepad and search for "when" text. Will it be found?
Already did that. What if there are a gazillion instances? I want to automate it. I thought the ten lines above could do it. Obviously I was wrong.
Re: Extracting time stamps from pdf
Quote:
Originally Posted by
Simon666
Already did that. What if there are a gazillion instances? I want to automate it. I thought the ten lines above could do it. Obviously I was wrong.
Did what? Did you open your (it means the file ttat for sure contains the text you are looking for).pdf file in the notepad?
If you did it then what about the search result?
Re: Extracting time stamps from pdf
I don't use CStdioFile, but in general, if you open a file up
in text mode and read while looping, you can come across
EOF early. Try opening in binary mode.
Re: Extracting time stamps from pdf
InputFile.Open(FileName,CFile::modeRead|CFile::typeBinary))
Let's me read a lot further but it still exits way before the actual end of the file.
Re: Extracting time stamps from pdf
Some things to consider:
1) I do not think that CString::Find() will work if their are embedded NULLS in the
string before the time stamp.
2) one "line" could contain multiple time stamps
Re: Extracting time stamps from pdf
I've looked at several .pdf files and none of them have when in them. The only time related stuff that I could find was /CreationDate eg. /CreationDate (D:20060208110100)
Re: Extracting time stamps from pdf
Quote:
Originally Posted by
2kaud
I've looked at several .pdf files and none of them have when in them. The only time related stuff that I could find was /CreationDate eg. /CreationDate (D:20060208110100)
Yeah, It was the first thing I've done before posting my first answer:
Quote:
Originally Posted by
VictorN
Well, open your .pdf file in the notepad and search for "when" text. Will it be found?
Just because I coudn't find "when" (either as ANSI or UNICODE) in some of my .pdf files!
However, OP seems to ignore my opnion... :eek:
Re: Extracting time stamps from pdf
Quote:
Originally Posted by
Philip Nicoletti
Some things to consider:
1) I do not think that CString::Find() will work if their are embedded NULLS in the
string before the time stamp.
2) one "line" could contain multiple time stamps
1) I checked that doesn't occur.
2) Same.
Quote:
Originally Posted by
2kaud
I've looked at several .pdf files and none of them have when in them. The only time related stuff that I could find was /CreationDate eg. /CreationDate (D:20060208110100)
It is metadata of pictures from Adobe Photoshop.
Quote:
Originally Posted by
VictorN
Just because I coudn't find "when" (either as ANSI or UNICODE) in some of my .pdf files!
However, OP seems to ignore my opnion... :eek:
Victor, I did not ignore you. I addressed that issue specifically. There are about 80 entries per pdf and 10 pdf's. I didn't want to do roughly all 800 of them manually when roughly 10 lines of code could do but by now the time spent will be about equal.
Re: Extracting time stamps from pdf
Quote:
Originally Posted by
Simon666
Victor, I did not ignore you. I addressed that issue specifically. There are about 80 entries per pdf and 10 pdf's. I didn't want to do roughly all 800 of them manually when roughly 10 lines of code could do but by now the time spent will be about equal.
So could you find in notepad at least one of these "80 entries per pdf" or not?
Re: Extracting time stamps from pdf
Re: Extracting time stamps from pdf
What "why"?
Are they found as ANSI or UNICODE?
Is your buils ANSI or UNICODE?
Re: Extracting time stamps from pdf
Anyway, I got an idea, I might use GetPosition and check if it is anywhere near the end of the file, if not just call ReadString again in a while loop. I'll first try that.
Re: Extracting time stamps from pdf
Quote:
Originally Posted by
Simon666
Anyway, I got an idea, I might use GetPosition and check if it is anywhere near the end of the file, if not just call ReadString again in a while loop. I'll first try that.
Or you could respond to Victor's questions.