CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Join Date
    Dec 2012
    Posts
    18

    Extracting Binary Data?

    I don't suppose anyone knows anything about reading stream data from PDF files?

    I'm trying to read stream data from a stream in a pdf file compressed in LZW. My first step is to read this compressed data into a buffer. The second step is to decompress it. I know how to decompress it, it is the first step I am having trouble with.

    How should I read this data? If I read each character into a char the numerical output will not exceed 256. LZW compressed output should exceed 256.

    I read the pdf in binary mode and have tried the read function like below.

    From what I understand (don't quote me), I should be reading in 12 bits at a time. Is that correct and if so should I be using a bitfield. If I do need a bitfield then how to I read the data from the stream into a bitfield of 12 bits without restricting the binary value of the characters being read from the compressed stream.

    Please remember.. I am not decompressing yet,,, that is the next step.

    Code:
    ifstream inputfilestream;
    inputfilestream.open("myfile.pdf", ios::out | ios::binary);
    char mychar;  //unsigned char gives an error though it can be //converted to unsigned char later.
    
    while (inputfilestream.read(mychar, 1))
    {
    // do something with the char.. I have stored it in an int vector 
    }

  2. #2
    Join Date
    Jul 2013
    Posts
    576

    Re: Extracting Binary Data?

    Why don't you just read the pdf file as a sequence of bytes (in binary mode) and store the result in a byte array?

  3. #3
    2kaud's Avatar
    2kaud is offline Super Moderator Power Poster
    Join Date
    Dec 2012
    Location
    England
    Posts
    7,822

    Re: Extracting Binary Data?

    How should I read this data? If I read each character into a char the numerical output will not exceed 256. LZW compressed output should exceed 256.
    Data is stored in a file as a sequence of bytes where each byte is 8 bits and each byte has a maximum decimal value of 255. You need to read these bytes as binary sequentially and store them into an array of unsigned char. Once you have read them into the array, then you can process them as required. If LZW compression stores, as you sugest, 12 bits at a time then you will need to process the unsigned char array 12 bits at a time - but this is a function of processing the data and not reading the data.
    All advice is offered in good faith only. All my code is tested (unless stated explicitly otherwise) with the latest version of Microsoft Visual Studio (using the supported features of the latest standard) and is offered as examples only - not as production quality. I cannot offer advice regarding any other c/c++ compiler/IDE or incompatibilities with VS. You are ultimately responsible for the effects of your programs and the integrity of the machines they run on. Anything I post, code snippets, advice, etc is licensed as Public Domain https://creativecommons.org/publicdomain/zero/1.0/ and can be used without reference or acknowledgement. Also note that I only provide advice and guidance via the forums - and not via private messages!

    C++23 Compiler: Microsoft VS2022 (17.6.5)

  4. #4
    Join Date
    Dec 2012
    Posts
    18

    Re: Extracting Binary Data?

    Quote Originally Posted by 2kaud View Post
    Data is stored in a file as a sequence of bytes where each byte is 8 bits and each byte has a maximum decimal value of 255. You need to read these bytes as binary sequentially and store them into an array of unsigned char. Once you have read them into the array, then you can process them as required. If LZW compression stores, as you sugest, 12 bits at a time then you will need to process the unsigned char array 12 bits at a time - but this is a function of processing the data and not reading the data.
    Thanks.. I agree with the above. This is what I'm doing but I suppose I was thinking that this is a standard problem. Extracting 8 bit chracters and processing them to form 12bits by concatenation or whatever. Is there a standard way of doing this of what you know of? I would assume this is a more common problem than I know.

  5. #5
    Join Date
    Jul 2013
    Posts
    576

    Re: Extracting Binary Data?

    Quote Originally Posted by tomadom View Post
    Is there a standard way of doing this of what you know of?
    Yes bitfiddling.

    3 times 8 bits equals 2 times 12 bits.

  6. #6
    Join Date
    Dec 2012
    Posts
    18

    Re: Extracting Binary Data?

    Yes, I actually have a processing solution for this. Can you think of another circumstance where this approach is common?
    If what I'm doing is an isolated case then chances are I'm approaching this the wrong way.

    I know how to process bits, just unsure how common the need for this is?

  7. #7
    Join Date
    Jul 2013
    Posts
    576

    Re: Extracting Binary Data?

    Quote Originally Posted by tomadom View Post
    I know how to process bits, just unsure how common the need for this is?
    When you do low-level processing of bitfields there's not much of a choise.

    But it's tedious and error-prone so if you don't have to do it yourself it's much better to use existing code. I would first try to locate some open source LZW library.
    Last edited by razzle; November 30th, 2013 at 07:06 AM.

  8. #8
    Join Date
    Dec 2012
    Posts
    18

    Re: Extracting Binary Data?

    Well, I can almost guarantee that there is no such library. There are libraries for processing compressed data of different descriptions but not for processing binary streams.

    When you say concatenation do you mean it to be like the below?

    Numbers 121, 121 is 01111001 , 01111001 . If I concatenate the first 8 bits with the next 4 bits then would I get......... 01111001 0111 ?

    But you have said that I concatenate to the less significant bytes so this would be 01111001 1001 which equals 12 bits.
    Is that version correct. I'm trying to understand because you have told me the direct opposite to someone else. It's important that I get this right...

    Which ever version,, I'm assuming that the 4th byte is linked to the least significant/most significant 4 bytes of the 5th byte.. You will need to read this carefully to understand what I am asking?

  9. #9
    Join Date
    Jul 2013
    Posts
    576

    Re: Extracting Binary Data?

    Quote Originally Posted by tomadom View Post
    Well, I can almost guarantee that there is no such library.
    And I'm certain there is. A quick search on the net and this came up,

    http://marknelson.us/1989/10/01/lzw-data-compression/

    It's a Dr Dobbs article with several source codes.

  10. #10
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Extracting Binary Data?

    Quote Originally Posted by tomadom View Post
    Well, I can almost guarantee that there is no such library.
    Are you saying there is no library that can decompress a stream that is LZW encoded??
    There are libraries for processing compressed data of different descriptions but not for processing binary streams.
    Where the data comes from is irrelevant. You have a buffer, it has data, the data gets processed.

    It's incredulous to say "there is no such library", especially with a well-known algorithm as Lempel-Ziv-Welch. There are many libraries and open source showing compression/decompression of LZW data using C and C++. If you want an example, there is libtiff that has LZW compression/decompression. Here is another:

    http://www.codeproject.com/Articles/...ng-Binary-Tree

    Regards,

    Paul McKenzie
    Last edited by Paul McKenzie; November 30th, 2013 at 11:43 AM.

  11. #11
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Extracting Binary Data?

    Quote Originally Posted by tomadom View Post
    I don't suppose anyone knows anything about reading stream data from PDF files?

    I'm trying to read stream data from a stream in a pdf file compressed in LZW. My first step is to read this compressed data into a buffer. The second step is to decompress it. I know how to decompress it, it is the first step I am having trouble with.

    How should I read this data?
    A PDF object describes the LZW stream, and this object gives you a length attribute that tells you in bytes the size of the LZW data. So why not just read the number of bytes described by what the length is stating? Then after you do that, you process the buffer.

    Also, I don't understand your code example. A PDF file is much more than an LZW stream. You have dictionaries, objects, cross-reference tables, etc. that have nothing to do with LZW data. Maybe you should first identify when a PDF object is describing an LZW stream, and then you process this stream accordingly.

    Regards,

    Paul McKenzie

  12. #12
    Join Date
    Dec 2012
    Posts
    18

    Re: Extracting Binary Data?

    Quote Originally Posted by Paul McKenzie View Post
    Are you saying there is no library that can decompress a stream that is LZW encoded??Where the data comes from is irrelevant. You have a buffer, it has data, the data gets processed.

    It's incredulous to say "there is no such library", especially with a well-known algorithm as Lempel-Ziv-Welch. There are many libraries and open source showing compression/decompression of LZW data using C and C++. If you want an example, there is libtiff that has LZW compression/decompression. Here is another:

    http://www.codeproject.com/Articles/...ng-Binary-Tree

    Regards,

    Paul McKenzie
    The decompression process is simple. All I meant is that there is not such library as one which will let me read the 8bit data into 12 bit groups. That's all.

    Yes, there's several LZW implementations. Let's not get mixed up between reading the compressed data and extracting data from the decompressed data.

    Let's not get ahead here. I know how to extract a stream. It's just the reading of the compressed data (before decompression I'm trying to get right.
    Last edited by tomadom; November 30th, 2013 at 09:57 PM.

  13. #13
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Extracting Binary Data?

    Quote Originally Posted by tomadom View Post
    It's just the reading of the compressed data (before decompression I'm trying to get right.
    So again, why not read the LZW object's description? It has the length telling you how long the data is. So you position the file pointer to the beginning of that data and call the read() function reading that many bytes.

    But again, your code sample in your first post doesn't make sense. You're supposed to read the PDF file, parse it, looking for objects that describe an LZW stream. Once you find that object, you know everything about the stream in terms of its length. Object data occurs usually after the stream indicator for the object.

    Regards,

    Paul McKenzie
    Last edited by Paul McKenzie; November 30th, 2013 at 10:08 PM.

  14. #14
    Join Date
    Apr 1999
    Posts
    27,449

    Re: Extracting Binary Data?

    This:
    Code:
    inputfilestream.open("myfile.pdf", ios::out | ios::binary);
    char mychar;  //unsigned char gives an error though it can be //converted to unsigned char later.
    
    while (inputfilestream.read(mychar, 1))
    doesn't make sense.

    You don't parse PDF files this simplistically. Where do you skip over the PDF header? Where do you parse looking for objects, dictionaries, etc.? Where is the check for the stream / endstream markers, delineating the LZW data? Even more basic, where is the test to see if you even have an LZW stream (a PDF file can have Flate, Ascii85, or any other number of other encodings for a stream)?
    Code:
    1 0 obj
    << /Length 10234
    /Filter [/LZWDecode]
    ...
    >>
    stream
    This is the LZW stream.....
    endstream
    That is an example of what an object would look like. The /Length tells you how many bytes the stream is, and the /Filter tells you that the stream is LZW encoded. Then the stream / endstream delineates the data. Also, the /Filter can have multiple filters on it, so the stream data may even be encoded twice, first with something like Ascii85 and then with LZW.

    Regards,

    Paul McKenzie
    Last edited by Paul McKenzie; November 30th, 2013 at 10:20 PM.

  15. #15
    Join Date
    Dec 2012
    Posts
    18

    Re: Extracting Binary Data?

    Thanks for your input.. But I know all this and have done this in my code. I've only provided the relevant bits in my first post just so we didn't get sidelined (of which I must confess, failed miserably). Yes, it ascertains the filter type, the stream and endstream markers... All is good on that front. All I have asked is how the 8 bit characters are transformed into 12 bit chunks. That's it.. I actually know how to process the data but, as asked above, just wanted to confirm if there is a common approach to this transformation from 8bit compressed to 12 bit compressed data.. The decompression part comes later... down the road aways.. Let's forget how to find the stream or decompressing it.. Believe me, I know how to do this..

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  





Click Here to Expand Forum to Full Width

Featured