How you debug a binary format












11















I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.



It What seems like you could do is view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.



The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.



Also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.










share|improve this question


















  • 71





    You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?

    – Doc Brown
    2 days ago








  • 4





    While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.

    – jpmc26
    2 days ago








  • 2





    The tool you want is a hex editor. If you're using Windows, HxD is pretty good.

    – immibis
    2 days ago






  • 3





    @jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.

    – jwenting
    yesterday






  • 3





    @jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.

    – jpmc26
    yesterday


















11















I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.



It What seems like you could do is view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.



The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.



Also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.










share|improve this question


















  • 71





    You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?

    – Doc Brown
    2 days ago








  • 4





    While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.

    – jpmc26
    2 days ago








  • 2





    The tool you want is a hex editor. If you're using Windows, HxD is pretty good.

    – immibis
    2 days ago






  • 3





    @jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.

    – jwenting
    yesterday






  • 3





    @jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.

    – jpmc26
    yesterday
















11












11








11


4






I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.



It What seems like you could do is view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.



The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.



Also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.










share|improve this question














I would like to be able to debug building a binary builder. Right now I am basically printing out the input data to the binary parser, and then going deep into the code and printing out the mapping of the input to the output, then taking the output mapping (integers) and using that to locate the corresponding integer in the binary. Pretty clunky, and requires that I modify the source code deeply to get at the mapping between input and output.



It What seems like you could do is view the binary in different variants (in my case I'd like to view it in 8-bit chunks as decimal numbers, because that's pretty close to the input). Actually, some numbers are 16 bit, some 8, some 32, etc. So maybe there would be a way to view the binary with each of these different numbers highlighted in memory in some way.



The only way I could see that being possible is if you actually build a visualizer specific to the actual binary format/layout. So it knows where in the sequence the 32 bit numbers should be, and where the 8 bit numbers should be, etc. This is a lot of work and kind of tricky in some situations. So wondering if there's a general way to do it.



Also wondering what the general way of debugging this type of thing currently is, so maybe I can get some ideas on what to try from that.







debugging binary






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 2 days ago









Lance PollardLance Pollard

819513




819513








  • 71





    You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?

    – Doc Brown
    2 days ago








  • 4





    While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.

    – jpmc26
    2 days ago








  • 2





    The tool you want is a hex editor. If you're using Windows, HxD is pretty good.

    – immibis
    2 days ago






  • 3





    @jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.

    – jwenting
    yesterday






  • 3





    @jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.

    – jpmc26
    yesterday
















  • 71





    You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?

    – Doc Brown
    2 days ago








  • 4





    While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.

    – jpmc26
    2 days ago








  • 2





    The tool you want is a hex editor. If you're using Windows, HxD is pretty good.

    – immibis
    2 days ago






  • 3





    @jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.

    – jwenting
    yesterday






  • 3





    @jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.

    – jpmc26
    yesterday










71




71





You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?

– Doc Brown
2 days ago







You got one answer saying "use the hexdump directly, and do this and that additionally"- and that answer got a lot of upvotes. And a second answer, 5 hours later(!), saying only "use a hexdump". Then you accepted the second one in favor of the first? Seriously?

– Doc Brown
2 days ago






4




4





While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.

– jpmc26
2 days ago







While you might have a good reason to use a binary format, do consider whether you can just use an existing text format like JSON instead. Human readability counts a lot, and machines and networks are typically fast enough that using a custom format to reduce size is unnecessary nowadays.

– jpmc26
2 days ago






2




2





The tool you want is a hex editor. If you're using Windows, HxD is pretty good.

– immibis
2 days ago





The tool you want is a hex editor. If you're using Windows, HxD is pretty good.

– immibis
2 days ago




3




3





@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.

– jwenting
yesterday





@jpmc26 there's still a lot of use for binary formats and always will be. Human readability is usually secondary to performance, storage requirements, and network performance. And there are still many areas where especially network performance is poor and storage limited. Plus don't forget all the systems having to interface with legacy systems (both hardware and software) and having to support their data formats.

– jwenting
yesterday




3




3





@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.

– jpmc26
yesterday







@jwenting No, actually, developer time is usually the most expensive piece of an application. Sure, that may not be the case if you're working at Google or Facebook, but most apps don't operate at that scale. And when your developers spending time on stuff is the most expensive resource, human readability counts for a lot more than the 100 milliseconds extra for the program to parse it.

– jpmc26
yesterday












5 Answers
5






active

oldest

votes


















74














For ad-hoc checks, just use a standard hexdump and learn to eyeball it.



If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).



Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.






share|improve this answer



















  • 1





    "just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.

    – Mast
    2 days ago











  • I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.

    – Peter A. Schneider
    yesterday



















9














The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.



An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.



A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.






share|improve this answer





















  • 2





    This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.

    – Kasper van den Berg
    2 days ago






  • 1





    @KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.

    – JimmyJames
    2 days ago











  • @KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).

    – Lightness Races in Orbit
    yesterday



















5














ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.




  • DDT - Develop using sample data and unit tests.

  • A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.

  • ASN.1 is not really needed but a grammar based, more declarative file specification is easier.






share|improve this answer



















  • 5





    If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.

    – Mark
    2 days ago






  • 1





    @Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.

    – Joop Eggen
    yesterday



















1














I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.



A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.



You would essentially have:



byte arrayOfBytes; // initialized somehow
Object obj = Parser.parse(arrayOfBytes);
Logger.log(obj.ToString());


And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString function for yourObject class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.



You can additionally use a conditional statement to prevent the ToString function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.



Your ToString might look like this:



return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);

// OR

return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);


Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).



This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).



In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.



I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).



Example: Say your ToStrings spit out variables a,b,c,d,e,f,g,h. You run your program and notice a bug with g, but the problem really started with c (but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c is where problems start.



Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....; if your fields are varying size, where does c begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.






share|improve this answer































    1














    Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.



    Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.



    dotNetBytes Example






    share|improve this answer






















      protected by gnat yesterday



      Thank you for your interest in this question.
      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



      Would you like to answer one of these unanswered questions instead?














      5 Answers
      5






      active

      oldest

      votes








      5 Answers
      5






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      74














      For ad-hoc checks, just use a standard hexdump and learn to eyeball it.



      If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).



      Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.






      share|improve this answer



















      • 1





        "just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.

        – Mast
        2 days ago











      • I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.

        – Peter A. Schneider
        yesterday
















      74














      For ad-hoc checks, just use a standard hexdump and learn to eyeball it.



      If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).



      Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.






      share|improve this answer



















      • 1





        "just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.

        – Mast
        2 days ago











      • I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.

        – Peter A. Schneider
        yesterday














      74












      74








      74







      For ad-hoc checks, just use a standard hexdump and learn to eyeball it.



      If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).



      Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.






      share|improve this answer













      For ad-hoc checks, just use a standard hexdump and learn to eyeball it.



      If you want to tool up for a proper investigation, I usually write a separate decoder in something like Python - ideally this will be driven directly from a message spec document or IDL, and be as automated as possible (so there's no chance of manually introducing the same bug in both decoders).



      Lastly, don't forget you should be writing unit tests for your decoder, using known-correct canned input.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered 2 days ago









      UselessUseless

      8,98922036




      8,98922036








      • 1





        "just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.

        – Mast
        2 days ago











      • I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.

        – Peter A. Schneider
        yesterday














      • 1





        "just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.

        – Mast
        2 days ago











      • I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.

        – Peter A. Schneider
        yesterday








      1




      1





      "just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.

      – Mast
      2 days ago





      "just use a standard hexdump and learn to eyeball it." Yup. In my experience, multiple sections of anything up to 200 bits can be written down on a whiteboard for grouped comparison, which sometimes helps with this kind of thing to get started.

      – Mast
      2 days ago













      I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.

      – Peter A. Schneider
      yesterday





      I find a separate decoder well worth the effort if the binary data plays an important part in the application (or system, in general). This is especially true if the data format is variable: data in fixed layouts can be spotted in a hexdump with a little practice but hits a practicability wall fast. We debugged USB and CAN traffic with commercial packet decoders, and I have written a PROFIBus decoder (where variables spread across bytes, completely unreadable in a hex dump), and found all three of them immensely helpful.

      – Peter A. Schneider
      yesterday













      9














      The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.



      An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.



      A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.






      share|improve this answer





















      • 2





        This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.

        – Kasper van den Berg
        2 days ago






      • 1





        @KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.

        – JimmyJames
        2 days ago











      • @KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).

        – Lightness Races in Orbit
        yesterday
















      9














      The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.



      An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.



      A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.






      share|improve this answer





















      • 2





        This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.

        – Kasper van den Berg
        2 days ago






      • 1





        @KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.

        – JimmyJames
        2 days ago











      • @KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).

        – Lightness Races in Orbit
        yesterday














      9












      9








      9







      The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.



      An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.



      A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.






      share|improve this answer















      The first step to doing this is that you need a way to find or define a grammar that describes structure of the data i.e. a schema.



      An example of this is a language feature of COBOL which is informally known as copybook. In COBOL programs you would define the structure of the data in memory. This structure mapped directly to the way the bytes were stored. This is common to languages of that era as opposed to common contemporary languages where the physical layout of memory is an implementation concern that is abstracted away from the developer.



      A google search for binary data schema language turns up a number of tools. An example is Apache DFDL. There may already be UI for this as well.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 2 days ago

























      answered 2 days ago









      JimmyJamesJimmyJames

      13.2k2351




      13.2k2351








      • 2





        This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.

        – Kasper van den Berg
        2 days ago






      • 1





        @KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.

        – JimmyJames
        2 days ago











      • @KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).

        – Lightness Races in Orbit
        yesterday














      • 2





        This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.

        – Kasper van den Berg
        2 days ago






      • 1





        @KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.

        – JimmyJames
        2 days ago











      • @KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).

        – Lightness Races in Orbit
        yesterday








      2




      2





      This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.

      – Kasper van den Berg
      2 days ago





      This feature is not reserved to 'ancient' era languages. C and C++ structs and unions can be memory aligned. C# has StructLayoutAttribute, which I have use to transmit binary data.

      – Kasper van den Berg
      2 days ago




      1




      1





      @KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.

      – JimmyJames
      2 days ago





      @KaspervandenBerg Unless you are saying that C and C++ added these recently, I consider that the same era. The point is that these formats were not simply for data transmission, though they were used for that, they mapped directly to how the code worked with data in memory and on disk. That's not, in general, how newer languages tend to work though they may have such features.

      – JimmyJames
      2 days ago













      @KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).

      – Lightness Races in Orbit
      yesterday





      @KaspervandenBerg C++ does not do that as much as you think it does. It is possible using implementation-specific tooling to align and eliminate padding (and, admittedly, increasingly the standard is adding features for this kind of thing) and member order is deterministic (but not necessarily the same as in memory!).

      – Lightness Races in Orbit
      yesterday











      5














      ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.




      • DDT - Develop using sample data and unit tests.

      • A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.

      • ASN.1 is not really needed but a grammar based, more declarative file specification is easier.






      share|improve this answer



















      • 5





        If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.

        – Mark
        2 days ago






      • 1





        @Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.

        – Joop Eggen
        yesterday
















      5














      ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.




      • DDT - Develop using sample data and unit tests.

      • A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.

      • ASN.1 is not really needed but a grammar based, more declarative file specification is easier.






      share|improve this answer



















      • 5





        If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.

        – Mark
        2 days ago






      • 1





        @Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.

        – Joop Eggen
        yesterday














      5












      5








      5







      ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.




      • DDT - Develop using sample data and unit tests.

      • A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.

      • ASN.1 is not really needed but a grammar based, more declarative file specification is easier.






      share|improve this answer













      ASN.1, Abstract Syntax Notation One, provides a way of specifying a binary format.




      • DDT - Develop using sample data and unit tests.

      • A textual dump can be helpful. If in XML you can collapse/expand subhierarchies.

      • ASN.1 is not really needed but a grammar based, more declarative file specification is easier.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered 2 days ago









      Joop EggenJoop Eggen

      94745




      94745








      • 5





        If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.

        – Mark
        2 days ago






      • 1





        @Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.

        – Joop Eggen
        yesterday














      • 5





        If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.

        – Mark
        2 days ago






      • 1





        @Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.

        – Joop Eggen
        yesterday








      5




      5





      If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.

      – Mark
      2 days ago





      If the never-ending parade of security vulnerabilities in ASN.1 parsers is any indication, adopting it would certainly provide good exercise in debugging binary formats.

      – Mark
      2 days ago




      1




      1





      @Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.

      – Joop Eggen
      yesterday





      @Mark many small byte arrays (and that in varying hierarchy trees) are often not handled right (securely) in C (for instance not using exceptions). Never underestimate the low-levelness, inherent unsafeness of C. ASN.1 in - for instance - java does not expose this problem. As an ASN.1 grammar directed parsing could be done safely, even C could be done with a small and safe code base. And part of the vulnerabilities are inherent of the binary format itself: one can exploit "legal" constructs of the format's grammar, that have desastrous semantics.

      – Joop Eggen
      yesterday











      1














      I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.



      A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.



      You would essentially have:



      byte arrayOfBytes; // initialized somehow
      Object obj = Parser.parse(arrayOfBytes);
      Logger.log(obj.ToString());


      And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString function for yourObject class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.



      You can additionally use a conditional statement to prevent the ToString function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.



      Your ToString might look like this:



      return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);

      // OR

      return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);


      Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).



      This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).



      In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.



      I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).



      Example: Say your ToStrings spit out variables a,b,c,d,e,f,g,h. You run your program and notice a bug with g, but the problem really started with c (but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c is where problems start.



      Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....; if your fields are varying size, where does c begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.






      share|improve this answer




























        1














        I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.



        A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.



        You would essentially have:



        byte arrayOfBytes; // initialized somehow
        Object obj = Parser.parse(arrayOfBytes);
        Logger.log(obj.ToString());


        And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString function for yourObject class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.



        You can additionally use a conditional statement to prevent the ToString function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.



        Your ToString might look like this:



        return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);

        // OR

        return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);


        Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).



        This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).



        In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.



        I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).



        Example: Say your ToStrings spit out variables a,b,c,d,e,f,g,h. You run your program and notice a bug with g, but the problem really started with c (but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c is where problems start.



        Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....; if your fields are varying size, where does c begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.






        share|improve this answer


























          1












          1








          1







          I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.



          A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.



          You would essentially have:



          byte arrayOfBytes; // initialized somehow
          Object obj = Parser.parse(arrayOfBytes);
          Logger.log(obj.ToString());


          And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString function for yourObject class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.



          You can additionally use a conditional statement to prevent the ToString function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.



          Your ToString might look like this:



          return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);

          // OR

          return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);


          Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).



          This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).



          In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.



          I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).



          Example: Say your ToStrings spit out variables a,b,c,d,e,f,g,h. You run your program and notice a bug with g, but the problem really started with c (but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c is where problems start.



          Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....; if your fields are varying size, where does c begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.






          share|improve this answer













          I'm not sure I fully understand, but it sounds like you have a parser for this binary format, and you control the code for it. So this answer is built on that assumption.



          A parser will in some way be filling up structs, classes, or whatever data structure your language has. If you implement a ToString for everything that gets parsed, then you end up with a very easy to use and easily maintained method of displaying that binary data in a human readable format.



          You would essentially have:



          byte arrayOfBytes; // initialized somehow
          Object obj = Parser.parse(arrayOfBytes);
          Logger.log(obj.ToString());


          And that's it, from the standpoint of using it. Of course this requires you implement/override the ToString function for yourObject class/struct/whatever, and you would also have to do so for any nested classes/structs/whatevers.



          You can additionally use a conditional statement to prevent the ToString function from being called in release code so that you don't waste time on something that won't be logged outside of debug mode.



          Your ToString might look like this:



          return String.Format("%d,%d,%d,%d", int32var, int16var, int8var, int32var2);

          // OR

          return String.Format("%s:%d,%s:%d,%s:%d,%s:%d", varName1, int32var, varName2, int16var, varName3, int8var, varName4, int32var2);


          Your original question makes it sound like you've somewhat attempted to do this, and that you think this method is burdensome, but you have also at some point implemented parsing a binary format and created variables to store that data. So all you have to do is print those existing variables at the appropriate level of abstraction (the class/struct the variable is in).



          This is something you should only have to do once, and you can do it while building the parser. And it will only change when the binary format changes (which will already prompt a change to your parser anyways).



          In a similar vein: some languages have robust features for turning classes into XML or JSON. C# is particularly good at this. You don't have to give up your binary format, you just do the XML or JSON in a debug logging statement and leave your release code alone.



          I'd personally recommend not going the hex dump route, because it's prone to errors (did you start on the right byte, are you sure when you're reading left to right that you're "seeing" the correct endianness, etc).



          Example: Say your ToStrings spit out variables a,b,c,d,e,f,g,h. You run your program and notice a bug with g, but the problem really started with c (but you're debugging, so you haven't figured that out yet). If you know the input values (and you should) you'll instantly see that c is where problems start.



          Compared to a hex dump that just tells you 338E 8455 0000 FF76 0000 E444 ....; if your fields are varying size, where does c begin and what's the value - a hex editor will tell you but my point is this is error prone and time consuming. Not only that, but you can't easily/quickly automate a test via a hex viewer. Printing out a string after parsing the data will tell you exactly what your program is 'thinking', and will be one step along the path of automated testing.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered yesterday









          ShazShaz

          2,3761714




          2,3761714























              1














              Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.



              Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.



              dotNetBytes Example






              share|improve this answer




























                1














                Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.



                Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.



                dotNetBytes Example






                share|improve this answer


























                  1












                  1








                  1







                  Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.



                  Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.



                  dotNetBytes Example






                  share|improve this answer













                  Other answers have described viewing a hex dump, or writing out object structures in i.e. JSON. I think combining both of these is very helpful.



                  Using a tool that can render the JSON on top of the hex dump is really useful; I wrote an open source tool that parsed .NET binaries called dotNetBytes, here's a view of an example DLL.



                  dotNetBytes Example







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered yesterday









                  Carl WalshCarl Walsh

                  1453




                  1453

















                      protected by gnat yesterday



                      Thank you for your interest in this question.
                      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                      Would you like to answer one of these unanswered questions instead?



                      Popular posts from this blog

                      If I really need a card on my start hand, how many mulligans make sense? [duplicate]

                      Alcedinidae

                      Can an atomic nucleus contain both particles and antiparticles? [duplicate]