[compond_file_binary] Gauging interest in a possible library submission.

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
Hi, all.
Is there any interest in a cross-platform C++ library which allows to
create/read/write binary compound files?

Briefly, compound file is filesystem for storing files and directories
within a single file on a disk. Initially compound file format is
developed by Microsoft and now is the part of Microsoft's Open
Specifications Documentation. Compound files are used across various
platforms and applications, and not restricted by some specific domain. In
general, any application can store information in such a way. I've found
on the Internet interest in such library and decided to implement it.

Additional information can be found here:

Article on the Wikipedia:
http://en.wikipedia.org/wiki/Compound_File_Binary_Format

Benefits of Compound Files:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa378938(v=vs.85).aspx

[MS-CFB] Open specification for compound files:
http://msdn.microsoft.com/en-us/library/dd942138.aspx

Regards,
Alexander Voitenko.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Andrey Semashev-2
On Fri, Nov 30, 2012 at 6:15 PM, А В <[hidden email]> wrote:

> Hi, all.
> Is there any interest in a cross-platform C++ library which allows to
> create/read/write binary compound files?
>
> Briefly, compound file is filesystem for storing files and directories
> within a single file on a disk. Initially compound file format is
> developed by Microsoft and now is the part of Microsoft's Open
> Specifications Documentation. Compound files are used across various
> platforms and applications, and not restricted by some specific domain. In
> general, any application can store information in such a way. I've found
> on the Internet interest in such library and decided to implement it.

Does it provide any benefits compared to mounting a loop device and
working with it through traditional file system interfaces? It seems
odd to have a library duplicating file system operations.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Olaf van der Spek-3
On Fri, Nov 30, 2012 at 5:44 PM, Andrey Semashev
<[hidden email]> wrote:
> Does it provide any benefits compared to mounting a loop device and
> working with it through traditional file system interfaces? It seems
> odd to have a library duplicating file system operations.

Is it? We don't use loop devices to read/write tar/zip files, do we?
Can you even mount loop devices as unprivileged user?
Does Windows even have loop devices?


--
Olaf

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Andrey Semashev-2
On December 2, 2012 8:01:12 PM Olaf van der Spek <[hidden email]> wrote:
> On Fri, Nov 30, 2012 at 5:44 PM, Andrey Semashev
> <[hidden email]> wrote:
> > Does it provide any benefits compared to mounting a loop device and
> > working with it through traditional file system interfaces? It seems
> > odd to have a library duplicating file system operations.
>
> Is it? We don't use loop devices to read/write tar/zip files, do we?

On OS X you do mount packages. You also typically mount various image
files. Library-level access to conventional archive files is a legacy
from older systems (read Windows and DOS) that did not support flexible
mounting.

> Can you even mount loop devices as unprivileged user?

Ok, you have a point about unprivileged mounting.

> Does Windows even have loop devices?

According to Wikipedia, it does now. But then again, Windows is not the
whole world.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
In reply to this post by Andrey Semashev-2
> Does it provide any benefits compared to mounting a loop device and
> working with it through traditional file system interfaces? It seems
> odd to have a library duplicating file system operations.

First of all, compound files are portable. You can use the same data, in the same compound files within different operating systems, not limited to Linux, Windows and Mac OS.

Please note, that both Unix-like systems and Windows struggling with limit of loop devices
http://www.tldp.org/HOWTO/CDServer-HOWTO/addloops.html [Linux]
http://stackoverflow.com/questions/1944877/maximum-number-of-drives-in-windows [Windows]

But with a such library you can simultaneously read and write thousands(millions?) of such files without restrictions from the system side. Library can be linked statically and only thing that you will require to access such file - to launch your application and not necessarily written in C++, because already exist analogous libraries in Java http://poi.apache.org and C# http://openmcdf.sourceforge.net worlds.

Also, if I want to use lot of different compound files with my data within my application, it is to weird mount and unmount devices from the runtime and also require super-user privileges.

Please do not think about compound files only as file system replacement. Imagine software that can produce some documents. Such documents use binary format and have very complex internal structure, so such structure can be represented as file system where some logical parts are grouped in folders and files.

Example of such software that really uses compound files:
http://www.corel.com/corel with it .CDX format
http://www.amwa.tv/ with it .AWM file format
and exist some others, but not so famous.

Of course lot of Microsoft applications use them, but I don't want mention them by religious reasons ;-)

User of such applications even can not know that internal representation of his documents is an entire file system.

In my usual work I deal with compound files like with some sort of archives. Store hundreds of them on my hard drive, often copy, rename, move or zip them and send via e-mail.

Regards,
Alexander Voitenko.
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Minh Phanivong

On 03/12/2012, at 6:33 AM, Alexander Voitenko <[hidden email]> wrote:

>> Does it provide any benefits compared to mounting a loop device and
>> working with it through traditional file system interfaces? It seems
>> odd to have a library duplicating file system operations.
>
> First of all, compound files are portable. You can use the same data, in the
> same compound files within different operating systems, not limited to
> Linux, Windows and Mac OS.
>
> Please note, that both Unix-like systems and Windows struggling with limit
> of loop devices
> http://www.tldp.org/HOWTO/CDServer-HOWTO/addloops.html [Linux]
> http://stackoverflow.com/questions/1944877/maximum-number-of-drives-in-windows
> [Windows]
>
> But with a such library you can simultaneously read and write
> thousands(millions?) of such files without restrictions from the system
> side. Library can be linked statically and only thing that you will require
> to access such file - to launch your application and not necessarily written
> in C++, because already exist analogous libraries in Java
> http://poi.apache.org and C# http://openmcdf.sourceforge.net worlds.
>
> Also, if I want to use lot of different compound files with my data within
> my application, it is to weird mount and unmount devices from the runtime
> and also require super-user privileges.
>
> Please do not think about compound files only as file system replacement.
> Imagine software that can produce some documents. Such documents use binary
> format and have very complex internal structure, so such structure can be
> represented as file system where some logical parts are grouped in folders
> and files.
>
> Example of such software that really uses compound files:
> http://www.corel.com/corel with it .CDX format
> http://www.amwa.tv/ with it .AWM file format
> and exist some others, but not so famous.
>
> Of course lot of Microsoft applications use them, but I don't want mention
> them by religious reasons ;-)
>
> User of such applications even can not know that internal representation of
> his documents is an entire file system.
>
> In my usual work I deal with compound files like with some sort of archives.
> Store hundreds of them on my hard drive, often copy, rename, move or zip
> them and send via e-mail.
>
> Regards,
> Alexander Voitenko.
>
>

I guess you can add to your examples tar. zip, iso as well and these should be accessible at the user level without super/root mount.

>
>
> --
> View this message in context: http://boost.2283326.n4.nabble.com/compond-file-binary-Gauging-interest-in-a-possible-library-submission-tp4639282p4639328.html
> Sent from the Boost - Dev mailing list archive at Nabble.com.
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Klaim - Joël Lamotte
I'm interested in such library.
Is it correct that the file content isn't compressed?

Joel Lamotte

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
> Is it correct that the file content isn't compressed?
Yes, this is correct.

Of course, if you need, you can create stream inside compound file and write data that is compressed or encrypted in some way. Overall overhead for representing binary data as file system is small: about 0.8% from a user's data size.

If you have any additional questions, please feel free to ask.
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Klaim - Joël Lamotte
On Mon, Dec 3, 2012 at 10:34 AM, Alexander Voitenko <[hidden email]> wrote:

> If you have any additional questions, please feel free to ask.


Yes, you mentioned the fact that it's not obvious to people not knowing the
format that the file is organized like a filesystem.
Does this mean that there is active obfuscation or is it just a side effect?

Joel Lamotte

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
> Yes, you mentioned the fact that it's not obvious to people not knowing the
> format that the file is organized like a filesystem.
> Does this mean that there is active obfuscation or is it just a side effect?

Internal file system format is not obfuscated. Yes, this is just a side effect. For example, Microsoft Project's .mpp files are binary compound files inside, highly structured and contain about 100 directory entries(files and folders), but user just open it within MS Propject or Open Propject(free open source alternative) and work with content and does not care about how his or her data is organized inside that file.
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Brian Ravnsgaard Riis
Den 03-12-2012 11:47, Alexander Voitenko skrev:
> Internal file system format is not obfuscated. Yes, this is just a side
> effect. For example, Microsoft Project's .mpp files are binary compound
> files inside, highly structured and contain about 100 directory
> entries(files and folders), but user just open it within MS Propject or Open
> Propject(free open source alternative) and work with content and does not
> care about how his or her data is organized inside that file.

How does your intended solution differ from (or compare to) simply
writing documents as a (possibly renamed) .zip file?

Or maybe standardising (as such) an interface for this is the purpose of
your submission?

Regards,
  Brian Riis


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Gottlob Frege
On Tue, Dec 4, 2012 at 7:42 AM, Brian Ravnsgaard Riis
<[hidden email]>wrote:

>
> How does your intended solution differ from (or compare to) simply writing
> documents as a (possibly renamed) .zip file?
>
> Or maybe standardising (as such) an interface for this is the purpose of
> your submission?
>
> Regards,
>  Brian Riis
>
>
>
I think this is leaning towards my thinking as well: I'd be more interested
in a library that forms a generic interface to any hierarchical structured
file (tar, zip, compound files, etc), than a library that just did one
particular format.

ie a common interface with "pluggable" back-ends for various formats.
Would that be possible?

That would seem more "Boost-ish" to me.


Tony

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
In reply to this post by Brian Ravnsgaard Riis
> How does your intended solution differ from (or compare to) simply
> writing documents as a (possibly renamed) .zip file?

Yes, zip files without compression are similar to the compound files in some way. They both provide some directory entries hierarchy inside and methods to iterate through them, also they both provide acces to the stored data.

I am not expert in zip archives, so possible my suggestions are wrong.
But I can see several advantages over zip files:
- As I understand, zip files work sequentially: you put one file inside, then another, then can delete some files. Only as whole chunks. But with compound files you can open several streams and keep them opened: some for writing, some for reading and make modifications simultaneously. Of course this will lead to data fragmentation, but this is another story ;-) With compound files you do not need to extract internal content to modify it, you can do it "in place", even if you want to expand length of stored data by writing new chunk at the end of some stream. Compound file's internal file system provides all needed facilities to do such operations with minimal cost.

- Faster entries searching. All child entries of one directory are organized as red-black tree at format level. So entire directory entries hierarchy look like a tree of red-black trees.

- Faster entries deletion. As I can see, zip files explicitly exclude deleted files then recalculate CRC checksum. But compound files only mark sectors for a deleted entry in the "Files Allocation Table" as "unused". Yes, in common cases actual data is not removed and can be recovered by some tools like hex editors. But this is similar to all file systems.
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
In reply to this post by Gottlob Frege
> ie a common interface with "pluggable" back-ends for various formats.
> Would that be possible?
I think yes, but with some restrictions, because some formats have unique specific features that are not shared among another formats.
What is the boost policy about depending on a third-party software? Of course, I do not want to reinvent the wheel and write yet another implementation of zlib library.

> That would seem more "Boost-ish" to me.
Yes, I completely agree that this will be more "Boost-ish" and my current implementation lacks of some generality. So, probably, at this point I will release my library as free-standing code on a resource like GitHub or GoogleCode.

Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Brian Ravnsgaard Riis
Den 05-12-2012 10:09, Alexander Voitenko skrev:
>> ie a common interface with "pluggable" back-ends for various formats.
>> Would that be possible?
> I think yes, but with some restrictions, because some formats have unique
> specific features that are not shared among another formats.
> What is the boost policy about depending on a third-party software? Of
> course, I do not want to reinvent the wheel and write yet another
> implementation of zlib library.

That's understandable, and should not be necessary.

I'd suggest trying to define a clean boundary between the interface and
the implementation. You already have one back-end that makes the entire
package usable. Adding another backend later, or as an option, that
supports zlib/bzip2/whatever compression should be easier then.

As long as the library has one working imlementation that does not
depend on 3rd party software, I don't think there'll be any problems.
Boost.Locale already does this with ICU, which is required for certain
feature/platform combinations, but the library can be used without ICU;
it'll then not support the features that ICU provides. A similar
approach should be possible with zlib in your case.

Regards,
  Brian Riis


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Klaim - Joël Lamotte
On Wed, Dec 5, 2012 at 12:56 PM, Brian Ravnsgaard Riis <[hidden email]
> wrote:

> I'd suggest trying to define a clean boundary between the interface and
> the implementation. You already have one back-end that makes the entire
> package usable. Adding another backend later, or as an option, that
> supports zlib/bzip2/whatever compression should be easier then.


I totally agree with this.

I see tons of cases where such a library  would have been useful to me,
where zip is unperfect tool but a possible alternative.

Joel Lamotte

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
In reply to this post by Brian Ravnsgaard Riis
> I'd suggest trying to define a clean boundary between the interface and
> the implementation. You already have one back-end that makes the entire
> package usable. Adding another backend later, or as an option, that
> supports zlib/bzip2/whatever compression should be easier then.

I agree with your thoughts, Brian. To be sure that this point was achieved, I must investigate such possibility well and implement at least one more back-end.

For now I have several questions about integration with boost.

1) Compound files use utf-16 encoding for names and I use third party library for conversions utf-8 <-> utf-16
http://utfcpp.sourceforge.net/
But looks like that Boost.Locale can be used as replacement. Is it?

2) I like TDD and automated tests, so implemented lot of them(for now about 7000 lines and plan add more). But most of my tests are not unit tests, but integration test. For such software, testing in terms of usage scenarios is better I think. As result, I have bunch of test data files which have size about 100mb. Of course it is unacceptable to provide all this stuff within the boost distribution. Is it possible to split tests? Module tests can be included in boost distribution that available to the end user, and integration test leave in repository and use only in nightly builds or whatever you have there.

3) Compound files use red-black tree as format feature. And to deal with compound files correctly, I need to access such low level data of a red-black tree as color of a node. I searched on the Internet for a free C++ library that allow to access all the data within a tree but not found any which is well-tested, so decided to make my own implementation using TDD approach.

It was developed according to this book:
http://www.amazon.com/Introduction-Algorithms-Includes-CD-Rom-Thomas/dp/0072970545
With my tiny modifications. ]:)
Is it any interest for binary and red-black tree implementations as separate component? Or... the boost already have such facilities and I waste my time by reinventing one more wheel? :-)

4) For integral types I use defines from <stdint.h> is it acceptable? Or should I switch to similar types from the boost library?

Regards,
Alexander.
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Hartmut Kaiser
In reply to this post by Alexander Voitenko

> Is there any interest in a cross-platform C++ library which allows to
> create/read/write binary compound files?
>
> Briefly, compound file is filesystem for storing files and directories
> within a single file on a disk. Initially compound file format is
> developed by Microsoft and now is the part of Microsoft's Open
> Specifications Documentation. Compound files are used across various
> platforms and applications, and not restricted by some specific domain. In
> general, any application can store information in such a way. I've found
> on the Internet interest in such library and decided to implement it.
>
> Additional information can be found here:
>
> Article on the Wikipedia:
> http://en.wikipedia.org/wiki/Compound_File_Binary_Format
>
> Benefits of Compound Files:
> http://msdn.microsoft.com/en-
> us/library/windows/desktop/aa378938(v=vs.85).aspx
>
> [MS-CFB] Open specification for compound files:
> http://msdn.microsoft.com/en-us/library/dd942138.aspx

I'm interested.

Does your library share the property with the MS compound file
implementation that it guarantees fail safety and data consistency even
under power loss scenarios?

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Hartmut Kaiser
In reply to this post by Alexander Voitenko

> > Is there any interest in a cross-platform C++ library which allows to
> > create/read/write binary compound files?
> >
> > Briefly, compound file is filesystem for storing files and directories
> > within a single file on a disk. Initially compound file format is
> > developed by Microsoft and now is the part of Microsoft's Open
> > Specifications Documentation. Compound files are used across various
> > platforms and applications, and not restricted by some specific
> > domain. In general, any application can store information in such a
> > way. I've found on the Internet interest in such library and decided to
> implement it.
> >
> > Additional information can be found here:
> >
> > Article on the Wikipedia:
> > http://en.wikipedia.org/wiki/Compound_File_Binary_Format
> >
> > Benefits of Compound Files:
> > http://msdn.microsoft.com/en-
> > us/library/windows/desktop/aa378938(v=vs.85).aspx
> >
> > [MS-CFB] Open specification for compound files:
> > http://msdn.microsoft.com/en-us/library/dd942138.aspx
>
> I'm interested.
>
> Does your library share the property with the MS compound file
> implementation that it guarantees fail safety and data consistency even
> under power loss scenarios?

More generally, does your library support the properties of the MS compound
storage implementation as described here:

 
http://msdn.microsoft.com/en-us/library/windows/desktop/aa378871(v=vs.85).as
px

?

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [compond_file_binary] Gauging interest in a possible library submission.

Alexander Voitenko
First of all I want to mention that my library is not finished yet. I am at the end of implementing planned features for the first release, but some additional effort is needed. I have no plans to add support for the all features that MS implementation provides in the first release, but want to develop library in evolutionary way. Also I plan add some features in next releases that are not supported by MS implementation: storage compacting and defragmentation. I found on the Internet that users need them.


> More generally, does your library support the properties of the MS compound
> storage implementation as described here:
>
>
> http://msdn.microsoft.com/en-us/library/windows/desktop/aa378871(v=vs.85).aspx ?

Here is that list with my comments.

[Incremental access.]
Implemented and tested.

[Multiple use.]
Not implemented, but I think often about that. The main problem: I can not decide use C++11 threading model or switch to some cross-platform library(Boost.Thread may be?)
For now supported one thread per one compound files.
Also, threading implementation and testing will require lot of effort. And I not plan to include multithreading  support in the first release.

[Transaction processing.]
Not supported. Planned in next releases.

[Low-memory saves.]
Yes. I designed the library to be memory efficient. All needed memory is allocated on stream opening.
I plan to test memory issues by emulating std::bad_alloc in various places.
12