Serialization cumulatively.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Serialization cumulatively.

tcamuso

I'm trying to serialize the data from multiple passes of an app
on various files. Each pass generates data that must be serialized.
If I simply append each serialization, then deserialization will
only occur for the first instance, since the number of records
read by the deserialization will only be for the first instance.

What I'm doing is deserializing on each new pass, deleting the
original file, and then serializing everything again with the
new information.

If there were only a few files to process, this would not be a
problem. However there are thousands of files.

Additionally, on each new pass, I am checking to see if a
certain type of record has already been saved. So, with every
pass, I must look up in a deeper and deeper database.

Currently, it's taking almost an hour to process about 3000
files, with an average of 55,000 lines per file. It is a
huge amount of data.

However, I'm looking for a way to reduce the length of time it
takes to do this processing.

Does anybody have a better idea than to cycle through the
serialize-deserialize-lookup-serialize sequence for each
file?
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

Steven Clark
There are probably some constraints you didn't mention.  Here are some ideas based on various different guesses.

* At 80 bytes per line, that's a total of about 15 Gb of data.  With a moderately beefy computer you can hold it all in memory.

* You can store the intermediate results unserialized, just dumping your structs into files.  Only serialize when you're finished.  Or, keep all your intermediate results in memory until you're finished.

* Depending on what you're doing, using an actual database to store your intermediate results might improve performance.

* Reorganize your algorithm so it computes the final results for a file in one pass.  Perhaps you can read each file, store some information in memory, then write results for each file.

* Store the intermediate results for all 3000 files in one file.  Mmap the intermediate results file; this is another variation of the suggestion not to serialize intermediate results.

* Fix the program that reads the serialized files, so that it can read an arbitrary number of serialized records rather than just one.  I'm sure this can be done - slurp in a serialized record, see if you're at the end of file, if not then repeat.

If none of these ideas are useful, at least they should help point out what other constraints you have, that were not evident in your first message.

Steven J. Clark
VGo Communications

-----Original Message-----
From: Boost-users [mailto:[hidden email]] On Behalf Of Tony Camuso
Sent: Thursday, March 12, 2015 9:09 AM
To: [hidden email]
Subject: [Boost-users] Serialization cumulatively.


I'm trying to serialize the data from multiple passes of an app on various files. Each pass generates data that must be serialized.
If I simply append each serialization, then deserialization will only occur for the first instance, since the number of records read by the deserialization will only be for the first instance.

What I'm doing is deserializing on each new pass, deleting the original file, and then serializing everything again with the new information.

If there were only a few files to process, this would not be a problem. However there are thousands of files.

Additionally, on each new pass, I am checking to see if a certain type of record has already been saved. So, with every pass, I must look up in a deeper and deeper database.

Currently, it's taking almost an hour to process about 3000 files, with an average of 55,000 lines per file. It is a huge amount of data.

However, I'm looking for a way to reduce the length of time it takes to do this processing.

Does anybody have a better idea than to cycle through the serialize-deserialize-lookup-serialize sequence for each file?
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
On 03/12/2015 10:58 AM, Steven Clark wrote:
> There are probably some constraints you didn't mention.

Of course. :)

> Here are some ideas based on various different guesses.

And thank you so much for taking the time to respond to my post.

> * At 80 bytes per line, that's a total of about 15 Gb of data.  With
> a moderately beefy computer you can hold it all in memory.
>
> * You can store the intermediate results unserialized, just dumping
> your structs into files.  Only serialize when you're finished.  Or,

True that, but one of the details I omitted is that this app is linked
with libsparse, which is like lint on steroids. This tool parses
preprocessor files and creates a tree in memory of all the symols in
the file. My code walks this tree to create a database of info germane
to our purposes. Of course, this uses more memory again. With about
3000 files to process, there isn't enough memory on the average
workstation to contain it all at once.

When I tried to do this all in memory, even a big kahuna machine
with 32 GB of memory and 48 cores tanked after about the 100th
file.

> * Depending on what you're doing, using an actual database to store
> your intermediate results might improve performance.

Tried that. The performance of boost serialization trumps the
performance of a dbms. :)

> * Reorganize your algorithm so it computes the final results for a
> file in one pass.  Perhaps you can read each file, store some
> information in memory, then write results for each file.
>
> * Store the intermediate results for all 3000 files in one file.
> Mmap the intermediate results file; this is another variation of the
> suggestion not to serialize intermediate results.
>
> * Fix the program that reads the serialized files, so that it can
> read an arbitrary number of serialized records rather than just one.
> I'm sure this can be done - slurp in a serialized record, see if
> you're at the end of file, if not then repeat.

These steps offer the most promise.

The code already reads all the serialized records into memory, to a
vector, with one deserialization call.

The fault lies in the algorithm I am using to manage duplicate
symbols when I encounter them.

What I do for every symbol is ...

. create a new node (vertex)
. search the existing list for duplicates
. if the symbol is a duplicate, add its connections (edges) to the
   pre-existing node and delete the new node.
. next

Performance drops from about 3 files per second to a less than one
per second at the end. For the 3000+ files, it takes more than 50
minutes on an 8-core with 16 GB of memory.

To speed things up, I've created a nodes-only list, which reduces
the size of the vector to be searched by a factor of 4. I haven't
got this working, yet, so I have yet to determine the performance
gain.

> If none of these ideas are useful, at least they should help point
> out what other constraints you have, that were not evident in your
> first message.
>
> Steven J. Clark VGo Communications

Many thanks, Steven. I realize how busy everybody is, and I really
appreciate the thoughtful and valuable input.

Regards,
Tony
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

Robert Ramey
In reply to this post by tcamuso
Tony Camuso wrote
I'm trying to serialize the data from multiple passes of an app
on various files. Each pass generates data that must be serialized.
If I simply append each serialization, then deserialization will
only occur for the first instance, since the number of records
read by the deserialization will only be for the first instance.

What I'm doing is deserializing on each new pass, deleting the
original file, and then serializing everything again with the
new information.
I'm not sure I understand what you're trying to do - but of course this is the list so I can just answer anyway.

Why doesn't the following work?
text_oarchive oa

struct serializable_data {...} data;

loop n times
    oa << data;
    // alter data
endloop

// close archive

// later
text_iarchive ia
loop n times
    ia >> data
    // do something with the data
endloop

Ok a couple of problems:
a) tracking prevents writing of data multiple times
b) serialization requires that data be const just to prevent users from doing this exact sort of thing - which is a mistake in the presence of tracking.

Solution - turn tracking off and cast away consents

text_oarchive oa

struct serializable_data {...} data;

BOOST_CLASS_TRACKING(serializable_data, boost::serialization::track_never)

loop n times
    oa << const_cast<serializable_data &>(data);
    // alter data
endloop

// close archive

// later
text_iarchive ia
loop n times
    ia >> data
    // do something with the data
endloop

I have no idea if this is helpful - but maybe it's food for thought

Robert Ramey
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
On 03/13/2015 12:35 PM, Robert Ramey wrote:

> Tony Camuso wrote
>> I'm trying to serialize the data from multiple passes of an app
>> on various files. Each pass generates data that must be serialized.
>> If I simply append each serialization, then deserialization will
>> only occur for the first instance, since the number of records
>> read by the deserialization will only be for the first instance.
>>
>> What I'm doing is deserializing on each new pass, deleting the
>> original file, and then serializing everything again with the
>> new information.
>
> I'm not sure I understand what you're trying to do - but of course this is
> the list so I can just answer anyway.

Hi, Robert. I sent a response to Steven this morning, posted here ...
http://lists.boost.org/boost-users/2015/03/83963.php
... that gives a little more detail about what I'm trying to do.

> Why doesn't the following work?
>
> Ok a couple of problems:
> a) tracking prevents writing of data multiple times
> b) serialization requires that data be const just to prevent users from
> doing this exact sort of thing - which is a mistake in the presence of
> tracking.
>
> Solution - turn tracking off and cast away consents

How do I disable tracking? That sounds like it may be very useful. I  must
do my own tracking, as described in my response to Steven, so having boost
track for me is redundant, and probably hurts performance.

By "cast away constants" do you mean, static_cast<type>(identifier) from
"const type" to "type" ?
 
> I have no idea if this is helpful - but maybe it's food for thought
>
> Robert Ramey

Thanks, Robert. I appreciate any and all input.

Regards,
Tony Camuso

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

Robert Ramey
Hmmm - I included a code sketch of what I had in mind.  Does it not show up?

Robert Ramey
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
On 03/14/2015 05:22 PM, Robert Ramey wrote:
> Hmmm - I included a code sketch of what I had in mind.  Does it not show up?
>
> Robert Ramey
>

It shows up on the nabble link you gave me, but not on the boost users list at
http://lists.boost.org/boost-users/2015/03/83965.php

Thanks for the link!

>
>
> --
> View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673197.html
> Sent from the Boost - Users mailing list archive at Nabble.com.
> _______________________________________________
> Boost-users mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
In reply to this post by tcamuso
Hi, Robert.

I would have answered sooner, but had other issues arise.

I had a look at your code, and that's basically what I'm already doing.

Problem is that the time to process the files this way is O(log n), as processing each file takes incrementally longer as the database grows. It takes about an hour to process around 3000 files having about 15GB of data. Sounds reasonable, until you compare it to the compiler that whizzes through all the same files, and more, in only a few minutes.

When I serialize the output without trying to recreate the whole database for each file, the length of time to process these 3000 files drops to about 5 minutes, which is a much more acceptable number for my target users. This yields one very large file with about 3000 appended serializations.

What I'd like to do, because i think it would be much faster, is to go through the one big file and deserialize each of those serializations as they are encountered. Early testing showed that it would only take a few minutes to integrate these pieces into one whole.

If there were linefeeds in the serialized data, the code to do this would be much simpler.

Is there another, more architected way for me to deserialize an aggregate of serializations?

Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

Robert Ramey
tcamuso wrote
Problem is that the time to process the files this way is O(log n), ....
I would think that this is solvable, but I can't really comment without spending significant time looking at the specific code.
If there were linefeeds in the serialized data, the code to do this would be much simpler.
I don't even remember if there should be line feeds in there.  certainly xml archives have linefeeds.  
But again, I'd have to spend a lot of time looking at your specific case.  Of course you could hire me by the hour if you like.
Is there another, more architected way for me to deserialize an aggregate of serializations?
I think what you want to do should be possible in an efficient way.  However, it would require spending enough time with the library to understand how it works at a deeper level.  I realized that this defeats the original appeal of the library to some extent.  But it's still better than writing a new system from scratch.

Robert Ramey
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
On 03/20/2015 12:17 PM, Robert Ramey [via Boost] wrote:
> tcamuso wrote
>
>> Problem is that the time to process the files this way is O(log n),
 
> I would think that this is solvable, but I can't really comment
> without spending significant time looking at the specific code.

Understood.

>> If there were linefeeds in the serialized data, the code to do this
>>  would be much simpler.

> I don't even remember if there should be line feeds in there.
> certainly xml archives have linefeeds. But again, I'd have to spend
> a lot of time looking at your specific case.

Hmm.. can I save a text archive as xml? Does the serializer care
whether xml tags are present?

Interestingly, the text archiver was giving me linefeeds for a while.
Now they aren't there. I didn't change any of the serialization code,
but I did change the classes and structs that get serialized.

> Of course you could hire me by the hour if you like.

:) I don't have that kind of money.

> Is there another, more architected way for me to deserialize an
> aggregate of serializations?
>
> I think what you want to do should be possible in an efficient way.
> However, it would require spending enough time with the library to
> understand how it works at a deeper level.

Unfortunately, time is of the essence.

> I realized that this defeats the original appeal of the library to
> some extent.  But it's still better than writing a new system from
> scratch.

Agreed. I need to get to the knee point with this app, then I can
address these things in maintenance mode.

> Robert Ramey
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

Nat Goodspeed-2
In reply to this post by tcamuso
On Fri, Mar 20, 2015 at 10:37 AM, tcamuso <[hidden email]> wrote:

> What I'd like to do, because i think it would be much faster, is to go
> through the one big file and deserialize each of those serializations as
> they are encountered. Early testing showed that it would only take a few
> minutes to integrate these pieces into one whole.

If there were some simple way for user code to recognize the end of an
archive, maybe you could interpose an input filter using
Boost.Iostreams. The filter would present EOF to its caller on
spotting the end of archive, but leave the underlying file open (with
its read pointer adjusted to immediately after the end of archive) for
the application to bind another instance of the same filter onto the
same underlying file.

> If there were linefeeds in the serialized data, the code to do this would be
> much simpler.

Maybe you could derive an archive type yourself from one of the
existing ones that differs only in appending an easily-recognizable
marker when finished writing?
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
On 03/20/2015 02:18 PM, Nat Goodspeed wrote:

> On Fri, Mar 20, 2015 at 10:37 AM, tcamuso <[hidden email]> wrote:
>
>> What I'd like to do, because i think it would be much faster, is to go
>> through the one big file and deserialize each of those serializations as
>> they are encountered. Early testing showed that it would only take a few
>> minutes to integrate these pieces into one whole.
>
> If there were some simple way for user code to recognize the end of an
> archive, maybe you could interpose an input filter using
> Boost.Iostreams. The filter would present EOF to its caller on
> spotting the end of archive, but leave the underlying file open (with
> its read pointer adjusted to immediately after the end of archive) for
> the application to bind another instance of the same filter onto the
> same underlying file.

I must look into Boost.Iostreams, because I can recognize the end of an
archive by detecting the "serialization::archive" string at the
beginning of another. This is a hack, I realize, and there's no guarantee
that this banner won't change in the future.

>> If there were linefeeds in the serialized data, the code to do this would be
>> much simpler.
>
> Maybe you could derive an archive type yourself from one of the
> existing ones that differs only in appending an easily-recognizable
> marker when finished writing?

This may be the what's needed. I'm almost out of time on this project, so I
may have to eat the long time it takes to build the database and revisit
this another day when I'm in maintenance mode. I will be happy to post my
results.

> _______________________________________________
> Boost-users mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/boost-users
>

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

Robert Ramey
In reply to this post by tcamuso
tcamuso wrote
Interestingly, the text archiver was giving me linefeeds for a while.
I always thought it worked that way.  But I don't remember.
Now they aren't there.
which surprise me.
> Of course you could hire me by the hour if you like.
:) I don't have that kind of money.
how much does it cost all your customers to run your program for hours?

Robert Ramey
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
On 03/20/2015 03:22 PM, Robert Ramey wrote:

> tcamuso wrote
>> Interestingly, the text archiver was giving me linefeeds for a while.
>
> I always thought it worked that way.  But I don't remember.
>
>> Now they aren't there.
>
> which surprise me.
>
>>> Of course you could hire me by the hour if you like.
>> :) I don't have that kind of money.
>
> how much does it cost all your customers to run your program for hours?

My customers are my fellow engineers who will likely run it as
a cron job at night, with all their other cron jobs. However,
there are times when you need to refresh on the spot, and waiting
an hour is a hideous prospect. Of course, most of us are balancing
more than one thing at a time, so it's just another context switch
in the big scheme of things.

Basically, what this thing does is look for exported symbols in the
Linux kernel. It uses the sparse library to do this. We are looking
for deeply nested structures that could affect the kernel application
binary interface (KABI). If changes are made to those structures
that are not KABI-safe, then problems can emerge with 3rd party apps
that use the KABI.

The idea is to provide kernel developers with a tool that can
expose whether the data structure they are considering for change
could affect the KABI. We have means to protect such changes,
but it's difficult to know when to use them without a tool that
can plumb the depths looking for any and all dependencies an
exported symbol may have.

Many thanks and warm regards,
Tony Camuso
Platform Enablement
Red Hat

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Serialization cumulatively.

tcamuso
Greetings Robert.

Given the assistance you and the other boost cognoscenti
provided while I was developing my project, I feel that I
owe you an update.

What I decided to do in the end was to use a distributed
database model. The code generates a data file for each
preprocessed kernel source file. Rather than squashing
those together into one large database, I left them
distributed in their respective source directories.

The length of time to process the whole kernel now only
takes about 5 minutes on my desktop. The lookup utility
can find anything in less than a minute. Performance is
enhanced all around, though the size of the database
collectively is about ten times larger than if I
compressed it into one file. The trade-off of disk space
for performance was well worth it.

The project is at a decent knee-point, though there are a
few  things I'm sure my fellow engineers will want to add
or change.

You can track the progress of the project at
https://github.com/camuso/kabiparser

Thanks and regards,
Tony Camuso
Red Hat Platform Kernel

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users