caching fusion map (idea/query)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

caching fusion map (idea/query)

Josh
I'm trying to figure out a good way to implement a lazy and transparent value cache ...by which I mean something like a fusion map which, whenever queried for an actual value, first ensures that said
value is up-to-date before returning it.

The up-to-date mechanism would be maintained according to some internal flags such as 'up-to-date', 'changed-since-last-used', and/or version counters for the various values. I want it to be as efficient as possible (especially regarding all possible compiler optimizations), and allow flexibility regarding the key types and actual data.

After some thought (and much mopping of brows), I am beginning to convince myself that the way to do this is to make a new sequence type, as illustrated in the triple.cpp example program. The new sequence would basically wrap/duplicate the fusion map, but add the necessary flags/version counters, and implement all query/assignment functions via the cache maintaining functions.

I'm thinking that the dependency relationships among the values could be specified using type symbols such as,
'depends<id1, mpl::set<id2, id3> >' which would indicate that the data associated with type symbol 'id1' becomes out-of-date whenever the data associated with 'id2' or 'id3' is changed. These dependency symbols could be given to the cache type as template parameters, and the function for recomputing the value associated with 'id1' could be a static method in a template class parameterized by a tag type associated with the cache as another template parameter.

(The type symbols representing dependencies would probably be a bit more sophisticated, because I'm thinking of various ways of handling the propagation of 'out-of-date' information, e.g., flag a value as out-of-date whenever a dependency is changed, or instead, mark a value with dependant values as changed-since-last-used, or instead, mark values with an incremental version counter, as well as orthogonally, dependencies specified at known at compile time and/or modifiable during runtime.)

Another aspect of sophistication I'd like to include is: assignment/query of multiple values simultaneously since this may affect the optimal way to manage the cache flags/update the data. e.g., if C depends on both A and B, then assigning A then assigning B incurs two checks/automatic-updates to the value of C, but an assignment that somehow encodes the simultaneity can avoid this. This could be achieved with a template assignment function like: 'assign<id1, id2>(cache, value1, value2);' or something like that...

Anyway, I'd be very grateful for any responses, but in particular:

  - this is pointless (because...)
  - this has been done (look at...)
  - you should also...

and most of all:

  - there's a better way (which is...)


Thanks for reading this.

Joshua Hale.


------------------------------------------------------------------------------
Doing More with Less: The Next Generation Virtual Desktop
What are the key obstacles that have prevented many mid-market businesses
from deploying virtual desktops?   How do next-generation virtual desktops
provide companies an easier-to-deploy, easier-to-manage and more affordable
virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: caching fusion map (idea/query)

Joel de Guzman-2
On 9/8/2011 10:38 AM, Josh wrote:

> I'm trying to figure out a good way to implement a lazy and transparent value cache
> ...by which I mean something like a fusion map which, whenever queried for an actual
> value, first ensures that said value is up-to-date before returning it.
>
> The up-to-date mechanism would be maintained according to some internal flags such as
> 'up-to-date', 'changed-since-last-used', and/or version counters for the various
> values. I want it to be as efficient as possible (especially regarding all possible
> compiler optimizations), and allow flexibility regarding the key types and actual
> data.
>
> After some thought (and much mopping of brows), I am beginning to convince myself that
> the way to do this is to make a new sequence type, as illustrated in the triple.cpp
> example program. The new sequence would basically wrap/duplicate the fusion map, but
> add the necessary flags/version counters, and implement all query/assignment functions
> via the cache maintaining functions.
>
> I'm thinking that the dependency relationships among the values could be specified
> using type symbols such as, 'depends<id1, mpl::set<id2, id3> >' which would indicate
> that the data associated with type symbol 'id1' becomes out-of-date whenever the data
> associated with 'id2' or 'id3' is changed. These dependency symbols could be given to
> the cache type as template parameters, and the function for recomputing the value
> associated with 'id1' could be a static method in a template class parameterized by a
> tag type associated with the cache as another template parameter.
>
> (The type symbols representing dependencies would probably be a bit more sophisticated,
> because I'm thinking of various ways of handling the propagation of 'out-of-date'
> information, e.g., flag a value as out-of-date whenever a dependency is changed, or
> instead, mark a value with dependant values as changed-since-last-used, or instead,
> mark values with an incremental version counter, as well as orthogonally, dependencies
> specified at known at compile time and/or modifiable during runtime.)
>
> Another aspect of sophistication I'd like to include is: assignment/query of multiple
> values simultaneously since this may affect the optimal way to manage the cache
> flags/update the data. e.g., if C depends on both A and B, then assigning A then
> assigning B incurs two checks/automatic-updates to the value of C, but an assignment
> that somehow encodes the simultaneity can avoid this. This could be achieved with a
> template assignment function like: 'assign<id1, id2>(cache, value1, value2);' or
> something like that...
>
> Anyway, I'd be very grateful for any responses, but in particular:
>
> - this is pointless (because...) - this has been done (look at...) - you should
> also...
>
> and most of all:
>
> - there's a better way (which is...)

I think this makes perfect sense. I am not aware if this has been done
before. Go for it!

While you are on it; write a blog about it :-)

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://boost-spirit.com




------------------------------------------------------------------------------
Why Cloud-Based Security and Archiving Make Sense
Osterman Research conducted this study that outlines how and why cloud
computing security and archiving is rapidly being adopted across the IT
space for its ease of implementation, lower cost, and increased
reliability. Learn more. http://www.accelacomm.com/jaw/sfnl/114/51425301/
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: caching fusion map (idea/query)

Josh

Thanks for the encouragement and confirmation that I am not duplicating an effort already made.

It seems like a mechanism with considerable utility to me, but I suppose that's hard to judge.

Either way, I guess it's not very complicated:

(i) Implement all the functions that return actual data like 'at_key<Id>' etc., "cache-safe" and
strictly "query-only". Where "cache-safe" means that they first invoke the cache maintenance
function/s to ensure that requested values are up-to-date, and "query-only" means that they
return by value or const reference.

(ii) Add corresponding new cache-specific functions like 'set_at_key<Id>' which allow assignment
monitored by the cache maintenance function/s.

(iii) Add more new functions like 'at_keys<Id1, Id2, ...>(type1&, type2&)' -this could be done
with a type sequence and a single tier argument, but I want to avoid the cost of constructing the
tier and facilitate inlining optimizations through a transparent 'at_keys' (I guess using a tier
blocks them...?)

(iv) Use a bit of MPL to build the flags and version counters needed by the cache, and write a few
functions to maintain the cache in response to single/multiple assignments/queries.

There are a few design choices that I'm not sure about:

I think the cache type could be designed in such a way that it requires no more specification than
the id types and actual data types, i.e., all update functions and dependency relationships would
be specified with template specializations expected by the cache implementation. The advantage of
this would be that a cache types and other containers could be easily converted using the functions
like 'as_list'. On the other hand, that probably means that you would have to jump through some
hoops if you wanted to use the same key types in different contexts (two different cache types
containing similar data would both identify the same update functions if they are encoded by key
type alone). Maybe that's not such a bad thing though.

Alternatively, the update functions and dependency relationships could be specified via a tag
type given as an extra template parameter to the cache type (or as a class providing the information
and functions). This avoids the restriction described in the paragraph above, but means that
you would have to specify it whenever you wanted to do an 'as_cache' like conversion.
... maybe that's not such a bad thing though.

Maybe instead of making at_key etc. cache-safe and query-only, implement these functions as
direct accessors (e.g., returning non-const references), and add functions like 'get_at_key<Id>'
which are cache-safe and query-only (there are several other functions with cache-specific
behaviour so why not make them all new).


I suppose the implications of these options will become a bit clearer when I actually implement it.
Any foresight would be most appreciated.

I'll feedback any meaningful progress -though I have to start working on some bread-and-butter
stuff for a while from next week to earn my keep. :(



On 09/09/11 09:54, Joel de Guzman wrote:

> On 9/8/2011 10:38 AM, Josh wrote:
>> I'm trying to figure out a good way to implement a lazy and transparent value cache
>> ...by which I mean something like a fusion map which, whenever queried for an actual
>> value, first ensures that said value is up-to-date before returning it.
>>
>> The up-to-date mechanism would be maintained according to some internal flags such as
>> 'up-to-date', 'changed-since-last-used', and/or version counters for the various
>> values. I want it to be as efficient as possible (especially regarding all possible
>> compiler optimizations), and allow flexibility regarding the key types and actual
>> data.
>>
>> After some thought (and much mopping of brows), I am beginning to convince myself that
>> the way to do this is to make a new sequence type, as illustrated in the triple.cpp
>> example program. The new sequence would basically wrap/duplicate the fusion map, but
>> add the necessary flags/version counters, and implement all query/assignment functions
>> via the cache maintaining functions.
>>
>> I'm thinking that the dependency relationships among the values could be specified
>> using type symbols such as, 'depends<id1, mpl::set<id2, id3> >' which would indicate
>> that the data associated with type symbol 'id1' becomes out-of-date whenever the data
>> associated with 'id2' or 'id3' is changed. These dependency symbols could be given to
>> the cache type as template parameters, and the function for recomputing the value
>> associated with 'id1' could be a static method in a template class parameterized by a
>> tag type associated with the cache as another template parameter.
>>
>> (The type symbols representing dependencies would probably be a bit more sophisticated,
>> because I'm thinking of various ways of handling the propagation of 'out-of-date'
>> information, e.g., flag a value as out-of-date whenever a dependency is changed, or
>> instead, mark a value with dependant values as changed-since-last-used, or instead,
>> mark values with an incremental version counter, as well as orthogonally, dependencies
>> specified at known at compile time and/or modifiable during runtime.)
>>
>> Another aspect of sophistication I'd like to include is: assignment/query of multiple
>> values simultaneously since this may affect the optimal way to manage the cache
>> flags/update the data. e.g., if C depends on both A and B, then assigning A then
>> assigning B incurs two checks/automatic-updates to the value of C, but an assignment
>> that somehow encodes the simultaneity can avoid this. This could be achieved with a
>> template assignment function like: 'assign<id1, id2>(cache, value1, value2);' or
>> something like that...
>>
>> Anyway, I'd be very grateful for any responses, but in particular:
>>
>> - this is pointless (because...) - this has been done (look at...) - you should
>> also...
>>
>> and most of all:
>>
>> - there's a better way (which is...)
> I think this makes perfect sense. I am not aware if this has been done
> before. Go for it!
>
> While you are on it; write a blog about it :-)
>
> Regards,

------------------------------------------------------------------------------
Why Cloud-Based Security and Archiving Make Sense
Osterman Research conducted this study that outlines how and why cloud
computing security and archiving is rapidly being adopted across the IT
space for its ease of implementation, lower cost, and increased
reliability. Learn more. http://www.accelacomm.com/jaw/sfnl/114/51425301/
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: caching fusion map (idea/query)

Gordon Woodhull
In reply to this post by Josh
Hi Josh,

This sounds like a really interesting application of Fusion.  I will be interested to see what you come up with.

On Sep 7, 2011, at 10:38 PM, Josh wrote:
I'm trying to figure out a good way to implement a lazy and transparent value cache ...by which I mean something like a fusion map which, whenever queried for an actual value, first ensures that said
value is up-to-date before returning it.
...
I'm thinking that the dependency relationships among the values could be specified using type symbols such as,
'depends<id1, mpl::set<id2, id3> >' which would indicate that the data associated with type symbol 'id1' becomes out-of-date whenever the data associated with 'id2' or 'id3' is changed. These dependency symbols could be given to the cache type as template parameters, and the function for recomputing the value associated with 'id1' could be a static method in a template class parameterized by a tag type associated with the cache as another template parameter.
...
Another aspect of sophistication I'd like to include is: assignment/query of multiple values simultaneously since this may affect the optimal way to manage the cache flags/update the data. e.g., if C depends on both A and B, then assigning A then assigning B incurs two checks/automatic-updates to the value of C, but an assignment that somehow encodes the simultaneity can avoid this. This could be achieved with a template assignment function like: 'assign<id1, id2>(cache, value1, value2);' or something like that...

Of course I am biased but whenever I hear the word "dependency" my thoughts go to graph data structures and topological sort.  

I am not clear if you mean to recalculate dependencies eagerly in response to changes, or just set dirty flags and calculate lazily in response to requests.  It is basically the same problem, only changing which direction you read the links.  IIUC with the eager approach doing multiple changes at once will get you some efficiency gains; with the lazy approach it's multiple requests.

My extremely humble advice is to create an MPL.Graph [1] where the field tags are the vertices and the dependencies are edges.

Then (for the lazy approach) you could run one depth_first_search from each field requested, appending each field to an MPL sequence on finish_vertex, but reusing the color map so that no field gets visited twice.  The resulting MPL sequence will give you a total ordering of all the fields that might need to be computed.

That part is all at compile-time and it wouldn't be wise to try to mix that with actually checking the dirty flag (code explosion).  So just mpl::for_each on the sorted dependencies, updating the fields that are marked dirty.

For the eager approach, there's no dirty flag, you're just updating the whole DAG that's downstream.

Of course, topological sort is just a fancy name for tree recursion printing the children first, so you could code this fairly easily without MPL.Graph, but it shouldn't cost you much over keeping a bunch of mpl::sets yourself, and I hope you find the code & abstraction useful.

[Taking this a little further, this is one possible sort of Fusion.Graph, where the vertices are heterogeneous data in a Fusion container and the edges are pure metadata.  I don't have any code for that ..yet.]

I'd be glad to discuss this with you either offline or on the main Boost dev list.

Cheers,
Gordon

[1] source msm/mpl_graph; docs http://metagraph.info/mpl.graph/
I also briefly mentioned this topic in my BoostCon paper & talk, under "data flow".



------------------------------------------------------------------------------
Doing More with Less: The Next Generation Virtual Desktop
What are the key obstacles that have prevented many mid-market businesses
from deploying virtual desktops?   How do next-generation virtual desktops
provide companies an easier-to-deploy, easier-to-manage and more affordable
virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: caching fusion map (idea/query)

Gordon Woodhull

On Sep 15, 2011, at 1:15 PM, Gordon Woodhull wrote:

That part is all at compile-time and it wouldn't be wise to try to mix that with actually checking the dirty flag (code explosion).  So just mpl::for_each on the sorted dependencies, updating the fields that are marked dirty.

For the eager approach, there's no dirty flag, you're just updating the whole DAG that's downstream.

Correction: in the lazy version you'd need to mark everything downstream as dirty whenever a set of changes come in.  Then when handling a set of requests calculate everything upstream (compile-time topo sort) that's got it's dirty bit (runtime) set. 

So the lazy version actually requires topo sort in both directions and benefits from both change sets and request sets.  Pretty cool, and IMO MPL.Graph would make it pretty easy (once you wrap your head around it).

Cheers,
Gordon


------------------------------------------------------------------------------
Doing More with Less: The Next Generation Virtual Desktop
What are the key obstacles that have prevented many mid-market businesses
from deploying virtual desktops?   How do next-generation virtual desktops
provide companies an easier-to-deploy, easier-to-manage and more affordable
virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: caching fusion map (idea/query)

Josh

Gordon,

Thanks for your interest and enthusiasm.

Using the MPL.Graph stuff sounds like the smart way to do it. I haven't used it yet so that route would cost me a bit of mental pre-computation.

Anyway, I think we have the same mechanism in mind, i.e., pro-actively propagated 'out-of-date' information, and lazily queried 'changed-since-last-used' information. I had three data models for this in mind: boolean 'up-to-date' flags, boolean 'changed' flags, and integer 'version' counters. The dependencies would be specified as types, and all vanish at compile time leaving only the optimal flag/version maintenance code at runtime.

I think all three mechanisms require, in general, that a value 'know' all the values that depend on it. Let's talk about 'source' values which have a number of 'dependant' values. Regarding these three mechanisms:

    'out-of-date' propagation: the source value must know all of its dependants in order to mark them all out-of-date when it is changed

    'changed-since-last-used': a source value could provide a 'changed' flag, which is set whenever its value is changed, and cleared when its dependant values are updated or marked 'out-of-date' sometime later. If a source value can have multiple dependants then none of them can safely clear this flag unless they know that all other dependants are up-to-date or marked as out-of-date. So I'm thinking there should be a 'propagate-change-information' method that causes all lazy dependants to be marked out-of-date or updated, and then clears the source value's 'changed-since-last-used' flag.

    'version' propagation: in this case a source value usually doesn't need to know the values that are lazily checking its version number, except that practically, the version variable might have to cycle when it reaches the numerical limit. In that rare case, the dependants could all have their 'latest-version' copy of the source value's 'version' counter changed to MAX_VERSION and the 'version' changed back to 0.

In addition to dependencies fully specified at compile time, I want to provide support for run-time dependency manipulation. I think this is quite straightforward, and could be achieved by adding a runtime version of the three mechanisms. Rather than instantiating a bool or version counter type, they would contain a runtime set of attached values, e.g., as a std::set<boost::weak_ptr<T> >, or by references or whatever. Anyway, instead of checking their flag, they would have to check every element in the set. Likewise, when being informed they are out-of-date, they would set each element in the set out-of-date.

Regarding the update of values in the cache, it would be nice if the cache automatically selected the best functions for computing the values. In particular, (forgive me if I am repeating what I described in an earlier message), if a value X is the source for two values Y and Z deduced from it, there could be a function to compute Y from X, Z from X, or both X and Y simultaneously from X. Assuming each of these functions is the cheapest solution when respectively X, Y, or both are required, it would be nice if we could query the cache with a get<X>, get<Y>, get<X,Y> method that selects the most efficient computation. Likewise, when setting values there's no point in recomputing a value that is about to be set at the same time, so methods like set<A>, set<B> and set<A,B> would be nice (sorry, I should probably use 'query' and 'assign' to avoid confusion with a set data type). I think adding this kind of behaviour is pretty straightforward, by providing the cache with a class
(e.g., as a template parameter to the cache), that has the compute<X>, compute<Y>, compute<X,Y> methods and, e.g., typedefs an mpl::set<X, Y, mpl::set<X,Y> > computable_values;).

I also had an idea of a possible nifty extension to do some automatic code optimization. In some cases its more efficient to use lazy propagation, and others immediate propagation, and it might not be obvious which of the alternative compute<X>, compute<X,Y>, etc. methods described above yields the most efficient solution. So, timing statistics could be gathered by adding internal timers around the assign, query, and compute methods. The cache could also try various alternative propagation schemes by backing itself up every time it is queried/assigned and performing the required actions using different schemes. Then when you quit it could dump out some timing statistics. You copy the values back into the source (e.g., the class providing the compute<..> methods can have an optional, typedef mpl::map< mpl::pair<X, mpl::int_< expected duration >, ... > expected_durations;, and when you recompile without a CACHE_PROFILING flag, it selects the best cache propagation mechanisms
and compute<..> methods for you automatically. This would always be a bit rough though, I guess.

Anyway, like I said, I'm very grateful for your interest and enthusiasm.

Unfortunately, I have some imminent deadlines on other stuff at the moment, so it might be several weeks until I get back to this. On the other hand, a fairly simple version without all the bells and whistles really shouldn't be too much effort and I'm intending to have a crack at it when I get a few spare days.

Best regards,

Josh.

> So the lazy version actually requires topo sort in both directions and benefits from both change sets and request sets.

Yup. A lazy value that checks its sources cannot have dependants that expect their sources to inform them when they are out of date (unless its sufficient for them to stay in sync with their lazy sources even when those lazy sources might be out of sync with their own sources).

...if that's the angle you had in mind..!



------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
http://p.sf.net/sfu/rim-devcon-copy2
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: caching fusion map (idea/query)

Josh
In reply to this post by Gordon Woodhull

Regarding the specification of dependencies, and values to keep in the cache, I'm considering a couple of ideas:

(i) A cache is specified with the same template arguments as a fusion map (a sequence of pairs of key type and data type). The dependencies are specified for the key types independently, e.g., something like:

namespace cache_stuff {

    template<typename Key> struct depends_on : boost::mpl::set<> {};

    template<> struct depends_on<my_id0> : boost::mpl::set<my_id1, my_id2> {};

} // namespace cache_stuff

The functions for re-computing a value(s) are also specified independently in terms of the key types. They could perhaps be template functions or static methods of template classes. There functions available for recomputing the value associated with a given key must also be defined (e.g., in a similar way to the depends_on struct, and the keys for the values they require as arguments must also be defined. Anyway, the point is that the functions for recomputing the value(s) associated with a given key(s) using a single function/static method are defined independently of the cache. This means that the cache implementation just looks them up, chooses the most appropriate of the applicable functions and uses it to update the values it holds whenever necessary.

An advantage of this approach is that the cache class template can be designed as 'just another fusion container' (I'm only really thinking about fusion maps, but the other associative container, the set, could be an easy extension as well as perhaps the non-associative types but skip that speculation for now) except that when you use a cache it automatically handles the maintenance of update values as-and-when needed/required. e.g., you can do an as_map(cache_type &) to get a map with the same structure but which does not do the maintenance part for you. You might then manipulate the contents of the cache, add new cached values or remove some others, and then turn it back into a cache type. However you arrive at the new cache type, the point is that by just designating it as a cache container you will enable all the caching functions and dependencies defined for the key type/data it contains.

A disadvantage is that all the dependency information and updating functions must be specified independently so the key types are locked into specific interpretations. The cache also does not contain any other instantiated data apart from the key indexed data.

(ii) Alternatively, a cache can be specified via a single template parameter: Cache_Spec which contains all the typedefs defining the keys/data contained, the dependency relationships, the methods/functions available for updating a particular value(s), and indeed, those methods may be written in the Cache_Spec class itself.

An advantage of this approach is that the dependencies and functions can be specified in different ways for the same key types, and that the Cache_Spec class can have its own data to use during computations (e.g., if they require a lot of allocation, or exploit coherency).

A disadvantage is that it is harder to manipulate the cache as a fusion container, and indeed, it seems more like a specific kind of view type.


Anyway, for the time being I prefer the latter approach. Any comments, ideas, cold water would be greatly appreciated.


Josh.



------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
http://p.sf.net/sfu/rim-devcon-copy2
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general