Odd dlopen behavior

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Odd dlopen behavior

Davidson, Josh
I'm wrapping a C++ project with Py++/Boost.Python under Windows and Linux.  Everything in Windows is working fine, but I'm a bit confused over the behavior in Linux.  The C++ project is built into a single shared library called libsimif, but I'd like to split it up into 3 separate extension modules.  For simplicity, I'll only discuss two of them, since the behavior for the third is identical.  The first, called storage contains definitions of data structures.  It has no dependencies on anything defined in either of the other two extension modules.  The second module, control, uses data structures that are defined in storage.   On the C++ side of things, the headers and source files for storage and control are in entirely different directories.  I've tried a number of different configurations to build the extensions, but one thing that has remained consistent is that for storage, I am only generating Py++ wrappers for the headers included in the storage directory and only building sou
 rce files in that directory along with the Py++ generated sources.  Ditto for the control extension.

The current configuration that I am using that works passes in libsimif as a library to the distutils.Extension constructor.  Then before starting Python, I need to ensure that libsimif is found in LD_LIBRARY_PATH.  Then I can launch Python and import either module (or from them) and everything works as-expected.   Here is some sample output from this working configuration:
>>> import ast.simif.model_io.storage as storage
>>> import ast.simif.model_io.control as control
>>> dir(storage)
['DiscreteStore', 'PulseStore', 'RtStore', 'SerialStore', 'SharedMemoryBuilder', 'SharedMemoryDeleter', 'SpaceWireStore', '__doc__', '__file__', '__name__', '__package__']
>>> dir(control)
['DiscreteController', 'ModelIoController', 'PulseController', 'RtController', 'SerialController', 'SpaceWireController', '__doc__', '__file__', '__name__', '__package__']
>>> storage.__file__
'ast/simif/model_io/storage.so'
>>> control.__file__
'ast/simif/model_io/control.so'


As you can see, both modules have their own shared library and unique set of symbols.  Now here is why I am confused.  In Linux, we've always set the dlopen flags to include RTLD_NOW and RTLD_GLOBAL.  If I do that, this is what happens:
>>> import sys
>>> import DLFCN
>>> sys.setdlopenflags(DLFCN.RTLD_NOW | DLFCN.RTLD_GLOBAL)
>>> import ast.simif.model_io.storage as storage
>>> import ast.simif.model_io.control as control
__main__:1: RuntimeWarning: to-Python converter for DiscreteStore::FrameData already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for PulseStore::FrameData already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for RtStore::Link already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for RtStore::FrameData already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for RtStore::RtData already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for SerialStore::FrameData already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for SharedMemoryBuilder already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for SharedMemoryDeleter already registered; second conversion method ignored.
>>> dir(storage)
['DiscreteStore', 'PulseStore', 'RtStore', 'SerialStore', 'SharedMemoryBuilder', 'SharedMemoryDeleter', 'SpaceWireStore', '__doc__', '__file__', '__name__', '__package__']
>>> dir(control)
['DiscreteStore', 'PulseStore', 'RtStore', 'SerialStore', 'SharedMemoryBuilder', 'SharedMemoryDeleter', '__doc__', '__file__', '__name__', '__package__']
>>> storage.__file__
'ast/simif/model_io/storage.so'
>>> control.__file__
'ast/simif/model_io/control.so'

So, here storage imports ok, but control complains about a bunch of duplicate registrations.  Then when inspecting the modules, control is completely wrong.  It's like it tried to import storage twice even though __file__ reports the correct shared libraries.   Perhaps not surprising, if  I change the import order and import control ahead of storage, this is what happens:

>>> import sys
>>> import DLFCN
>>> sys.setdlopenflags(DLFCN.RTLD_NOW | DLFCN.RTLD_GLOBAL)
>>> import ast.simif.model_io.control as control
>>> dir(control)
['DiscreteController', 'ModelIoController', 'PulseController', 'RtController', 'SerialController', 'SpaceWireController', '__doc__', '__file__', '__name__', '__package__']
>>> import ast.simif.model_io.storage as storage
__main__:1: RuntimeWarning: to-Python converter for DiscreteController already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for PulseController already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for RtController already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for SerialController already registered; second conversion method ignored.
__main__:1: RuntimeWarning: to-Python converter for SpaceWireController already registered; second conversion method ignored.
>>> dir(storage)
['DiscreteController', 'ModelIoController', 'PulseController', 'RtController', 'SerialController', 'SpaceWireController', 'SpaceWireStore', '__doc__', '__file__', '__name__', '__package__']

Similar behavior, but now the storage import is FUBAR.  Does anyone understand what is going on here?

I'm using x64 Python 2.6.6 on x64 RHEL 6.  Gcc version 4.4.6.

Thanks,
Josh



_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig
Reply | Threaded
Open this post in threaded view
|

Re: Odd dlopen behavior

Niall Douglas
On 30 Jan 2012 at 1:21, Davidson, Josh wrote:

> Similar behavior, but now the storage import is FUBAR.  Does anyone
> understand what is going on here?
>
> I'm using x64 Python 2.6.6 on x64 RHEL 6.  Gcc version 4.4.6.

It's never popular for me to say this, but shared libraries really
aren't implemented well in ELF. It's always more unnecessary work
there due to its bad design.

Have you applied symbol visibility as per
http://www.nedprod.com/programs/gccvisibility.html? It should be a
cinch if you already have windows support in there.

On the wider issue, BPL has no concept of DLL/SO type ownership, so
if DLL A defines a class Foo and DLL B defines a class Foo with a
completely different definition, all BPL can do is complain when it
sees the type's registration code being duplicated without knowing if
it's serious or not. Needless to say, any binary generated here can't
work reliably unless one disables one or the other of class Foo.

Now regarding your issue, Py++ has to make the assumption that thunk
code must be generated for each type for each module output even
though those can't be combined without runtime warnings. If you've
implemented the GCC visibility stuff above and you still have a
problem, you need to start marking the clashing symbols as weak or
inline so GNU ld elides the duplicates at runtime.

I'm sure Py++ can insert the required markup automagically - Roman
might be able to help here.

If that isn't a runner, start chopping out sections of API mirrored
into the Python space, or if you need that section then break your
common DLL/SO into its own python module and have that be imported by
the modules using that common DLL/SO. Remember that you can split a
large DLL/SO in multiple Python module representations as needed.

HTH,
Niall

--
Technology & Consulting Services - ned Productions Limited.
http://www.nedproductions.biz/. VAT reg: IE 9708311Q. Company no:
472909.



_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig
Reply | Threaded
Open this post in threaded view
|

Re: EXTERNAL: Re: Odd dlopen behavior

Davidson, Josh
Ok, well I did figure out the discrepancy between these extensions and previous extensions that have been built that required setting RTLD_GLOBAL.  What I'm doing for these extensions is instead of building in all of the original C++ code AND the Py++ generated code into the extension, I'm only building in the Py++ generated sources  and linking an existing shared library containing the original C++ definitions.  Is this non-standard or bad practice?  

One issue with this is I'm now forced to deliver both the Python extension shared libraries and the original shared libraries.  Not a huge deal, but it does add a little work on the deployment and maintenance end.


On 30 Jan 2012 at 1:21, Davidson, Josh wrote:

> Similar behavior, but now the storage import is FUBAR.  Does anyone
> understand what is going on here?
>
> I'm using x64 Python 2.6.6 on x64 RHEL 6.  Gcc version 4.4.6.

It's never popular for me to say this, but shared libraries really aren't implemented well in ELF. It's always more unnecessary work there due to its bad design.

Have you applied symbol visibility as per http://www.nedprod.com/programs/gccvisibility.html? It should be a cinch if you already have windows support in there.

On the wider issue, BPL has no concept of DLL/SO type ownership, so if DLL A defines a class Foo and DLL B defines a class Foo with a completely different definition, all BPL can do is complain when it sees the type's registration code being duplicated without knowing if it's serious or not. Needless to say, any binary generated here can't work reliably unless one disables one or the other of class Foo.

Now regarding your issue, Py++ has to make the assumption that thunk code must be generated for each type for each module output even though those can't be combined without runtime warnings. If you've implemented the GCC visibility stuff above and you still have a problem, you need to start marking the clashing symbols as weak or inline so GNU ld elides the duplicates at runtime.

I'm sure Py++ can insert the required markup automagically - Roman might be able to help here.

If that isn't a runner, start chopping out sections of API mirrored into the Python space, or if you need that section then break your common DLL/SO into its own python module and have that be imported by the modules using that common DLL/SO. Remember that you can split a large DLL/SO in multiple Python module representations as needed.

HTH,
Niall

--
Technology & Consulting Services - ned Productions Limited.
http://www.nedproductions.biz/. VAT reg: IE 9708311Q. Company no:
472909.



_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig
_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig
Reply | Threaded
Open this post in threaded view
|

Re: EXTERNAL: Re: Odd dlopen behavior

Niall Douglas
On 31 Jan 2012 at 16:44, Davidson, Josh wrote:

> Ok, well I did figure out the discrepancy between these extensions and
> previous extensions that have been built that required setting
> RTLD_GLOBAL.  What I'm doing for these extensions is instead of building
> in all of the original C++ code AND the Py++ generated code into the
> extension, I'm only building in the Py++ generated sources  and linking
> an existing shared library containing the original C++ definitions.  Is
> this non-standard or bad practice?  

The big problem with shared objects exporting lots of symbols was
that the Linux runtime shared object linker used to have O(N^3)
complexity. As a result, every time you ran a program linking to a
ginormous shared object you'd get a pause of several seconds as it
bound the symbols.

Now, some years ago with KDE and OpenOffice taking forever to load,
some eyeballs were turned onto this problem and I know they were
going to get it down to O(N^2). There was speak of replacing bits
with O(N), but it would introduce ABI compat problems among other
things. Another angle was making it use multiple cores. My attention
ended up moving elsewhere so I have no idea what has happened since.
It could still be O(N^2), it could be O(N) or somewhere in between.

> One issue with this is I'm now forced to deliver both the Python
> extension shared libraries and the original shared libraries.  Not a
> huge deal, but it does add a little work on the deployment and
> maintenance end.

On systems with sane DLL designs like Windows and Mac OS X, you'd
generally keep the Python bindings separate from the library being
bound as it's cleaner and more self-contained. You can also issue
smaller self-container ABI compatible releases as hotfixes etc etc.

On the insanity that is ELF, generally you can make inter-SO problems
go away by linking everything into a ginormous monolithic SO. However
you used to get that O(N^3)/O(N^2) problem I mentioned and maybe you
still do. So, sometimes you just have to get your hands dirty and
start with hack scripts which post-process the SOs to make their
symbol tables sane, or write your own SO loader and binder
implementation using dlopen() et al and bypass the system linker
altogether :)

Sadly the ISO standards work to enforce sanity in shared libraries
across all platforms got dropped from C11 and C++11, but I certainly
will try to push that forward again for C11 TR1 along with a few
other items on my shopping list (I'm the ISO SC22 convenor for
Ireland, though Ireland is only an observer). The problem, as always,
is a lack of sponsorship or funding by anyone who cares enough to
have the problem fixed properly - and it is a difficult problem to
get correct. In the end, as much as these problems are annoying and
cost time to people like you, the cost of fixing them isn't seen as
business relevant by those with the resources.

HTH,
Niall

--
Technology & Consulting Services - ned Productions Limited.
http://www.nedproductions.biz/. VAT reg: IE 9708311Q. Company no:
472909.



_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig
Reply | Threaded
Open this post in threaded view
|

Re: EXTERNAL: Re: Odd dlopen behavior

Davidson, Josh
Neil, great information, but I did track this problem down to a quirk with Py++.

I've had a great deal of trouble finding a reliable way to actually write modules with module_builder.  Originally, I had been using split_module but I've run into several cases where it goes off in the weeds and tries to write files that exceed the maximum file length.  Generally, this occurs when wrapping classes that go nuts on specialization since module_builder uses the name of the class in the file bindings file.  Here is one quick, extreme example where this occurs when trying to wrap members of boost:
File "C:\Users\davidsj2\workspace\SimCommon\src\Python\goes\build\bindings.py", line 325, in _generate
    files = mb.split_module(self._bindingsDir)
  File "c:\Python26\lib\site-packages\pyplusplus\module_builder\boost_python_builder.py", line 375, in split_module
    , encoding=self.encoding)
  File "c:\Python26\lib\site-packages\pyplusplus\file_writers\__init__.py", line 37, in write_multiple_files
    mfs.write()
  File "c:\Python26\lib\site-packages\pyplusplus\file_writers\multiple_files.py", line 406, in write
    self.split_classes()
  File "c:\Python26\lib\site-packages\pyplusplus\file_writers\multiple_files.py", line 307, in split_classes
    map( self.split_class, class_creators )
  File "c:\Python26\lib\site-packages\pyplusplus\file_writers\multiple_files.py", line 294, in split_class
    self.split_class_impl( class_creator )
  File "c:\Python26\lib\site-packages\pyplusplus\file_writers\multiple_files.py", line 268, in split_class_impl
    , self.create_function_code( function_name ) ) )
  File "c:\Python26\lib\site-packages\pyplusplus\file_writers\multiple_files.py", line 61, in write_file
    writer.writer_t.write_file( fpath, content, self.files_sum_repository, self.encoding )
  File "c:\Python26\lib\site-packages\pyplusplus\file_writers\writer.py", line 150, in write_file
    f = codecs.open( fpath, 'w+b', encoding )
  File "c:\Python26\lib\codecs.py", line 881, in open
    file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: 'C:\\Users\\davidsj2\\workspace\\SimCommon\\build\\win64\\pybindings\\goes\\boost\\dividable2_less__boost_scope_date_time_scope_date_duration_less__boost_scope_date_time_scope_duration_traits_adapted__greater__comma__int_comma__boost_scope_detail_scope_empty_base_less__boost_scope_date_time_scope_date_duration_less__boost_scope_date_time_scope_duration_traits_adapted__greater___greater___greater_.pypp.hpp'
make: *** [all] Error 1

After finding references to this problem as far back as 2006, I decided to switch over to balanced_split_module.  This has its own set of problems.  The first is that it is highly prone to divide by zero errors.  One quick way to reproduce this issue is to wrap one class and specify a split count of 2.  Obviously not a wise combo, but it's an easy error case that Py++ should handle.

So anyways, the root of *this* problem is how balanced_split_module creates its registration functions.  For each extension, it creates one register function for each file it writes in the form:  void register_classes_<N>()    Obviously, these collide when you create more than one extension using balanced_split_module and enable RTLD_GLOBAL.  One quick solution to this problem would be to prepend the extension name to the name of the registration functions, e.g.: <module>_register_classes_<N>  Since the module name is used to name the files, its easily accessible and would solve a lot of problems.  Of course, if you have modules with the same name in different packages you would run into this again.

Josh

-----Original Message-----
From: cplusplus-sig-bounces+josh.davidson=[hidden email] [mailto:cplusplus-sig-bounces+josh.davidson=[hidden email]] On Behalf Of Niall Douglas
Sent: Wednesday, February 01, 2012 10:46 AM
To: Development of Python/C++ integration
Subject: Re: [C++-sig] EXTERNAL: Re: Odd dlopen behavior

On 31 Jan 2012 at 16:44, Davidson, Josh wrote:

> Ok, well I did figure out the discrepancy between these extensions and
> previous extensions that have been built that required setting
> RTLD_GLOBAL.  What I'm doing for these extensions is instead of
> building in all of the original C++ code AND the Py++ generated code
> into the extension, I'm only building in the Py++ generated sources  
> and linking an existing shared library containing the original C++
> definitions.  Is this non-standard or bad practice?

The big problem with shared objects exporting lots of symbols was that the Linux runtime shared object linker used to have O(N^3) complexity. As a result, every time you ran a program linking to a ginormous shared object you'd get a pause of several seconds as it bound the symbols.

Now, some years ago with KDE and OpenOffice taking forever to load, some eyeballs were turned onto this problem and I know they were going to get it down to O(N^2). There was speak of replacing bits with O(N), but it would introduce ABI compat problems among other things. Another angle was making it use multiple cores. My attention ended up moving elsewhere so I have no idea what has happened since.
It could still be O(N^2), it could be O(N) or somewhere in between.

> One issue with this is I'm now forced to deliver both the Python
> extension shared libraries and the original shared libraries.  Not a
> huge deal, but it does add a little work on the deployment and
> maintenance end.

On systems with sane DLL designs like Windows and Mac OS X, you'd generally keep the Python bindings separate from the library being bound as it's cleaner and more self-contained. You can also issue smaller self-container ABI compatible releases as hotfixes etc etc.

On the insanity that is ELF, generally you can make inter-SO problems go away by linking everything into a ginormous monolithic SO. However you used to get that O(N^3)/O(N^2) problem I mentioned and maybe you still do. So, sometimes you just have to get your hands dirty and start with hack scripts which post-process the SOs to make their symbol tables sane, or write your own SO loader and binder implementation using dlopen() et al and bypass the system linker altogether :)

Sadly the ISO standards work to enforce sanity in shared libraries across all platforms got dropped from C11 and C++11, but I certainly will try to push that forward again for C11 TR1 along with a few other items on my shopping list (I'm the ISO SC22 convenor for Ireland, though Ireland is only an observer). The problem, as always, is a lack of sponsorship or funding by anyone who cares enough to have the problem fixed properly - and it is a difficult problem to get correct. In the end, as much as these problems are annoying and cost time to people like you, the cost of fixing them isn't seen as business relevant by those with the resources.

HTH,
Niall

--
Technology & Consulting Services - ned Productions Limited.
http://www.nedproductions.biz/. VAT reg: IE 9708311Q. Company no:
472909.



_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig
_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig
Reply | Threaded
Open this post in threaded view
|

Re: EXTERNAL: Re: Odd dlopen behavior

Niall Douglas
On 2 Feb 2012 at 6:00, Davidson, Josh wrote:

> Neil, great information, but I did track this problem down to a quirk with Py++.
> [snip]
> After finding references to this problem as far back as 2006, I decided
> to switch over to balanced_split_module.  This has its own set of
> problems.  The first is that it is highly prone to divide by zero
> errors.  One quick way to reproduce this issue is to wrap one class and
> specify a split count of 2.  Obviously not a wise combo, but it's an
> easy error case that Py++ should handle.

It's been a long time since I used Py++, or indeed BPL. Neither has
seen much work done on them in recent years, so I should imagine both
will have suffered from a certain amount of bitrot.

> So anyways, the root of *this* problem is how balanced_split_module
> creates its registration functions.  For each extension, it creates one
> register function for each file it writes in the form:  void
> register_classes_<N>()    Obviously, these collide when you create more
> than one extension using balanced_split_module and enable RTLD_GLOBAL.

In my own code, registration functions are always static and pass
their own address into the runtime. The runtime does a two-pass
initialisation, so first off it eliminates any duplicate registration
functions whose address lies within the same DLL. Then and only then
does it start complaining.

That solution allows you to throw registration functions anywhere and
let the runtime sort out what's what. It also lets you operate
per-DLL and DLL-specialised registries, something I suggested to Jim
Bosch for the next release of BPL a few months back. Knowing which
DLL registered what is also very useful for debugging.

> One quick solution to this problem would be to prepend the extension
> name to the name of the registration functions, e.g.:
> <module>_register_classes_<N>  Since the module name is used to name the
> files, its easily accessible and would solve a lot of problems.  Of
> course, if you have modules with the same name in different packages you
> would run into this again.

This is a bad solution. Firstly, who is to say that type Foo in
extension X is or is not the same as type Foo in extension Y even if
both have the same type, same length, same traits and live in the
same namespace? Yes I know it's a violation of ODR, but in the real
world maybe they are the same, and maybe they aren't - people use
DLLs to violate ODR all the time, it's one of their big utilities.
You need a way of explicitly specifying if the types are equal or
aren't. Type registration functions aren't the place to do it,
however in my own code I have a concept of "type conversion"
registration functions which are used to declare runtime type
conversions which kick in if the static (metaprogrammed) type
conversions fail. When you say type Foo is the same between
registries X and Y it simply bumps that type out of X and Y and into
the common parent registry between X and Y, so searches fail in X and
Y and jump into the parent where they get resolved as being
identical.

Now me and Dave Abrahams come to a disagreement after this - he does
not feel that there ought to be separate static and runtime type
registries. And I can see his point. But we disagree :). I'm very
fond of binding together other people's libraries in order to
personally avoid writing as much code as possible, so I think in
terms of ways of getting other people's code to cooperate, and for
that I like to override when needs be which is blatent ODR breaking.
Dave has a different approach, and therefore a different philosophy.

(And before Dave points out that that wasn't my philosophy in the
past, and indeed he once argued with me strenuously to use Boost
rather than write my own metaprogramming library, I admit I was wrong
and he was right)

However, back to the point. It would appear you're running into well
known limitations in BPL. Some years ago, people's needs were
generally simpler and BPL worked without issue for them. As years
move on, more and more complex use cases are becoming common.

Bitrot. It consumes all code eventually.

Niall

--
Technology & Consulting Services - ned Productions Limited.
http://www.nedproductions.biz/. VAT reg: IE 9708311Q. Company no:
472909.



_______________________________________________
Cplusplus-sig mailing list
[hidden email]
http://mail.python.org/mailman/listinfo/cplusplus-sig