Quantcast

NuDB: A fast key/value insert-only database for SSD drives in C++11

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
92 messages Options
12345
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
I just released version 1.0.0 of my header-only C++11 library NuDB.
Its quite a mature library, haven't had to make any changes in a
while. Its been running on production servers for 2 years or so with
no problems, managing ever-growing databases of 2TB.

I'm wondering what the level of interest, if any, there would be for
making this a part of Boost. You can check it out here:

https://github.com/vinniefalco/NuDB

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
Vinnie Falco wrote:

> I just released version 1.0.0 of my header-only C++11 library NuDB.
> Its quite a mature library, haven't had to make any changes in a while.
> Its been running on production servers for 2 years or so with no problems,
> managing ever-growing databases of 2TB.
>
> I'm wondering what the level of interest, if any, there would be for
> making this a part of Boost. You can check it out here:
>
> https://github.com/vinniefalco/NuDB

The obvious question one might have is, in what scenarios is a database that
only supports insertions, but not updates or deletions, useful?

A follow-up to that is, is it not possible to add update/delete support to
NuDB by just appending the new data and making the key file point to the new
location (using zero-sized data for deletions)?

Last, the README says "Value sizes from 1 to 2^32 bytes (4GB)", but the file
format says uint48_t. Which is it? 2^32 is perhaps a bit limiting nowadays.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
On Tue, Mar 21, 2017 at 5:01 PM, Peter Dimov via Boost
<[hidden email]> wrote:
> The obvious question one might have is, in what scenarios is a database that
> only supports insertions, but not updates or deletions, useful?

So this actually comes up quite a bit in decentralized systems, where
the cryptographic digest (or hash) of a binary object is used as the
key to identify the object. Its called "Content-addressable storage:"
https://en.wikipedia.org/wiki/Content-addressable_storage

In digital currencies such as Bitcoin and Ripple these values
represent immutable information such as transactions that have taken
place in the past. It also comes up in distributed storage systems
such as file sharding, to represent pieces of files.

There's no question that this library is quite specialized and
applicable only in niche use-cases. But for those cases, it works
amazingly well.

> A follow-up to that is, is it not possible to add update/delete support to
> NuDB by just appending the new data and making the key file point to the new
> location (using zero-sized data for deletions)?

That is certainly possible, But space for the unused objects created
as a result of updates and deletes could not be reclaimed in
real-time. Administrators would have to use an offline process to take
a database and re-create it to exclude those objects. This could take
quite a long time for databases in the tens of terabytes, on the order
of days or weeks.

It is also not a use-case required for the applications that I
developed NuDB for, so I haven't done it. But it could be explored.

> Last, the README says "Value sizes from 1 to 2^32 bytes (4GB)", but the file
> format says uint48_t. Which is it? 2^32 is perhaps a bit limiting nowadays.

Probably the documentation is a touch out of date. The database
imposes a limit of 2^32 bytes for individual values. I think that's
big enough. NuDB is designed for database that have an enormous number
of keys, not a small number of keys with enormous values. If an
application needs a key/value store where typical values are in excess
of 4GB, they don't need support for billions of keys (since a billion
of those values would exceed all known storage limits).

Also I think I changed that limit from 2^48 down to 2^32 to make NuDB
work identically when built for 32-bit processor targets as when built
for 64-bit processor targets.

Thanks for the questions!

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 21/03/2017 20:36, Vinnie Falco via Boost wrote:
> I just released version 1.0.0 of my header-only C++11 library NuDB.
> Its quite a mature library, haven't had to make any changes in a
> while. Its been running on production servers for 2 years or so with
> no problems, managing ever-growing databases of 2TB.
>
> I'm wondering what the level of interest, if any, there would be for
> making this a part of Boost. You can check it out here:
>
> https://github.com/vinniefalco/NuDB

As I said to you at CppCon, I'm very keen on the concept. It's why I
started AFIO back in 2012, is to one day deliver a low level key-value
store suitable for C++ standardisation. And I'm still at that five years
later, with Outcome being my next step to getting that achieved as the
key-value store's API uses a ton of outcome::result<T>.

But as I also said to you at CppCon, I'm not keen on your NuDB. I have
problems with its API, with its flexibility and scalability, and lots
more problems with its implementation. I am unsure what benefit it
confers over the much safer choices of SQLite or UnQLite both of which
are battle tested and written by database experts to ensure that other
people's data is never, ever lost.

I felt when I reviewed NuDB's implementation some time ago that your
implementation would be highly susceptible to data loss in a wide set of
circumstances. I also felt it wouldn't scale well to the > 1M IOPS
storage just around the corner. I appreciate that as you say, it's been
running in production for two years, but I would suspect it is being
used in a highly restricted use case where corner cases don't present.
As a library in Boost, you would find people doing all sorts of daft
things with it, and it would therefore need to be absolutely rock solid
reliable on a wide variety of operating systems, filing systems and use
cases e.g. over a Samba share. This is why it's taking me so long,
writing the testing for this stuff takes a very, very long time.

All that said, I haven't delivered a key-value store after five years of
outside-of-work effort. You have. I would therefore endorse you bringing
this to review, but be aware that in a review I would be fairly certain
it'll be rejected. That said, the feedback you'd get would be very
useful, not just to you but also to me for my planned key-value store,
so if you are up for getting this to a state ready to submit to Boost,
I'd welcome it.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
On Tue, Mar 21, 2017 at 7:36 PM, Niall Douglas via Boost
<[hidden email]> wrote:
> On 21/03/2017 20:36, Vinnie Falco via Boost wrote:
> I am unsure what benefit it
> confers over the much safer choices of SQLite or UnQLite both of which
> are battle tested and written by database experts to ensure that other
> people's data is never, ever lost.

SQLite is painfully slow compared to NuDB. UnQLite supports features
not present in NuDB such as cursor iteration, moving, and deleting.
The tradeoff for not supporting these features is that NuDB can make
better claims about its performance.

> I felt when I reviewed NuDB's implementation some time ago that your
> implementation would be highly susceptible to data loss in a wide set of
> circumstances.

NuDB makes some assumptions about the underlying file system. In
particular, that when a call to fsync() returns it is assured the data
is reliably written. If these invariants are met there is no data
loss.

In fact, one of NuDB's unit tests exhaustively verifies that the
database recovery process performs correctly after an I/O failure. The
unit test works by instantiating a database with a custom File
template parameter. This custom file implementation simulates an I/O
failure on the Nth operation. The test runs with N=1 to failure, then
increments N, and re-runs the test until there are no failures. At
each iteration, the test verifies that the database recover process
left the database in a consistent state. You can see this code here:
https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125

I don't doubt that you think this database is vulnerable to data loss
under some condition. However, I would appreciate specifics so that I
can verify under what conditions your claim is correct, or if it is
correct at all.

> I also felt it wouldn't scale well to the > 1M IOPS
> storage just around the corner.

While it hasn't yet been tested on these non-existent devices, I am
optimistic, since NuDB was specifically designed for very intense
fetch workloads with light concurrent insertions. Fetches can happen
concurrently, it should be possible to saturate the read throughput of
a device. See:
https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_store.ipp#L213

> I would suspect it is being
> used in a highly restricted use case where corner cases don't present.

Maybe. I'll counter by saying, there are only three operations
possible on an open database: insert, fetch, and close.

> if you are up for getting this to a state ready to submit to Boost,
> I'd welcome it.

Cool! The docs just need a bit of work, the library code is more than ready.

Thanks for taking the time to go through it!

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
On Tue, 21 Mar 2017 20:00:39 -0400
Vinnie Falco via Boost <[hidden email]> wrote:

> On Tue, Mar 21, 2017 at 7:36 PM, Niall Douglas via Boost
> <[hidden email]> wrote:
> > On 21/03/2017 20:36, Vinnie Falco via Boost wrote:
> > I am unsure what benefit it
> > confers over the much safer choices of SQLite or UnQLite both of
> > which are battle tested and written by database experts to ensure
> > that other people's data is never, ever lost.  
>
> SQLite is painfully slow compared to NuDB. UnQLite supports features
> not present in NuDB such as cursor iteration, moving, and deleting.
> The tradeoff for not supporting these features is that NuDB can make
> better claims about its performance.
>

Have you compared this to LMDB? That should be the best comparison for
performance and features, although I've never seen LMDB compared
directly to UnQLite.

One thing that I noticed is that the documentation of NuDB indirectly
indicates that recent data loss can occur since an insertion is not
immediately synced to disk. This is going to give NuDB a write
advantage when comparing against many databases.

> > I felt when I reviewed NuDB's implementation some time ago that your
> > implementation would be highly susceptible to data loss in a wide
> > set of circumstances.  
>
> NuDB makes some assumptions about the underlying file system. In
> particular, that when a call to fsync() returns it is assured the data
> is reliably written. If these invariants are met there is no data
> loss.
>
> In fact, one of NuDB's unit tests exhaustively verifies that the
> database recovery process performs correctly after an I/O failure. The
> unit test works by instantiating a database with a custom File
> template parameter. This custom file implementation simulates an I/O
> failure on the Nth operation. The test runs with N=1 to failure, then
> increments N, and re-runs the test until there are no failures. At
> each iteration, the test verifies that the database recover process
> left the database in a consistent state. You can see this code here:
> https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125
>
> I don't doubt that you think this database is vulnerable to data loss
> under some condition. However, I would appreciate specifics so that I
> can verify under what conditions your claim is correct, or if it is
> correct at all.

The pathological case is power loss, which cannot be unit tested
easily (and probably isn't possible). I _think_ you took this into
consideration with the log file approach. But one potential issue is
when power loss occurs before the log file header is completely
synch'ed. The filesystem could sync the FILE metadata first (size, etc),
so the file appears large enough to contain the entire header metadata,
but actually contains nothing or a portion of the data. Niall thoughts ?

Also:

  - The key-file is append only so NuDB has multiple blocks of sorted
    records. So hopefully the insertions do not cause lots of blocks or
    the algorithm becomes a linear search.
  - There does not appear to be an advantage to writing the blocks
    to the log file first. There is already a time window after an
    insertion where data loss can occur, so the data and key file sizes
    in the header seem sufficient for rollback.

> > I also felt it wouldn't scale well to the > 1M IOPS
> > storage just around the corner.  
>
> While it hasn't yet been tested on these non-existent devices, I am
> optimistic, since NuDB was specifically designed for very intense
> fetch workloads with light concurrent insertions. Fetches can happen
> concurrently, it should be possible to saturate the read throughput of
> a device. See:
> https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_store.ipp#L213

This is surprisingly unlikely to scale to the IOPS Niall mentioned. The
reader lock requires exclusive cacheline access to synchronize with the
writer, which forces the core to wait for a response from other
cores/processors. IIRC the problem is amplified in multi-sockets
systems where the communication must leave the "die". The speculative
locking feature may help here, but Niall tested this a while ago and
showed some surprising performance quirks (although I do not recall
specifics). Niall or Dimov should be better informed on this subject,
and may correct me a bit.

Generally this is why something like LMDB does so well in benchmarks;
the design/implementation was tailored to modern virtual page caching
and cachelines. Which isn't to say its 100% perfect, just that its
likely to perform better than any implementation which does not take
those system designs into consideration.

> > I would suspect it is being
> > used in a highly restricted use case where corner cases don't
> > present.  
>
> Maybe. I'll counter by saying, there are only three operations
> possible on an open database: insert, fetch, and close.
>
> > if you are up for getting this to a state ready to submit to Boost,
> > I'd welcome it.  
>
> Cool! The docs just need a bit of work, the library code is more than
> ready.
>
> Thanks for taking the time to go through it!
>

Lee

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
On Tue, Mar 21, 2017 at 10:50 PM, Lee Clagett via Boost
<[hidden email]> wrote:
> Have you compared this to LMDB?

In our application (rippled) LMDB performs quite poorly even compared
to LevelDB, which performed poorly compared to NuDB. Its true that the
rippled workload is quite niche.

> The pathological case is power loss, which cannot be unit tested
> easily (and probably isn't possible).

I think this can be unit tested, and I believe that NuDB's unit test
covers the case of power loss. I think we can agree that power loss on
a read is uninteresting (since it can't corrupt data). The unit test
models a power loss as a fatal error during a write. The test
exercises all possible fatal errors using an incremental approach (I
alluded to this in my previous message).

The database template requires a File type to instantiate:

  template<class Hasher, class File>
  class basic_store;

NuDB comes with posix_file and win32_file. The unit test uses its own
fail_file which wraps the native file and generates a simulated error
on the Nth write:
https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.hpp#L143

The test implementation tries to insert a bunch of records into a new
database using a fail_file, incrementing the counter as it goes. On a
failure it tries to recover (which can also result in a failed write).
After the recover fails, it verifies that the database is still
consistent. You can see the test implementation:
https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125

I *think* this test proves that the implementation maintains
consistency on any power loss. would very much like anyone who is
interested to tell me whether this approach works.

>   - The key-file is append only so NuDB has multiple blocks of sorted
>     records. So hopefully the insertions do not cause lots of blocks or
>     the algorithm becomes a linear search.

There's no sorting going on here, the key file is an on-disk hash
table where each bucket holds a good-sized number of keys (so that a
single read from the key file does the most work).

> This is surprisingly unlikely to scale to the IOPS Niall mentioned. The
> reader lock requires exclusive cacheline access to synchronize with the
> writer, which forces the core to wait for a response from other
> cores/processors.

That's possible. I haven't benchmarked it for that scenario and I am
not knowledgeable on optimizing for memory access / cache performance.

Thanks

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
On 22/03/2017 16:08, Vinnie Falco via Boost wrote:
> I think this can be unit tested, and I believe that NuDB's unit test
> covers the case of power loss. I think we can agree that power loss on
> a read is uninteresting (since it can't corrupt data). The unit test
> models a power loss as a fatal error during a write. The test
> exercises all possible fatal errors using an incremental approach (I
> alluded to this in my previous message).

A power loss is more like a fatal error that fails to execute any
subsequent clean-up code, so it might not be quite the same.

There are also more pathological cases such as where a write has been
partially successful and done some subset of increasing the file size,
zeroing the extra file space, and writing some subset of the intended
data.  So it's not necessarily that data is missing; there might be
invalid data in its place.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Tue, Mar 21, 2017 at 11:08 PM, Vinnie Falco via Boost  wrote:
> I just released version 1.0.0 of my header-only C++11 library NuDB.
> Its quite a mature library, haven't had to make any changes in a
> while. Its been running on production servers for 2 years or so with
> no problems, managing ever-growing databases of 2TB.
>
> I'm wondering what the level of interest, if any, there would be for
> making this a part of Boost. You can check it out here:

I only had a brief look - one quick question: Is your arena allocating
storage for a sequence of uint8_t's, and you're constructing 'element'
objects in that storage?

Glen

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Wed, Mar 22, 2017 at 05:13:31PM +1300, Gavin Lambert via Boost wrote:

> On 22/03/2017 16:08, Vinnie Falco via Boost wrote:
> >I think this can be unit tested, and I believe that NuDB's unit test
> >covers the case of power loss. I think we can agree that power loss on
> >a read is uninteresting (since it can't corrupt data). The unit test
> >models a power loss as a fatal error during a write. The test
> >exercises all possible fatal errors using an incremental approach (I
> >alluded to this in my previous message).
>
> A power loss is more like a fatal error that fails to execute any subsequent
> clean-up code, so it might not be quite the same.

Agreed.

The hardware is losing power. The peripheral devices will likely not all lose
power simultaneously. The kernel is suffering a fatal crash. In my experience,
when you pull the power plug on a server, the execution of the kernel simply
ends without any trace of what happened. The application will suffer a fatal
error without any trace of it having happened.

Karen.
--
Karen Shaeffer                 The subconscious mind is driven by your deeply
Neuralscape Services           held beliefs -- not your deeply held desires.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Wed, Mar 22, 2017 at 5:13 AM, Gavin Lambert via Boost
<[hidden email]> wrote:

> On 22/03/2017 16:08, Vinnie Falco via Boost wrote:
>>
>> I think this can be unit tested, and I believe that NuDB's unit test
>> covers the case of power loss. I think we can agree that power loss on
>> a read is uninteresting (since it can't corrupt data). The unit test
>> models a power loss as a fatal error during a write. The test
>> exercises all possible fatal errors using an incremental approach (I
>> alluded to this in my previous message).
>
>
> A power loss is more like a fatal error that fails to execute any subsequent
> clean-up code, so it might not be quite the same.

Recovery obviously runs on the next power on. ;)

> There are also more pathological cases such as where a write has been
> partially successful and done some subset of increasing the file size,
> zeroing the extra file space, and writing some subset of the intended data.
> So it's not necessarily that data is missing; there might be invalid data in
> its place.

Do modern FS still allow that to happen? I guess some do..
Don't (log) records contain a checksum to guard against this?


--
Olaf

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Wed, Mar 22, 2017 at 12:36 AM, Niall Douglas via Boost <
[hidden email]> wrote:

> [...] much safer choices of SQLite or UnQLite both of which are battle
> tested

and written by database experts to ensure that other people's data is
> never, ever lost.
>

Very much so for SQLite, but not so for UnQLite IMHO, which is a "mostly
abandoned"
project of changing SQLite3 to use a new key-value store back-end, thus
create in the
process a new "SQLite4". The original was lead by Dr Hipp, who later
concentrated back
on SQLite3, and that project was basically "ripped-off" by some random
people.

You shouldn't associate the latter with the former, or claim it's battle
tested. --DD

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
>> I am unsure what benefit it
>> confers over the much safer choices of SQLite or UnQLite both of which
>> are battle tested and written by database experts to ensure that other
>> people's data is never, ever lost.
>
> SQLite is painfully slow compared to NuDB. UnQLite supports features
> not present in NuDB such as cursor iteration, moving, and deleting.
> The tradeoff for not supporting these features is that NuDB can make
> better claims about its performance.

You are correct that NuDB is a single purpose design, and so will always
be faster.

However SQLite configured into "I don't care about data loss" mode is
pretty swift. "Painfully slow" is not a correct term to use except for
its default data loss caring mode.

>> I felt when I reviewed NuDB's implementation some time ago that your
>> implementation would be highly susceptible to data loss in a wide set of
>> circumstances.
>
> NuDB makes some assumptions about the underlying file system. In
> particular, that when a call to fsync() returns it is assured the data
> is reliably written. If these invariants are met there is no data
> loss.
>
> In fact, one of NuDB's unit tests exhaustively verifies that the
> database recovery process performs correctly after an I/O failure. The
> unit test works by instantiating a database with a custom File
> template parameter. This custom file implementation simulates an I/O
> failure on the Nth operation. The test runs with N=1 to failure, then
> increments N, and re-runs the test until there are no failures. At
> each iteration, the test verifies that the database recover process
> left the database in a consistent state. You can see this code here:
> https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125
>
> I don't doubt that you think this database is vulnerable to data loss
> under some condition. However, I would appreciate specifics so that I
> can verify under what conditions your claim is correct, or if it is
> correct at all.

The single biggest thing which struck me when reading your
implementation was that you make very strong assumptions about your
storage, and that generally leads to corner case data loss.

Plucking straight from random as it was too long ago I examined your
source, but a classic mistake is to assume this is sequentially consistent:

int keyfd, storefd;
write(storefd, data)
fsync(storefd)
write(keyfd, key)
fsync(keyfd)

Here the programmer writes the value being stored, persists it, then
writes the key to newly stored value, persists that. Most programmers
unfamiliar with filing systems will assume that the fsync to the storefd
cannot happen after the fsync to the keyfd. They are wrong, that is a
permitted reorder. fsyncs are only guaranteed to be sequentially
consistent *on the same file descriptor* not different file descriptors.
Technically speaking, you could even dupfd() the same fd to the same
inode and fsync is not required to be sequentially consistent between
those two fds, though most major filing systems do implement that
because the locking is kept at inode level.

I can't remember if NuDB makes that sort of mistake, but the single
biggest thing which struck me was your assumption that storage is
reliable and predictable and that the syscalls you wrote will be
executed in the order you wrote them. None of those are true except in a
very limited circumstance i.e. power never goes off, storage never
fails, ECC RAM is present, bitflips never occur, you are on a very
recent Linux kernel, your ext4 filing system hasn't been mounted with
unusual flags etc.

You also rely heavily on the kernel page caching architecture to give
your performance. That's a good move as most people trying to invent
their own caching architecture on top of O_DIRECT do a far worse job.
However to really get IOPS out of storage just around the corner where
hitting 1M 4K IOPS is achievable, you can't avoid O_DIRECT for that.

>> I also felt it wouldn't scale well to the > 1M IOPS
>> storage just around the corner.
>
> While it hasn't yet been tested on these non-existent devices, I am
> optimistic, since NuDB was specifically designed for very intense
> fetch workloads with light concurrent insertions. Fetches can happen
> concurrently, it should be possible to saturate the read throughput of
> a device. See:
> https://github.com/vinniefalco/NuDB/blob/master/include/nudb/impl/basic_store.ipp#L213

The Macbook Pro I am typing on right now hits 300k 4Kb IOPS just fine.
Some of the systems with big RAM drives I've tested AFIO were busting
past 2M IOPS. These are not non-existent devices, they are just
currently high end, and rapidly heading towards the consumer.

>> I would suspect it is being
>> used in a highly restricted use case where corner cases don't present.
>
> Maybe. I'll counter by saying, there are only three operations
> possible on an open database: insert, fetch, and close.

I was more referring to use environment. For example, atomic appending
to a file is super slow on ZFS. It may be fast on ext4 and there you
should use atomic appends, but on other filing systems you need to set
the extent to maximum and use lazy allocation via an atomically updated
manually kept watermark.

There's lots of this stuff, it's why AFIO v2 has a concept of "storage
profiles" which keep a precanned mix of AFIO filing system algorithms
for some storage combo. That way if you load your key-value store on ZFS
on Linux, the storage profile says to use algorithms A, B and D. If you
load it on NTFS on Windows, use algorithms B, C and F instead. AFIO
provides many algorithms all accessible via an abstracted base class so
things using them don't need to care.

>> if you are up for getting this to a state ready to submit to Boost,
>> I'd welcome it.
>
> Cool! The docs just need a bit of work, the library code is more than ready.
>
> Thanks for taking the time to go through it!

I appreciate I'm full of nothing but criticism, but I wanted to give you
some idea of how a peer review would go.

I do want to say thank you for writing your library, and intending to
bring it to Boost. It may well be that the community will accept NuDB if
it very substantially reduces the guarantees it offers to essentially
none i.e. this is a fast key-value insert only store with no guarantees.

But I will mention that if I don't need to provide any guarantees, I
believe I could roll something with AFIO v2 even in its current very
alpha state that will match or beat NuDB in less than a hundred lines of
code and about one week of work. I even reckon I can achieve key-value
deletion if users are willing to accept a reasonable ceiling on total
item count of maybe a few million items, and use of a recent extents
based filing system.

I don't wish to do your work down, but a key value store with no
guarantees really is that easy to implement with AFIO v2, right now.
It's the guarantees and achieving consistent performance across
environments which is hard.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
On Wed, Mar 22, 2017 at 11:43 AM, Niall Douglas via Boost
<[hidden email]> wrote:

> Plucking straight from random as it was too long ago I examined your
> source, but a classic mistake is to assume this is sequentially consistent:
>
> int keyfd, storefd;
> write(storefd, data)
> fsync(storefd)
> write(keyfd, key)
> fsync(keyfd)
>
> Here the programmer writes the value being stored, persists it, then
> writes the key to newly stored value, persists that. Most programmers
> unfamiliar with filing systems will assume that the fsync to the storefd
> cannot happen after the fsync to the keyfd. They are wrong, that is a
> permitted reorder. fsyncs are only guaranteed to be sequentially
> consistent *on the same file descriptor* not different file descriptors.

Just curious, how is that permitted?

Isn't fsync() supposed to ensure data is on durable storage before it returns?

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 22/03/2017 04:13, Gavin Lambert via Boost wrote:

> On 22/03/2017 16:08, Vinnie Falco via Boost wrote:
>> I think this can be unit tested, and I believe that NuDB's unit test
>> covers the case of power loss. I think we can agree that power loss on
>> a read is uninteresting (since it can't corrupt data). The unit test
>> models a power loss as a fatal error during a write. The test
>> exercises all possible fatal errors using an incremental approach (I
>> alluded to this in my previous message).
>
> A power loss is more like a fatal error that fails to execute any
> subsequent clean-up code, so it might not be quite the same.
>
> There are also more pathological cases such as where a write has been
> partially successful and done some subset of increasing the file size,
> zeroing the extra file space, and writing some subset of the intended
> data.  So it's not necessarily that data is missing; there might be
> invalid data in its place.

There are a few rings to testing data loss safety:

Ring 1: Does my application code correctly handle all possible errors in
all possible contexts?

This can be tested using Monte Carlo methods, fuzzing, parameter
permutations, unit and functional testing.


Ring 2: Does my code correctly handle sudden stop?

This can be tested using LXC containers where you kill -9 the container
mid-test. Monte Carlo to verification.


Ring 3: Does my code correctly handle sudden kernel stop?

This can be tested using kvm or qemu where you kill -9 the virtualised
OS mid-test.


Ring 4: Does my code correctly handle sudden power loss to the CPU?

This can be tested using a few dozen cheap odroid devices where you
manually trip their watchdog hard reset hardware feature. This solution
has the big advantage of not requiring the SSD used to be sudden power
loss safe :)


Ring 5: Does my code correctly handle sudden power loss to the storage?

It requires more work and you'll find endless bugs in the kernel, filing
system and the storage device, but you can install a hardware switch to
cut power to the storage device mid-test. This is a never ending "fun"
task, it's far too uncommonly tested by the kernel vendors, but it's a
great simulation of how well faulty storage is handled.


Ring 6: Does my code correctly handle sudden power loss to the system?

Unlike Ring 5 this is actually a better tested situation. Sudden power
loss to everything at once is probably less buggy than Ring 5. Still,
you can get data loss at any level from the kernel, to the SATA chip, to
the device itself.


There are also other test rings not related to sudden power loss. For
example, single and paired bit flips are not uncommon in terabytes of
storage, either transient or permanent. These can be simulated using kvm
with you manually flipping random bits in the disc image as it runs. You
might become quite appalled at what data gets destroyed by bugs in the
filing system when facing flipped bits.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 22/03/2017 09:30, Dominique Devienne via Boost wrote:

> On Wed, Mar 22, 2017 at 12:36 AM, Niall Douglas via Boost <
> [hidden email]> wrote:
>
>> [...] much safer choices of SQLite or UnQLite both of which are battle
>> tested
>
> and written by database experts to ensure that other people's data is
>> never, ever lost.
>>
>
> Very much so for SQLite, but not so for UnQLite IMHO, which is a "mostly
> abandoned"
> project of changing SQLite3 to use a new key-value store back-end, thus
> create in the
> process a new "SQLite4". The original was lead by Dr Hipp, who later
> concentrated back
> on SQLite3, and that project was basically "ripped-off" by some random
> people.

That's interesting, thank you.

> You shouldn't associate the latter with the former, or claim it's battle
> tested. --DD

My claim was based on UnQLite using the same storage abstraction layer
as SQLite. I am very confident in that layer. The stuff UnQLite adds on
top, that could lose data for sure, but it'll do it much less frequently
and in much fewer of the usual cases thanks to using the SQLite
abstraction layer.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 22/03/2017 10:50, Olaf van der Spek wrote:

> On Wed, Mar 22, 2017 at 11:43 AM, Niall Douglas via Boost
> <[hidden email]> wrote:
>> Plucking straight from random as it was too long ago I examined your
>> source, but a classic mistake is to assume this is sequentially consistent:
>>
>> int keyfd, storefd;
>> write(storefd, data)
>> fsync(storefd)
>> write(keyfd, key)
>> fsync(keyfd)
>>
>> Here the programmer writes the value being stored, persists it, then
>> writes the key to newly stored value, persists that. Most programmers
>> unfamiliar with filing systems will assume that the fsync to the storefd
>> cannot happen after the fsync to the keyfd. They are wrong, that is a
>> permitted reorder. fsyncs are only guaranteed to be sequentially
>> consistent *on the same file descriptor* not different file descriptors.
>
> Just curious, how is that permitted?
>
> Isn't fsync() supposed to ensure data is on durable storage before it returns?

A common misconception. Here is the POSIX wording:

"The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes. The nature of the transfer
is implementation-defined. The fsync() function shall not return until
the system has completed that action or until an error is detected.

[SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the fsync()
function shall force all currently queued I/O operations associated with
the file indicated by file descriptor fildes to the synchronized I/O
completion state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion. [Option End]"


So, without _POSIX_SYNCHRONIZED_IO, all fsync() guarantees is that it
will not return until the *request* for the transfer of outstanding data
to storage has completed. In other words, it pokes the OS to start
flushing data now rather than later, and returns immediately. OS X
implements this sort of fsync() for example.

With _POSIX_SYNCHRONIZED_IO, you get stronger guarantees that upon
return from the syscall, "synchronized I/O file integrity completion"
has occurred. Linux infamously claims _POSIX_SYNCHRONIZED_IO, yet
ext2/ext3/ext4 don't implement it fully and will happily reorder fsyncs
of the metadata needed to later retrieve a fsynced write of data. So the
data itself is written on fsync return sequentially consistent, but not
the metadata to later retrieve it, that can be reordered with respect to
other fsyncs.

AFIO v1 and v2 take care of this sort of stuff for you. If you tell AFIO
you want a handle to a file to write reliably, AFIO does what is needed
to make it reliable. Be prepared to give up lots of performance however
(and hence where async file i/o starts to become very useful because you
can queue up lots of writes, and completion handlers will fire when the
write really has reached storage in a way always retrievable in the
future - excluding bugs in the kernel, filing system, storage device etc).

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
On 23/03/2017 00:46, Niall Douglas via Boost wrote:
> On 22/03/2017 10:50, Olaf van der Spek wrote:
>> Isn't fsync() supposed to ensure data is on durable storage before it returns?
[...]

> So, without _POSIX_SYNCHRONIZED_IO, all fsync() guarantees is that it
> will not return until the *request* for the transfer of outstanding data
> to storage has completed. In other words, it pokes the OS to start
> flushing data now rather than later, and returns immediately. OS X
> implements this sort of fsync() for example.
>
> With _POSIX_SYNCHRONIZED_IO, you get stronger guarantees that upon
> return from the syscall, "synchronized I/O file integrity completion"
> has occurred. Linux infamously claims _POSIX_SYNCHRONIZED_IO, yet
> ext2/ext3/ext4 don't implement it fully and will happily reorder fsyncs
> of the metadata needed to later retrieve a fsynced write of data. So the
> data itself is written on fsync return sequentially consistent, but not
> the metadata to later retrieve it, that can be reordered with respect to
> other fsyncs.

There's other factors, too; even if the OS has fully flushed the data to
the storage device, the storage device itself might be holding some of
it in an internal volatile buffer and may possibly even perform actual
non-volatile writes out of order.

(Disclaimer: I haven't looked at the internals of storage hardware for a
long time so perhaps they make better guarantees than I think.)



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Wed, Mar 22, 2017 at 12:21 AM, Glen Fernandes via Boost
<[hidden email]> wrote:
> I only had a brief look - one quick question: Is your arena allocating
> storage for a sequence of uint8_t's, and you're constructing 'element'
> objects in that storage?

Kind of, each object allocated in the arena is an `element` type,
followed by a variable number of bytes of storage.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: NuDB: A fast key/value insert-only database for SSD drives in C++11

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Tue, 21 Mar 2017 23:08:30 -0400
Vinnie Falco via Boost <[hidden email]> wrote:
> On Tue, Mar 21, 2017 at 10:50 PM, Lee Clagett via Boost
> <[hidden email]> wrote:
>
[...snip...]

> > The pathological case is power loss, which cannot be unit tested
> > easily (and probably isn't possible).  
>
> I think this can be unit tested, and I believe that NuDB's unit test
> covers the case of power loss. I think we can agree that power loss on
> a read is uninteresting (since it can't corrupt data). The unit test
> models a power loss as a fatal error during a write. The test
> exercises all possible fatal errors using an incremental approach (I
> alluded to this in my previous message).
>
> The database template requires a File type to instantiate:
>
>   template<class Hasher, class File>
>   class basic_store;
>
> NuDB comes with posix_file and win32_file. The unit test uses its own
> fail_file which wraps the native file and generates a simulated error
> on the Nth write:
> https://github.com/vinniefalco/NuDB/blob/master/extras/nudb/test/fail_file.hpp#L143
>
> The test implementation tries to insert a bunch of records into a new
> database using a fail_file, incrementing the counter as it goes. On a
> failure it tries to recover (which can also result in a failed write).
> After the recover fails, it verifies that the database is still
> consistent. You can see the test implementation:
> https://github.com/vinniefalco/NuDB/blob/master/test/recover.cpp#L125
>
> I *think* this test proves that the implementation maintains
> consistency on any power loss. would very much like anyone who is
> interested to tell me whether this approach works.

The other responses to this thread reiterated what I thought could
occur - there should be corruption "races" from a write call to file
sync completion.

Writing the blocks to the log file are superfluous because it is
writing to multiple sectors and there is no mechanism to detect a
partial write after power failure. So the information cannot be read
from reliably. If these writes were omitted, the presence of the log
file header indicates whether a commit failed to complete, and this
header should fit in a single sector. The key and data file sizes in
that header indicate what portion at the end of the key and data files
should be ignored.

If the log file header was moved to beginning of the key file, after the
records are appended and sync'ed, the header could be overwritten with
the new file lengths. The key file would not need to be opened with the
`_POSIX_SYNCHRONIZED_IO` flag, and a fsync after the writing header can
be omitted (which means a commit also has "relaxed" disk flushing).
This should improve commit performance.

At this point I'm sure Niall will tell me that writing <512 bytes to
the beginning of the file is also not an atomic operation. Which is
exactly what SQLite documentation assumes. The power-loss recovery
algorithm gets tougher with this assumption.

> >   - The key-file is append only so NuDB has multiple blocks of
> > sorted records. So hopefully the insertions do not cause lots of
> > blocks or the algorithm becomes a linear search.  
>
> There's no sorting going on here, the key file is an on-disk hash
> table where each bucket holds a good-sized number of keys (so that a
> single read from the key file does the most work).

I jumped into the internal fetch function which was sorted within a
single bucket and had a linked list of spills. Reading the README first
would've made it clear that there was more to the implementation.

So the worst case performance is a link-list if a hash collision. Was
the primary decision for the default hash implementation performance?

Lee

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
12345
Loading...