[text] SIMD UTF-8 decoding

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[text] SIMD UTF-8 decoding

Boost - Dev mailing list
Dear All,

I have been looking at the UTF-8 decoding code in the proposed
Boost.Text, as this is a problem I've looked at myself in the past.
I've mentioned an issue with the copyright in another message.
Here are my other observations.

1. The SIMD code is x86-specific.  It doesn't need to be; I think
it could use gcc's vector builtins to do the same thing and be
portable to other SIMD implementations.  (Clang provides the same
builtins; I'm not sure about what you need to do on MSVC/Windows.)
See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

2. The SIMD code only seems to provide a fast path for bytes < 0x80,
falling back to sequential code for everything else.  I guess I was
expecting something more sophisticated.

3. The code used for bytes >= 0x80, and in all cases for non-x86,
is here:
https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
around lines 400-560.  It implements a state machine, which surprises
me; it takes much less code and gives better performance if you write
out the bit-testing and shifting etc. explicitly.  This seems to be
about 50% slower than my existing UTF-8 decoding code.

4. There aren't enough comments anywhere in the code I've looked at!


Regards, Phil.







_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [text] SIMD UTF-8 decoding

Boost - Dev mailing list
On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
<[hidden email]> wrote:

>
> Dear All,
>
> I have been looking at the UTF-8 decoding code in the proposed
> Boost.Text, as this is a problem I've looked at myself in the past.
> I've mentioned an issue with the copyright in another message.
> Here are my other observations.
>
> 1. The SIMD code is x86-specific.  It doesn't need to be; I think
> it could use gcc's vector builtins to do the same thing and be
> portable to other SIMD implementations.  (Clang provides the same
> builtins; I'm not sure about what you need to do on MSVC/Windows.)
> See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

That page describes vector-friendly data types and arithmetic
operations.  It does not seem to support the operations actually used
in the code currently in Boost.Text.

> 2. The SIMD code only seems to provide a fast path for bytes < 0x80,
> falling back to sequential code for everything else.  I guess I was
> expecting something more sophisticated.

The code makes the fast path extra fast, but the slow path, being
quite branchy, is not really amenable to vectorization.  If you have
an implementation that proves that claim false, I'm happy to use it.

> 3. The code used for bytes >= 0x80, and in all cases for non-x86,
> is here:
> https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
> around lines 400-560.  It implements a state machine, which surprises
> me; it takes much less code and gives better performance if you write
> out the bit-testing and shifting etc. explicitly.  This seems to be
> about 50% slower than my existing UTF-8 decoding code.

Could you point me to that code, and let me use your benchmarks to
verify?  I'm happy to do something faster!

> 4. There aren't enough comments anywhere in the code I've looked at!

I only put comments where something unclear or unexpected is
happening.  The intention is that the rest of the code is clear enough
to read on its own.  Particularly in the case of Boost.Text, where
most of the code follows one or more Unicode specifications, I tend to
put a comment indicating where the online description of an algorithm
might be found, and that's it -- except for API docs, of course.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [text] SIMD UTF-8 decoding

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Mon, 15 Jun 2020 at 14:06, Phil Endecott via Boost <[hidden email]>
wrote:

> Dear All,
>
> I have been looking at the UTF-8 decoding code in the proposed
> Boost.Text, as this is a problem I've looked at myself in the past.
> I've mentioned an issue with the copyright in another message.
> Here are my other observations.
>
> 1. The SIMD code is x86-specific.  It doesn't need to be; I think
> it could use gcc's vector builtins to do the same thing and be
> portable to other SIMD implementations.  (Clang provides the same
> builtins; I'm not sure about what you need to do on MSVC/Windows.)
> See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html


Bar writing intel assembler for ml64 [a stand-alone assembler, the only
option on Windows-x64 is Intel Intrinsics (
https://software.intel.com/sites/landingpage/IntrinsicsGuide/# ). Over and
above the Intel Intrinsics, Microsoft offers 'a number of extensions',
which often correspond to some builtin_ in clang/gcc. I'm not sure this is
related, but there is also the vectorcall-ABI.

degski

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [text] SIMD UTF-8 decoding

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Zach Laine wrote:

> On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
> <[hidden email]> wrote:
>> 1. The SIMD code is x86-specific.  It doesn't need to be; I think
>> it could use gcc's vector builtins to do the same thing and be
>> portable to other SIMD implementations.  (Clang provides the same
>> builtins; I'm not sure about what you need to do on MSVC/Windows.)
>> See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
>
> That page describes vector-friendly data types and arithmetic
> operations.  It does not seem to support the operations actually used
> in the code currently in Boost.Text.

I think it has most of what's needed, though it seems that the
type conversion __builtin_convertvector, which is needed to
expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present
in newer versions of g++ than I have.


>> 2. The SIMD code only seems to provide a fast path for bytes < 0x80,
>> falling back to sequential code for everything else.  I guess I was
>> expecting something more sophisticated.
>
> The code makes the fast path extra fast, but the slow path, being
> quite branchy, is not really amenable to vectorization.  If you have
> an implementation that proves that claim false, I'm happy to use it.

Well I've never written any SIMD code before, but I thought I'd
have a look.  The following is just a proof of concept; it just
converts 16 bytes of UTF-8 to one-byte-per-character ISO-8859-1.
(Because, as noted above, I don't have a good way to spread out
the bytes.)

int utf8_to_latin1(uint8_t* out_p, const uint8_t* in_p)
{
  // Attempt to decode the subset of UTF-8 with code points < 256.
  // Format is either 0xxxxxxx          -> 0xxxxxxx
  //               or 110---xx 10yyyyyy -> xxyyyyyy
  // The input mustn't start or finish in the middle of a multi-byte
  // character.
  // Other inputs produce undefined outputs.

  // Process 16 bytes at a time (for 128-bit SIMD).
  using uint8_x16 = uint8_t  __attribute__((vector_size(16)));

  // Load input:
  uint8_x16 in;
  for (int i = 0; i < 16; ++i) in[i] = in_p[i];

  // Each byte is one of three types:
  uint8_x16 is_onebyte       = (in & 0b10000000) == 0b00000000;
  uint8_x16 is_first_of_two  = (in & 0b11100000) == 0b11000000;
  uint8_x16 is_second_of_two = (in & 0b11000000) == 0b10000000;

  // (If all bytes are is_onebyte we could take a fast path now.)

  // Get the bits that each byte contributes to the result:
  uint8_x16 bits = is_onebyte       ? in
                 : is_first_of_two  ? (in << 6)
                 : is_second_of_two ? (in & 0b00111111)
                 : 0;

  // Make a shifted copy:
  // (Isn't there a better way to do this?)
  uint8_x16 shifts = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0 };
  uint8_x16 shifted = __builtin_shuffle(bits,shifts);

  // Combine the two bytes, where required:
  uint8_x16 combined = is_first_of_two ? (bits | shifted)
                     : bits;

  // Form a shuffle to pack the result, skipping the second bytes;
  // this is non-SIMD.  Any tricks to make this quicker?
  uint8_x16 shuffle;
  int to = 0, from = 0;
  while (from < 16) {
    shuffle[to++] = from++;
    if (is_second_of_two[from]) ++from;
  }

  // Apply the shuffle:
  uint8_x16 packed = __builtin_shuffle(combined,shuffle);

  // Store result.
  for (int i = 0; i < 16; ++i) out_p[i] = packed[i];
  return to;  // Number of characters in result.
}

That seems to work, and does generate vector instructions on ARM64.
I've no idea if it faster than the alternatives.


>> 3. The code used for bytes >= 0x80, and in all cases for non-x86,
>> is here:
>> https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
>> around lines 400-560.  It implements a state machine, which surprises
>> me; it takes much less code and gives better performance if you write
>> out the bit-testing and shifting etc. explicitly.  This seems to be
>> about 50% slower than my existing UTF-8 decoding code.
>
> Could you point me to that code, and let me use your benchmarks to
> verify?  I'm happy to do something faster!

Essentially you could just replace the body of the advance() function at
line 536 of transcode_iterator.hpp with something like this:

char8_t b0 = *(p++);
IF_LIKELY((b0&0x80)==0) return b0;
char8_t b1 = *(p++);
IF_LIKELY((b0&0xe0)==0xc0) return (b1&0x3f) | ((b0&0x1f)<<6);
char8_t b2 = *(p++);
IF_LIKELY((b0&0xf0)==0xe0) return (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
char8_t b3 = *(p++);
IF_LIKELY((b0&0xf8)==0xf0) return (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);

That will be quick, but it does lack a few things; it doesn't check if
it has reached the end of the input and it doesn't do any error checking.  


Regards, Phil.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [text] SIMD UTF-8 decoding

Boost - Dev mailing list
> I think it has most of what's needed, though it seems that the
> type conversion __builtin_convertvector, which is needed to
> expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present
> in newer versions of g++ than I have.
Than it's likely not very useful for now. Maybe later once that compiler
version is more wide-spread
>    // Attempt to decode the subset of UTF-8 with code points < 256.
>    // Format is either 0xxxxxxx          -> 0xxxxxxx
>    //               or 110---xx 10yyyyyy -> xxyyyyyy
>    // The input mustn't start or finish in the middle of a multi-byte
>    // character.
>    // Other inputs produce undefined outputs.
Good code for that special case. But I think "undefined outputs" is not
acceptable. I've seen other SIMD UTF-8 conversions around and they
basically all focus on ASCII converting as much as possible and fallback
to one-by-one decoding once a non-ascii is found
> That will be quick, but it does lack a few things; it doesn't check if
> it has reached the end of the input and it doesn't do any error checking.

So not really usable either. BUT: Compare to Boost.Locale which has a
`decode` and `decode_valid` function where the latter assumes valid UTF-8
However checking for end-of-input is a must obviously.

BTW: Does Boost.Text have functions or overloads where you can specify
that text is in a specific encoding/normalization?
If not I think this should be added. Sometimes you get text from an
internal function and know those things so you can skip verification and
conversion




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [text] SIMD UTF-8 decoding

Boost - Dev mailing list
Alexander Grund wrote:
> I've seen other SIMD UTF-8 conversions around and they
> basically all focus on ASCII converting as much as possible and
> fallback to one-by-one decoding once a non-ascii is found

The question is, do they do that because they've determined that
that gives the best performance (for some benchmark input), or
have they not tried to do more with the SIMD code?


Regards, Phil.





_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [text] SIMD UTF-8 decoding

Boost - Dev mailing list

Am 18.06.20 um 13:10 schrieb Phil Endecott via Boost:
> Alexander Grund wrote:
>> I've seen other SIMD UTF-8 conversions around and they basically all
>> focus on ASCII converting as much as possible and fallback to
>> one-by-one decoding once a non-ascii is found
>
> The question is, do they do that because they've determined that
> that gives the best performance (for some benchmark input), or
> have they not tried to do more with the SIMD code?
I guess the former which would be my intuition. It is easy to detect the
first byte of a multi-byte UTF-8 sequence in SIMD and also easy to bulk
convert single-byte UTF-8 sequences. Once you get to converting the
multi-byte sequence then SIMD doesn't make sense anymore. To much
checking to do: How many bytes to "squash", end-of-input, shortest
value, legal value, ...
So summary: Once it requries branching it doesn't make sense to use SIMD
anymore.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment