What happened to Boost.Nowide?

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

What happened to Boost.Nowide?

Boost - Dev mailing list
Hi,

I'm a user of (a fork of) Boost.Nowide and already fixed some issues I
found and was looking into getting them upstream.

I also wanted to know, if it is finally in Boost. Unfortunately this
does not seem to be the case. I found
https://lists.boost.org/Archives/boost//2017/06/236475.php which
accepted it into Boost. This is from mid-2017 and nothing has happened
since.

Does anyone know what the status of Boost.Nowide is? It seems the
filestream parts are now incorporated into Boost.FileSystem. So it seems
only cin/cout/cerr, the args wrapper and the C-functions (fopen, ...)
are missing. Especially the first 2 are very useful in writing
cross-platform code.

Might those be integrated into some other Boost Library?

Thanks,
Alex




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
On Wed, Nov 6, 2019 at 2:11 AM Alexander Grund via Boost <
[hidden email]> wrote:

> Does anyone know what the status of Boost.Nowide is? It seems the
> filestream parts are now incorporated into Boost.FileSystem. So it seems
> only cin/cout/cerr, the args wrapper and the C-functions (fopen, ...)
> are missing. Especially the first 2 are very useful in writing
> cross-platform code.
>

I don't know the status of Boost.Nowide. However, it's usefulness has
diminished with the introduction of UTF-8 codepage support in Windows 10 in
May this year. See
https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
.

--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list

Am 07.11.19 um 03:39 schrieb Yakov Galka:

> On Wed, Nov 6, 2019 at 2:11 AM Alexander Grund via Boost
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Does anyone know what the status of Boost.Nowide is? It seems the
>     filestream parts are now incorporated into Boost.FileSystem. So it
>     seems
>     only cin/cout/cerr, the args wrapper and the C-functions (fopen, ...)
>     are missing. Especially the first 2 are very useful in writing
>     cross-platform code.
>
>
> I don't know the status of Boost.Nowide. However, it's usefulness has
> diminished with the introduction of UTF-8 codepage support in Windows
> 10 in May this year. See
> https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page.
Interesting, thanks! Great to see that someone at MS finally made the
right decision so that Windows is no longer the only OS not supporting UTF8.

However it does require Win10 1903 minimum and a change to the
manifest(s). So maybe Nowide still has some use.

Alex



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Am 07.11.19 um 03:39 schrieb Yakov Galka:
>  However, it's usefulness has diminished with the introduction of
> UTF-8 codepage support in Windows 10 in May this year. See
> https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page.

I just noticed that it is very unfortunate, that this didn't happen 3
years (or so) ago. Now not only `boost::filesystem::path` is using
`wchar` but also the C++17 `std::filesystem::path` does so. So we now
have costly conversions and wasting half the space on windows for no gain :/




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
On Mon, Nov 11, 2019 at 4:19 AM Alexander Grund via Boost <
[hidden email]> wrote:

> I just noticed that it is very unfortunate, that this didn't happen 3
> years (or so) ago. Now not only `boost::filesystem::path` is using
> `wchar` but also the C++17 `std::filesystem::path` does so. So we now
> have costly conversions and wasting half the space on windows for no gain
> :/
>

I raised this issue many years ago. In fact boost filesystem v2 was better
in this respect, because it followed the established convention of having a
templated basic_path<char>, thus not committing to a specific char type.
Alas, v2 was deprecated and v3 was lobbied into WG21 for standardization.
It was an unprecedented case of introducing a "char on some platforms,
wchar_t on others" interface into the standard, which is a bad decision
from portability stand point.

While we are at it, I would like to say that boost filesystem should have
never introduced a path class in the first place. filesystem::path is just
a glorified string with no extra invariants. Any string -> path conversion
copies the data, even if it's already in the right encoding, even on
operating systems that don't need any conversions at all. There goes your
"don't pay for what you don't use" principle. Most can agree that C++'s
spirit is to separate containers from algorithms. A proper design would
introduce path manipulation functions that work on any string types, and
let users use std::string or even char[] for storage.

--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Mon, Nov 11, 2019 at 1:09 AM Alexander Grund via Boost <
[hidden email]> wrote:

> However it does require Win10 1903 minimum and a change to the
> manifest(s). So maybe Nowide still has some use.
>

Looks like there is a way to set UTF-8 globally for existing applications:
https://stackoverflow.com/q/56419639/277176
Though I didn't try it yet.

--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 4/12/2019 05:25, Yakov Galka wrote:
> On Mon, Nov 11, 2019 at 4:19 AM Alexander Grund wrote:
> I raised this issue many years ago. In fact boost filesystem v2 was better
> in this respect, because it followed the established convention of having a
> templated basic_path<char>, thus not committing to a specific char type.
> Alas, v2 was deprecated and v3 was lobbied into WG21 for standardization.
> It was an unprecedented case of introducing a "char on some platforms,
> wchar_t on others" interface into the standard, which is a bad decision
> from portability stand point.

While I agree in principle, the simple fact is that performing string
transcoding on filesystem paths is a Very Bad Idea™, since both Windows
and Linux treat them as opaque byte sequences -- but Windows' native
encoding is UTF-16 and Linux' is (mostly) UTF-8.

So, while unfortunate, v3 made the correct choice.  Paths have to be
kept in their original encoding between original source (command line,
file, or UI) and file API usage, otherwise you can get weird errors when
transcoding produces a different byte sequence that appears identical
when actually rendered, but doesn't match the filesystem.  Transcoding
is only safe when you're going to do something with the string other
than using it in a file API.

> While we are at it, I would like to say that boost filesystem should have
> never introduced a path class in the first place. filesystem::path is just
> a glorified string with no extra invariants. Any string -> path conversion
> copies the data, even if it's already in the right encoding, even on
> operating systems that don't need any conversions at all. There goes your
> "don't pay for what you don't use" principle. Most can agree that C++'s
> spirit is to separate containers from algorithms. A proper design would
> introduce path manipulation functions that work on any string types, and
> let users use std::string or even char[] for storage.

While copying is unfortunate, these things are rarely on a
performance-critical path, and the benefits of having consistent
compose/decompose operations on paths vastly outweighs that, in my
opinion.  Combined with the need to maintain native encoding for paths,
separated algorithms don't seem particularly useful -- just less
convenient to use.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
On Tue, Dec 3, 2019 at 2:19 PM Gavin Lambert via Boost <
[hidden email]> wrote:

> While I agree in principle, the simple fact is that performing string
> transcoding on filesystem paths is a Very Bad Idea™, since both Windows
> and Linux treat them as opaque byte sequences -- but Windows' native
> encoding is UTF-16 and Linux' is (mostly) UTF-8.
>

Unix paths can be stored in a narrow string already, where fopen() always
magically worked for any text. Windows paths can be transcoded losslessy
into WTF-8 and back.

So, while unfortunate, v3 made the correct choice.  Paths have to be
> kept in their original encoding between original source (command line,
> file, or UI) and file API usage, otherwise you can get weird errors when
> transcoding produces a different byte sequence that appears identical
> when actually rendered, but doesn't match the filesystem.  Transcoding
> is only safe when you're going to do something with the string other
> than using it in a file API.
>

See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset)
and back losslessly. The unprecedented introduction of a platform specific
interface into the standard was, still is, and will always be, a horrible
mistake.


> While copying is unfortunate, these things are rarely on a
> performance-critical path, and the benefits of having consistent
> compose/decompose operations on paths vastly outweighs that, in my
> opinion.  Combined with the need to maintain native encoding for paths,
> separated algorithms don't seem particularly useful -- just less
> convenient to use.
>

The path parsing and modification functions could be storage agnostic. Some
prefer the x.join(y) syntax over join(x,y), but that's just a preference
originating from the OOP crowd.

--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
On 7/01/2020 14:58, Yakov Galka wrote:

>> So, while unfortunate, v3 made the correct choice.  Paths have to be
>> kept in their original encoding between original source (command line,
>> file, or UI) and file API usage, otherwise you can get weird errors when
>> transcoding produces a different byte sequence that appears identical
>> when actually rendered, but doesn't match the filesystem.  Transcoding
>> is only safe when you're going to do something with the string other
>> than using it in a file API.
>
> See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset)
> and back losslessly. The unprecedented introduction of a platform specific
> interface into the standard was, still is, and will always be, a horrible
> mistake.

Given that WTF-8 is not itself supported by the C++ standard library
(and the other formats are), that doesn't seem like a valid argument.
You'd have to campaign for that to be added first.

The main problem though is that once you start allowing transcoding of
any kind, it's a slippery slope to other conversions that can make lossy
changes (such as applying different canonicalisation formats, or
adding/removing layout codepoints such as RTL markers).

Also, if you read the WTF-8 spec, it notes that it is not legal to
directly concatenate two WTF-8 strings (you either have to convert it
back to UCS-16 first, or execute some special handling for the trailing
characters of the first string), which immediately renders it a poor
choice for a path storage format.  And indeed a poor choice for any
purpose.  (I suspect many people who are using it have conveniently
forgotten that part.)



Although on a related note, I think C++11/17 dropped the ball a bit on
the new encoding-specific character types.  It's definitely an
improvement on the prior method, but it would have been better to do
something like:

     struct ansi_encoding_t;
     struct utf_encoding_t;
     typedef encoded_char<ansi_encoding_t, 8> char_t;
     typedef encoded_char<utf_encoding_t, 8> char8_t;
     typedef encoded_char<utf_encoding_t, 16> char16_t;

Where "encoded_char<E,N>" has storage size equal to N bits (blittable,
and otherwise behaves like a standard integer type) but also carries
around an arbitrary encoding tag type E.  This could be used to
distinguish "a string encoded in UTF-8" from "a string encoded in WTF-8"
or "a string encoded in EDBDIC".  And supplemental libraries could
define additional encodings and conversion functions, and algorithms
could operate on generic strings of any encoding.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
Gavin Lambert wrote:

> The main problem though is that once you start allowing transcoding of any
> kind, it's a slippery slope to other conversions that can make lossy
> changes (such as applying different canonicalisation formats, or
> adding/removing layout codepoints such as RTL markers).

There's no such slippery slope, no canonicalization, no adding or removing
anything. You just WTF-8 encode whatever Windows gives you, and WTF-8 decode
the path before passing it to Windows.

> Also, if you read the WTF-8 spec, it notes that it is not legal to
> directly concatenate two WTF-8 strings (you either have to convert it back
> to UCS-16 first, or execute some special handling for the trailing
> characters of the first string), which immediately renders it a poor
> choice for a path storage format.

Do you have a specific example in which concatenation won't work for the use
outlined above? Because I can't think of any.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Tue, Jan 7, 2020 at 3:17 PM Gavin Lambert via Boost <
[hidden email]> wrote:

> > See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset)
> > and back losslessly. The unprecedented introduction of a platform
> specific
> > interface into the standard was, still is, and will always be, a horrible
> > mistake.
>
> Given that WTF-8 is not itself supported by the C++ standard library
> (and the other formats are), that doesn't seem like a valid argument.
> You'd have to campaign for that to be added first.
>

It doesn't need to be added to the standard. My claim was that instead of
adding a wchar_t/char Heisenstring into the standard and proliferating the
amount of fstream constructors, one could stick to char interfaces and
demand that "basic execution character set would be capable of storing any
Unicode data". An Windows implementation could do that with WTF-8 to allow
lossless transcoding.

The main problem though is that once you start allowing transcoding of
> any kind, it's a slippery slope to other conversions that can make lossy
> changes (such as applying different canonicalisation formats, or
> adding/removing layout codepoints such as RTL markers).
>

The truth is that there's already transcoding happening. Mount a Windows
partition on Unix or vice versa. It's expected to have some breakage there
if the filenames contain invalid sequences.


> Also, if you read the WTF-8 spec, it notes that it is not legal to
> directly concatenate two WTF-8 strings (you either have to convert it
> back to UCS-16 first, or execute some special handling for the trailing
> characters of the first string), which immediately renders it a poor
> choice for a path storage format.  And indeed a poor choice for any
> purpose.  (I suspect many people who are using it have conveniently
> forgotten that part.)
>

Paths are, almost always, concatenated with ASCII separators (or other
valid strings) in-between. Even when concatenating malformed strings
directly, the issue isn't there if the result is passed immediately back to
the "UTF-16" system.


> Although on a related note, I think C++11/17 dropped the ball a bit on
> the new encoding-specific character types.  [...]
>

C++11 over-engineered it, and you keep over-engineering it even further.
Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC
strings in one program *at compile time*.

--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
On 8/01/2020 12:57, Yakov Galka wrote:
> Paths are, almost always, concatenated with ASCII separators (or other
> valid strings) in-between. Even when concatenating malformed strings
> directly, the issue isn't there if the result is passed immediately back to
> the "UTF-16" system.

But the conversion from WTF-8 to UCS-16 can interpret the joining point
as a different character, resulting in a different sequence.  Unless
I've misread something, this could occur if the first string ended in an
unpaired high surrogate and the second started with an unpaired low
surrogate (or rather the WTF-8 equivalents thereof).  Unlikely, perhaps,
but not impossible.

>> Although on a related note, I think C++11/17 dropped the ball a bit on
>> the new encoding-specific character types.  [...]
>
> C++11 over-engineered it, and you keep over-engineering it even further.
> Can't think of a time anybody had to mix ASCII, UTF-8, WTF-8 and EBCDIC
> strings in one program *at compile time*.

You've just suggested cases where apps will contain both UTF-8 and
WTF-8, which would be useful to distinguish between at compile time --
both to allow overloading to automatically select the correct conversion
function and to give you compile errors if you accidentally try to pass
a WTF-8 string to a function that expects pure UTF-8, or vice versa.

The same applies for other cases.  That's why C++20 introduced char8_t,
so that you wouldn't accidentally pass UTF-8 strings to methods
expecting other char formats.

This could even be extended to other forms of two-way data encoding,
such as UUEncoding or Base64.  I don't think that's over-engineering,
that's just basic data conversion and type safety.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
Gavin Lambert wrote:
> But the conversion from WTF-8 to UCS-16 can interpret the joining point as
> a different character, resulting in a different sequence.  Unless I've
> misread something, this could occur if the first string ended in an
> unpaired high surrogate and the second started with an unpaired low
> surrogate (or rather the WTF-8 equivalents thereof).

I don't see why do you think this would present a problem. The conversion of
the first string will end in an unpaired high surrogate. The conversion of
the second string will start with an unpaired low surrogate. The two, when
concatenated, will form a valid UTF-16 encoding of a non-BMP character.
Where is the issue here?


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
On Tue, Jan 7, 2020 at 5:16 PM Peter Dimov via Boost <[hidden email]>
wrote:

> Gavin Lambert wrote:
> > But the conversion from WTF-8 to UCS-16 can interpret the joining point
> as
> > a different character, resulting in a different sequence.  Unless I've
> > misread something, this could occur if the first string ended in an
> > unpaired high surrogate and the second started with an unpaired low
> > surrogate (or rather the WTF-8 equivalents thereof).
>
> I don't see why do you think this would present a problem. The conversion
> of
> the first string will end in an unpaired high surrogate. The conversion of
> the second string will start with an unpaired low surrogate. The two, when
> concatenated, will form a valid UTF-16 encoding of a non-BMP character.
> Where is the issue here?
>

That's my point essentially. However Gavin refers to the fact that the
current WTF-8 spec explicitly says that an encoding of high/low surrogate
pairs is invalid in WTF-8.

For example

UTF-16: d83d de09

should be encoded as

WTF-8: f0 9f 98 89

But if one "UTF-16" string ended in d83d and the other in de09,
concatenating in WTF-8 would yield

"Invalid WTF-8": ed a0 bd ed b8 89

The spec explicitly prohibits this. The rationale behind this is to have a
unique representation of any "UTF-16" stream, just like UTF-8 requires
shortest representations. It might be important for security reasons if
you're going to compare those "invalid WTF-8" strings, but it is not an
issue if the next thing you do is converting them back to UTF-16.

--
Yakov Galka
http://stannum.co.il/

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
Yakov Galka wrote:

> That's my point essentially. However Gavin refers to the fact that the
> current WTF-8 spec explicitly says that an encoding of high/low surrogate
> pairs is invalid in WTF-8.

Ah that.

Yes, concatenating two character sequences can result in technically invalid
WTF-8. But that's not an issue unique to Windows. You can do the same on any
non-Windows platform. It's still not clear how this prevents a `path` class
from storing ~WTF-8 on Windows, or exposing a char-based API that ~WTF-8
decodes when passing to Windows, and encodes on the reverse trip.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
On 8/01/2020 14:43, Peter Dimov wrote:
> Yes, concatenating two character sequences can result in technically
> invalid WTF-8. But that's not an issue unique to Windows. You can do the
> same on any non-Windows platform. It's still not clear how this prevents
> a `path` class from storing ~WTF-8 on Windows, or exposing a char-based
> API that ~WTF-8 decodes when passing to Windows, and encodes on the
> reverse trip.

It could.  And if you're only round-tripping it to file APIs and doing
nothing else, then you can probably get away with that.

But there's probably other code that wants to do manipulation on the
path (swapping extensions, passing to some UI, truncating the filename
to 10 characters, etc).  Now there's more parts of the system that needs
to know you have data in not-legal-WTF-8 format, and how to deal with that.

(Or more likely you end up passing it to something that expects legal
UTF-8 without telling it otherwise, and it mostly works -- until it
doesn't.)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: What happened to Boost.Nowide?

Boost - Dev mailing list
Gavin Lambert wrote:

> On 8/01/2020 14:43, Peter Dimov wrote:
> > Yes, concatenating two character sequences can result in technically
> > invalid WTF-8. But that's not an issue unique to Windows. You can do the
> > same on any non-Windows platform. It's still not clear how this prevents
> > a `path` class from storing ~WTF-8 on Windows, or exposing a char-based
> > API that ~WTF-8 decodes when passing to Windows, and encodes on the
> > reverse trip.
>
> It could.  And if you're only round-tripping it to file APIs and doing
> nothing else, then you can probably get away with that.
>
> But there's probably other code that wants to do manipulation on the path
> (swapping extensions, passing to some UI, truncating the filename to 10
> characters, etc).  Now there's more parts of the system that needs to know
> you have data in not-legal-WTF-8 format, and how to deal with that.

No, there aren't any (new) problems with that. That is, there aren't
problems you wouldn't have otherwise, on other platforms. Vanilla POSIX can
have any NTBS at all as a path/file name; macOS has UTF-8 NFD paths/file
names. Any code you have that tries to truncate the filename to 10
characters (for whatever definition of character) is already broken. This is
simply not an operation that can be done portably on a path or file name.
(And any code that assumes that a file name will roundtrip, or that two
different file names can't refer to the same file/directory entry, is also
broken.)


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost