Silly Boost.Locale default narrow string encoding in Windows

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Silly Boost.Locale default narrow string encoding in Windows

Alf P. Steinbach
When I engage the compiler-in-my-mind to the example given at

   http://cppcms.sourceforge.net/boost_locale/html/

namely

<code>
#include <boost/locale.hpp>
#include <boost/filesystem/path.hpp>
#include <boost/filesystem/fstream.hpp>

int main()
{
     // Create and install global locale
     std::locale::global(boost::locale::generator().generate(""));
     // Make boost.filesystem use it
     boost::filesystem::path::imbue(std::locale());
     // Now Works perfectly fine with UTF-8!
     boost::filesystem::ofstream hello("שלום.txt");
}
</code>

then it fails to work when the literal string is replaced with a `main`
argument.

A conversion is then necessary and must be added.

It breaks the principle of least surprise.

It breaks the principle of not paying for what you don't (want to) use.

I understand, from discussions elsewhere, that the author(s) have chosen
a narrow string encoding that requires inefficient & awkward conversions
in all directions, for political/religious reasons. Maybe my
understanding of that is faulty, that it's no longer politics & religion
but outright war (and maybe that war is even over, with even Luke
Skywalker dead or deadly wounded). However, I still ask:

why FORCE INEFFICIENCY & AWKWARDNESS on Boost users  --  why not just do
it right, using the platforms' native encodings.


Cheers,

- Alf


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding in Windows

Peter Dimov-5
Alf P. Steinbach wrote:

> However, I still ask:
>
> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users  --  why not just do
> it right, using the platforms' native encodings.

Comment out the imbue line.

(The platform's native encoding is UTF-16. The "ANSI" code page, which is
not necessarily ANSI or ANSI-like at all, despite your assertion, is not
"native"; the OS just converts from/to it as needed. Your program will work
fine until it's given a file name that is not representable in the ANSI CP.)


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding in Windows

Artyom Beilis
In reply to this post by Alf P. Steinbach
>
> then it fails to work when the literal string is replaced
> with a `main` argument.
>
> A conversion is then necessary and must be added.
>
> It breaks the principle of least surprise.
>
> It breaks the principle of not paying for what you don't
> (want to) use.
>

Did you read this?

http://beta.boost.org/doc/libs/1_48_0_beta1/libs/locale/doc/html/default_encoding_under_windows.html

You can **easily** switch to ANSI as default...

But you don't want to (rather switch to UTF-16 or UTF-8)
especially when you actually use localization... :-)

> I understand, from discussions elsewhere, that the
> author(s) have chosen a narrow string encoding that requires
> inefficient & awkward conversions in all directions, for
> political/religious reasons.

No you hadn't read rationale correctly and didn't read
what is written in the link I had given.

If you write "Windows only" software you should either
set Ansi option to use native encoding - UTF-16.

If not stick to cross platform UTF-8.

> Maybe my understanding of that
> is faulty, that it's no longer politics & religion but
> outright war (and maybe that war is even over, with even
> Luke Skywalker dead or deadly wounded). However, I still
> ask:
>
> why FORCE INEFFICIENCY & AWKWARDNESS on Boost
> users  --  why not just do it right, using the
> platforms' native encodings.
>

Windows native encoding is not ANSI. It is Wide/UTF-16 encoding.

-----------------------------------------------------

If you still not convinced, using UTF-8 by default was one
of important pluses this library brings and it was noticed
by many reviewers.

Artyom


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding in Windows

Alf P. Steinbach
In reply to this post by Peter Dimov-5
On 27.10.2011 18:47, Peter Dimov wrote:
> Alf P. Steinbach wrote:
>
>> However, I still ask:
>>
>> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
>> it right, using the platforms' native encodings.
>
> Comment out the imbue line.

But that line is much of the point, isn't it?


> (The platform's native encoding is UTF-16. The "ANSI" code page, which
> is not necessarily ANSI or ANSI-like at all, despite your assertion,

The article you responded to did not contain the word "ANSI".

Thus, when you refer to an assertion about "ANSI", you have fantasized
something.

I hope you are not going to go on like that.


> [ANSI] is not "native"; the OS just converts from/to it as needed.

OK, you need to learn a quite bit but

(1) you appear to be very sure that you're already knowledgeable, and

(2) you attribute things to me that you have just fantasized.

That makes it very difficult to teach you.

For narrow character strings in Windows, "native" and "ANSI" are
interchangeable terms.

They mean the same, namely the codepage identified by the GetACP() function.

This is not a particular codepage, it is configurable.

On my machine, and most probably on yours, it is codepage 1252, Windows
ANSI Western.

"Native" means the encoding used and expected by the OS' API functions.

For narrow character strings in Windows, that's Windows ANSI.


> Your program

No, again you're wrong: it's the Boost.Locale documentation's program.


> will work fine until it's given a file name that is not representable in
> the ANSI CP.)

Nope, sorry, for any /reasonable interpretation/ of what you're writing.

I can imagine that maybe you're thinking about setting ANSI CP to 65001,
which however is not reasonable.


Cheers & hth.,

- Alf


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding in Windows

Alf P. Steinbach
In reply to this post by Artyom Beilis
On 27.10.2011 19:06, Artyom Beilis wrote:
>
> Windows native encoding is not ANSI. It is Wide/UTF-16 encoding.

Try using UTF-16 with narrow strings.


Cheers & hth.,

- Alf



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding inWindows

Peter Dimov-5
In reply to this post by Alf P. Steinbach
Alf P. Steinbach wrote:

> On 27.10.2011 18:47, Peter Dimov wrote:
> > Alf P. Steinbach wrote:
> >
> >> However, I still ask:
> >>
> >> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
> >> it right, using the platforms' native encodings.
> >
> > Comment out the imbue line.
>
> But that line is much of the point, isn't it?

There wouldn't be much point in calling imbue if you didn't want a change in
the boost::filesystem default behavior, which is to convert using the ANSI
CP (or the OEM CP if AreFIleApisAnsi() returns false, if I'm not mistaken).


> > (The platform's native encoding is UTF-16. The "ANSI" code page, which
> > is not necessarily ANSI or ANSI-like at all, despite your assertion,
>
> The article you responded to did not contain the word "ANSI".
>
> Thus, when you refer to an assertion about "ANSI", you have fantasized
> something.

http://boost.2283326.n4.nabble.com/Making-Boost-Filesystem-work-with-GENERAL-filenames-with-g-in-Windows-a-solution-tp3936857p3944493.html

> I hope you are not going to go on like that.
>
>
> > [ANSI] is not "native"; the OS just converts from/to it as needed.
>
> OK, you need to learn a quite bit but
>
> (1) you appear to be very sure that you're already knowledgeable, and
>
> (2) you attribute things to me that you have just fantasized.
>
> That makes it very difficult to teach you.
>
> For narrow character strings in Windows, "native" and "ANSI" are
> interchangeable terms.

I will accept your definition for the time being and restate what I just
said without using "native":

Under Windows (NT+ and NTFS), the narrow character API is a wrapper over the
wide character API. The system converts from/to the ANSI code page as
needed. The narrowing conversion may lose data.

> > Your program
>
> No, again you're wrong: it's the Boost.Locale documentation's program.
>
>
> > will work fine until it's given a file name that is not representable in
> > the ANSI CP.)
>
> Nope, sorry, for any /reasonable interpretation/ of what you're writing.

File names on NTFS are not necessarily representable in the ANSI code page.
A program that uses narrow strings in the ANSI code page to represents paths
will not necessarily be able to open all files on the system.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding in Windows

Mateusz Loskot
In reply to this post by Alf P. Steinbach
On 27 October 2011 18:19, Alf P. Steinbach
<[hidden email]> wrote:
> On 27.10.2011 19:06, Artyom Beilis wrote:
>>
>> Windows native encoding is not ANSI. It is Wide/UTF-16 encoding.
>
> Try using UTF-16 with narrow strings.

You simply don't do that, do you, without conversion to wide string type.

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Charter Member of OSGeo, http://osgeo.org
Member of ACCU, http://accu.org

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
--
Mateusz Loskot, http://mateusz.loskot.net
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding inWindows

Alf P. Steinbach
In reply to this post by Peter Dimov-5
On 27.10.2011 20:01, Peter Dimov wrote:

> Alf P. Steinbach wrote:
>> On 27.10.2011 18:47, Peter Dimov wrote:
>> > Alf P. Steinbach wrote:
>> >
>> >> However, I still ask:
>> >>
>> >> why FORCE INEFFICIENCY & AWKWARDNESS on Boost users -- why not just do
>> >> it right, using the platforms' native encodings.
>> >
>> > Comment out the imbue line.
>>
>> But that line is much of the point, isn't it?
>
> There wouldn't be much point in calling imbue if you didn't want a
> change in the boost::filesystem default behavior, which is to convert
> using the ANSI CP (or the OEM CP if AreFIleApisAnsi() returns false, if
> I'm not mistaken).

Oh there is.

It is a level of indirection.

You want Boost.Filesystem to assume /the same/ narrow character encoding
as Boost.Locale, whatever it is.

And to quote the docs where I found that program,

"Boost Locale fully supports both narrow and wide API. The default
character encoding is assumed to be UTF-8 on Windows."


>> > (The platform's native encoding is UTF-16. The "ANSI" code page, which
>> > is not necessarily ANSI or ANSI-like at all, despite your assertion,
>>
>> The article you responded to did not contain the word "ANSI".
>>
>> Thus, when you refer to an assertion about "ANSI", you have fantasized
>> something.
>
> http://boost.2283326.n4.nabble.com/Making-Boost-Filesystem-work-with-GENERAL-filenames-with-g-in-Windows-a-solution-tp3936857p3944493.html

That's a different context and a different discussion, where it was
neither necessary nor natural to dot the i's and cross the t's to
perfection.

Talk about dragging in things from out of the blue.

If you wanted to point out the possibility of e.g. a Japanese codepage
as ANSI, then you should have done that over there, in that thread. I
mean in the context where it could make sense and where it could help
prevent readers getting a wrong impression. If it was that important.


[snippety]


> Under Windows (NT+ and NTFS), the narrow character API is a wrapper over
> the wide character API. The system converts from/to the ANSI code page
> as needed. The narrowing conversion may lose data.

OK, we're just talking about two different meanings of "native", for two
different contexts: windows internals, and windows apps.

The relevant context for discussing Boost.Locale's treatment of narrow
strings, is the application level.


>> > [the program] will work fine until it's given a file name that is not
>> > representable in the ANSI CP.)
>>
>> Nope, sorry, for any /reasonable interpretation/ of what you're writing.
>
> File names on NTFS are not necessarily representable in the ANSI code
> page. A program that uses narrow strings in the ANSI code page to
> represents paths will not necessarily be able to open all files on the
> system.

Right, that's one reason why modern Windows programs should best be
wchar_t based. Other reasons include efficiency (avoiding conversions)
and simple convenience. Some API functions do not have narrow wrappers.

However, a default assumption of UTF-8 encoding for narrow strings, as
in Boost.Locale, seems to me to clash with most uses of narrow strings.

For example, if you output UTF-8 on standard output, and then try to
pipe that through `more` in Windows' [cmd.exe], you get this:


<example>
d:\dave> chcp 65001
Active code page: 65001

d:\dave> echo "imagine this is utf8" | more
Not enough memory.

d:\dave> _
</example>


So utf-8 is, to put it less than strongly, not very practical as a
general narrow-character encoding in Windows.

The example that I gave at top of the thread was passing a `main`
argument further on, when using Boost.Locale. It causes trouble because
in Windows `main` arguments are by convention encoded as ANSI, while
Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8 generally
yields gobbledygook, except for the pure ASCII common subset.

But with ANSI as Boost.Locale default, with that more reasonable choice
of default, the imbue call would not cause trouble, but would instead
help to avoid trouble  --  which is surely the original intention.


Cheers & hth.,

- Alf


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encodinginWindows

Peter Dimov-5
Alf P. Steinbach wrote:
> On 27.10.2011 20:01, Peter Dimov wrote:
...
> > File names on NTFS are not necessarily representable in the ANSI code
> > page. A program that uses narrow strings in the ANSI code page to
> > represents paths will not necessarily be able to open all files on the
> > system.
>
> Right, that's one reason why modern Windows programs should best be
> wchar_t based.

This is one of the two options. The other is using UTF-8 for representing
paths as narrow strings. The first option is more natural for Windows-only
code, and the second is better, in practice, for portable code because it
avoids the need to duplicate all path-related functions for char/wchar_t.
The motivation for using UTF-8 is practical, not political or religious.

> The example that I gave at top of the thread was passing a `main` argument
> further on, when using Boost.Locale. It causes trouble because in Windows
> `main` arguments are by convention encoded as ANSI, while Boost.Locale has
> UTF-8 as default. Treating ANSI as UTF-8 generally yields gobbledygook,
> except for the pure ASCII common subset.

Yes. If you (generic second person, not you specifically) want to take your
paths from the narrow API, an UTF-8 default is not practical. But then
again, you shouldn't take your paths from the narrow API, because it can't
represent the names of all the files the user may have.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encoding inWindows

Artyom Beilis
In reply to this post by Alf P. Steinbach
> From: Alf P. Steinbach <[hidden email]>
>
> [...]
>
> It is a level of indirection.
>
> You want Boost.Filesystem to assume /the same/ narrow
> character encoding as Boost.Locale, whatever it is.
>
> And to quote the docs where I found that program,
>
> "Boost Locale fully supports both narrow and wide API. The
> default character encoding is assumed to be UTF-8 on
> Windows."
>


I would probably say it once again and the last time.

1. Boost.Locale is **localization** library and localization
   today is done using **Unicode** not cp1252, cp936 or cp1255
   
   And UTF-8 is **Unicode** encoding for narrow strings.

   So _any_ localization library **must** use Unicode encoding
   otherwise it will be useless crap.

2. If you write software for Windows and what to use ANSI encoding
   by default all you need is to add a _single_ line into your code.

   I give you a choice to use whatever you want. But the default
   should be suitable for **Localization** - the reason this library
   is written for,


Now, you may not like the design of Boost.Locale library or you
don't like its defaults. Legitimate. But using UTF-8 by default
was one of few points that had total agreement between all
Boost.Locale reviewers.

Using UTF-8 by default is indeed strategical decision. You may
call it political, I may call it practical. You may do not like
it but this is what will remain because it is the way the library
designed and it is one of its central parts.

You don't like it? Ok... I had given you an option to change it.
I think you and other users will survive this one extra line
that changes the default encoding to ANSI instead of cross platform
and UTF-8.

Best Regards,


Artyom Beilis
--------------
CppCMS - C++ Web Framework:   http://cppcms.sf.net/
CppDB - C++ SQL Connectivity: http://cppcms.sf.net/sql/cppdb/


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encodinginWindows

Alf P. Steinbach
In reply to this post by Peter Dimov-5
On 27.10.2011 21:07, Peter Dimov wrote:

> Alf P. Steinbach wrote:
>> On 27.10.2011 20:01, Peter Dimov wrote:
> ...
>> > File names on NTFS are not necessarily representable in the ANSI code
>> > page. A program that uses narrow strings in the ANSI code page to
>> > represents paths will not necessarily be able to open all files on the
>> > system.
>>
>> Right, that's one reason why modern Windows programs should best be
>> wchar_t based.
>
> This is one of the two options. The other is using UTF-8 for
> representing paths as narrow strings. The first option is more natural
> for Windows-only code, and the second is better, in practice, for
> portable code because it avoids the need to duplicate all path-related
> functions for char/wchar_t. The motivation for using UTF-8 is practical,
> not political or religious.

Thanks for that clarification of the current thinking at Boost.

I suspected that people envisioned those two choices as an exhaustive
set of alternatives, what to choose from, but I wasn't sure.

Anyway, happily, the apparent forced choice between two inefficient
ungoods, is not necessary  --  i.e. it's a false dichotomy.

For, there are at least THREE options for representing paths and other
strings internally in the program, in portable single-source code:

   1. wide character based (UTF-16 in Windows, possibly UTF-32 in *nix),
      as you described above,

   2. narrow character based (UTF-8), as you described above, and

   3. the most natural sufficiently general native encoding, 1 or 2
      depending on the platform that the source is being built for.

Option 3 means  --  it requires, as far as I can see  --  some
abstraction that hides the narrow/wide representation so as to get
source code level portability, which is all that matters for C++. It
doesn't need to involve very much. Some typedefs, traits, references.

Prior art in this direction, includes Microsoft's [tchar.h].

For example, write a portable string literal like this:

     PS( "This is a portable string literal" )

As compared to options 1 and 2, the benefits of option 3 include:

   * no inefficient conversions except at the external boundary of the
     program (and then in practice only in Windows, where it's already),

   * no problems with software and tools that don't understand a chosen
     "universal" (option 1 or 2) encoding,

   * no need to duplicate functions to adapt to underlying OS: one has
     at hand exactly what the OS API wants.

The main drawback is IMO the need to use something like a PS macro for
string and character literals, or a C++11 /user defined literal/.
Windows programmers are used to that, writing _T("blah") all the time as
if Windows 95 was still extant. So, considering that all that current
labor is being done for no reward whatsoever, I think it should be no
problem convincing programmers that writing a few characters more in
order to get portable string literals, is worth it; it just needs
exposure to examples from some authoritative source...


>> The example that I gave at top of the thread was passing a `main`
>> argument further on, when using Boost.Locale. It causes trouble
>> because in Windows `main` arguments are by convention encoded as ANSI,
>> while Boost.Locale has UTF-8 as default. Treating ANSI as UTF-8
>> generally yields gobbledygook, except for the pure ASCII common subset.
>
> Yes. If you (generic second person, not you specifically) want to take
> your paths from the narrow API, an UTF-8 default is not practical. But
> then again, you shouldn't take your paths from the narrow API, because
> it can't represent the names of all the files the user may have.

That's an unrelated issue, really, but I think Boost could use a "get
undamaged program arguments in portable strings" thing, if it isn't
there already?


Cheers & hth.,

- Alf


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow stringencodinginWindows

Peter Dimov-5
Alf P. Steinbach wrote:
> On 27.10.2011 21:07, Peter Dimov wrote:
> > Alf P. Steinbach wrote:
...

> >> Right, that's one reason why modern Windows programs should best be
> >> wchar_t based.
> >
> > This is one of the two options. The other is using UTF-8 for
> > representing paths as narrow strings. The first option is more natural
> > for Windows-only code, and the second is better, in practice, for
> > portable code because it avoids the need to duplicate all path-related
> > functions for char/wchar_t. The motivation for using UTF-8 is practical,
> > not political or religious.
>
> Thanks for that clarification of the current thinking at Boost.

My opinion is not representative of all of Boost, although I've found that
there is substantial agreement between people who write portable software
that needs to deal with paths (#2, UTF-8, as the way to go).

>   3. the most natural sufficiently general native encoding, 1 or 2
>      depending on the platform that the source is being built for.

Yes, with its various suboptions. 3a, TCHAR, 3b, template on char_type, 3c,
providing both char and wchar_t overloads. They all have their problems;
people don't move to UTF-8 merely out of spite.

> Prior art in this direction, includes Microsoft's [tchar.h].

This works, more or less, once you've accumulated the appropriate library of
_T macros, _t functions and T/t typedefs. I've never heard of it actually
being used for a portable code base, but I admit that it's possible to do
things this way, even if it's somewhat alien to POSIX people.

The advantage of using UTF-8 is that, apart from the border layer that calls
the OS (and that needs to be ported either way), the rest of the code is
happily char[]-based. There's no need to be aware of the fact that literals
need to be quoted or that strlen should be spelled _tcslen. There's no need
to convert paths to an external representation when writing them into a
portable config/project file.

> That's an unrelated issue, really, but I think Boost could use a "get
> undamaged program arguments in portable strings" thing, if it isn't there
> already?

We'll be back to the question of what constitutes a portable string. I'd
prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer
TCHAR[].


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow stringencodinginWindows

Alf P. Steinbach
On 27.10.2011 23:56, Peter Dimov wrote:

> Alf P. Steinbach wrote:
>> On 27.10.2011 21:07, Peter Dimov wrote:
>> > Alf P. Steinbach wrote:
> ...
>> >> Right, that's one reason why modern Windows programs should best be
>> >> wchar_t based.
>> >
>> > This is one of the two options. The other is using UTF-8 for
>> > representing paths as narrow strings. The first option is more natural
>> > for Windows-only code, and the second is better, in practice, for
>> > portable code because it avoids the need to duplicate all path-related
>> > functions for char/wchar_t. The motivation for using UTF-8 is
>> practical,
>> > not political or religious.
>>
>> Thanks for that clarification of the current thinking at Boost.
>
> My opinion is not representative of all of Boost, although I've found
> that there is substantial agreement between people who write portable
> software that needs to deal with paths (#2, UTF-8, as the way to go).
>
>> 3. the most natural sufficiently general native encoding, 1 or 2
>> depending on the platform that the source is being built for.
>
> Yes, with its various suboptions. 3a, TCHAR, 3b, template on char_type,
> 3c, providing both char and wchar_t overloads. They all have their
> problems; people don't move to UTF-8 merely out of spite.
>
>> Prior art in this direction, includes Microsoft's [tchar.h].
>
> This works, more or less, once you've accumulated the appropriate
> library of _T macros, _t functions and T/t typedefs. I've never heard of
> it actually being used for a portable code base,

[tchar.h], plus the similar support in <windows.h>, was heavily used for
porting applications between Windows 9x ANSI and Windows NT Unicode,
before Microsoft introduced the Layer for Unicode in 2001 or thereabouts
(the layer allowed wchar_t-apps to run in Windows 9x).

I'm not saying it's a good C++ approach for that porting  --  it's not,
since it was designed for the C language.

I just gave it as an example of prior art, which includes a neat header
where the names of the relevant functions to wrap (or whatever) can be
extracted by a small Python script. ;-)


> but I admit that it's
> possible to do things this way, even if it's somewhat alien to POSIX
> people.
>
> The advantage of using UTF-8 is that, apart from the border layer that
> calls the OS (and that needs to be ported either way), the rest of the
> code is happily char[]-based.

Oh.

I would be happy to learn this.

How do I make the following program work with Visual C++ in Windows,
using narrow character string?


<code>
#include <stdio.h>
#include <fcntl.h>      // _O_U8TEXT
#include <io.h>         // _setmode, _fileno
#include <windows.h>

int main()
{
     //SetConsoleOutputCP( 65001 );
     //_setmode( _fileno( stdout ), _O_U8TEXT );
     printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
}
</code>


The out-commented code is from my random efforts to Make It Work(TM).

It refused.

By the way, I'm hoping Boost isn't supporting old versions of g++.

Because old versions of g++ chocked on a BOM at start of UTF-8 encoded
source code, while Visual C++ requires that BOM... So, UTF-8 source code
ungood with old versions of g++, if Visual C++ is also used.


> There's no need to be aware of the fact
> that literals need to be quoted or that strlen should be spelled
> _tcslen. There's no need to convert paths to an external representation
> when writing them into a portable config/project file.

Hm, I'm not so sure.

I'd like to see this magic in action before believing in it, e.g., the
program above working with narrow chars and printf, with Visual C++.


>> That's an unrelated issue, really, but I think Boost could use a "get
>> undamaged program arguments in portable strings" thing, if it isn't
>> there already?
>
> We'll be back to the question of what constitutes a portable string. I'd
> prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer
> TCHAR[].

No, not TCHAR, which was designed for the C language (and is an ugly
uppercase name to boot).

Instead, like this:


<code>
#include "u/stdio_h.h"      // u::CodingValue, u::sprintf, U

#undef UNICODE
#define UNICODE
#include <windows.h>        // MessageBox

int main()
{
     u::CodingValue  buffer[80];

     sprintf( buffer, U( "The answer is %d!" ), 6*7 );  // Koenig lookup.
     MessageBox(
         0,
         buffer->rawPtr(),
         U( "This is a title!" )->rawPtr(),
         MB_ICONINFORMATION | MB_SETFOREGROUND
         );
}
</code>


I coded up that support after reading the article I'm responding to now,
because I felt that without coding it up I would be just spewing gut
feelings and hunches. Well-informed such, but still. So I coded. :-)


Cheers & hth.,

- Alf


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow stringencodinginWindows

Yakov Galka
On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach <
[hidden email]> wrote:

> On 27.10.2011 23:56, Peter Dimov wrote:
>
>> Alf P. Steinbach wrote:
>>
>>> On 27.10.2011 21:07, Peter Dimov wrote:
>>> > Alf P. Steinbach wrote:
>>>
>> ...
>>
>>> >> Right, that's one reason why modern Windows programs should best be
>>> >> wchar_t based.
>>> >
>>> > This is one of the two options. The other is using UTF-8 for
>>> > representing paths as narrow strings. The first option is more natural
>>> > for Windows-only code, and the second is better, in practice, for
>>> > portable code because it avoids the need to duplicate all path-related
>>> > functions for char/wchar_t. The motivation for using UTF-8 is
>>> practical,
>>> > not political or religious.
>>>
>>> Thanks for that clarification of the current thinking at Boost.
>>>
>>
>> My opinion is not representative of all of Boost, although I've found
>> that there is substantial agreement between people who write portable
>> software that needs to deal with paths (#2, UTF-8, as the way to go).
>>
>>  3. the most natural sufficiently general native encoding, 1 or 2
>>> depending on the platform that the source is being built for.
>>>
>>
>> Yes, with its various suboptions. 3a, TCHAR, 3b, template on char_type,
>> 3c, providing both char and wchar_t overloads. They all have their
>> problems; people don't move to UTF-8 merely out of spite.
>>
>>  Prior art in this direction, includes Microsoft's [tchar.h].
>>>
>>
>> This works, more or less, once you've accumulated the appropriate
>> library of _T macros, _t functions and T/t typedefs. I've never heard of
>> it actually being used for a portable code base,
>>
>
> [tchar.h], plus the similar support in <windows.h>, was heavily used for
> porting applications between Windows 9x ANSI and Windows NT Unicode, before
> Microsoft introduced the Layer for Unicode in 2001 or thereabouts (the layer
> allowed wchar_t-apps to run in Windows 9x).
>
> I'm not saying it's a good C++ approach for that porting  --  it's not,
> since it was designed for the C language.
>
> I just gave it as an example of prior art, which includes a neat header
> where the names of the relevant functions to wrap (or whatever) can be
> extracted by a small Python script. ;-)
>
>
>
>  but I admit that it's
>> possible to do things this way, even if it's somewhat alien to POSIX
>> people.
>>
>> The advantage of using UTF-8 is that, apart from the border layer that
>> calls the OS (and that needs to be ported either way), the rest of the
>> code is happily char[]-based.
>>
>
> Oh.
>
> I would be happy to learn this.
>
> How do I make the following program work with Visual C++ in Windows, using
> narrow character string?
>
>
> <code>
> #include <stdio.h>
> #include <fcntl.h>      // _O_U8TEXT
> #include <io.h>         // _setmode, _fileno
> #include <windows.h>
>
> int main()
> {
>    //SetConsoleOutputCP( 65001 );
>    //_setmode( _fileno( stdout ), _O_U8TEXT );
>    printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
> }
> </code>
>

How will you make this program portable?

The out-commented code is from my random efforts to Make It Work(TM).
>
> It refused.
>

This is because windows narrow-chars can't be UTF-8. You could make it
portable by:

int main()
{
    boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
}


>
> By the way, I'm hoping Boost isn't supporting old versions of g++.
>
> Because old versions of g++ chocked on a BOM at start of UTF-8 encoded
> source code, while Visual C++ requires that BOM... So, UTF-8 source code
> ungood with old versions of g++, if Visual C++ is also used.


If you don't use widechars, you can cheat VC++ to use UTF-8 string-literals.
Just save the file as UTF-8 *without* BOM. It will just embed them verbatim
into the executable.

 There's no need to be aware of the fact
>> that literals need to be quoted or that strlen should be spelled
>> _tcslen. There's no need to convert paths to an external representation
>> when writing them into a portable config/project file.
>>
>
> Hm, I'm not so sure.
>
> I'd like to see this magic in action before believing in it, e.g., the
> program above working with narrow chars and printf, with Visual C++.


See above and see
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036


>
>  That's an unrelated issue, really, but I think Boost could use a "get
>>> undamaged program arguments in portable strings" thing, if it isn't
>>> there already?
>>>
>>
>> We'll be back to the question of what constitutes a portable string. I'd
>> prefer UTF-8 on Windows and whatever was passed on POSIX. You'd prefer
>> TCHAR[].
>>
>
> No, not TCHAR, which was designed for the C language (and is an ugly
> uppercase name to boot).
>
> Instead, like this:
>
>
> <code>
> #include "u/stdio_h.h"      // u::CodingValue, u::sprintf, U
>
> #undef UNICODE
> #define UNICODE
> #include <windows.h>        // MessageBox
>
> int main()
> {
>    u::CodingValue  buffer[80];
>
>    sprintf( buffer, U( "The answer is %d!" ), 6*7 );  // Koenig lookup.
>    MessageBox(
>        0,
>        buffer->rawPtr(),
>        U( "This is a title!" )->rawPtr(),
>        MB_ICONINFORMATION | MB_SETFOREGROUND
>        );
> }
> </code>
>

You judge from a non-portable coed point-of-view. How about:

#inclued <cstdio>
#include "gtkext/message_box.h" // for gtkext::message_box

int main()
{
    char buffer[80];
    sprintf(buffer, "The answer is %d!", 6*7);
    gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah,
...);
}

And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on
windows)

Sincerely,
--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encodinginWindows

Stewart, Robert
In reply to this post by Alf P. Steinbach
Alf P. Steinbach wrote:

>
> Option 3 means  --  it requires, as far as I can see  --  some
> abstraction that hides the narrow/wide representation so as to
> get source code level portability, which is all that matters
> for C++. It doesn't need to involve very much. Some typedefs,
> traits, references.
>
> For example, write a portable string literal like this:
>
>      PS( "This is a portable string literal" )
[snip]

> The main drawback is IMO the need to use something like a PS
> macro for string and character literals, or a C++11 /user
> defined literal/.
> Windows programmers are used to that, writing _T("blah") all
> the time as if Windows 95 was still extant. So, considering
> that all that current labor is being done for no reward
> whatsoever, I think it should be no problem convincing
> programmers that writing a few characters more in order to get
> portable string literals, is worth it; it just needs exposure
> to examples from some authoritative source...

The problem with that approach is that existing, non-Windows, code must be painstakingly altered to introduce such manual portability constructs.  If code was already written using the Microsoft facilities for portability, it's a relatively easy transition to make (s/_T/PS/, for example).

Regardless of authoritative examples, inertia is against your idea.

_____
Rob Stewart                           [hidden email]
Software Engineer                     using std::disclaimer;
Dev Tools & Components
Susquehanna International Group, LLP  http://www.sig.com




________________________________

IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow stringencodinginWindows

Alf P. Steinbach
In reply to this post by Yakov Galka
On 28.10.2011 12:36, Yakov Galka wrote:

> On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach<
> [hidden email]>  wrote:
>
>> On 27.10.2011 23:56, Peter Dimov wrote:
>>>
>>> The advantage of using UTF-8 is that, apart from the border layer that
>>> calls the OS (and that needs to be ported either way), the rest of the
>>> code is happily char[]-based.
>>
>> Oh.
>>
>> I would be happy to learn this.
>>
>> How do I make the following program work with Visual C++ in Windows, using
>> narrow character string?
>>
>>
>> <code>
>> #include<stdio.h>
>> #include<fcntl.h>       // _O_U8TEXT
>> #include<io.h>          // _setmode, _fileno
>> #include<windows.h>
>>
>> int main()
>> {
>>     //SetConsoleOutputCP( 65001 );
>>     //_setmode( _fileno( stdout ), _O_U8TEXT );
>>     printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>> }
>> </code>
>>
>
> How will you make this program portable?

Well, that was *my* question.

The claim that this minimal "Hello, world!" program puts to the point,
is that "the rest of the [UTF-8 based] code is happily char[]-based".

Apparently that is not so.


> The out-commented code is from my random efforts to Make It Work(TM).
>>
>> It refused.
>>
>
> This is because windows narrow-chars can't be UTF-8. You could make it
> portable by:
>
> int main()
> {
>      boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
> }

Thanks, TIL boost::printf.

The idea of UTF-8 as a universal encoding seems now to be to use some
workaround such as boost::printf for each and every case where it turns
out that it doesn't work portably.

When every portability problem has been diagnosed and special cased to
use functions that translate to/from UTF-8 translation, and ignoring the
efficiency aspect of that, then UTF-8 just magically works, hurray.

E.g., if 'fopen( "rød.txt", "r" )' fails in the universal UTF-8 code,
then just replace with 'boost::fopen', or 'my_special_casing::fopen'.

However, with these workaround details made manifest, it is /much less/
convincing than the original general vague claim that UTF-8 just works.


[snip]

> You judge from a non-portable coed point-of-view. How about:
>
> #include <cstdio>
> #include "gtkext/message_box.h" // for gtkext::message_box
>
> int main()
> {
>      char buffer[80];
>      sprintf(buffer, "The answer is %d!", 6*7);
>      gtkext::message_box(buffer, "This is a title!", gtkext::icon_blah_blah,
> ...);
> }
>
> And unlike your code, it's magically portable! (thanks to gtk using UTF-8 on
> windows)

Aha. When you use a library L that translates in platform-specific ways
to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.

However, try to pass a `main` argument over to gtkext::message_box.

Then you have involved some /ohter code/ (namely the runtime library
code that calls 'main') that may not necessarily translate for you, and
in fact in Windows is extremely unlikely to translate for you.

Such code is prevalent.

Most code does not translate to/from UTF-8.


Cheers & hth., & thanks for mention of boost::printf,

- Alf


PS: With C++11 there is no longer any reason to use <cstdio> instead of
<stdio.h>, because <cstdio> no longer formally guarantees to not pollute
the global namespace (and in practice it has never honored its C++98
guarantee). The code above is a good example why <stdio.h> is preferable
-- it is too easy to write non-portable code with <cstdio>, such as
using unqualified sprintf (not to mention size_t!).


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow stringencodinginWindows

Yakov Galka
On Fri, Oct 28, 2011 at 13:17, Alf P. Steinbach <
[hidden email]> wrote:

> On 28.10.2011 12:36, Yakov Galka wrote:
>
>> On Fri, Oct 28, 2011 at 04:23, Alf P. Steinbach<
>> alf.p.steinbach+usenet@gmail.**com <alf.p.steinbach%[hidden email]>>
>>  wrote:
>>
>>  On 27.10.2011 23:56, Peter Dimov wrote:
>>>
>>>>
>>>> The advantage of using UTF-8 is that, apart from the border layer that
>>>> calls the OS (and that needs to be ported either way), the rest of the
>>>> code is happily char[]-based.
>>>>
>>>
>>> Oh.
>>>
>>> I would be happy to learn this.
>>>
>>> How do I make the following program work with Visual C++ in Windows,
>>> using
>>> narrow character string?
>>>
>>>
>>> <code>
>>> #include<stdio.h>
>>> #include<fcntl.h>       // _O_U8TEXT
>>> #include<io.h>          // _setmode, _fileno
>>> #include<windows.h>
>>>
>>> int main()
>>> {
>>>    //SetConsoleOutputCP( 65001 );
>>>    //_setmode( _fileno( stdout ), _O_U8TEXT );
>>>    printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
>>> }
>>> </code>
>>>
>>>
>> How will you make this program portable?
>>
>
> Well, that was *my* question.
>
> The claim that this minimal "Hello, world!" program puts to the point, is
> that "the rest of the [UTF-8 based] code is happily char[]-based".
>
> Apparently that is not so.


My point is that you cannot talk about things without comparison.


>  The out-commented code is from my random efforts to Make It Work(TM).
>>
>>>
>>> It refused.
>>>
>>>
>> This is because windows narrow-chars can't be UTF-8. You could make it
>> portable by:
>>
>> int main()
>> {
>>     boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
>> }
>>
>
> Thanks, TIL boost::printf.
>
> The idea of UTF-8 as a universal encoding seems now to be to use some
> workaround such as boost::printf for each and every case where it turns out
> that it doesn't work portably.
>

You pull things out of context. We should COMPARE the UTF-8 approach to the
wide-char on windows narrow-char on non-windows approach. Your approach
involves using your own printf just as well:

#include "u/stdio_h.h"      // u::CodingValue, u::printf, U
printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // ADL?
u::printf(U("Blåbærsyltetøy! 日本国 кошка!\n")); // or not ADL? depends on what
exactly U is.

but anyway you have to do O(N) work to wrap the N library functions you use.

Your approach is no way better.


> [...]
>
> [snip]
>
>> You judge from a non-portable coed point-of-view. How about:
>>
>> #include <cstdio>
>>
>> #include "gtkext/message_box.h" // for gtkext::message_box
>>
>> int main()
>> {
>>     char buffer[80];
>>     sprintf(buffer, "The answer is %d!", 6*7);
>>     gtkext::message_box(buffer, "This is a title!",
>> gtkext::icon_blah_blah,
>> ...);
>> }
>>
>> And unlike your code, it's magically portable! (thanks to gtk using UTF-8
>> on
>> windows)
>>
>
> Aha. When you use a library L that translates in platform-specific ways
> to/from UTF-8 for you, then UTF-8 is magically portable. For use of L.
>
> However, try to pass a `main` argument over to gtkext::message_box.
>

See the argv explanation in
http://permalink.gmane.org/gmane.comp.lib.boost.devel/225036

--
Yakov

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrowstringencodinginWindows

Peter Dimov-5
In reply to this post by Alf P. Steinbach
Alf P. Steinbach wrote:

> How do I make the following program work with Visual C++ in Windows, using
> narrow character string?
>
> <code>
> #include <stdio.h>
> #include <fcntl.h>      // _O_U8TEXT
> #include <io.h>         // _setmode, _fileno
> #include <windows.h>
>
> int main()
> {
>      //SetConsoleOutputCP( 65001 );
>      //_setmode( _fileno( stdout ), _O_U8TEXT );
>      printf( "Blåbærsyltetøy! 日本国 кошка!\n" );
> }
> </code>

Output to a console wasn't our topic so far (and is not one of my strong
points), but the specific problem with this program is that the embedded
literal is not UTF-8, as the warning C4566 tells us, so there is no way for
you to get UTF-8 in the output. (You should be able to set VC++'s code page
to 65001, but I don't think you can.)

int main()
{
    printf( utf8_encode( L"кошка" ).c_str() );
}

This is not a practical problem for "proper" applications because Russian
text literals should always come from the equivalent of gettext and never be
embedded in code.

int main()
{
    printf( gettext( "cat" ).c_str() );
}

So, yes, I admit that you can't easily write a portable application (or a
command-line utility) that has its Russian texts hardcoded, if that's your
point. But you can write a command-line utility that can take кошка.txt as
input and work properly, which is what I've been saying, and what sparked
the original debate (argv[1]).


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrow string encodinginWindows

Beman Dawes
In reply to this post by Alf P. Steinbach
Alf,

On Thu, Oct 27, 2011 at 5:12 PM, Alf P. Steinbach
<[hidden email]> wrote:
>...
> Thanks for that clarification of the current thinking at Boost.
>...

Please understand that Boost isn't a single library, but rather a
collection of 100 or so individual libraries. So there isn't any
single "current thinking at Boost" on any topic that has library or
application dependent aspects.

That said, Peter Dimov's replies do represent the thinking of many
Boost developers and library maintainers, include me:-)

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Silly Boost.Locale default narrowstringencodinginWindows

Peter Dimov-5
In reply to this post by Alf P. Steinbach
Alf P. Steinbach wrote:
On 28.10.2011 12:36, Yakov Galka wrote:
> > This is because windows narrow-chars can't be UTF-8. You could make it
> > portable by:
> >
> > int main()
> > {
> >      boost::printf("Blåbærsyltetøy! 日本国 кошка!\n");
> > }
>
> Thanks, TIL boost::printf.

No, I don't think that this works. The problem here is not the printf call,
it's the literal. When a char[] that does contain the proper UTF-8 text is
passed, printf works under chcp 65001.

In principle, you should still need to use the hypothetical boost::printf,
though, if you want the program to properly support arbitrary code pages
(not that the text above can be output in any code page other than 65001).

> When every portability problem has been diagnosed and special cased to use
> functions that translate to/from UTF-8 translation, and ignoring the
> efficiency aspect of that, then UTF-8 just magically works, hurray.
>
> E.g., if 'fopen( "rød.txt", "r" )' fails in the universal UTF-8 code, then
> just replace with 'boost::fopen', or 'my_special_casing::fopen'.

Yes, exactly. It's not a silver bullet, but... try coming up with a better
alternative.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
123