How to access unicode strings through MEX/Engine C interfaces?

Question

Szabolcs il 18 Feb 2013

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/63980-how-to-access-unicode-strings-through-mex-engine-c-interfaces

Risposta accettata: Jan

Apri in MATLAB Online

Short version

How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX interface?

Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as

 fid=fopen('test.txt','r','l','UTF-8');
 s=fscanf(fid, '%s')

If I first do feature('DefaultCharacterSet', 'UTF-8'), then engEvalString(ep, "s"), then I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaces by code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.) Is this possible?

Addendum: The doc page of mxArrayToString explicitly states that

"[mxArrayToString] supports multibyte encoded characters."

So how can I get the multibyte non-Latin-1 characters then?

----

Original long version What character encoding does MATLAB use internally---if any---and is there a way to control this? To be precise, I would like to know if there is a way to guarantee that any character array I retrieve is going to adhere to a particular encoding, preferably a unicode one.

I am interfacing MATLAB with another library through the MATLAB Engine interface and I need to guarantee a character encoding when sending strings to the other library. Is this possible at all, or are MATLAB's strings plain char arrays with no associated encoding?

Related things I found:

This here says that it uses UTF-16, but that's not what I see when I retrieve strings in C code.
I found references to feature('DefaultCharacterEncoding', 'UTF-8') on the web. What this appears to do is control what encoding the input commands (engEvalString) are assumed to have, and how the output is encoded. If I supply a UTF-8 encoded á as s='á', then retrieve this in C, I get an ISO-Latin-1 encoded á. If I send something that's not in Latin-1, I get nonsense (actually character code 26). (At least this is my impression after a few simple tests---these are time consuming)

In light of this finding, I'd like to know: does MATLAB support unicode for all its strings? If yes, how do I get access to these from the C interface? (Any unicode encoding is acceptable, UTF8, UTF16, UCS32, etc.) If it doesn't support unicode, is ISO-Latin-1 its default? Can I assume that all strings I retrieve though the C interface can be interpreted as ISO-Latin-1?

Also, any pointers to the relevant documentation on the issue is most welcome.

(I should probably mention that I was testing this on OS X as I'm aware that there are differences in the implementation of the matlab engine interface between platforms.)

2 Commenti
Mostra NessunoNascondi Nessuno

Jan il 18 Feb 2013

See also: http://www.mathworks.com/matlabcentral/answers/3198-convert-matlab-string-to-wchar-in-c-mex-under-windows-and-linux

Amro il 18 Feb 2013

@Szabolcs: I just finished preparing an answer for Stack Overflow only to find the question deleted... Can you reopen it, I think I have some useful info (with working examples and all) :)

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Jan il 18 Feb 2013

2
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/63980-how-to-access-unicode-strings-through-mex-engine-c-interfaces#answer_75551

Apri in MATLAB Online

Sorry, this does not match your question exactly, but perhaps it is useful for the topic.

See also: Answers: Matlab string to wchar. I got this message form the support:

 I believe mxChar was originally intended to be UTF-16, however the surrogate pair style unicode characters do not appear to be fully supported. However I suspect passing these characters through MATLAB 'mxChar' to the operating system should still be fine as MATLAB links against ICU (International Components for Unicode).
 For compilers that have 'wchar_t' as a 16-bit value and use encoding schemes UTF-16 / UCS-2, this code will be safe.
 For 32-bit 'wchar_t' values, you would need to do a conversion from UTF-16 to the encoding scheme employed by the operating system. For basic MATLAB strings to UTF-32, you could potentially just leave the upper 16-bits at zero. However as you expect, there may be certain strings obtained from the operating system that are in surrogate pair form, which require a slightly more advanced conversion. It may be better to utilize a separate library such as ICU to do the conversion between UTF-16 and the Linux encoding scheme.

2 Commenti
Mostra NessunoNascondi Nessuno

Szabolcs il 18 Feb 2013

This is actually very useful, as well as this, which I did not find until now. I did not realize that mxChars were two (or four) bytes, unfortunately. Also, the library I am interfacing with does not support surrogate pairs either (which is pretty annoying in general, but somewhat convenient in this case).

I'll accept once I manage to get things working using this information.

Szabolcs il 18 Feb 2013

@Jan, I know this is not related, but since you seem to have a lot of experience with the C interface, could you take a look at this? Have you ever tried to transfer "class" type values? When I attempt this, MATLAB crashes. If you don't have experience with this, please just ignore this comment. (Background: I'm interfacing MATLAB with another language, so all the unusual edge cases are coming out...)

Accedi per commentare.