Azzera filtri
Azzera filtri

How to access unicode strings through MEX/Engine C interfaces?

3 visualizzazioni (ultimi 30 giorni)
Short version
How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX interface?
Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as
fid=fopen('test.txt','r','l','UTF-8');
s=fscanf(fid, '%s')
If I first do feature('DefaultCharacterSet', 'UTF-8'), then engEvalString(ep, "s"), then I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaces by code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.) Is this possible?
Addendum: The doc page of mxArrayToString explicitly states that
  • "[mxArrayToString] supports multibyte encoded characters."
So how can I get the multibyte non-Latin-1 characters then?
----
Original long version What character encoding does MATLAB use internally---if any---and is there a way to control this? To be precise, I would like to know if there is a way to guarantee that any character array I retrieve is going to adhere to a particular encoding, preferably a unicode one.
I am interfacing MATLAB with another library through the MATLAB Engine interface and I need to guarantee a character encoding when sending strings to the other library. Is this possible at all, or are MATLAB's strings plain char arrays with no associated encoding?
Related things I found:
  • This here says that it uses UTF-16, but that's not what I see when I retrieve strings in C code.
  • I found references to feature('DefaultCharacterEncoding', 'UTF-8') on the web. What this appears to do is control what encoding the input commands (engEvalString) are assumed to have, and how the output is encoded. If I supply a UTF-8 encoded á as s='á', then retrieve this in C, I get an ISO-Latin-1 encoded á. If I send something that's not in Latin-1, I get nonsense (actually character code 26). (At least this is my impression after a few simple tests---these are time consuming)
In light of this finding, I'd like to know: does MATLAB support unicode for all its strings? If yes, how do I get access to these from the C interface? (Any unicode encoding is acceptable, UTF8, UTF16, UCS32, etc.) If it doesn't support unicode, is ISO-Latin-1 its default? Can I assume that all strings I retrieve though the C interface can be interpreted as ISO-Latin-1?
Also, any pointers to the relevant documentation on the issue is most welcome.
(I should probably mention that I was testing this on OS X as I'm aware that there are differences in the implementation of the matlab engine interface between platforms.)
  2 Commenti
Amro
Amro il 18 Feb 2013
@Szabolcs: I just finished preparing an answer for Stack Overflow only to find the question deleted... Can you reopen it, I think I have some useful info (with working examples and all) :)

Accedi per commentare.

Risposta accettata

Jan
Jan il 18 Feb 2013
Sorry, this does not match your question exactly, but perhaps it is useful for the topic.
See also: Answers: Matlab string to wchar. I got this message form the support:
I believe mxChar was originally intended to be UTF-16, however the surrogate pair style unicode characters do not appear to be fully supported. However I suspect passing these characters through MATLAB 'mxChar' to the operating system should still be fine as MATLAB links against ICU (International Components for Unicode).
For compilers that have 'wchar_t' as a 16-bit value and use encoding schemes UTF-16 / UCS-2, this code will be safe.
For 32-bit 'wchar_t' values, you would need to do a conversion from UTF-16 to the encoding scheme employed by the operating system. For basic MATLAB strings to UTF-32, you could potentially just leave the upper 16-bits at zero. However as you expect, there may be certain strings obtained from the operating system that are in surrogate pair form, which require a slightly more advanced conversion. It may be better to utilize a separate library such as ICU to do the conversion between UTF-16 and the Linux encoding scheme.
  2 Commenti
Szabolcs
Szabolcs il 18 Feb 2013
This is actually very useful, as well as this, which I did not find until now. I did not realize that mxChars were two (or four) bytes, unfortunately. Also, the library I am interfacing with does not support surrogate pairs either (which is pretty annoying in general, but somewhat convenient in this case).
I'll accept once I manage to get things working using this information.
Szabolcs
Szabolcs il 18 Feb 2013
@Jan, I know this is not related, but since you seem to have a lot of experience with the C interface, could you take a look at this? Have you ever tried to transfer "class" type values? When I attempt this, MATLAB crashes. If you don't have experience with this, please just ignore this comment. (Background: I'm interfacing MATLAB with another language, so all the unusual edge cases are coming out...)

Accedi per commentare.

Più risposte (1)

Walter Roberson
Walter Roberson il 18 Feb 2013
MATLAB uses a 16 bit character internally, but it does not use UTF-anything. It simply uses the first 65536 Unicode code points.

Categorie

Scopri di più su Characters and Strings in Help Center e File Exchange

Prodotti

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by