Encoding problems in TeXmacs pdf exports

re4zuaFe · April 20, 2024, 5:49am

I open this post because the bug report was mistakenly closed.

I looked into the problem, and the exported pdf contains a line:

/Title <FEFF003100201D538>

which is of course incorrect: FEFF indicates that this is encoded in UTF-16 while there are odd number of hex codes. The problem happens at the conversion of the character 𝔸, which indeed corresponds to 0x1d538, but in order to encode in UTF-16, it should be converted into two words: D835 DD38, not 1D538.

mgubi · May 25, 2021, 9:05pm

I will try to give a look and understand if there is a bug in the PDF export, or if this can be fixed on the TeXmacs side.

re4zuaFe · May 26, 2021, 9:57am

This bug is still reproducible in TeXmacs 1.99.20. The tm file is attached at the end.

If the conversion is handled by TeXmacs, I propose to use C++ STL to perform the conversion: codecvt_utf16 and wstring_convert.

I don’t know what was the result of the discussion between @darcy and Joris about the modernization of C++ codes. I don’t think it worthy converting everything, but when there are buggy codes, it might be good to replace them with the correct modern C++11 codes.

I modified the sample code which produces the correct UTF-16BE (and also UTF-8, UTF-16LE codes):

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <iomanip>
 
// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for(unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}
 
int main()
{
    // wide character data
    // std::wstring wstr =  L"z\u00df\u6c34\U0001f34c"; // or L"zß水🍌"
    std::wstring wstr = L"𝔸";
 
    // wide to UTF-8
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
    std::string u8str = conv1.to_bytes(wstr);
    std::cout << "UTF-8 conversion produced " << u8str.size() << " bytes:\n";
    hex_print(u8str);
 
    // wide to UTF-16le
    std::wstring_convert<std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>> conv2;
    std::string u16str = conv2.to_bytes(wstr);
    std::cout << "UTF-16le conversion produced " << u16str.size() << " bytes:\n";
    hex_print(u16str);

    // wide to UTF-16be
    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv3;
    std::string u16bestr = conv3.to_bytes(wstr);
    std::cout << "UTF-16be conversion produced " << u16bestr.size() << " bytes:\n";
    hex_print(u16bestr);
}

The tm file:

<TeXmacs|1.99.20>

<style|generic>

<\body>
  <section|<math|\<bbb-A\>>>

  <subsection|<math|\<bbb-B\>>>

  <subsubsection|<math|\<bbb-A\>>>

  <section|<math|\<bbb-C\>>>

  <section|<math|\<bbb-U\>>>
</body>

<\initial>
  <\collection>
    <associate|page-height|auto>
    <associate|page-medium|paper>
    <associate|page-type|letter>
    <associate|page-width|auto>
  </collection>
</initial>

<\references>
  <\collection>
    <associate|auto-1|<tuple|1|1>>
    <associate|auto-2|<tuple|1.1|1>>
    <associate|auto-3|<tuple|1.1.1|1>>
    <associate|auto-4|<tuple|2|1>>
    <associate|auto-5|<tuple|3|1>>
  </collection>
</references>

<\auxiliary>
  <\collection>
    <\associate|toc>
      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|1<space|2spc><with|mode|<quote|math>|\<bbb-A\>>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-1><vspace|0.5fn>

      <with|par-left|<quote|1tab>|1.1<space|2spc><with|mode|<quote|math>|\<bbb-B\>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-2>>

      <with|par-left|<quote|2tab>|1.1.1<space|2spc><with|mode|<quote|math>|\<bbb-A\>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-3>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|2<space|2spc><with|mode|<quote|math>|\<bbb-C\>>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-4><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|3<space|2spc><with|mode|<quote|math>|\<bbb-U\>>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-5><vspace|0.5fn>
    </associate>
  </collection>
</auxiliary>

mgubi · May 26, 2021, 1:33pm

TeXmacs does not use the standard library (it was written before the standardisation). I see no point in mixing various languages nor to switch language to cure a small problem. I will check if there is a bug or just a missed conversion somewhere.

mgubi · May 26, 2021, 2:10pm

I’ve found the problem. Is indeed a bug in the function utf8_to_pdf_hex_string in converter.cpp. I’m writing a fix. Indeed we do not properly convert utf16 outside the basic multilingual plane (i.e. unicode points > 0xFFFF).

mgubi · May 26, 2021, 2:19pm

This is the patch if you want to try it.

Index: src/Data/String/converter.cpp
===================================================================
--- src/Data/String/converter.cpp	(revision 13515)
+++ src/Data/String/converter.cpp	(working copy)
@@ -877,15 +877,31 @@
   return result;
 }
 
+
 string
-utf8_to_hex_string (string s) {
-  string result;
+utf8_to_utf16be_string (string s) {
+  string result, hex;
   int i, n= N(s);
   for (i=0; i<n; ) {
     unsigned int code= decode_from_utf8 (s, i);
-    string hex= as_hexadecimal (code);
-    while (N(hex) < 4) hex = "0" * hex;
-    result << hex;
+    // see e.g. https://en.wikipedia.org/wiki/UTF-16
+    if (code >= 0x10000) {
+      // supplementary planes
+      unsigned int code2= code - 0x10000;
+      unsigned int w1= 0xD800 + (code2 >> 10);
+      unsigned int w2= 0xDC00 + (code2 & 0x3FF);
+      hex= as_hexadecimal (w1);
+      while (N(hex) < 4) hex = "0" * hex;
+      result << hex;
+      hex= as_hexadecimal (w2);
+      while (N(hex) < 4) hex = "0" * hex;
+      result << hex;
+    } else {
+      // basic planes
+      string hex= as_hexadecimal (code);
+      while (N(hex) < 4) hex = "0" * hex;
+      result << hex;
+    }
   }
   return result;
 }
@@ -892,5 +908,5 @@
 
 string
 utf8_to_pdf_hex_string (string s) {
-  return "<FEFF" * utf8_to_hex_string (cork_to_utf8 (s)) * ">";
-}
\ No newline at end of file
+  return "<FEFF" * utf8_to_utf16be_string (cork_to_utf8 (s)) * ">";
+}
Index: src/Data/String/converter.hpp
===================================================================
--- src/Data/String/converter.hpp	(revision 13515)
+++ src/Data/String/converter.hpp	(working copy)
@@ -121,7 +121,6 @@
 string convert_char_entities (string s);
 string convert_char_entity (string s, int& start, bool& success);
 string utf8_to_hex_entities (string s);
-string utf8_to_hex_string (string s);
 string utf8_to_pdf_hex_string (string s);
 
 #endif // CONVERTER_H

re4zuaFe · May 26, 2021, 4:14pm

I think that you could write a tester: enumerate all unsigned integers up to 2^32-1 and convert it into a string of 5 bytes (including an \0 at the end). Then pass it to your function and the result produced by std::codecvt_utf16 above. This should finish in a reasonable amount of time (it should take less than a second for an enumeration up to 2^28, based on my experience more than a decade ago) maybe you could reduce a bit the amount to enumerate if it takes too long).

mgubi · May 27, 2021, 12:57pm

I suppose I could. Anyway I was not suggesting that you should test it. I already did and seems to work fine on your example document.

re4zuaFe · May 27, 2021, 1:14pm

Such test finds potential problems unforeseeable by several “hand-made” samples, including possible mistakes in the description in Wikipedia, say.

mgubi · May 27, 2021, 1:58pm

I’m aware of that. We do not have a systematic testing strategy for TeXmacs so I would not know where to put a test. I agree would be good practice, but maybe I would not anyway implement what you suggest since it seems to me to cover a very small “error surface”. It would be more useful to me to know if there is a mistake in Wikipedia description of the encoding and/or where I can find a better source, or if you see a problem with my algorithm. Code review is also an equally effective strategy to avoid bugs.

Btw, It could be a nice and useful project for a newcomer to implement a test infrastructure and various tests of basic routines.

re4zuaFe · May 27, 2021, 4:41pm

I don’t know whether there is a mistake in Wikipedia description, but as you know, in academia, we don’t see Wikipedia as a reliable reference (I have previous spotted mathematical errors and I edited, but I am not sure the quality about programming).

I know that code review is an effective strategy, but I had the experience of a careful code review during a coding competition a decade ago - I even tried to prove some loop invariant by hand, but afterwards my code was erroneous by setting an incorrect initialization value.

I don’t know the industrial testing strategy. When I participated in the coding competitions, it was common to write a program implementing a “naive” algorithm, write a data generator, then compare the results produced by two programs. As far as I remember, the opportunity that my code was correct was almost zero.

jeroen · June 2, 2021, 9:27pm

I just discovered that there are already several tests in the tests subdirectory. They use the Google Test framework (https://github.com/google/googletest). Many of them have been implemented by @darcy

mgubi · June 3, 2021, 6:35am

Yes, indeed. I should say that personally I do not like much that it depends on a large codebase like googletest. I do not think it is necessary for us. The code is also written in a style different from the maincodebase (it uses C++11). @darcy it is necessary to use the library? Cannot we just write two/three macros and some support code (maybe a couple of cpp files) to run the test ourselves? What was your reason to choose googletest? For example, Qt already has a testing framework, so we could use that instead. It would be nice to have some more tests even if I’m a bit skeptical these are critical for us. The codebase is quite old and tested, there are not so many bugs which is unlikely they will be caught by tests unless we really have a ~100% test coverage of the codebase. We are not really doing test-driven developments, at least not at the lower levels of the code. Some parts are difficult to test : like the font selection system, but I feel there is were maybe we need it most. Test could help to detect regressions in case we decide to make big reorganisations of the code. What is your opinion? Personally if some newcomer would like to give a try to improve the testing this is a great way to learn ones way in the codebase. But I would really like not to depend on google test and also to write the tests in the style of the main codebase.

re4zuaFe · June 3, 2021, 2:03pm

I don’t understand the motivation for “general tests”. In my opinion, the tests should be closely tied to codes, and when the code is written, if there are some contracts, then it could be tested via a random data generator (contracts might be useful even without tests, which could be understood as an informal version of Floyd-Hoare logic. An example in a Dijkstra’s letter). It could also happen if you have two implementations and you want to compare them.

jeroen · June 4, 2021, 8:20am

Do we need to use C++11 for Google Test? I see that it needs a C++11 capable compiler, but can’t we still write tests in our own style?
We may not be able to cover all of the code, but then still it would be valuable to test as much as possible. For example, the Russian hyphenation bug could have been easily caught by a simple test. I’ve written one now and it’s been very useful for further debugging already. Such bugs may reappear in the future, if for example we would decide to change the encoding to something other than Cork.

I saw in one of @darcy posts that there is an intention to use Catch2 instead of Google, but I don’t know why. QtTest also looks interesting as it’s lightweight.