Hyphenation doesn't work with Russian language (only)

syroezhkin · May 7, 2021, 4:03pm

Yep! Love Manjaro for that .

syroezhkin · May 20, 2021, 4:48pm

So, I was just able to test the latest version 1.99.20. The hyphenations started working. But, they do not meet the rules of the Russian language. If you look closely, you will notice that the syllables of the words seem to be shifted one letter to the right (as if the index of the array is increased by one).

TEXMACS            SHOULD BE
рек-онструкции     ре-конструкции
изв-естными        из-вестными
паралл-ельного     парал-лельного
перс-пективы       пер-спективы

But if I specify the hephenation manually in Format / Hephenate as…, then everything works correctly.

I also checked whether the words from the exception list in the hyphen.english file are handled correctly. Interestingly, these words are hyphenated correctly, but after a while the program just crashed.

pireddag · May 20, 2021, 5:30pm

This for example I cannot interpret looking at the hyphenation patterns.
I find:
из1в2
and I do not find other rules that stop the hyphenation there, while I find
-в8е8
8в8е-
which I think apply when the word has an hyphen of its own, and nothing else for breaking ве.
I am interpreting the patterns above as “hyphenate as з-в”

Let us see if anyone else posts some insight.

The crash is a bug I think, worth reporting.

Edit: read this sentence now

What I wrote seems to me compatible with the idea.

jeroen · May 20, 2021, 10:28pm

Yes, something odd is going on.
I’ve turned on some debugging output in src/System/Language/hyphenate.cpp to see the applied hyphenation rules, the resulting penalties and the end result.

The applied rules:

.известными.
  ы => ы1
  м => 1м
  тн => 2т1н
  ны => 1ны
  изв => из1в2
  зве => з1ве
  вес => 1вес
  ест => е1ст
  стн => 2стн

Penalties:

.известными. --> [ 100000000, 100000000, 10000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000 ]

The end result:

Hyphen известными, 2
Yields изв-, естными

Edit: the penalties are indicating a split after the letter в

jeroen · May 20, 2021, 6:45pm

Considering the rules shown and the algorithm explained here, it seems that the results should be
и0з1в2е2с2т1н0ы1м0и, resulting in feasible hyphenation as из-вест-ны-ми. So, something is definitely wrong in the penalties.

pireddag · May 20, 2021, 8:27pm

So: we should file a bug report. I propose that either you or @syroezhkin does that. This is a bit “I propose ;-)”, but you have the original version of the debugging output, and @syroezhkin has the original version of the hyphenated text.

jeroen · May 20, 2021, 8:50pm

I have found the problem and fixed it, but in the process I’ve discovered more problems with hyphenating utf8 encoded strings that are a bit trickier to fix

I’ll see if I can fix both problems, if not I’ll file the first fix and report the second issue.

jeroen · May 20, 2021, 9:57pm

I’ve submitted a patch that fixes all the issues I’ve found, but unfortunately it seems to make the editor kind of sluggish in long paragraphs in Russian. Let’s hope Joris (or someone else who’s a better programmer than me) can find a more efficient solution.

mgubi · May 21, 2021, 10:11am

There should not be utf8 encoded strings in TeXmacs documents, no? As far as I understand all the strings in TeXmacs are encoded via the universal TeXmacs encoding. Could this be a nice topic for the next hacker’s meeting?

jeroen · May 21, 2021, 11:40am

Yes, the input string to be hyphenated is transformed to utf8 with cork_to_utf8 at the start of the get_hyphens routine. I don’t know why it was done this way and not the other way around, transforming the hyphenation tables to cork when it is loaded, as it is done for other languages. It’d be an interesting topic to discuss.

vdhoeven · June 5, 2021, 9:16pm

Please retry with TeXmacs 1.99.21. With the help of Jeroen, I think that the problem should be fixed now.

jeroen · June 7, 2021, 10:17am

@syroezhkin One more fix was needed to solve the issue. This was integrated in the latest SVN commit, which will go into the next version after 1.99.21.

panpav · June 7, 2021, 1:16pm

I see that the problem with Russian language is much more subtle than just hyphenation. For example, if I open a new document, set the document language as Russian, and try to type almost any formula like:

<\equation*>
<gamma><frak-A><sqrt|>
</equation*>

Then instead of the formula I see:

texmacs_russian

If I change from the focus bar the font from Cyrillic to Roman, then everything is fine.

I suppose that all this strange behavior is related to the special support of Cyrillic input methods in TeXmacs and some internal conversions related to that. Maybe it is better to turn this support off by default and treat Russian as any other language like French or Spanish? Nowadays almost everyone uses Unicode to write papers. I’m not sure weather this support is very relevant now.

jeroen · June 7, 2021, 1:47pm

Thanks for reporting this @panpav. I can confirm this. On the terminal there is an error message missing 'rm-cyrillic' master. Somehow when Russian or Ukrainian are selected, a non-existent font rm-cyrillic is selected. I’ve had a quick look at the source, but can’t immediately figure out where it happens. I’ll have a closer look at it later.

I’ve filed a bug report:
https://savannah.gnu.org/bugs/index.php?60745

panpav · June 23, 2021, 12:29pm

In the newest version 2.1, I still see the problem with math symbols and the Russian language. However in the corresponding bug report https://savannah.gnu.org/bugs/index.php?60745 this bug is considered as fixed.

At the same time, I see that the Russian hyphenation works as expected. Thanks for fixing that!

panpav · June 24, 2021, 10:18am

I added a comment in this bug report
https://savannah.gnu.org/bugs/index.php?60745

that the problem with formulas and documents in Russian are still not fixed in TeXmacs 2.1. Do I need to submit a new bug report? It seems that now it is considered as fixed.