Yep! Love Manjaro for that .
Hyphenation doesn't work with Russian language (only)
So, I was just able to test the latest version 1.99.20. The hyphenations started working. But, they do not meet the rules of the Russian language. If you look closely, you will notice that the syllables of the words seem to be shifted one letter to the right (as if the index of the array is increased by one).
TEXMACS SHOULD BE
рек-онструкции ре-конструкции
изв-естными из-вестными
паралл-ельного парал-лельного
перс-пективы пер-спективы
But if I specify the hephenation manually in Format / Hephenate as…, then everything works correctly.
I also checked whether the words from the exception list in the hyphen.english
file are handled correctly. Interestingly, these words are hyphenated correctly, but after a while the program just crashed.
This for example I cannot interpret looking at the hyphenation patterns.
I find:
из1в2
and I do not find other rules that stop the hyphenation there, while I find
-в8е8
8в8е-
which I think apply when the word has an hyphen of its own, and nothing else for breaking ве.
I am interpreting the patterns above as “hyphenate as з-в”
Let us see if anyone else posts some insight.
The crash is a bug I think, worth reporting.
Edit: read this sentence now
What I wrote seems to me compatible with the idea.
Yes, something odd is going on.
I’ve turned on some debugging output in src/System/Language/hyphenate.cpp
to see the applied hyphenation rules, the resulting penalties and the end result.
The applied rules:
.известными.
ы => ы1
м => 1м
тн => 2т1н
ны => 1ны
изв => из1в2
зве => з1ве
вес => 1вес
ест => е1ст
стн => 2стн
Penalties:
.известными. --> [ 100000000, 100000000, 10000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000 ]
The end result:
Hyphen известными, 2
Yields изв-, естными
Edit: the penalties are indicating a split after the letter в
Considering the rules shown and the algorithm explained here, it seems that the results should be
и0з1в2е2с2т1н0ы1м0и
, resulting in feasible hyphenation as из-вест-ны-ми
. So, something is definitely wrong in the penalties.
So: we should file a bug report. I propose that either you or @syroezhkin does that. This is a bit “I propose ;-)”, but you have the original version of the debugging output, and @syroezhkin has the original version of the hyphenated text.
I have found the problem and fixed it, but in the process I’ve discovered more problems with hyphenating utf8 encoded strings that are a bit trickier to fix
I’ll see if I can fix both problems, if not I’ll file the first fix and report the second issue.
I’ve submitted a patch that fixes all the issues I’ve found, but unfortunately it seems to make the editor kind of sluggish in long paragraphs in Russian. Let’s hope Joris (or someone else who’s a better programmer than me) can find a more efficient solution.
There should not be utf8 encoded strings in TeXmacs documents, no? As far as I understand all the strings in TeXmacs are encoded via the universal TeXmacs encoding. Could this be a nice topic for the next hacker’s meeting?
Yes, the input string to be hyphenated is transformed to utf8 with cork_to_utf8
at the start of the get_hyphens
routine. I don’t know why it was done this way and not the other way around, transforming the hyphenation tables to cork when it is loaded, as it is done for other languages. It’d be an interesting topic to discuss.
Please retry with TeXmacs 1.99.21. With the help of Jeroen, I think that the problem should be fixed now.
@syroezhkin One more fix was needed to solve the issue. This was integrated in the latest SVN commit, which will go into the next version after 1.99.21.
I see that the problem with Russian language is much more subtle than just hyphenation. For example, if I open a new document, set the document language as Russian, and try to type almost any formula like:
<\equation*>
<gamma><frak-A><sqrt|>
</equation*>
Then instead of the formula I see:
If I change from the focus bar the font from Cyrillic to Roman, then everything is fine.
I suppose that all this strange behavior is related to the special support of Cyrillic input methods in TeXmacs and some internal conversions related to that. Maybe it is better to turn this support off by default and treat Russian as any other language like French or Spanish? Nowadays almost everyone uses Unicode to write papers. I’m not sure weather this support is very relevant now.
Thanks for reporting this @panpav. I can confirm this. On the terminal there is an error message missing 'rm-cyrillic' master
. Somehow when Russian or Ukrainian are selected, a non-existent font rm-cyrillic
is selected. I’ve had a quick look at the source, but can’t immediately figure out where it happens. I’ll have a closer look at it later.
I’ve filed a bug report:
https://savannah.gnu.org/bugs/index.php?60745
In the newest version 2.1, I still see the problem with math symbols and the Russian language. However in the corresponding bug report https://savannah.gnu.org/bugs/index.php?60745 this bug is considered as fixed.
At the same time, I see that the Russian hyphenation works as expected. Thanks for fixing that!
I added a comment in this bug report
https://savannah.gnu.org/bugs/index.php?60745
that the problem with formulas and documents in Russian are still not fixed in TeXmacs 2.1. Do I need to submit a new bug report? It seems that now it is considered as fixed.