Hyphenation doesn't work with Russian language (only)

syroezhkin · June 23, 2021, 2:59pm

I done everything which in Help → Notes for users of Cyrillic languages:

selected Russian as default language in Edit → Preferences → Language → Russian;
selected Russian for an entire document using Document → Language → Russian;
I also checked if a file with patterns exists on the path /usr/share/TeXmacs/langs/natural/hyphen/hyphen.russian.

Despite this hyphenation doesn’t work with Russian language. But if I set any other language (I tried English and German), then everything works correctly.

At the same time, the spell check of the Russian language works.

Is it a bug? How to fix this issue?

Linux
5.10.23-1-MANJARO
TeXmacs version 1.99.19

mgubi · March 24, 2021, 9:02am

Seems like to be a bug. Can you provide a minimal example which can be used to reproduce the problem you see? This would be helpful to debug. Also you can file a bug at https://savannah.gnu.org/bugs/?group=texmacs . In the meanwhile I think there are ways to force hyphenation by hand in Format->Hyphenate as. Does this way around works for you?

syroezhkin · March 24, 2021, 10:56am

Yes, I will duplicate this issue in the tracker.

You can download my example from this link: https://www.dropbox.com/s/tmph2iwk74xsqiu/test.tm?dl=1

I tried to force hyphenation by hand like you said. But nothing has changed.

syroezhkin · March 24, 2021, 12:20pm

Link to the reporting: https://savannah.gnu.org/bugs/index.php?60284

jeroen · March 24, 2021, 2:07pm

Добро пожаловать @syroezhkin!

For me the hyphenation in the hyphen.russian seems to work:
Screenshot%20from%202021-03-24%2014-05-20

The file doens’t contain many words, though. Perhaps you can try adding some words to it and see if they get hyphenated.

Hyphenate as doesn’t work for me either, that’s probably a bug.

pireddag · March 24, 2021, 2:34pm

I do not understand how hyphenation works, but these should be patterns, not words, then the list might be sufficient for a large number of words.

syroezhkin · March 24, 2021, 3:02pm

You are right! The “наоборот” world works for me. But there are no other words from my text in the hyphen.russian file. I added a few new words there and now everything works. So the problem is that the dictionary is too small.

So I need to increase the dictionary. Hence the question: where can I see the manuals, how to design this file format or how to generate it yourself?

jeroen · March 24, 2021, 3:07pm

@pireddag made a very valid comment. The first part of the hyphenation file contains patterns that should be used to hyphenate most common words and the words at the end of the list should be exceptions. So this would be a hack, but at least it may be a workaround for now.

I have to say I know little about these patterns and how they are generated. This stackexchange has some interesting background on hyphenation in LaTeX. As far as I can tell, TeXmacs uses the same patterns:

pireddag · March 24, 2021, 3:23pm

Maybe yet there is something not working. Out of curiosity I combined two of the patterns (I chose two which contain odd numbers) into a word, and then I pasted it several times—it did not get hyphenated by TeXmacs.

jeroen · March 24, 2021, 3:37pm

Yes, the patterns don’t seem to work. I’ve copied this example to TeXmacs:

TeXmacs can’t hyphenate that text at all.

jeroen · March 24, 2021, 10:02pm

I’ve been investigating this a bit more.

The hyphenation is done in src/System/Language/hyphenate.cpp. There are some commented out debugging statements in there that give some very useful information. If you uncomment those you can see on the terminal how the hyphenation patterns are loaded and how words are checked for their hyphenation.

So, what does this tell us? It seems that the tables are properly loaded:

TeXmacs] Loading hyphen.russian
.абр ==> .аб1р
.агро ==> .аг1ро
.ади ==> .ади2
.аи ==> .аи2
.акр ==> .ак1р

and so on.

Now, let’s enter some text in Russian. The hyphenation algorithm prints unintelligible characters on the terminal:

.������������.
.������������. --> [ 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000 ]
.������������.

and so on.

Let’s now try some words that are in the list of explicitly hyphenated words at the end of hyphen.russian:

понадоблюсь --> [ 100000000, 10000, 100000000, 10000, 100000000, 100000000, 10000, 100000000, 100000000, 100000000 ]
понадоблюсь --> [ 100000000, 10000, 100000000, 10000, 100000000, 100000000, 10000, 100000000, 100000000, 100000000 ]
понадоблюсь --> [ 100000000, 10000, 100000000, 10000, 100000000, 100000000, 10000, 100000000, 100000000, 100000000 ]
Hyphen <#43F><#43E><#43D><#430><#434><#43E><#431><#43B><#44E><#441><#44C>, 1
Yields <#43F><#43E>-, <#43D><#430><#434><#43E><#431><#43B><#44E><#441><#44C>

tada, all of a sudden we have intelligible characters and a successful hyphenation!

EDIT: And we’ve got the culprit, it’s the function locase_all in the line

 s= "." * locase_all (s) * ".";

changing this to

s= "." * s * ".";

results in pattern hyphenation working:
Screenshot%20from%202021-03-24%2021-58-09

So we’ll need to modify locase_all to work with Cyrillic characters.

pireddag · March 24, 2021, 10:22pm

Taking advantage of the work you did

in src/Data/String/analyze.cpp:

r[i]= (char) (((int) ((unsigned char) s[i]))+32);

At a very quick look, if the table is this https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024, the function needs a case for the Cyrillic alphabet.
There are some characters which are at the beginning of the list in the list of uppercase characters and at the end in the list of lowercase, too.

jeroen · March 24, 2021, 10:32pm

Awesome, thanks @pireddag!

That has led me to uni_locase_char in src/Data/String/universal.cpp, which has this:

else if (code >= 0x400 && code <= 0x40F) code += 0x50;
else if (code >= 0x410 && code <= 0x42F) code += 0x20;
else if (code >= 0x460 && code <= 0x4FF) {
  if ((code & 1) == 0) code += 1;
}

It seems to work with uni_locase_all! I’ll post a patch to Savannah.

pireddag · March 24, 2021, 10:50pm

It could be that the locase_all function needs to be short, and therefore cannot use uni_locase_char (I did not figure out the details of how uni_locase_char works). I do not know if is better to take that code and paste it to locase_all or to rewrite the line

r[i]= (char) (((int) ((unsigned char) s[i]))+32);

using uni_locase_char

mgubi · March 25, 2021, 8:19am

great work! If you have a patch which seems to work, I think you can put it on a bug report on savannah (for documentation) and make Joris aware of it for review and integration.

jeroen · March 25, 2021, 10:09am

There are still some problems The cursor makes crazy jumps when moving through a hyphenated word. I’ll have to investigate further.

syroezhkin · May 7, 2021, 12:41pm

Hi!

I just noticed a commit appeared in the repository that fixes this bug. I made sure that I have version 1.99.19 and decided to check it out.

Unfortunately, the hyphenations still don’t work. If I manually specify how to hyphenate a word, then I get these incomprehensible characters (this did not happen before).

jeroen · May 7, 2021, 1:18pm

Привет!

I’m sorry to hear it’s not working for you. I can’t reproduce this on the latest version on github:

Are you certain that this commit is in the version you are using. Version 1.99.20 has just been released yesterday, so if you want to make sure, you can download this version from www.texmacs.org

If that doesn’t work, could you please upload an example document somewhere?

syroezhkin · May 7, 2021, 2:05pm

Oh, I didn’t notice that there is a newer version. So everything should work for me, too! Thanks!

(I have some problems running the downloaded program from the site [some troubles with fonts]. But I’d rather wait for maintainer to update it in my distribution’s repository soon.)

jeroen · May 7, 2021, 3:54pm

It looks like Manjaro is keeping up very well with new TeXmacs releases, so I guess you won’t have to wait too long

https://repology.org/project/texmacs/versions