Hyphenation doesn't work with Russian language (only)

I done everything which in Help → Notes for users of Cyrillic languages:

  • selected Russian as default language in Edit → Preferences → Language → Russian;
  • selected Russian for an entire document using Document → Language → Russian;
  • I also checked if a file with patterns exists on the path /usr/share/TeXmacs/langs/natural/hyphen/hyphen.russian.

Despite this hyphenation doesn’t work with Russian language. But if I set any other language (I tried English and German), then everything works correctly.

At the same time, the spell check of the Russian language works.

Is it a bug? How to fix this issue?

Linux
5.10.23-1-MANJARO
TeXmacs version 1.99.19

Seems like to be a bug. Can you provide a minimal example which can be used to reproduce the problem you see? This would be helpful to debug. Also you can file a bug at https://savannah.gnu.org/bugs/?group=texmacs . In the meanwhile I think there are ways to force hyphenation by hand in Format->Hyphenate as. Does this way around works for you?

Yes, I will duplicate this issue in the tracker.

You can download my example from this link: https://www.dropbox.com/s/tmph2iwk74xsqiu/test.tm?dl=1

I tried to force hyphenation by hand like you said. But nothing has changed.

Link to the reporting: https://savannah.gnu.org/bugs/index.php?60284

Добро пожаловать @syroezhkin!

For me the hyphenation in the hyphen.russian seems to work:
Screenshot%20from%202021-03-24%2014-05-20

The file doens’t contain many words, though. Perhaps you can try adding some words to it and see if they get hyphenated.

Hyphenate as doesn’t work for me either, that’s probably a bug.

I do not understand how hyphenation works, but these should be patterns, not words, then the list might be sufficient for a large number of words.

1 Like

You are right! The “наоборот” world works for me. But there are no other words from my text in the hyphen.russian file. I added a few new words there and now everything works. So the problem is that the dictionary is too small.

So I need to increase the dictionary. Hence the question: where can I see the manuals, how to design this file format or how to generate it yourself?

@pireddag made a very valid comment. The first part of the hyphenation file contains patterns that should be used to hyphenate most common words and the words at the end of the list should be exceptions. So this would be a hack, but at least it may be a workaround for now.

I have to say I know little about these patterns and how they are generated. This stackexchange has some interesting background on hyphenation in LaTeX. As far as I can tell, TeXmacs uses the same patterns:

1 Like

Maybe yet there is something not working. Out of curiosity I combined two of the patterns (I chose two which contain odd numbers) into a word, and then I pasted it several times—it did not get hyphenated by TeXmacs.

2 Likes

Yes, the patterns don’t seem to work. I’ve copied this example to TeXmacs:

TeXmacs can’t hyphenate that text at all.

1 Like

I’ve been investigating this a bit more.

The hyphenation is done in src/System/Language/hyphenate.cpp. There are some commented out debugging statements in there that give some very useful information. If you uncomment those you can see on the terminal how the hyphenation patterns are loaded and how words are checked for their hyphenation.

So, what does this tell us? It seems that the tables are properly loaded:

TeXmacs] Loading hyphen.russian
.абр ==> .аб1р
.агро ==> .аг1ро
.ади ==> .ади2
.аи ==> .аи2
.акр ==> .ак1р

and so on.

Now, let’s enter some text in Russian. The hyphenation algorithm prints unintelligible characters on the terminal:

.������������.
.������������. --> [ 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000, 100000000 ]
.������������.

and so on.

Let’s now try some words that are in the list of explicitly hyphenated words at the end of hyphen.russian:

понадоблюсь --> [ 100000000, 10000, 100000000, 10000, 100000000, 100000000, 10000, 100000000, 100000000, 100000000 ]
понадоблюсь --> [ 100000000, 10000, 100000000, 10000, 100000000, 100000000, 10000, 100000000, 100000000, 100000000 ]
понадоблюсь --> [ 100000000, 10000, 100000000, 10000, 100000000, 100000000, 10000, 100000000, 100000000, 100000000 ]
Hyphen <#43F><#43E><#43D><#430><#434><#43E><#431><#43B><#44E><#441><#44C>, 1
Yields <#43F><#43E>-, <#43D><#430><#434><#43E><#431><#43B><#44E><#441><#44C>

tada, all of a sudden we have intelligible characters and a successful hyphenation!

EDIT: And we’ve got the culprit, it’s the function locase_all in the line

 s= "." * locase_all (s) * ".";

changing this to

s= "." * s * ".";

results in pattern hyphenation working:
Screenshot%20from%202021-03-24%2021-58-09

So we’ll need to modify locase_all to work with Cyrillic characters.

2 Likes

Taking advantage of the work you did :slight_smile:

in src/Data/String/analyze.cpp:

r[i]= (char) (((int) ((unsigned char) s[i]))+32);

At a very quick look, if the table is this https://www.utf8-chartable.de/unicode-utf8-table.pl?start=1024, the function needs a case for the Cyrillic alphabet.
There are some characters which are at the beginning of the list in the list of uppercase characters and at the end in the list of lowercase, too.

2 Likes

Awesome, thanks @pireddag!

That has led me to uni_locase_char in src/Data/String/universal.cpp, which has this:

else if (code >= 0x400 && code <= 0x40F) code += 0x50;
else if (code >= 0x410 && code <= 0x42F) code += 0x20;
else if (code >= 0x460 && code <= 0x4FF) {
  if ((code & 1) == 0) code += 1;
}

It seems to work with uni_locase_all! I’ll post a patch to Savannah.

1 Like

It could be that the locase_all function needs to be short, and therefore cannot use uni_locase_char (I did not figure out the details of how uni_locase_char works). I do not know if is better to take that code and paste it to locase_all or to rewrite the line

r[i]= (char) (((int) ((unsigned char) s[i]))+32);

using uni_locase_char

2 Likes

great work! If you have a patch which seems to work, I think you can put it on a bug report on savannah (for documentation) and make Joris aware of it for review and integration.

There are still some problems :frowning: The cursor makes crazy jumps when moving through a hyphenated word. I’ll have to investigate further.