заменить глифы по эталону #5

AlexeyAlexeew · 2024-03-04T02:58:16Z

AlexeyAlexeew
Mar 4, 2024

Скажите, не смогли бы вы допилить ваш djvudict на такую фичу:
сравнивать получающийся shape с "эталонным" из отдельной папки и заменять его. Я бы заранее нашел и сделал bmp-шки основных алфавитов и кеглей книги, и программа определяла новый шейп - с устанавливаемым уровнем/коэффициентом совпадения - и заменяла старый на новый. И в словаре и в чанке sjbz. Для начала можно попробовать даже refinement-шейпы не трогать. Которые как отклонения записаны.
Вариативности и произвольности не предполагаю, выискивать "инь-эффект" чтоб ошибочные "н" заменить на "и" не нужно. Хочу получить "улучшалку" плохобинаризованных книг, когда книга в целом читаема, но после бинаризации буквы не очень красивые. Хотя возможен вариант что если буква разорвана на куски, то и шейп у нее будет не один, а несколько. Тогда я приплыл.

trufanov-nok · 2024-03-08T08:53:32Z

trufanov-nok
Mar 8, 2024
Maintainer

Пока мне очень сильно не до проектов связанных с djvu. Я даже собраться и опубликовать djvu модуль к stu никак не могу, хотя бы в альфа версии. И ближайший год мне будет не до решения серьезных задач в этих проектах.
По поводу редактирования словарей я когда-то думал... Что именно я тогда думал, я уже плохо помню... Поэтому изложу сумбурно.
Емнип, я считал, что автоматически ничего заменять нельзя. Классификаторы символов в кодировщиках djvu заточены на передачу look&feel входного изображения. Если изображение плохобиноризовано - по хорошему это надо решать на этапе бинаризации. Замена плохобиноризованного на хорошобинаризованное - это уже не задача заменить подобное подобным, а задача заменить подобное чем-то отличающимся в более серьезной степени, чем ожидает имеющийся классификатор. Итого, нужно будет играться с весами или методами сравнения, чтобы сделать новый классификатор, а я уже наигрался всем этим в рамках проекта minidjvu-mod, и не совсем доволен результатом. Поэтому я против автоматизации в этом смысле - проще и надежнее будет сделать ручной редактор. Тогда все что потребуется от djvudict - это замена указанного по id символа словаря на предоставленный пользователем. И это более посильная мне задача.

И можно было бы на пальцах посмотреть, а не слишком ли коряво выглядят отредактированные таким образом страницы... потому что такое подозрение у меня есть. Потому что плохобинаризованных вариантов написания одной буквы А на 20 страниц может быть в словаре 10 штук, и замена их всех на 1 эталонный может в конечном итоге бросаться в глаза. Потому что, к примеру, геометрия страницы поползла (или даже дефект оригинальной печати) и все буквы у края страницы на 2 пикселя больше букв у корешка. Начнет плавать положение букв в строке и замены будет бросаться в глаза.
Вообще, если вы составляете эталонный словарь символов, то можно сделать просто из них шрифт, сделать OCR документа и собрать с этим шрифтом все обратно в doc/pdf. В пределе оно к этому и идет, но это уже будет не оригинальный документ.
И только если такое редактирование покажется оправданным, то только тогда:

В качестве GUI можно было бы использовать польский djview4shapes. Это форк djview4, который использовался для изучения шрифтов средневековых книг. Он умеет выводить глифы словаря для текущей страницы, подсвечивать их и пр. Но кодовая база там устарела даже по ср с актуальным djview, форк старый. Я им с этим чуть-чуть помогал. В общем, сначала нужно будет актуализировать его состояние. Но не в этом году. Доживем до следующего - там посмотрим.

P.S. И если кто и будет этим пользоваться, то полтора человека раз в год... В общем, без энтузиазма и очень скептически я на это смотрю.

0 replies

jsbien · 2024-03-08T09:22:21Z

jsbien
Mar 8, 2024

I understand Russian a little, but I'm afraid I don't understand fully your discussion.
I made a trivial extension to djview4shapes, I can publish it at https://github.com/jsbien/djview4shapes: when you open a glyph in djview4, the glyph coordinates are written to the log. You can convert it easily into a djview4poliqarp index.
In last months I used djvudict intensively, some of the outputs are here: https://github.com/jsbien/early_fonts_inventory. The interpretation of its results is not always clear for me, I'm afraid it can have some bugs.
exportshapes (the klf-uw branch) is a tool somewhat similar to djvudict. It was never used, but I intend to look into it.
At https://szukajwslownikach.uw.edu.pl/slownik-lindego/ you can find an example when too agressive compression replaced all "1" by "l" (or vice versa, I don't remember).
BTW, it seems all the DjVu patents expired by now. I have, hopefully complete, the list of them, I can publish it somewhere.

9 replies

jsbien Apr 5, 2024

Great! I tried to make a fork as I intend to introduce some changes (https://github.com/jsbien/export_djvu_shapes) but it seems something went wrong (my knowledge of git and GitHub is rudimentary). I will check it later.
What I am trying to do is to replace the export to database (https://github.com/jsbien/ndt/wiki/z_shapes) by export in the form of djview4poliqarp index. Unfortunately I have to learn a lot to make some reasonable progress :-(

rmast Apr 5, 2024

I created a new branch revealshapes, stripping off the database-part, just revealing the coordinates on the screen. I just want to extend python-djvulibre with getting the coordinates of the blits and comparable shapes on the original image, so I need a template to start from...

jsbien Apr 6, 2024

Please log the revealed coordinates to a file, I will convert them to djview4poliqarp index :-)

BTW, the issues are not enabled in your repository. I suggest to enable both issues and discussions.

rmast Apr 6, 2024

Looking at this example: https://github.com/jsbien/early_fonts_inventory/blob/main/font_tables/indexes/Augezdecki-01/Augezdecki-01.csv you also need the OCR-interpretation of the glyph found. I only intend to output x 47 y 925 h 64 w 73 per line, and an index number for the matched shape, non discriminative of either the local or the main dictionary. Will you then be able to link the rest?

jsbien Apr 6, 2024

My main problem with djvudict was the identification of the page number. Can you provide it?

AlexeyAlexeew · 2024-03-08T09:41:38Z

AlexeyAlexeew
Mar 8, 2024
Author

Когда замена одной буквы - то бросается в глаза. Когда замена многих - то страницы начинают выглядеть намногосимпатичнее.
В целом я ваши сомнения понимаю, сам также думал. Но ведь ClearScan адобовский - работает, и вполне прилично работает. Выполняет распознавание и подмену на вектор так, что книга кажется растровой, но очень хорошо сглаженной.

0 replies

jsbien · 2024-03-08T09:56:35Z

jsbien
Mar 8, 2024

https://www.reddit.com/r/Acrobat/comments/1avsvf5/i_miss_and_need_the_clearscan_feature_it_hasnt/
https://community.adobe.com/t5/acrobat-discussions/clearscan-no-longer-available-in-acrobat-dc/m-p/7036679

0 replies

trufanov-nok · 2024-03-08T10:44:34Z

trufanov-nok
Mar 8, 2024
Maintainer

Не пробовал ClearScan, нужно будет ознакомиться..

0 replies

jsbien · 2024-03-10T11:31:22Z

jsbien
Mar 10, 2024

@AlexeyAlexeew What exactly ClearScan ( "Editable text and images") is doing? Can you demonstrate it on one of the images at https://github.com/jsbien/early_fonts_inventory/tree/main/font_tables/PNG?

2 replies

AlexeyAlexeew Mar 10, 2024
Author

I cannot demonstrate it, I am not a specialist. This technology work as it write in adobe documents and I see good results. Some years ago. It create new fonts and replace rasterized glyphes by font symbols.

rmast Apr 4, 2024

I've seen that behavior in Acrobat Pro. I have a cheap licence till 2026.

jsbien · 2024-03-10T12:18:39Z

jsbien
Mar 10, 2024

OK, I will download the trial version some time in the future.

1 reply

AlexeyAlexeew Mar 10, 2024
Author

I know it will better to use old version of acrobat. Very old.

rmast · 2024-04-05T05:40:38Z

rmast
Apr 5, 2024

You forked the wrong branch. The easiest way is dropping the fork and fork branch master-exportshapes as you can only fork one branch. But if you intend to learn learning git is very helpful. I used chatbots to bring me to the idea to replace libboost with a guard. Programming C++ is mostly 30 years ago for me.

1 reply

jsbien Apr 5, 2024

You forked the wrong branch

That's what I suspected.

fork branch master-exportshapes as you can only fork one branch.

That's what I wanted to do, but didn't now how... As a temporary solution I forked the whole repository with all the branches.

I used chatbots

I guess I should also use it :-) Any hints how to start?

rmast · 2024-04-05T15:50:18Z

rmast
Apr 5, 2024

https://chat.openai.com/ is now advertised not to require a login. If that's what kept you from using it that might not be a problem anymore.

0 replies

rmast · 2024-04-06T08:34:06Z

rmast
Apr 6, 2024

I did not decide on the repo yet. Those big dependencies of djvulibre, even as fork as the internal details are not published to the regular apt-library, make me consider using this repo. If the resulting coordinates are equal the error in the resulting .bmp of thiscrepo might be negligable.

3 replies

jsbien Apr 6, 2024

Those big dependencies of djvulibre

In old days Bottou made some suggestions to my colleagues. They were followed by Michał Rudolf in djview4poliqarp and djview4shapes, but others used quick-and-dirty approaches. So it definitely can be done in a more elegant and convenient way.

If the resulting coordinates are equal the error in the resulting .bmp of thiscrepo might be negligable.

I'm afraid I don't follow you :-(

rmast Apr 7, 2024

After reading some code I found a way to reduce the dependencies to the standard shared djvulibre- library, only with including some .h files from djvulibre, so I could wget them separately in a git action during the build. I'll recreate the djvurevealshapes with minimal source. Unfortunately automake still makes a big directory of it, but I had to copy and strip lots of those configurations to get the biuld going.

jsbien Apr 7, 2024

Good luck!

rmast · 2024-04-06T15:49:27Z

rmast
Apr 6, 2024

Reading the code of exportshapes I get the idea a multipage-djvu can also have different main dictionaries with the same glyph-number ranges, am I right? Then there should also be a main dictionary index number, and, as main dictionaries can differ in size, an indicator pointing to some main dictionary or the individual dictionary per glyph index. So then the desired output-line per glyph becomes 1. Page number 2. Dictionary id 3. Glyph id 4. x 5. y 6. w 7. h Am I right? Verzonden vanaf Outlook voor Android<https://aka.ms/AAb9ysg>

…

________________________________ Van: Janusz S. Bie ***@***.***> Verstuurd: zaterdag, april 6, 2024 10:52:14 a.m. Aan: trufanov-nok/djvudict ***@***.***> CC: rmast ***@***.***>; Comment ***@***.***> Onderwerp: Re: [trufanov-nok/djvudict] заменить глифы по эталону (Discussion #5) My main problem with djvudict was the identification of the page number. Can you provide it? ― Reply to this email directly, view it on GitHub<#5 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5RR2BWFECEETGZCJE3Y36ZTVAVCNFSM6AAAAABEMQMOIOVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TAMRYGQZDM>. You are receiving this because you commented.Message ID: ***@***.***>

1 reply

jsbien Apr 6, 2024

Yes, it looks OK.

rmast · 2024-04-06T15:54:18Z

rmast
Apr 6, 2024

>If the resulting coordinates are equal the >>error in the resulting .bmp of thiscrepo >>might be negligable. I'm afraid I don't follow you :-(

djvudict reconstructs the image in which the glyphs of one row jump up and down. Some wrong y calculation.

1 reply

jsbien Apr 6, 2024

I had an impression something is wrong with coordinates but was not sure :-)

rmast · 2024-04-14T21:46:07Z

rmast
Apr 14, 2024

Скажите, не смогли бы вы допилить ваш djvudict на такую фичу: сравнивать получающийся shape с "эталонным" из отдельной папки и заменять его. Я бы заранее нашел и сделал bmp-шки основных алфавитов и кеглей книги, и программа определяла новый шейп - с устанавливаемым уровнем/коэффициентом совпадения - и заменяла старый на новый. И в словаре и в чанке sjbz. Для начала можно попробовать даже refinement-шейпы не трогать. Которые как отклонения записаны. Вариативности и произвольности не предполагаю, выискивать "инь-эффект" чтоб ошибочные "н" заменить на "и" не нужно. Хочу получить "улучшалку" плохобинаризованных книг, когда книга в целом читаема, но после бинаризации буквы не очень красивые. Хотя возможен вариант что если буква разорвана на куски, то и шейп у нее будет не один, а несколько. Тогда я приплыл.

Google-Translated:

Tell me, could you add this feature to your djvudict: compare the resulting shape with the “reference” one from a separate folder and replace it. I would find and make bmps of the main alphabets and fonts of the book in advance, and the program would determine a new shape - with a set level/coincidence of coincidence - and replace the old one with a new one. Both in the dictionary and in the sjbz chunk. To begin with, you can even try not touching the refinement shapes. Which are recorded as deviations. I don’t assume variability and arbitrariness; there is no need to look for the “yin effect” so that erroneous “n”s are replaced with “i”. I want to get an “improvement” for poorly binarized books, when the book as a whole is readable, but after binarization the letters are not very beautiful. Although it is possible that if a letter is torn into pieces, then it will have not one shape, but several. Then I got lost.

Glyphs within a DjVu-dictionary often have doubles, probably due to the same subpixel-shift that makes characters jump up and down in the restored image. The same principle of jumping up and down will also be there from left to right, that will also contain extra info on those characters. So if a book has been scanned then many shared dictionaries might have multiple copies of characters, with positions on originals, that have a lot of extra information on the original font.

Taking the weight-point of all those - probably with OCR - further matchable characters might enable the possibility to blow them up by for example 8 or 16 times, undo the subpixel shift with their weight points, for as far as their weight-point is credible within context of their original positions with respect to the rest of the line they were on.

I could imagine trying to restore those characters, even with something like DualVector, might be an automizable approach.

0 replies

заменить глифы по эталону #5

Replies: 13 comments · 18 replies

trufanov-nok Mar 8, 2024 Maintainer

AlexeyAlexeew Mar 8, 2024 Author

trufanov-nok Mar 8, 2024 Maintainer

AlexeyAlexeew Mar 10, 2024 Author

AlexeyAlexeew Mar 10, 2024 Author

Replies: 13 comments 18 replies

trufanov-nok
Mar 8, 2024
Maintainer

AlexeyAlexeew
Mar 8, 2024
Author

trufanov-nok
Mar 8, 2024
Maintainer

AlexeyAlexeew Mar 10, 2024
Author

AlexeyAlexeew Mar 10, 2024
Author