Proposal: mechanism to append documents to and augment the generated PDF

jeroen · May 3, 2023, 8:52am

This could also be a good tool to experiment with, so as to get a better understanding of the PDF format:
https://pypdf.readthedocs.io/en/stable/user/reading-pdf-annotations.html

mgubi · May 3, 2023, 11:31am

Even better if we could add binding to our PDF Hummus library from Scheme, so that we do not need Python to experiment. It could be also a nice feature to have in TeXmacs and on which implement the proposed attached mechanism (and much more).

sharkc · May 4, 2023, 3:30am

Directly adding an entire tm doc might just be the first step. A more intrigate problem is that large media files in tm doc like images cause the sizes of pdf files more than double. It’s a bit nontrivial to resolve that without a decent pdf reader.

mgubi · May 4, 2023, 8:04am

Technically is not a difficult problem to me: one extract the images from the PDF and re-embeds them in the TeXmacs file. The problem is that they might not be the original images (maybe). But is a feature which can be opted in. However this requires a bit more work. PDF Hummus should have enough infrastructure to allow this.

tangdouer · May 10, 2023, 12:36pm

Recently I have repeatedly read the PDF-Writer wiki especially related Extensibility (I think this is what I need to understand in depth), and carefully read the author’s related blog, which is very useful, I have learned how to create and insert a pdf object and make it work, I am going to implement a small program based on this to embed a .txt file into a pdf(I’m going to start with .txt, which I think is the easiest)

I looked at pymupdf to generate a pdf embedded in a .txt file, and learned that Embedded File Streams can definitely accomplish our task, so I’m trying my best to read Adobe PDF , which is a bit difficult for me, I can not quite grasp the point.

tangdouer · May 13, 2023, 5:05pm

I just successfully created a PDF using PDFHummus, it has a .txt file attachment! But this code is not perfect, he can only add only one .txt file, I think the next step should be to make it able to add other types of files Such as .tm files and .jpg files and can attach multiple files.
But before that, I want to submit my OSPP project application, because there is a deadline for submission. I have written my project application but it is not perfect, and I will improve it recently, and then I will sent the application to you by email, hope you can give some suggestions for improvement, and if there is no problem, I will submit the application to the OSPP official website.

mgubi · May 13, 2023, 5:17pm

good news. Let’s discuss via email.

mgubi · May 20, 2023, 8:50pm

Just to keep track of the discussion. Here some steps to develop the goal. First, I do not think you need to understand all of TeXmacs (I don’t myself, it is a very large program and quite complex).

For the task at hand I think the first objective is to ensure you are able to write C++ code which embed data in a preexisting PDF file and can extract it. I think the data can be a text file since TeXmacs store documents in plain text files (with extension .tm). They are quite similar to HTML files but with a lightweight notation for the structures.
Once this step is done one should proceed to integrate the code with the TeXmacs codebase, so that the relevant C++ functions can be called via the Scheme interface. For this you need to write C++ code which integrate well with the preexisting code. TeXmacs does not use STL, it has its own data structures, see src/Kernel for details. In general just give a look at how code is written and try to follow the same style. We do not have a style guide. If you want you can try to write one to guide you. The gluing with Scheme is done in src/Scheme/Glue. One should be able then to perform the same embedding/extraction steps from Scheme inside a running TeXmacs program.
The last part is to write UI code to allow the user to control the embedding/extraction procedure and also to check for embedded TeXmacs documents in PDFs so that the extraction operation is called automatically.

tangdouer · May 28, 2023, 6:12pm

@mgubi
If I embed additional resources such as pictures into the PDF as an attachment, the PDF will contain this resource repeatedly. I think this is a serious waste of resources. Maybe when we separate the tm file, we can separate the picture directly from the content of the pdf instead of the attachment of the pdf.

At the same time, I think it is better to embed the code into TeXmacs after completing these two functions. Embedding the tm file into the pdf and separating it from the pdf affect each other. Maybe I need to write two codes at the same time.

And I think these functions can be provided to the user in the form of tools in the toolbar instead of the UI, then I may devote more energy to performance optimization after the tools are completed.

I just sent the second edition of the application to your mailbox, hope you can give some advice. If there is no problem, I can submit it to the official website of OSPP.I only have a week time.

mgubi · May 29, 2023, 11:55am

I agree this is a critical design decision. You can implement in a first round a simple solution and then think to options to implement a less wasting solution. It is not clear why to do. It is possible that the image formats of the images in the PDF is different from that in the TeXmacs file, in this case using these images will change the actual content of the TeXmacs document and it will not be possible to store it faithfully. So I guess one should leave the option to the user in a preference setting. A basic setting will embed all the document and data faithfully, while in the other it will try to optimise for space.

tangdouer · May 31, 2023, 6:01pm

Yes, I consider improving efficiency after implementing the basic functions, and I should use this function according to the user’s preferences. I have modified my application

darcy · June 9, 2023, 8:39am

@mgubi Please complete the Mentor Review in time (before 2023/06/10):

Summer of Code Stage 4: Project Application Review

Otherwise, all students for this project will be rejected.

tangdouer · August 13, 2023, 12:59pm

@mgubi
Hi, max, I have now completed the most basic tasks of this project, which are respectively embedding tm attachments in exporting pdf and loading tm attachment in a pdf. The codes of these two functions are submitted to mogan through pr, Below are their links, can you review them and give me some suggestions for improvement?

tangdouer · August 14, 2023, 11:40am

I retweeted [29_2]. Here is the new pr link

tangdouer · August 21, 2023, 11:51am

@mgubi Hi max, I have completed the function of this project:

when exporting a tm document to a pdf, embed the tm document as attachment, and also including the situation of multiple-file tm documents.
load the exported pdf embed the tm document as attachment.

Can you review it if you have time? I know that my code definitely has some deficiencies, here are the two pr of this project:

darcy · September 27, 2023, 2:42pm

@mgubi Your proposal has been completed by @tangdouer

Try https://github.com/XmacsLabs/mogan/releases/tag/v1.2.0-beta14

mgubi · September 27, 2023, 9:35pm

Thanks. I’ve been quite busy and could not check progress regularly. I will look into it now.

darcy · September 28, 2023, 1:20am

https://summer-ospp.ac.cn/help/en/mentor/

Please spare some time to Evaluations – Mentor Final Term Review and PR/MR Merge during 2023/10/1~2023/10/31. If there are no evaluation from the mentor, the student will fail to complete the project.