Proposal: mechanism to append documents to and augment the generated PDF

In recent days, firstly I learned about the PDF format, but I didn’t read the Adobe PDF specification, it is too complicated and I don’t think it can help me quickly understand PDF, maybe I will refer to it later, instead I read PDF Explained this book, learned about basic pdf formatting. I guess the best result would be to add our file as an attachment to the pdf.

Then read https://github.com/galkahana/PDF-Writer/wiki/PDF-Embedding more carefully, but this only introduces three methods to use PDF-Writer to embed all or part of a pdf into another pdf, and we want to embed tm file and other resource files into pdf, maybe we have other ways.

And this https://mirror.ibcp.fr/pub/CTAN/macros/latex/contrib/attachfile/attachfile.pdf package introduces a tool for adding attachments in LATEX, which is indeed a useful tool, but I only learned how to use it, but how to make such a tool is still a mystery to me.

What is my next step? The two programs you have given seem to be my ultimate goal, and reading them seems to be of great help to me.

2 Likes

The important part seems to be at the end of page 14, the definition of the \atfi@insert@file@annot macro. It says that the PDF format support a FileAttachment annotation. This should be explained in the PDF specification (it is a specification, not a book to read, it is supposed to be referred to, not to be read as a whole). I guess the pdfattach.c command probably use the same mechanism, would be interesting to know.

As for poppler, it seems that the work is done here:
void Catalog::addEmbeddedFile(GooFile *file, const std::string &fileName)
in
https://fossies.org/linux/poppler/poppler/Catalog.cc

Some more info on EmbeddedFiles is here:

https://pymupdf.readthedocs.io/en/latest/app2.html

which says that the mechanism is explained in chapter “7.11.4 Embedded File Streams”, pp. 103 of the Adobe PDF References

I haven’t checked more deeply these documents.

This could also be a good tool to experiment with, so as to get a better understanding of the PDF format:
https://pypdf.readthedocs.io/en/stable/user/reading-pdf-annotations.html

1 Like

Even better if we could add binding to our PDF Hummus library from Scheme, so that we do not need Python to experiment. It could be also a nice feature to have in TeXmacs and on which implement the proposed attached mechanism (and much more).

Directly adding an entire tm doc might just be the first step. A more intrigate problem is that large media files in tm doc like images cause the sizes of pdf files more than double. It’s a bit nontrivial to resolve that without a decent pdf reader.

Technically is not a difficult problem to me: one extract the images from the PDF and re-embeds them in the TeXmacs file. The problem is that they might not be the original images (maybe). But is a feature which can be opted in. However this requires a bit more work. PDF Hummus should have enough infrastructure to allow this.

Recently I have repeatedly read the PDF-Writer wiki especially related Extensibility (I think this is what I need to understand in depth), and carefully read the author’s related blog, which is very useful, I have learned how to create and insert a pdf object and make it work, I am going to implement a small program based on this to embed a .txt file into a pdf(I’m going to start with .txt, which I think is the easiest)

I looked at pymupdf to generate a pdf embedded in a .txt file, and learned that Embedded File Streams can definitely accomplish our task, so I’m trying my best to read Adobe PDF , which is a bit difficult for me, I can not quite grasp the point.

1 Like

I just successfully created a PDF using PDFHummus, it has a .txt file attachment! But this code is not perfect, he can only add only one .txt file, I think the next step should be to make it able to add other types of files Such as .tm files and .jpg files and can attach multiple files.
But before that, I want to submit my OSPP project application, because there is a deadline for submission. I have written my project application but it is not perfect, and I will improve it recently, and then I will sent the application to you by email, hope you can give some suggestions for improvement, and if there is no problem, I will submit the application to the OSPP official website.

2 Likes

good news. Let’s discuss via email.

Just to keep track of the discussion. Here some steps to develop the goal. First, I do not think you need to understand all of TeXmacs (I don’t myself, it is a very large program and quite complex).

  1. For the task at hand I think the first objective is to ensure you are able to write C++ code which embed data in a preexisting PDF file and can extract it. I think the data can be a text file since TeXmacs store documents in plain text files (with extension .tm). They are quite similar to HTML files but with a lightweight notation for the structures.

  2. Once this step is done one should proceed to integrate the code with the TeXmacs codebase, so that the relevant C++ functions can be called via the Scheme interface. For this you need to write C++ code which integrate well with the preexisting code. TeXmacs does not use STL, it has its own data structures, see src/Kernel for details. In general just give a look at how code is written and try to follow the same style. We do not have a style guide. If you want you can try to write one to guide you. The gluing with Scheme is done in src/Scheme/Glue. One should be able then to perform the same embedding/extraction steps from Scheme inside a running TeXmacs program.

  3. The last part is to write UI code to allow the user to control the embedding/extraction procedure and also to check for embedded TeXmacs documents in PDFs so that the extraction operation is called automatically.

@mgubi
If I embed additional resources such as pictures into the PDF as an attachment, the PDF will contain this resource repeatedly. I think this is a serious waste of resources. Maybe when we separate the tm file, we can separate the picture directly from the content of the pdf instead of the attachment of the pdf.

At the same time, I think it is better to embed the code into TeXmacs after completing these two functions. Embedding the tm file into the pdf and separating it from the pdf affect each other. Maybe I need to write two codes at the same time.

And I think these functions can be provided to the user in the form of tools in the toolbar instead of the UI, then I may devote more energy to performance optimization after the tools are completed.

I just sent the second edition of the application to your mailbox, hope you can give some advice. If there is no problem, I can submit it to the official website of OSPP.I only have a week time.

I agree this is a critical design decision. You can implement in a first round a simple solution and then think to options to implement a less wasting solution. It is not clear why to do. It is possible that the image formats of the images in the PDF is different from that in the TeXmacs file, in this case using these images will change the actual content of the TeXmacs document and it will not be possible to store it faithfully. So I guess one should leave the option to the user in a preference setting. A basic setting will embed all the document and data faithfully, while in the other it will try to optimise for space.

Yes, I consider improving efficiency after implementing the basic functions, and I should use this function according to the user’s preferences. I have modified my application

1 Like

@mgubi Please complete the Mentor Review in time (before 2023/06/10):

Summer of Code Stage 4: Project Application Review

Otherwise, all students for this project will be rejected.

@mgubi
Hi, max, I have now completed the most basic tasks of this project, which are respectively embedding tm attachments in exporting pdf and loading tm attachment in a pdf. The codes of these two functions are submitted to mogan through pr, Below are their links, can you review them and give me some suggestions for improvement?


2 Likes

I retweeted [29_2]. Here is the new pr link

@mgubi Hi max, I have completed the function of this project:

  1. when exporting a tm document to a pdf, embed the tm document as attachment, and also including the situation of multiple-file tm documents.
  2. load the exported pdf embed the tm document as attachment.

Can you review it if you have time? I know that my code definitely has some deficiencies, here are the two pr of this project:


1 Like

@mgubi Your proposal has been completed by @tangdouer

Try https://github.com/XmacsLabs/mogan/releases/tag/v1.2.0-beta14

Thanks. I’ve been quite busy and could not check progress regularly. I will look into it now.

1 Like

https://summer-ospp.ac.cn/help/en/mentor/

Please spare some time to Evaluations – Mentor Final Term Review and PR/MR Merge during 2023/10/1~2023/10/31. If there are no evaluation from the mentor, the student will fail to complete the project.