Proposal: mechanism to append documents to and augment the generated PDF

mgubi · April 8, 2023, 9:52am

What

Develop a protocol and implement it to include whole TeXmacs documents within its PDF export and also mechanisms to retrieve the document by opening the PDF via TeXmacs.
Develop optional augmentation of the generated PDF by adding popups for link previews, and possibly create style files for generating more interactive PDF (e.g. show answers to questions, popups with arbitrary information, …) See this tutorial to see how this might be implemented in the PDF format:

Why

In some circumstances it is useful to transmit both a PDF version of a document and the associated TeXmacs file (eventually with all the needed support files like custom styles and images). This allow usual PDF viewers to open the file and at the same time allows users of TeXmacs to be able to retrieve an editable version of the document for modification.
It is useful to augment PDF with additional information to improve the user experience on a variety of platforms which do not allow to access the original TeXmacs documents

How

write code to include a copy of the document and the associated information in the generated PDF. Investigate possible protocols to allow this in a transparent way wrt. standard PDF viewers.
write code which allows TeXmacs to probe the presence of a TeXmacs document attached to a PDF file and eventually open and modify it.
write code which insert in the PDF interactive features like popups (for preview or for additional content), better tables of content, etc…

sharkc · April 6, 2023, 5:17am

Useful: https://github.com/ArtifexSoftware/mupdf/blob/233926048d7647c8daed493c4372656db9f9b2b5/source/pdf/pdf-link.c#L255

mgubi · April 6, 2023, 8:33am

Yes, this seems pertinent, I was wondering if we can have a mechanisms which does not require polluting the document with links targeting the sources. We could also think to add some more meta informations to the PDF file, like an interface (a PDF form) to record metadata explicitly and maybe a modification log, some more informations about the bibliography, preview links (like in the UI), and LaTeX translations of the formulas to allow correct copy/paste in the PDF.

sharkc · April 6, 2023, 4:36pm

if we can have a mechanisms which does not require polluting the document with links targeting the sources.

It is truly embedding instead of linking. I just confirmed that following the python binding.

exa7.tex within a pdf file be like:

45 0 obj
<</Type/Catalog/Pages 11 0 R/Names<</EmbeddedFiles<</Names[(exa7.tex)<</CI<<>>/EF<</F 47 0 R>>/F(exa7.tex)/UF(exa7.tex)/Desc(exa7.tex)/Type/Filespec>>]>>>>/PageMode/UseAttachments>>
endobj
...
47 0 obj
<</Length 365/Filter/FlateDecode/DL 600/Params<</Size 600/CreationDate(D:20230406121241-04'00')/ModDate(D:20230406121241-04'00')>>/Type/EmbeddedFile>>
stream
(some binary data)
endstream
endobj

Preview is the job of pdf reader, how can that happen in pdf file level?

mgubi · April 6, 2023, 4:49pm

Yes seems what we need. Well spotted.

Preview is the job of pdf reader, how can that happen in pdf file level?

I imagine you refer to my comment about preview of hyperlinks and references. I think PDF have some interactive capabilities, see e.g. : https://www.youtube.com/watch?v=nxj-V31w2Aw&t=202s

sharkc · April 7, 2023, 6:31pm

I think a better way allow preview in pdf is to distribute/embed a pdf reader with TeXmacs. That won’t be too hard when your work on mupdf is done. Moreover, it gives another marketing feature. I’ve seen at lease two such readers in this tier.

sioyek - mupdf+preview: https://sioyek.info/
eaf-pdf-viewer - mupdf+PyQt+XEmbed in an emacs buffer: https://github.com/emacs-eaf/eaf-pdf-viewer

In contrast, animation formula view is way more complicated, if not impossible, since formula is not simple text. Yet without pdf reader support, our work means nothing.

mgubi · April 8, 2023, 9:47am

If you already can run TeXmacs, then there is no need to view the generated PDF (well you might be interested, but this is another matter). Augmenting PDF with additional information (like previews, metadata, etc…) is only useful when you cannot run TeXmacs. PDF viewers are ubiquitous so my idea was to leverage them to improve user experience.

In contrast, animation formula view is way more complicated, if not impossible, since formula is not simple text.

I’m not sure what you mean, the PDF format is not about text, PDF objects can be general graphical content, I do not see any problem representing formulas. modern PDF is quite a complex format with many interactive capabilities, see e.g. the following tutorial:

mgubi · April 10, 2023, 8:16am

I think this could be an option, if we decide to integrate mupdf. At that point we will have more powerful access to editing and viewing PDF files, including maybe removing the need to call external GS to do conversions. But it is unclear for the moment to me what a good design of a tighter integration with PDF should be.

sharkc · April 10, 2023, 9:12pm

I’m not opposed to improving user experience. I’m just thinking that’s fancy nitty-gritty and doesn’t worth our time, considering there’s much simpler way to do that. Not to mention that few TeXmacs users pay hundreds of dollors every year on Adobe reader.

My advice on the PDF things is to make it as simple as possible. Necessary components from my point of view is only (formula/link/image) preview and a navigation panel (toc/history ring/ etc.). PDFs should only be used for reading.

tangdouer · April 17, 2023, 4:05pm

Hello Max, I am interested in this proposal, now I’m ready to learn more about it, do you have any suggestions or requests?

mgubi · April 17, 2023, 7:20pm

Hello. I guess there are various possible starting points for the basic goal of providing a way to attach a full document to its generated PDF file:

learn about the PDF format (see the Adobe PDF specification) and identify possible mechanisms to embed informations which can then be extracted. Maybe write a proof of concept. This step could have as a side benefit the creation of a generic command line tool to do this job, independent of TeXmacs but depending on some external PDF library (e.g. MuPDF is quite good). If good enough one could aim to standardise this so that other viewers can implement the protocol and allow the extraction of the embedded documents.
design and implement a tool which allow to embed/extract the information (e.g. in C++). The embedding can be possibly done with the PDF library that TeXmacs already has (PDF Hummus), I’m not sure about the extraction part. Anyway we do not want to have to import a full new library for this.
propose a way to integrate the above in TeXmacs in a user-friendly way. Note that a TeXmacs document could be composed of multiple files, in this case you might want to allow embedding of all the required files (images, parts, user-defined styles).

tangdouer · April 18, 2023, 3:12am

Thanks, I will try to learn them recently.

mgubi · April 18, 2023, 7:46am

As a starting point: the Poppler library has two utilities which perform attachment/detachement of documents, see

it might be possible to reimplement them using the PDFWriter API. We do not want to depend on Poppler for the moment.

Addenda: I’ve just fund some useful notes in the PDF Hummus wiki

https://github.com/galkahana/PDF-Writer/wiki/PDF-Embedding

mgubi · April 21, 2023, 10:15am

There is also this LaTeX package: https://mirror.ibcp.fr/pub/CTAN/macros/latex/contrib/attachfile/attachfile.pdf

tangdouer · April 29, 2023, 7:13am

In the last week, we were conducting midterm exams, so I didn’t spend a lot of time studying this proposal, but luckily the midterm exams are over.

I tried the MuPDF library, this powerful pdf library can easily embed the md source file into the generated pdf , and it can also be separated well, but as you said, we try not to add new libraries.

I did a little research on our existing library (PDF Hummus). It took me a lot of time to use cmake to build a compilation environment. I found that it seems that it can only embed pdf files into pdf files. I don’t know how to embed md files into pdf files, that’s exactly what I’m trying to do right?

Today, OSPP has started student registration and project application, and I will start writing my project application, applying Mechanism to append documents to and augment the PDF generated by GNU TeXmacs . Your help may be needed during this time

mgubi · April 29, 2023, 10:02am

Why you talk about md? We want to embed the tm document which generated the PDF. At least. If the document depends on some external things (like images) then we want also to embed those. I cited some references. I think the way to proceed is to understand first how the PDF format supports these additional data, and in second place how to include the data in the PDF. Even if PDF Hummus does not have a ready made function for this it might be possible to write one with the low level interface to PDF which it provides.

tangdouer · April 29, 2023, 12:48pm

Sorry, I mistakenly said .md as .tm. And what you said is correct, I will learn about the pdf format in the near future.

mgubi · April 29, 2023, 2:34pm

I think it is also useful to understand what they say here

https://github.com/galkahana/PDF-Writer/wiki/PDF-Embedding

and there

https://mirror.ibcp.fr/pub/CTAN/macros/latex/contrib/attachfile/attachfile.pdf

tangdouer · May 3, 2023, 3:25am

In recent days, firstly I learned about the PDF format, but I didn’t read the Adobe PDF specification, it is too complicated and I don’t think it can help me quickly understand PDF, maybe I will refer to it later, instead I read PDF Explained this book, learned about basic pdf formatting. I guess the best result would be to add our file as an attachment to the pdf.

Then read https://github.com/galkahana/PDF-Writer/wiki/PDF-Embedding more carefully, but this only introduces three methods to use PDF-Writer to embed all or part of a pdf into another pdf, and we want to embed tm file and other resource files into pdf, maybe we have other ways.

And this https://mirror.ibcp.fr/pub/CTAN/macros/latex/contrib/attachfile/attachfile.pdf package introduces a tool for adding attachments in LATEX, which is indeed a useful tool, but I only learned how to use it, but how to make such a tool is still a mystery to me.

What is my next step? The two programs you have given seem to be my ultimate goal, and reading them seems to be of great help to me.

mgubi · May 3, 2023, 8:42am

The important part seems to be at the end of page 14, the definition of the \atfi@insert@file@annot macro. It says that the PDF format support a FileAttachment annotation. This should be explained in the PDF specification (it is a specification, not a book to read, it is supposed to be referred to, not to be read as a whole). I guess the pdfattach.c command probably use the same mechanism, would be interesting to know.

As for poppler, it seems that the work is done here:
void Catalog::addEmbeddedFile(GooFile *file, const std::string &fileName)
in
https://fossies.org/linux/poppler/poppler/Catalog.cc

Some more info on EmbeddedFiles is here:

https://pymupdf.readthedocs.io/en/latest/app2.html

which says that the mechanism is explained in chapter “7.11.4 Embedded File Streams”, pp. 103 of the Adobe PDF References

I haven’t checked more deeply these documents.