Extended samples related test code by k00ni · Pull Request #277 · PrinsFrank/pdfparser

k00ni · 2025-10-28T13:40:20Z

Summary

This PR contains a few suggestions regarding the samples part of the test environment. You might had further thoughts since we spoke last time. These changes would help me when providing "broken" sample files in the future, because the amount and kind of information I have to provide was reduced and I can pinpoint "problems" more precisely.

Feel free to adapt/remove the code provided.

Adaptions

Added textPartsExpectedSomewhereInTheExtractedText

I added textPartsExpectedSomewhereInTheExtractedText as a new property to contents.yml. It is either empty/not there or a non-empty list of strings which are expected to be somewhere in the extracted text. This structure helps me to pinpoint certain text parts which are not extracted properly.

I've added the content of #273 to demonstrate the usage of this property. The related test-part fails currently:
https://github.com/PrinsFrank/pdfparser/actions/runs/18876963633/job/53869189967?pr=277#step:5:21

#273 can be closed if we use its content in this PR.

contents.yml: Only check if a value is not null

I suggest only checking values if they are different from null. In many cases I have no idea what I have to insert into the metadata, such as creation date or producer because the PDF in question is not mine. Only checking when its different from null allows us to construct real life cases but also don't have to bother, if these information are not available.

After this particular online service we talked about is available this might be obsolete :)

FileInfo

I also had to "fix" the date related part of FileInfo.php because I got the following error:

Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12

Pages property made optional

Also made pages-property optional, so I don't have to input the raw content necessarily when creating such a file. Especially for PDFs with many pages this becomes cumbersome.

Questions

What is the difference between a producer and a creator of PDF (contents.yml)?

codecov-commenter · 2025-10-28T13:41:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.55%. Comparing base (5fa35ac) to head (b0185aa).

Additional details and impacted files

@@            Coverage Diff            @@
##               main     #277   +/-   ##
=========================================
  Coverage     52.55%   52.55%           
  Complexity     1685     1685           
=========================================
  Files            99       99           
  Lines          3048     3048           
=========================================
  Hits           1602     1602           
  Misses         1446     1446

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

k00ni · 2025-10-28T13:41:52Z

+creationDate: 2025-10-06T10:48:53.000000+0200
+modificationDate: null
+textPartsExpectedSomewhereInTheExtractedText:
+  - Liposuktion bei Lipödem


This text part is from the related issue #272

Current only ipödem is extracted.

k00ni · 2025-10-28T13:42:28Z

+            $creationDate = null;
+            if (is_string($content->creationDate)) {
+                $creationDate = new DateTimeImmutable($content->creationDate);
+            }
+
+            $modificationDate = null;
+            if (is_string($content->modificationDate)) {
+                $modificationDate = new DateTimeImmutable($content->modificationDate);
+            }
+


Without these I receive the following error in my local test setup:

Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12

* Implement ASCII85Decode * Add sample from issue PrinsFrank#153 with ASCII85Decode filter

…sFrank#281)

k00ni · 2025-11-10T12:05:13Z

@PrinsFrank Friendly ping: is that helpful and goes in the right direction? If not, could you please give me an indication of which direction you are going?

PrinsFrank · 2025-12-02T17:37:00Z

@k00ni Thanks again for your contribution! Sorry I didn't reply to you before, I'm a bit overwhelmed with the work that needs to be done on this project and spent some time on other things. ;)

I'm a bit torn on these changes. On one hand this will make it easier for people to open PRs with samples. On the other hand, it decreases the usability of the samples folder slightly, where after this change it cannot be used to fully check parse differences between revisions.

I'm still not sure where I want to go with the samples. Right now I use them to validate any code changes I do with regards to text sorting based on x/y values and the resulting text output, and to validate that images are correctly exported. To do that with the test samples we have, we'll need to have a detailed "current" state. That can only be the total state, because otherwise we cannot see when slight reordering happens for example.

One option I have been thinking about, is to automatically parse PDFs in issues that have been marked as "I agree that this file can be used for testing and have the authority to do so", to automatically open a PR with both that file and the current output as a test file. That does require some work, especially because it needs to be safe and not allow token extraction in Github Actions.

What do you think?

k00ni · 2026-03-18T13:00:09Z

Sorry for the late response. I will go into some detail and hope it helps you. Feel free to get back to me here or via the other channels.

Maintaining an OS project can be overwhelming sometimes. It is always a struggle between the objectives on one hand and the reality (aka professional and private life) on the other 😆 Personally, I tend to set lower goals at the beginning to get the show on the road. After everything works (for the most part), such as workflows, releases on a (somehow) regular basis and the repository was set up properly, objectives are getting refined.

On the other hand, it decreases the usability of the samples folder slightly, where after this change it cannot be used to fully check parse differences between revisions. [...] I'm still not sure where I want to go with the samples.

To which extent is that currently implemented? If implemented would definitely make releasing more reliable, because regressions could be caught earlier. But it seems very ambitious from my limited perspective.

What do you think?

To be brutally honest and only with good intentions, the current objectives are too ambitious with regards to the time you have put in so far. For both: the library and the samples. To be quite frank, no one cares why there wasn't something done in the past, its only about now and the current project. If your library helps, good, if not, people improvise or use something else.

If you wanna offer something which people value, I would initially align the repository with the typical needs of developers:

A somewhat reliable piece of software, which does what it promotes
An easy way to get in touch with you when problems/questions appear
Reoccurring releases with fixes of reported issues and/or new features (at least once per quarter, a few times a year)

What I have seen so far in my own projects, you already did a good job! (1.) is covered already, because I had better results with your library than https://github.com/smalot/pdfparser (didn't check https://github.com/tecnickcom/TCPDF and its successor for a while). The library might be far from complete, but that's a good basement.

Unfortunately, there were a few problems which "came back" after they were reported and fixed. So there is uncertainty if a new release is at least "as good as" the last one. I think you were talking about this when you wrote:

Right now I use them to validate any code changes I do with regards to text sorting based on x/y values and the resulting text output, and to validate that images are correctly exported. To do that with the test samples we have, we'll need to have a detailed "current" state. That can only be the total state, because otherwise we cannot see when slight reordering happens for example.

The mentioned needs (2.) and (3.) depend on each other quite a bit, because if I report something and it gets fixed, that's wonderful. But when I see the same problem "few iterations" later again, I get doubts and don't trust my extracted texts anymore. Additionally, because of the vast complexity of the PDF specifications there are many edge cases to cover too. So changes behind the curtain might break code which worked before, but no one noticed, because no production code covered these of edge cases yet.

You should aim a bit lower and try to keep it at a certain level for quite some time. That means frequent commits with fixes/new features and a release from time to time. And some progress with the "report an issue and provide test examples"-thing.

Implement it in a way to report a bug and provide an example. The example should be at least the original PDF + some kind of a reproducer. This could be a failing unit test or 2 files, which represent the current and the expected state or ...

It could be done in various ways. A very lightweight one could be use this PR (at least as inspiration). When doing it with unit tests I can compile a bunch which outline the problem pretty clearly. They would also fail if the implementation changes in the future. Besides, I could keep the error description in the issue shorter, because the reproducer can demonstrate it more clearly.

Yes, it would be bond to this library, but is your plan really to provide test samples so libraries such as https://github.com/tecnickcom/tc-lib-pdf or https://github.com/smalot/pdfparser etc. use it to verify their output, theoretically? At least for smalot/pdfparser, I don't have no such plans. Besides, a simple web service could also assist when creating an issue. But it had to be developed first and maintained afterwards.

When all that works consistently, further refinements should be made. But not earlier.

k00ni · 2026-04-30T14:35:32Z

@PrinsFrank do you know https://github.com/veraPDF/veraPDF-corpus? Its a corpus of PDF files provided with basically the same intend as the stuff we are talking about here. Might worth a look.

k00ni commented Oct 28, 2025

View reviewed changes

PrinsFrank and others added 5 commits October 30, 2025 20:51

Implement ASCII85Decode (PrinsFrank#279)

86d1a42

* Implement ASCII85Decode * Add sample from issue PrinsFrank#153 with ASCII85Decode filter

Fix yml schema path for samples (PrinsFrank#280)

f31ae07

Distinguish between user and owner passwords in samples content (Prin…

5fa35ac

…sFrank#281)

refined samples related code: SampleProvider and SamplesTest

f17438d

fixed PHPStan errors

b0185aa

k00ni force-pushed the feature/pdf-samples-extended branch from bc523a9 to b0185aa Compare November 7, 2025 07:35

PrinsFrank force-pushed the main branch from 3c41e85 to 83acc3d Compare December 1, 2025 19:32

k00ni mentioned this pull request Dec 2, 2025

[Bug]: Text parts misplaced and mixed up #290

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extended samples related test code#277

Extended samples related test code#277
k00ni wants to merge 5 commits intoPrinsFrank:mainfrom
k00ni:feature/pdf-samples-extended

k00ni commented Oct 28, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 28, 2025 •

edited

Loading

Uh oh!

k00ni Oct 28, 2025

Uh oh!

k00ni Oct 28, 2025 •

edited

Loading

Uh oh!

k00ni commented Nov 10, 2025

Uh oh!

PrinsFrank commented Dec 2, 2025

Uh oh!

k00ni commented Mar 18, 2026 •

edited

Loading

Uh oh!

k00ni commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

k00ni commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Adaptions

Added textPartsExpectedSomewhereInTheExtractedText

contents.yml: Only check if a value is not null

FileInfo

Pages property made optional

Questions

Uh oh!

codecov-commenter commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

k00ni Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

k00ni Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k00ni commented Nov 10, 2025

Uh oh!

PrinsFrank commented Dec 2, 2025

Uh oh!

k00ni commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k00ni commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

k00ni commented Oct 28, 2025 •

edited

Loading

codecov-commenter commented Oct 28, 2025 •

edited

Loading

k00ni Oct 28, 2025 •

edited

Loading

k00ni commented Mar 18, 2026 •

edited

Loading

k00ni commented Apr 30, 2026 •

edited

Loading