Skip to content

Extended samples related test code#277

Open
k00ni wants to merge 5 commits intoPrinsFrank:mainfrom
k00ni:feature/pdf-samples-extended
Open

Extended samples related test code#277
k00ni wants to merge 5 commits intoPrinsFrank:mainfrom
k00ni:feature/pdf-samples-extended

Conversation

@k00ni
Copy link
Copy Markdown
Contributor

@k00ni k00ni commented Oct 28, 2025

Summary

This PR contains a few suggestions regarding the samples part of the test environment. You might had further thoughts since we spoke last time. These changes would help me when providing "broken" sample files in the future, because the amount and kind of information I have to provide was reduced and I can pinpoint "problems" more precisely.

Feel free to adapt/remove the code provided.

Adaptions

Added textPartsExpectedSomewhereInTheExtractedText

I added textPartsExpectedSomewhereInTheExtractedText as a new property to contents.yml. It is either empty/not there or a non-empty list of strings which are expected to be somewhere in the extracted text. This structure helps me to pinpoint certain text parts which are not extracted properly.

I've added the content of #273 to demonstrate the usage of this property. The related test-part fails currently:
https://github.com/PrinsFrank/pdfparser/actions/runs/18876963633/job/53869189967?pr=277#step:5:21

#273 can be closed if we use its content in this PR.

contents.yml: Only check if a value is not null

I suggest only checking values if they are different from null. In many cases I have no idea what I have to insert into the metadata, such as creation date or producer because the PDF in question is not mine. Only checking when its different from null allows us to construct real life cases but also don't have to bother, if these information are not available.

After this particular online service we talked about is available this might be obsolete :)

FileInfo

I also had to "fix" the date related part of FileInfo.php because I got the following error:

Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12

Pages property made optional

Also made pages-property optional, so I don't have to input the raw content necessarily when creating such a file. Especially for PDFs with many pages this becomes cumbersome.

Questions

  1. What is the difference between a producer and a creator of PDF (contents.yml)?

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.55%. Comparing base (5fa35ac) to head (b0185aa).

Additional details and impacted files
@@            Coverage Diff            @@
##               main     #277   +/-   ##
=========================================
  Coverage     52.55%   52.55%           
  Complexity     1685     1685           
=========================================
  Files            99       99           
  Lines          3048     3048           
=========================================
  Hits           1602     1602           
  Misses         1446     1446           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

creationDate: 2025-10-06T10:48:53.000000+0200
modificationDate: null
textPartsExpectedSomewhereInTheExtractedText:
- Liposuktion bei Lipödem
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text part is from the related issue #272

Current only ipödem is extracted.

Image

Comment on lines +33 to +42
$creationDate = null;
if (is_string($content->creationDate)) {
$creationDate = new DateTimeImmutable($content->creationDate);
}

$modificationDate = null;
if (is_string($content->modificationDate)) {
$modificationDate = new DateTimeImmutable($content->modificationDate);
}

Copy link
Copy Markdown
Contributor Author

@k00ni k00ni Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these I receive the following error in my local test setup:

Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12

@k00ni k00ni force-pushed the feature/pdf-samples-extended branch from bc523a9 to b0185aa Compare November 7, 2025 07:35
@k00ni
Copy link
Copy Markdown
Contributor Author

k00ni commented Nov 10, 2025

@PrinsFrank Friendly ping: is that helpful and goes in the right direction? If not, could you please give me an indication of which direction you are going?

@PrinsFrank
Copy link
Copy Markdown
Owner

@k00ni Thanks again for your contribution! Sorry I didn't reply to you before, I'm a bit overwhelmed with the work that needs to be done on this project and spent some time on other things. ;)

I'm a bit torn on these changes. On one hand this will make it easier for people to open PRs with samples. On the other hand, it decreases the usability of the samples folder slightly, where after this change it cannot be used to fully check parse differences between revisions.

I'm still not sure where I want to go with the samples. Right now I use them to validate any code changes I do with regards to text sorting based on x/y values and the resulting text output, and to validate that images are correctly exported. To do that with the test samples we have, we'll need to have a detailed "current" state. That can only be the total state, because otherwise we cannot see when slight reordering happens for example.

One option I have been thinking about, is to automatically parse PDFs in issues that have been marked as "I agree that this file can be used for testing and have the authority to do so", to automatically open a PR with both that file and the current output as a test file. That does require some work, especially because it needs to be safe and not allow token extraction in Github Actions.

What do you think?

@k00ni
Copy link
Copy Markdown
Contributor Author

k00ni commented Mar 18, 2026

Sorry for the late response. I will go into some detail and hope it helps you. Feel free to get back to me here or via the other channels.


Maintaining an OS project can be overwhelming sometimes. It is always a struggle between the objectives on one hand and the reality (aka professional and private life) on the other 😆 Personally, I tend to set lower goals at the beginning to get the show on the road. After everything works (for the most part), such as workflows, releases on a (somehow) regular basis and the repository was set up properly, objectives are getting refined.

On the other hand, it decreases the usability of the samples folder slightly, where after this change it cannot be used to fully check parse differences between revisions. [...] I'm still not sure where I want to go with the samples.

To which extent is that currently implemented? If implemented would definitely make releasing more reliable, because regressions could be caught earlier. But it seems very ambitious from my limited perspective.

What do you think?

To be brutally honest and only with good intentions, the current objectives are too ambitious with regards to the time you have put in so far. For both: the library and the samples. To be quite frank, no one cares why there wasn't something done in the past, its only about now and the current project. If your library helps, good, if not, people improvise or use something else.

If you wanna offer something which people value, I would initially align the repository with the typical needs of developers:

  1. A somewhat reliable piece of software, which does what it promotes
  2. An easy way to get in touch with you when problems/questions appear
  3. Reoccurring releases with fixes of reported issues and/or new features (at least once per quarter, a few times a year)

What I have seen so far in my own projects, you already did a good job! (1.) is covered already, because I had better results with your library than https://github.com/smalot/pdfparser (didn't check https://github.com/tecnickcom/TCPDF and its successor for a while). The library might be far from complete, but that's a good basement.

Unfortunately, there were a few problems which "came back" after they were reported and fixed. So there is uncertainty if a new release is at least "as good as" the last one. I think you were talking about this when you wrote:

Right now I use them to validate any code changes I do with regards to text sorting based on x/y values and the resulting text output, and to validate that images are correctly exported. To do that with the test samples we have, we'll need to have a detailed "current" state. That can only be the total state, because otherwise we cannot see when slight reordering happens for example.

The mentioned needs (2.) and (3.) depend on each other quite a bit, because if I report something and it gets fixed, that's wonderful. But when I see the same problem "few iterations" later again, I get doubts and don't trust my extracted texts anymore. Additionally, because of the vast complexity of the PDF specifications there are many edge cases to cover too. So changes behind the curtain might break code which worked before, but no one noticed, because no production code covered these of edge cases yet.


You should aim a bit lower and try to keep it at a certain level for quite some time. That means frequent commits with fixes/new features and a release from time to time. And some progress with the "report an issue and provide test examples"-thing.

Implement it in a way to report a bug and provide an example. The example should be at least the original PDF + some kind of a reproducer. This could be a failing unit test or 2 files, which represent the current and the expected state or ...

It could be done in various ways. A very lightweight one could be use this PR (at least as inspiration). When doing it with unit tests I can compile a bunch which outline the problem pretty clearly. They would also fail if the implementation changes in the future. Besides, I could keep the error description in the issue shorter, because the reproducer can demonstrate it more clearly.

Yes, it would be bond to this library, but is your plan really to provide test samples so libraries such as https://github.com/tecnickcom/tc-lib-pdf or https://github.com/smalot/pdfparser etc. use it to verify their output, theoretically? At least for smalot/pdfparser, I don't have no such plans. Besides, a simple web service could also assist when creating an issue. But it had to be developed first and maintained afterwards.

When all that works consistently, further refinements should be made. But not earlier.

@k00ni
Copy link
Copy Markdown
Contributor Author

k00ni commented Apr 30, 2026

@PrinsFrank do you know https://github.com/veraPDF/veraPDF-corpus? Its a corpus of PDF files provided with basically the same intend as the stuff we are talking about here. Might worth a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants