Extended samples related test code#277
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #277 +/- ##
=========================================
Coverage 52.55% 52.55%
Complexity 1685 1685
=========================================
Files 99 99
Lines 3048 3048
=========================================
Hits 1602 1602
Misses 1446 1446 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| creationDate: 2025-10-06T10:48:53.000000+0200 | ||
| modificationDate: null | ||
| textPartsExpectedSomewhereInTheExtractedText: | ||
| - Liposuktion bei Lipödem |
There was a problem hiding this comment.
| $creationDate = null; | ||
| if (is_string($content->creationDate)) { | ||
| $creationDate = new DateTimeImmutable($content->creationDate); | ||
| } | ||
|
|
||
| $modificationDate = null; | ||
| if (is_string($content->modificationDate)) { | ||
| $modificationDate = new DateTimeImmutable($content->modificationDate); | ||
| } | ||
|
|
There was a problem hiding this comment.
Without these I receive the following error in my local test setup:
Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12
* Implement ASCII85Decode * Add sample from issue PrinsFrank#153 with ASCII85Decode filter
bc523a9 to
b0185aa
Compare
|
@PrinsFrank Friendly ping: is that helpful and goes in the right direction? If not, could you please give me an indication of which direction you are going? |
|
@k00ni Thanks again for your contribution! Sorry I didn't reply to you before, I'm a bit overwhelmed with the work that needs to be done on this project and spent some time on other things. ;) I'm a bit torn on these changes. On one hand this will make it easier for people to open PRs with samples. On the other hand, it decreases the usability of the samples folder slightly, where after this change it cannot be used to fully check parse differences between revisions. I'm still not sure where I want to go with the samples. Right now I use them to validate any code changes I do with regards to text sorting based on x/y values and the resulting text output, and to validate that images are correctly exported. To do that with the test samples we have, we'll need to have a detailed "current" state. That can only be the total state, because otherwise we cannot see when slight reordering happens for example. One option I have been thinking about, is to automatically parse PDFs in issues that have been marked as "I agree that this file can be used for testing and have the authority to do so", to automatically open a PR with both that file and the current output as a test file. That does require some work, especially because it needs to be safe and not allow token extraction in Github Actions. What do you think? |
|
Sorry for the late response. I will go into some detail and hope it helps you. Feel free to get back to me here or via the other channels. Maintaining an OS project can be overwhelming sometimes. It is always a struggle between the objectives on one hand and the reality (aka professional and private life) on the other 😆 Personally, I tend to set lower goals at the beginning to get the show on the road. After everything works (for the most part), such as workflows, releases on a (somehow) regular basis and the repository was set up properly, objectives are getting refined.
To which extent is that currently implemented? If implemented would definitely make releasing more reliable, because regressions could be caught earlier. But it seems very ambitious from my limited perspective.
To be brutally honest and only with good intentions, the current objectives are too ambitious with regards to the time you have put in so far. For both: the library and the samples. To be quite frank, no one cares why there wasn't something done in the past, its only about now and the current project. If your library helps, good, if not, people improvise or use something else. If you wanna offer something which people value, I would initially align the repository with the typical needs of developers:
What I have seen so far in my own projects, you already did a good job! (1.) is covered already, because I had better results with your library than https://github.com/smalot/pdfparser (didn't check https://github.com/tecnickcom/TCPDF and its successor for a while). The library might be far from complete, but that's a good basement. Unfortunately, there were a few problems which "came back" after they were reported and fixed. So there is uncertainty if a new release is at least "as good as" the last one. I think you were talking about this when you wrote:
The mentioned needs (2.) and (3.) depend on each other quite a bit, because if I report something and it gets fixed, that's wonderful. But when I see the same problem "few iterations" later again, I get doubts and don't trust my extracted texts anymore. Additionally, because of the vast complexity of the PDF specifications there are many edge cases to cover too. So changes behind the curtain might break code which worked before, but no one noticed, because no production code covered these of edge cases yet. You should aim a bit lower and try to keep it at a certain level for quite some time. That means frequent commits with fixes/new features and a release from time to time. And some progress with the "report an issue and provide test examples"-thing. Implement it in a way to report a bug and provide an example. The example should be at least the original PDF + some kind of a reproducer. This could be a failing unit test or 2 files, which represent the current and the expected state or ... It could be done in various ways. A very lightweight one could be use this PR (at least as inspiration). When doing it with unit tests I can compile a bunch which outline the problem pretty clearly. They would also fail if the implementation changes in the future. Besides, I could keep the error description in the issue shorter, because the reproducer can demonstrate it more clearly. Yes, it would be bond to this library, but is your plan really to provide test samples so libraries such as https://github.com/tecnickcom/tc-lib-pdf or https://github.com/smalot/pdfparser etc. use it to verify their output, theoretically? At least for smalot/pdfparser, I don't have no such plans. Besides, a simple web service could also assist when creating an issue. But it had to be developed first and maintained afterwards. When all that works consistently, further refinements should be made. But not earlier. |
|
@PrinsFrank do you know https://github.com/veraPDF/veraPDF-corpus? Its a corpus of PDF files provided with basically the same intend as the stuff we are talking about here. Might worth a look. |

Summary
This PR contains a few suggestions regarding the samples part of the test environment. You might had further thoughts since we spoke last time. These changes would help me when providing "broken" sample files in the future, because the amount and kind of information I have to provide was reduced and I can pinpoint "problems" more precisely.
Feel free to adapt/remove the code provided.
Adaptions
Added textPartsExpectedSomewhereInTheExtractedText
I added
textPartsExpectedSomewhereInTheExtractedTextas a new property to contents.yml. It is either empty/not there or a non-empty list of strings which are expected to be somewhere in the extracted text. This structure helps me to pinpoint certain text parts which are not extracted properly.I've added the content of #273 to demonstrate the usage of this property. The related test-part fails currently:
https://github.com/PrinsFrank/pdfparser/actions/runs/18876963633/job/53869189967?pr=277#step:5:21
#273 can be closed if we use its content in this PR.
contents.yml: Only check if a value is not null
I suggest only checking values if they are different from null. In many cases I have no idea what I have to insert into the metadata, such as creation date or producer because the PDF in question is not mine. Only checking when its different from null allows us to construct real life cases but also don't have to bother, if these information are not available.
After this particular online service we talked about is available this might be obsolete :)
FileInfo
I also had to "fix" the date related part of FileInfo.php because I got the following error:
Pages property made optional
Also made pages-property optional, so I don't have to input the raw content necessarily when creating such a file. Especially for PDFs with many pages this becomes cumbersome.
Questions
contents.yml)?