This repository contains the DocHighlight dataset for the paper "Towards Real-World Document Specular Highlight Removal: The DocHighlight Dataset and DocSHRNet Method" published in Pattern Recognition and Computer Vision (PRCV 2025).
DocHighlight is a large-scale, high-resolution dataset specifically designed for document specular highlight removal. The dataset comprises 2,201 rigorously aligned paired images captured under diverse real-world conditions using a polarization-based acquisition pipeline, featuring:
- Various document types: books, magazines, receipts, and graphical content
- Diverse illumination conditions: varying color temperatures, brightness levels, and lighting angles
- Multiple capture devices: different camera types to ensure diversity
- High resolution: average 2924×3672 pixels (range: 1034×737 – 3468×4624)
- Real-world highlights: manual quality verification for reliable ground truth
The reference implementation DocSHRNet with training and inference code is available at 👉 https://github.com/shallweiwei/DocSHRNet.
The dataset is available via the following links:
- 🔒 Non-commercial use only (CC BY-NC-SA 4.0).
If this dataset is useful in your research or product, please cite our paper:
@InProceedings{xu2026dochighlight,
author="Xu, Haowei
and Zhang, Jiaxin
and Cheng, Hiuyi
and Zhang, Peirong
and Zheng, Xuhan
and Jin, Lianwen",
title={{Towards Real-World Document Specular Highlight Removal: The DocHighlight Dataset and DocSHRNet Method}},
booktitle="Pattern Recognition and Computer Vision",
year="2026",
publisher="Springer Nature Singapore",
address="Singapore",
pages="109--124",
isbn="978-981-95-5676-2"
}