Image classification using fine-tuned CLIP - for historical document sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training / evaluation of CLIP model, input file/directory processing, class 🪧 (category) results of top N predictions output, predictions summarizing into a tabular format.

Versions 🏁

There are currently 4 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest approved v1.1.3.7 is considered to be default and can be found in the main branch of HF 😊 hub ^1 🔗

Version	Base	Pages	PDFs	Description
`v1.1`	`ViT-B/16`	15855	5730	smallest
`v1.2`	`ViT-B/32`	15855	5730	small with higher granularity
`v2.1`	`ViT-L/14`	15855	5730	large
`v2.2`	`ViT-L/14@336`	15855	5730	large with highest resolution
`v1.1.3.[1,3,4,6,7]`	`ViT-B/16`	38625	37328	smallest and most accurate
`v1.2.3`	`ViT-B/32`	38625	37328	small and 2nd in accuracy
`v2.1.3.1`	`ViT-B/14`	38625	37328	larges and not too accurate
`v2.2.3.4`	`ViT-B/14@336`	38625	37328	larges and not too accurate

Version	Disk space	Parameters (Millions)
`openai/clip-vit-base-patch16`	992 Mb	149.62 M
`openai/clip-vit-base-patch32`	1008 Mb	151.28 M
`openai/clip-vit-large-patch14`	1.5 Gb	427.62 M
`openai/clip-vit-large-patch14-336`	1.5 Gb	427.94 M

Model description 📇

🔲 Fine-tuned model repository: UFAL's clip-historical-page ^1 🔗

🔳 Base model repository: OpenAI's clip-vit-base-patch16, clip-vit-base-patch32, clip-vit-large-patch14, clip-vit-large-patch14-336 ^2 ^13 ^14 ^15 🔗

The model was trained on the manually ✍️ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.

The images contain various combinations of texts ️📄, tables 📏, drawings 📈, and photos 🌄 - categories 🪧 described below were formed based on those archival documents. Page examples can be found in the category_samples 📁 directory.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.

In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten ✏️ / just printed plain ️📄 text or structured in tabular 📏 format text, as well as to mark the presence of the printed 🌄 or drawn 📈 graphic materials yet to be extracted from the page images.

Data 📜

The dataset is provided under Public Domain license, and consists of 48,499 PNG images of pages from 37,328 archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 🔗.

The annotation provided includes 5 different dataset splits of vX.X.3 model versions, and it's recommended to average all 5 trained model weights to get a more robust model for prediction (in some cases, like TEXT and TEXT_T categories which samples very often look the same, the accuracy of those problematic categories could drop below 90% with off-diagonal errors rising above 10% after the averaging of trained models). Anyhow, the averaged model usually score higher accuracy than any of its individual components... or sometimes causes a drop in accuracy for the most ambiguous categories 🪧️ - depends mostly on the base model choice.

Our dataset is not split using a simple random shuffle. This is because the data contains structured and clustered distributions of page types within many categories. A random shuffle would likely result in subsets with poor representative variability.

Instead, we use a deterministic, periodic sampling method with a randomized offset. To maximize the size of the training 💪 set, we select the development and test 🏆 subsets first. The training subset then consists of all remaining pages.

Here's the per-category 🪧 procedure for selecting the development and test 🏆 sets:

For the category of size N compute the desired subset size, k, as a fixed proportion (test_ratio which was 10%) of N
Compute a selection step, S, as S ≈ N/k which serves a period base for the selection
Apply a random shift to S - an integer index in the range [S_i - S/4; S_i + S/4] for every i-th of k steps of S.
Select every S-th (S-thish in fact) element from the alphabetically ordered sequence after applying the random shift.
Finally, Limit selected indices to be within the range of the category size N.

This method produces subsets that:

Respect the original ordering and local clustering in the data
Preserve the proportional representation of each category
Introduce controlled randomness, so the selected samples are not strictly periodic

This ensures that our subsets cover the full chronological and structural variability of the collection, leading to a more robust and reliable model evaluation.

Training 💪 set of the model: 14270 images for vX.X

Training 💪 set of the model: 38625 images for vX.X.3

The training subsets above are followed by the test sets below:

Evaluation 🏆 set: 1290 images for vX.X models

Evaluation 🏆 set: 4823 images (for vX.X.3 models)

Manual ✍️ annotation was performed beforehand and took some time ⌛, the categories 🪧 tabulated below were formed from different sources of the archival documents originated in the 1920-2020 years span.

Category	Dataset 0	Dataset 1	Dataset 2	Dataset 3
DRAW	1090 (9.1%)	1368 (8.8%)	1472 (9.3%)	2709 (5.6%)
DRAW_L	1091 (9.1%)	1383 (8.9%)	1402 (8.8%)	2921 (6.0%)
LINE_HW	1055 (8.8%)	1113 (7.2%)	1115 (7.0%)	2514 (5.2%)
LINE_P	1092 (9.1%)	1540 (9.9%)	1580 (10.0%)	2439 (5.0%)
LINE_T	1098 (9.2%)	1664 (10.7%)	1668 (10.5%)	9883 (20.4%)
PHOTO	1081 (9.1%)	1632 (10.5%)	1730 (10.9%)	2691 (5.5%)
PHOTO_L	1087 (9.1%)	1087 (7.0%)	1088 (6.9%)	2830 (5.8%)
TEXT	1091 (9.1%)	1587 (10.3%)	1592 (10.0%)	14227 (29.3%)
TEXT_HW	1091 (9.1%)	1092 (7.1%)	1092 (6.9%)	2008 (4.1%)
TEXT_P	1083 (9.1%)	1540 (9.9%)	1633 (10.3%)	2312 (4.8%)
TEXT_T	1081 (9.1%)	1476 (9.5%)	1482 (9.3%)	3965 (8.2%)
Unique PDFs	5001	5694	5729	37328
Total Pages	11,940	15,482	15,854	48,499

The table above shows category distribution for different model versions, where the last column (Dataset 3) corresponds to the latest vX.X.3 models data, which actually used 14,000 pages of TEXT category, while other columns cover all the used samples - specifically 80% as training 💪, and 10% each as development and test 🏆 sets. The early model version used 90% of the data as training 💪 and the remaining 10% as both development and test 🏆 set due to the lack of annotated (manually classified) pages.

Disproportion of the categories 🪧 in both training data and provided evaluation category_samples 📁 is NOT intentional, but rather a result of the source data nature.

The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW 📁)

Versions of CLIP models are grounded on the textual category description sets, all illustrated in descriptions_comparison_graph.png 📎 which is a graph containing separate and averaged results of all category 🪧 descriptions.

As our experiments showed, the averaging strategy is not the best. Moreover, the smallest model ViT-B/16 showed the best results after fine-tuning model on some selected category 🪧 set.

Categories 🪧

Label️	Description
`DRAW`	📈 - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions
`DRAW_L`	📈📏 - drawings, etc but presented within a table-like layout or includes a legend formatted as a table
`LINE_HW`	✏️📏 - handwritten text organized in a tabular or form-like structure
`LINE_P`	📏 - printed text organized in a tabular or form-like structure
`LINE_T`	📏 - machine-typed text organized in a tabular or form-like structure
`PHOTO`	🌄 - photographs or photographic cutouts, potentially with text captions
`PHOTO_L`	🌄📏 - photos presented within a table-like layout or accompanied by tabular annotations
`TEXT`	📰 - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements
`TEXT_HW`	✏️📄 - only handwritten text in paragraph or block form (non-tabular)
`TEXT_P`	📄 - only printed text in paragraph or block form (non-tabular)
`TEXT_T`	📄 - only machine-typed text in paragraph or block form (non-tabular)

The categories were chosen to sort the pages by the following criteria:

presence of graphical elements (drawings 📈 OR photos 🌄)
type of text 📄 (handwritten ✏️️ OR printed OR typed OR mixed 📰)
presence of tabular layout / forms 📏

The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification.

Examples of pages sorted by category 🪧 can be found in the category_samples 📁 directory which is also available as a testing subset of the training data.

Results 📊

Version	Base Model + category set	Accuracy (%)	Comment
v1.1.3.1	ViT-B/16 init	99.1	Very good
v1.1.3.2	ViT-B/16 details	99.08
v1.1.3.3	ViT-B/16 extra	99.12	2nd Best
v1.1.3.4	ViT-B/16 gemini	99.1	Very good
v1.1.3.5	ViT-B/16 gpt	98.95
v1.1.3.6	ViT-B/16 large	99.1	Very good
v1.1.3.7	ViT-B/16 mid	99.14	Best
v1.1.3.8	ViT-B/16 min	98.86
v1.1.3.9	ViT-B/16 short	99.06
v1.1.3	ViT-B/16 average	99.06
v1.2.3.1	ViT-B/32 init	98.95
v1.2.3.3	ViT-B/32 extra	98.92
v1.2.3.4	ViT-B/32 gemini	98.94
v1.2.3.6	ViT-B/32 large	98.97
v1.2.3.7	ViT-B/32 mid	98.86
v1.2.3	ViT-B/32 average	98.99	Larger & good
v2.2.3.1	ViT-L/14-336px init	98.86	Large & OK
v2.2.3.3	ViT-L/14-336px extra	98.59
v2.2.3.4	ViT-L/14-336px gemini	98.97
v2.2.3.6	ViT-L/14-336px large	98.68
v2.2.3.7	ViT-L/14-336px mid	98.81
v2.2.3	ViT-L/14-336px average	98.72
v2.1.3.1	ViT-L/14 init	98.97
v2.1.3.3	ViT-L/14 extra	98.83
v2.1.3.4	ViT-L/14 gemini	98.86	Large & OK
v2.1.3.6	ViT-L/14 large	98.92
v2.1.3.7	ViT-L/14 mid	98.9
v2.1.3	ViT-L/14 average	98.81

v1.1.3.1 Evaluation set's accuracy (Top-1): 99.1% 🏆

v1.1.3.3 Evaluation set's accuracy (Top-1): 99.12% 🏆

v1.1.3.4 Evaluation set's accuracy (Top-1): 99.1% 🏆

v1.1.3.6 Evaluation set's accuracy (Top-1): 99.1% 🏆

v1.1.3.7 Evaluation set's accuracy (Top-1): 99.14% 🏆

v1.2.3 Evaluation set's accuracy (Top-1): 98.99% 🏆

v2.1.3.1 Evaluation set's accuracy (Top-1): 98.97% 🏆

v2.2.3.4 Evaluation set's accuracy (Top-1): 98.97% 🏆

Confusion matrices provided above show the diagonal of matching gold and predicted categories 🪧 while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.

Image preprocessing steps 👀

transforms.ColorJitter(brightness 0.5)
transforms.ColorJitter(contrast 0.5)
transforms.ColorJitter(saturation 0.5)
transforms.ColorJitter(hue 0.5)
transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))

Training hyperparameters 👀

eval_strategy "epoch"
save_strategy "epoch"
learning_rate 5e-5
per_device_train_batch_size 8
per_device_eval_batch_size 8
num_train_epochs 3
warmup_ratio 0.1
logging_steps 10
load_best_model_at_end True
metric_for_best_model "accuracy"

Contacts 📧

For support write to: [email protected] responsible for this GitHub repository ^8 🔗

Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff 📎 file.

Acknowledgements 🙏

Developed by UFAL ^7 👥
Funded by ATRIUM ^4 💰
Shared by ATRIUM ^4 & UFAL ^7 🔗
Model type: fine-tuned CLIP-ViT with a 224x224 ^2 🔗 or 384x384 ^13 ^14 🔗 resolution size

Downloads last month: 8

Model tree for ufal/clip-historical-page

Base model

openai/clip-vit-base-patch16

Finetuned

(49)

this model