Image classification using fine-tuned CLIP - for historical document sorting

Goal: solve a task of archive page images sorting (for their further content-based processing)

Scope: Processing of images, training / evaluation of CLIP model, input file/directory processing, class πŸͺ§ (category) results of top N predictions output, predictions summarizing into a tabular format.

Versions 🏁

There are currently 4 version of the model available for download, both of them have the same set of categories, but different data annotations. The latest approved v1.1.3.7 is considered to be default and can be found in the main branch of HF 😊 hub ^1 πŸ”—

Version Base Pages PDFs Description
v1.1 ViT-B/16 15855 5730 smallest
v1.2 ViT-B/32 15855 5730 small with higher granularity
v2.1 ViT-L/14 15855 5730 large
v2.2 ViT-L/14@336 15855 5730 large with highest resolution
v1.1.3.[1,3,4,6,7] ViT-B/16 38625 37328 smallest and most accurate
v1.2.3 ViT-B/32 38625 37328 small and 2nd in accuracy
v2.1.3.1 ViT-B/14 38625 37328 larges and not too accurate
v2.2.3.4 ViT-B/14@336 38625 37328 larges and not too accurate
Version Disk space Parameters (Millions)
openai/clip-vit-base-patch16 992 Mb 149.62 M
openai/clip-vit-base-patch32 1008 Mb 151.28 M
openai/clip-vit-large-patch14 1.5 Gb 427.62 M
openai/clip-vit-large-patch14-336 1.5 Gb 427.94 M

Model description πŸ“‡

architecture_diagram

πŸ”² Fine-tuned model repository: UFAL's clip-historical-page ^1 πŸ”—

πŸ”³ Base model repository: OpenAI's clip-vit-base-patch16, clip-vit-base-patch32, clip-vit-large-patch14, clip-vit-large-patch14-336 ^2 ^13 ^14 ^15 πŸ”—

The model was trained on the manually ✍️ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.

The images contain various combinations of texts οΈπŸ“„, tables πŸ“, drawings πŸ“ˆ, and photos πŸŒ„ - categories πŸͺ§ described below were formed based on those archival documents. Page examples can be found in the category_samples πŸ“ directory.

The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.

In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten ✏️ / just printed plain οΈπŸ“„ text or structured in tabular πŸ“ format text, as well as to mark the presence of the printed πŸŒ„ or drawn πŸ“ˆ graphic materials yet to be extracted from the page images.

Data πŸ“œ

The dataset is provided under Public Domain license, and consists of 48,499 PNG images of pages from 37,328 archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 πŸ”—.

The annotation provided includes 5 different dataset splits of vX.X.3 model versions, and it's recommended to average all 5 trained model weights to get a more robust model for prediction (in some cases, like TEXT and TEXT_T categories which samples very often look the same, the accuracy of those problematic categories could drop below 90% with off-diagonal errors rising above 10% after the averaging of trained models). Anyhow, the averaged model usually score higher accuracy than any of its individual components... or sometimes causes a drop in accuracy for the most ambiguous categories πŸͺ§οΈ - depends mostly on the base model choice.

Our dataset is not split using a simple random shuffle. This is because the data contains structured and clustered distributions of page types within many categories. A random shuffle would likely result in subsets with poor representative variability.

Instead, we use a deterministic, periodic sampling method with a randomized offset. To maximize the size of the training πŸ’ͺ set, we select the development and test πŸ† subsets first. The training subset then consists of all remaining pages.

Here's the per-category πŸͺ§ procedure for selecting the development and test πŸ† sets:

  1. For the category of size N compute the desired subset size, k, as a fixed proportion (test_ratio which was 10%) of N
  2. Compute a selection step, S, as S β‰ˆ N/k which serves a period base for the selection
  3. Apply a random shift to S - an integer index in the range [S_i - S/4; S_i + S/4] for every i-th of k steps of S.
  4. Select every S-th (S-thish in fact) element from the alphabetically ordered sequence after applying the random shift.
  5. Finally, Limit selected indices to be within the range of the category size N.

This method produces subsets that:

  • Respect the original ordering and local clustering in the data
  • Preserve the proportional representation of each category
  • Introduce controlled randomness, so the selected samples are not strictly periodic

This ensures that our subsets cover the full chronological and structural variability of the collection, leading to a more robust and reliable model evaluation.

Training πŸ’ͺ set of the model: 14270 images for vX.X

Training πŸ’ͺ set of the model: 38625 images for vX.X.3

The training subsets above are followed by the test sets below:

Evaluation πŸ† set: 1290 images for vX.X models

Evaluation πŸ† set: 4823 images (for vX.X.3 models)

Manual ✍️ annotation was performed beforehand and took some time βŒ›, the categories πŸͺ§ tabulated below were formed from different sources of the archival documents originated in the 1920-2020 years span.

Category Dataset 0 Dataset 1 Dataset 2 Dataset 3
DRAW 1090 (9.1%) 1368 (8.8%) 1472 (9.3%) 2709 (5.6%)
DRAW_L 1091 (9.1%) 1383 (8.9%) 1402 (8.8%) 2921 (6.0%)
LINE_HW 1055 (8.8%) 1113 (7.2%) 1115 (7.0%) 2514 (5.2%)
LINE_P 1092 (9.1%) 1540 (9.9%) 1580 (10.0%) 2439 (5.0%)
LINE_T 1098 (9.2%) 1664 (10.7%) 1668 (10.5%) 9883 (20.4%)
PHOTO 1081 (9.1%) 1632 (10.5%) 1730 (10.9%) 2691 (5.5%)
PHOTO_L 1087 (9.1%) 1087 (7.0%) 1088 (6.9%) 2830 (5.8%)
TEXT 1091 (9.1%) 1587 (10.3%) 1592 (10.0%) 14227 (29.3%)
TEXT_HW 1091 (9.1%) 1092 (7.1%) 1092 (6.9%) 2008 (4.1%)
TEXT_P 1083 (9.1%) 1540 (9.9%) 1633 (10.3%) 2312 (4.8%)
TEXT_T 1081 (9.1%) 1476 (9.5%) 1482 (9.3%) 3965 (8.2%)
Unique PDFs 5001 5694 5729 37328
Total Pages 11,940 15,482 15,854 48,499

The table above shows category distribution for different model versions, where the last column (Dataset 3) corresponds to the latest vX.X.3 models data, which actually used 14,000 pages of TEXT category, while other columns cover all the used samples - specifically 80% as training πŸ’ͺ, and 10% each as development and test πŸ† sets. The early model version used 90% of the data as training πŸ’ͺ and the remaining 10% as both development and test πŸ† set due to the lack of annotated (manually classified) pages.

Disproportion of the categories πŸͺ§ in both training data and provided evaluation category_samples πŸ“ is NOT intentional, but rather a result of the source data nature.

The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW πŸ“)

Versions of CLIP models are grounded on the textual category description sets, all illustrated in descriptions_comparison_graph.png πŸ“Ž which is a graph containing separate and averaged results of all category πŸͺ§ descriptions.

As our experiments showed, the averaging strategy is not the best. Moreover, the smallest model ViT-B/16 showed the best results after fine-tuning model on some selected category πŸͺ§ set.

description comparison graph

Categories πŸͺ§

Label️ Description
DRAW πŸ“ˆ - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions
DRAW_L πŸ“ˆπŸ“ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table
LINE_HW βœοΈπŸ“ - handwritten text organized in a tabular or form-like structure
LINE_P πŸ“ - printed text organized in a tabular or form-like structure
LINE_T πŸ“ - machine-typed text organized in a tabular or form-like structure
PHOTO πŸŒ„ - photographs or photographic cutouts, potentially with text captions
PHOTO_L πŸŒ„πŸ“ - photos presented within a table-like layout or accompanied by tabular annotations
TEXT πŸ“° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements
TEXT_HW βœοΈπŸ“„ - only handwritten text in paragraph or block form (non-tabular)
TEXT_P πŸ“„ - only printed text in paragraph or block form (non-tabular)
TEXT_T πŸ“„ - only machine-typed text in paragraph or block form (non-tabular)

The categories were chosen to sort the pages by the following criteria:

  • presence of graphical elements (drawings πŸ“ˆ OR photos πŸŒ„)
  • type of text πŸ“„ (handwritten ✏️️ OR printed OR typed OR mixed πŸ“°)
  • presence of tabular layout / forms πŸ“

The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification.

Examples of pages sorted by category πŸͺ§ can be found in the category_samples πŸ“ directory which is also available as a testing subset of the training data.

dataset_timeline.png

Results πŸ“Š

Version Base Model + category set Accuracy (%) Comment
v1.1.3.1 ViT-B/16 init 99.1 Very good
v1.1.3.2 ViT-B/16 details 99.08
v1.1.3.3 ViT-B/16 extra 99.12 2nd Best
v1.1.3.4 ViT-B/16 gemini 99.1 Very good
v1.1.3.5 ViT-B/16 gpt 98.95
v1.1.3.6 ViT-B/16 large 99.1 Very good
v1.1.3.7 ViT-B/16 mid 99.14 Best
v1.1.3.8 ViT-B/16 min 98.86
v1.1.3.9 ViT-B/16 short 99.06
v1.1.3 ViT-B/16 average 99.06
v1.2.3.1 ViT-B/32 init 98.95
v1.2.3.3 ViT-B/32 extra 98.92
v1.2.3.4 ViT-B/32 gemini 98.94
v1.2.3.6 ViT-B/32 large 98.97
v1.2.3.7 ViT-B/32 mid 98.86
v1.2.3 ViT-B/32 average 98.99 Larger & good
v2.2.3.1 ViT-L/14-336px init 98.86 Large & OK
v2.2.3.3 ViT-L/14-336px extra 98.59
v2.2.3.4 ViT-L/14-336px gemini 98.97
v2.2.3.6 ViT-L/14-336px large 98.68
v2.2.3.7 ViT-L/14-336px mid 98.81
v2.2.3 ViT-L/14-336px average 98.72
v2.1.3.1 ViT-L/14 init 98.97
v2.1.3.3 ViT-L/14 extra 98.83
v2.1.3.4 ViT-L/14 gemini 98.86 Large & OK
v2.1.3.6 ViT-L/14 large 98.92
v2.1.3.7 ViT-L/14 mid 98.9
v2.1.3 ViT-L/14 average 98.81

v1.1.3.1 Evaluation set's accuracy (Top-1): 99.1% πŸ†

TOP-1 confusion matrix

v1.1.3.3 Evaluation set's accuracy (Top-1): 99.12% πŸ†

TOP-1 confusion matrix

v1.1.3.4 Evaluation set's accuracy (Top-1): 99.1% πŸ†

TOP-1 confusion matrix

v1.1.3.6 Evaluation set's accuracy (Top-1): 99.1% πŸ†

TOP-1 confusion matrix

v1.1.3.7 Evaluation set's accuracy (Top-1): 99.14% πŸ†

TOP-1 confusion matrix

v1.2.3 Evaluation set's accuracy (Top-1): 98.99% πŸ†

TOP-1 confusion matrix

v2.1.3.1 Evaluation set's accuracy (Top-1): 98.97% πŸ†

TOP-1 confusion matrix

v2.2.3.4 Evaluation set's accuracy (Top-1): 98.97% πŸ†

TOP-1 confusion matrix

Confusion matrices provided above show the diagonal of matching gold and predicted categories πŸͺ§ while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.

Image preprocessing steps πŸ‘€
  • transforms.ColorJitter(brightness 0.5)
  • transforms.ColorJitter(contrast 0.5)
  • transforms.ColorJitter(saturation 0.5)
  • transforms.ColorJitter(hue 0.5)
  • transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
  • transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Training hyperparameters πŸ‘€
  • eval_strategy "epoch"
  • save_strategy "epoch"
  • learning_rate 5e-5
  • per_device_train_batch_size 8
  • per_device_eval_batch_size 8
  • num_train_epochs 3
  • warmup_ratio 0.1
  • logging_steps 10
  • load_best_model_at_end True
  • metric_for_best_model "accuracy"

Contacts πŸ“§

For support write to: [email protected] responsible for this GitHub repository ^8 πŸ”—

Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff πŸ“Ž file.

Acknowledgements πŸ™

  • Developed by UFAL ^7 πŸ‘₯
  • Funded by ATRIUM ^4 πŸ’°
  • Shared by ATRIUM ^4 & UFAL ^7 πŸ”—
  • Model type: fine-tuned CLIP-ViT with a 224x224 ^2 πŸ”— or 384x384 ^13 ^14 πŸ”— resolution size

©️ 2022 UFAL & ATRIUM


Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ufal/clip-historical-page

Finetuned
(49)
this model