Image classification using fine-tuned CLIP - for historical document sorting
Goal: solve a task of archive page images sorting (for their further content-based processing)
Scope: Processing of images, training / evaluation of CLIP model, input file/directory processing, class πͺ§ (category) results of top N predictions output, predictions summarizing into a tabular format.
Versions π
There are currently 4 version of the model available for download, both of them have the same set of categories,
but different data annotations. The latest approved v1.1.3.7 is considered to be default and can be found in the main branch
of HF π hub ^1 π
| Version | Base | Pages | PDFs | Description |
|---|---|---|---|---|
v1.1 |
ViT-B/16 |
15855 | 5730 | smallest |
v1.2 |
ViT-B/32 |
15855 | 5730 | small with higher granularity |
v2.1 |
ViT-L/14 |
15855 | 5730 | large |
v2.2 |
ViT-L/14@336 |
15855 | 5730 | large with highest resolution |
v1.1.3.[1,3,4,6,7] |
ViT-B/16 |
38625 | 37328 | smallest and most accurate |
v1.2.3 |
ViT-B/32 |
38625 | 37328 | small and 2nd in accuracy |
v2.1.3.1 |
ViT-B/14 |
38625 | 37328 | larges and not too accurate |
v2.2.3.4 |
ViT-B/14@336 |
38625 | 37328 | larges and not too accurate |
| Version | Disk space | Parameters (Millions) |
|---|---|---|
openai/clip-vit-base-patch16 |
992 Mb | 149.62 M |
openai/clip-vit-base-patch32 |
1008 Mb | 151.28 M |
openai/clip-vit-large-patch14 |
1.5 Gb | 427.62 M |
openai/clip-vit-large-patch14-336 |
1.5 Gb | 427.94 M |
Model description π
π² Fine-tuned model repository: UFAL's clip-historical-page ^1 π
π³ Base model repository: OpenAI's clip-vit-base-patch16, clip-vit-base-patch32, clip-vit-large-patch14, clip-vit-large-patch14-336 ^2 ^13 ^14 ^15 π
The model was trained on the manually βοΈ annotated dataset of historical documents, in particular, images of pages from the archival documents with paper sources that were scanned into digital form.
The images contain various combinations of texts οΈπ, tables π, drawings π, and photos π - categories πͺ§ described below were formed based on those archival documents. Page examples can be found in the category_samples π directory.
The key use case of the provided model and data processing pipeline is to classify an input PNG image from PDF scanned paper source into one of the categories - each responsible for the following content-specific data processing pipeline.
In other words, when several APIs for different OCR subtasks are at your disposal - run this classifier first to mark the input data as machine-typed (old style fonts) / handwritten βοΈ / just printed plain οΈπ text or structured in tabular π format text, as well as to mark the presence of the printed π or drawn π graphic materials yet to be extracted from the page images.
Data π
The dataset is provided under Public Domain license, and consists of 48,499 PNG images of pages from 37,328 archival documents. The source image files and their annotation can be found in the LINDAT repository ^10 π.
The annotation provided includes 5 different
dataset splits of vX.X.3 model versions, and it's recommended to average all 5 trained model weights to get a more robust
model for prediction (in some cases, like TEXT and TEXT_T categories which samples very often look the same, the accuracy of those
problematic categories could drop below 90% with off-diagonal errors rising above 10% after the averaging of trained models). Anyhow, the
averaged model usually score higher accuracy than any of its individual components... or sometimes causes a drop in accuracy for
the most ambiguous categories πͺ§οΈ - depends mostly on the base model choice.
Our dataset is not split using a simple random shuffle. This is because the data contains structured and clustered distributions of page types within many categories. A random shuffle would likely result in subsets with poor representative variability.
Instead, we use a deterministic, periodic sampling method with a randomized offset. To maximize the size of the training πͺ set, we select the development and test π subsets first. The training subset then consists of all remaining pages.
Here's the per-category πͺ§ procedure for selecting the development and test π sets:
- For the category of size
Ncompute the desired subset size,k, as a fixed proportion (test_ratiowhich was 10%) ofN - Compute a selection step,
S, asS β N/kwhich serves a period base for the selection - Apply a random shift to
S- an integer index in the range[S_i - S/4; S_i + S/4]for everyi-th ofksteps ofS. - Select every
S-th (S-thish in fact) element from the alphabetically ordered sequence after applying the random shift. - Finally, Limit selected indices to be within the range of the category size
N.
This method produces subsets that:
- Respect the original ordering and local clustering in the data
- Preserve the proportional representation of each category
- Introduce controlled randomness, so the selected samples are not strictly periodic
This ensures that our subsets cover the full chronological and structural variability of the collection, leading to a more robust and reliable model evaluation.
Training πͺ set of the model: 14270 images for vX.X
Training πͺ set of the model: 38625 images for vX.X.3
The training subsets above are followed by the test sets below:
Evaluation π set: 1290 images for vX.X models
Evaluation π set: 4823 images (for vX.X.3 models)
Manual βοΈ annotation was performed beforehand and took some time β, the categories πͺ§ tabulated below were formed from different sources of the archival documents originated in the 1920-2020 years span.
| Category | Dataset 0 | Dataset 1 | Dataset 2 | Dataset 3 |
|---|---|---|---|---|
| DRAW | 1090 (9.1%) | 1368 (8.8%) | 1472 (9.3%) | 2709 (5.6%) |
| DRAW_L | 1091 (9.1%) | 1383 (8.9%) | 1402 (8.8%) | 2921 (6.0%) |
| LINE_HW | 1055 (8.8%) | 1113 (7.2%) | 1115 (7.0%) | 2514 (5.2%) |
| LINE_P | 1092 (9.1%) | 1540 (9.9%) | 1580 (10.0%) | 2439 (5.0%) |
| LINE_T | 1098 (9.2%) | 1664 (10.7%) | 1668 (10.5%) | 9883 (20.4%) |
| PHOTO | 1081 (9.1%) | 1632 (10.5%) | 1730 (10.9%) | 2691 (5.5%) |
| PHOTO_L | 1087 (9.1%) | 1087 (7.0%) | 1088 (6.9%) | 2830 (5.8%) |
| TEXT | 1091 (9.1%) | 1587 (10.3%) | 1592 (10.0%) | 14227 (29.3%) |
| TEXT_HW | 1091 (9.1%) | 1092 (7.1%) | 1092 (6.9%) | 2008 (4.1%) |
| TEXT_P | 1083 (9.1%) | 1540 (9.9%) | 1633 (10.3%) | 2312 (4.8%) |
| TEXT_T | 1081 (9.1%) | 1476 (9.5%) | 1482 (9.3%) | 3965 (8.2%) |
| Unique PDFs | 5001 | 5694 | 5729 | 37328 |
| Total Pages | 11,940 | 15,482 | 15,854 | 48,499 |
The table above shows category distribution for different model versions, where the last column
(Dataset 3) corresponds to the latest vX.X.3 models data, which actually used 14,000 pages of
TEXT category, while other columns cover all the used samples - specifically 80% as training πͺ,
and 10% each as development and test π sets. The early model version used 90% of the data as training πͺ
and the remaining 10% as both development and test π set due to the lack of annotated (manually
classified) pages.
Disproportion of the categories πͺ§ in both training data and provided evaluation category_samples π is NOT intentional, but rather a result of the source data nature.
The specific content and language of the source data is irrelevant considering the model's vision resolution, however, all of the data samples were from archaeological reports which may somehow affect the drawing detection preferences due to the common form of objects being ceramic pieces, arrowheads, and rocks formerly drawn by hand and later illustrated with digital tools (examples can be found in category_samples/DRAW π)
Versions of CLIP models are grounded on the textual category description sets, all illustrated in descriptions_comparison_graph.png π which is a graph containing separate and averaged results of all category πͺ§ descriptions.
As our experiments showed, the averaging strategy is not the best. Moreover, the smallest model ViT-B/16 showed the best results after fine-tuning model on some selected category πͺ§ set.
Categories πͺ§
| LabelοΈ | Description |
|---|---|
DRAW |
π - drawings, maps, paintings, schematics, or graphics, potentially containing some text labels or captions |
DRAW_L |
ππ - drawings, etc but presented within a table-like layout or includes a legend formatted as a table |
LINE_HW |
βοΈπ - handwritten text organized in a tabular or form-like structure |
LINE_P |
π - printed text organized in a tabular or form-like structure |
LINE_T |
π - machine-typed text organized in a tabular or form-like structure |
PHOTO |
π - photographs or photographic cutouts, potentially with text captions |
PHOTO_L |
ππ - photos presented within a table-like layout or accompanied by tabular annotations |
TEXT |
π° - mixtures of printed, handwritten, and/or typed text, potentially with minor graphical elements |
TEXT_HW |
βοΈπ - only handwritten text in paragraph or block form (non-tabular) |
TEXT_P |
π - only printed text in paragraph or block form (non-tabular) |
TEXT_T |
π - only machine-typed text in paragraph or block form (non-tabular) |
The categories were chosen to sort the pages by the following criteria:
- presence of graphical elements (drawings π OR photos π)
- type of text π (handwritten βοΈοΈ OR printed OR typed OR mixed π°)
- presence of tabular layout / forms π
The reasons for such distinction are different processing pipelines for different types of pages, which would be applied after the classification.
Examples of pages sorted by category πͺ§ can be found in the category_samples π directory which is also available as a testing subset of the training data.
Results π
| Version | Base Model + category set | Accuracy (%) | Comment |
|---|---|---|---|
| v1.1.3.1 | ViT-B/16 init | 99.1 | Very good |
| v1.1.3.2 | ViT-B/16 details | 99.08 | |
| v1.1.3.3 | ViT-B/16 extra | 99.12 | 2nd Best |
| v1.1.3.4 | ViT-B/16 gemini | 99.1 | Very good |
| v1.1.3.5 | ViT-B/16 gpt | 98.95 | |
| v1.1.3.6 | ViT-B/16 large | 99.1 | Very good |
| v1.1.3.7 | ViT-B/16 mid | 99.14 | Best |
| v1.1.3.8 | ViT-B/16 min | 98.86 | |
| v1.1.3.9 | ViT-B/16 short | 99.06 | |
| v1.1.3 | ViT-B/16 average | 99.06 | |
| v1.2.3.1 | ViT-B/32 init | 98.95 | |
| v1.2.3.3 | ViT-B/32 extra | 98.92 | |
| v1.2.3.4 | ViT-B/32 gemini | 98.94 | |
| v1.2.3.6 | ViT-B/32 large | 98.97 | |
| v1.2.3.7 | ViT-B/32 mid | 98.86 | |
| v1.2.3 | ViT-B/32 average | 98.99 | Larger & good |
| v2.2.3.1 | ViT-L/14-336px init | 98.86 | Large & OK |
| v2.2.3.3 | ViT-L/14-336px extra | 98.59 | |
| v2.2.3.4 | ViT-L/14-336px gemini | 98.97 | |
| v2.2.3.6 | ViT-L/14-336px large | 98.68 | |
| v2.2.3.7 | ViT-L/14-336px mid | 98.81 | |
| v2.2.3 | ViT-L/14-336px average | 98.72 | |
| v2.1.3.1 | ViT-L/14 init | 98.97 | |
| v2.1.3.3 | ViT-L/14 extra | 98.83 | |
| v2.1.3.4 | ViT-L/14 gemini | 98.86 | Large & OK |
| v2.1.3.6 | ViT-L/14 large | 98.92 | |
| v2.1.3.7 | ViT-L/14 mid | 98.9 | |
| v2.1.3 | ViT-L/14 average | 98.81 |
v1.1.3.1 Evaluation set's accuracy (Top-1): 99.1% π
v1.1.3.3 Evaluation set's accuracy (Top-1): 99.12% π
v1.1.3.4 Evaluation set's accuracy (Top-1): 99.1% π
v1.1.3.6 Evaluation set's accuracy (Top-1): 99.1% π
v1.1.3.7 Evaluation set's accuracy (Top-1): 99.14% π
v1.2.3 Evaluation set's accuracy (Top-1): 98.99% π
v2.1.3.1 Evaluation set's accuracy (Top-1): 98.97% π
v2.2.3.4 Evaluation set's accuracy (Top-1): 98.97% π
Confusion matrices provided above show the diagonal of matching gold and predicted categories πͺ§ while their off-diagonal elements show inter-class errors. By those graphs you can judge what type of mistakes to expect from your model.
Image preprocessing steps π
- transforms.ColorJitter(brightness 0.5)
- transforms.ColorJitter(contrast 0.5)
- transforms.ColorJitter(saturation 0.5)
- transforms.ColorJitter(hue 0.5)
- transforms.Lambda(lambda img: ImageEnhance.Sharpness(img).enhance(random.uniform(0.5, 1.5)))
- transforms.Lambda(lambda img: img.filter(ImageFilter.GaussianBlur(radius=random.uniform(0, 2))))
Training hyperparameters π
- eval_strategy "epoch"
- save_strategy "epoch"
- learning_rate 5e-5
- per_device_train_batch_size 8
- per_device_eval_batch_size 8
- num_train_epochs 3
- warmup_ratio 0.1
- logging_steps 10
- load_best_model_at_end True
- metric_for_best_model "accuracy"
Contacts π§
For support write to: [email protected] responsible for this GitHub repository ^8 π
Information about the authors of this project, including their names and ORCIDs, can be found in the CITATION.cff π file.
Acknowledgements π
- Developed by UFAL ^7 π₯
- Funded by ATRIUM ^4 π°
- Shared by ATRIUM ^4 & UFAL ^7 π
- Model type: fine-tuned CLIP-ViT with a 224x224 ^2 π or 384x384 ^13 ^14 π resolution size
Β©οΈ 2022 UFAL & ATRIUM
- Downloads last month
- 8
Model tree for ufal/clip-historical-page
Base model
openai/clip-vit-base-patch16









