Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
BEEspoke Data
community
AI & ML interests
'an LLM is only as good as the dataset it was trained on' - Sun Tzu
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation β’ 0.2B β’ Updated β’ 1.17k β’ 5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation β’ 0.2B β’ Updated β’ 6 β’ 2 -
BEE-spoke-data/beecoder-220M-python
Text Generation β’ 0.2B β’ Updated β’ 15 β’ 3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation β’ 0.2B β’ Updated β’ 1.05k β’ 1
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation β’ 1B β’ Updated β’ 4 β’ 2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation β’ 1B β’ Updated β’ 6 β’ 1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation β’ 8B β’ Updated β’ 10 -
BEE-spoke-data/phi-1bee5
Text Generation β’ 1B β’ Updated β’ 3 β’ 1
trained and adapted tokenizers - various
π§"raw" pretrained smol_llama checkpoints - WIP π§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation β’ 0.1B β’ Updated β’ 2.82k β’ 31 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation β’ 81.3M β’ Updated β’ 1.18k β’ 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation β’ 0.2B β’ Updated β’ 3.32k β’ 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation β’ 58.1M β’ Updated β’ 1.16k β’ 4
Pretrained encoder (fill-mask) models we made
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification β’ 0.2B β’ Updated β’ 6 β’ 2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification β’ 24.6M β’ Updated β’ 4 β’ 1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification β’ 0.4B β’ Updated β’ 5 β’ 1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification β’ 0.1B β’ Updated β’ 1
concept datasets extracted from fineweb
Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs
π§"raw" pretrained smol_llama checkpoints - WIP π§
-
BEE-spoke-data/smol_llama-101M-GQA
Text Generation β’ 0.1B β’ Updated β’ 2.82k β’ 31 -
BEE-spoke-data/smol_llama-81M-tied
Text Generation β’ 81.3M β’ Updated β’ 1.18k β’ 9 -
BEE-spoke-data/smol_llama-220M-GQA
Text Generation β’ 0.2B β’ Updated β’ 3.32k β’ 13 -
BEE-spoke-data/verysmol_llama-v11-KIx2
Text Generation β’ 58.1M β’ Updated β’ 1.16k β’ 4
smol_llama 220M fine-tunes we did
-
BEE-spoke-data/smol_llama-220M-openhermes
Text Generation β’ 0.2B β’ Updated β’ 1.17k β’ 5 -
BEE-spoke-data/smol_llama-220M-open_instruct
Text Generation β’ 0.2B β’ Updated β’ 6 β’ 2 -
BEE-spoke-data/beecoder-220M-python
Text Generation β’ 0.2B β’ Updated β’ 15 β’ 3 -
BEE-spoke-data/zephyr-220m-sft-full
Text Generation β’ 0.2B β’ Updated β’ 1.05k β’ 1
Pretrained encoder (fill-mask) models we made
models fine-tuned to be knowledgeable about apiary practice
-
BEE-spoke-data/TinyLlama-3T-1.1bee
Text Generation β’ 1B β’ Updated β’ 4 β’ 2 -
BEE-spoke-data/TinyLlama-1.1bee
Text Generation β’ 1B β’ Updated β’ 6 β’ 1 -
BEE-spoke-data/Meta-Llama-3-8Bee
Text Generation β’ 8B β’ Updated β’ 10 -
BEE-spoke-data/phi-1bee5
Text Generation β’ 1B β’ Updated β’ 3 β’ 1
text classification models for book genres
-
BEE-spoke-data/albert-xxlarge-v2-description2genre
Text Classification β’ 0.2B β’ Updated β’ 6 β’ 2 -
BEE-spoke-data/mobilebert-uncased-title2genre
Text Classification β’ 24.6M β’ Updated β’ 4 β’ 1 -
BEE-spoke-data/roberta-large-title2genre
Text Classification β’ 0.4B β’ Updated β’ 5 β’ 1 -
BEE-spoke-data/roberta-base-description2genre
Text Classification β’ 0.1B β’ Updated β’ 1
trained and adapted tokenizers - various
concept datasets extracted from fineweb