Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

BEEspoke Data

community

https://www.bees.org/

AI & ML interests

'an LLM is only as good as the dataset it was trained on' - Sun Tzu

BEE-spoke-data 's collections 8

Survivor Library Books - OCR

Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs

BEE-spoke-data/SurvivorLib-Nanonets-OCR-s

Viewer • Updated 21 days ago • 14.4k • 19 • 2
BEE-spoke-data/SurvivorLib-rolmOCR

Viewer • Updated 21 days ago • 14.6k • 14 • 1

finetuned smol 220M

smol_llama 220M fine-tunes we did

BEE-spoke-data/smol_llama-220M-openhermes

Text Generation • 0.2B • Updated 21 days ago • 1.17k • 5
BEE-spoke-data/smol_llama-220M-open_instruct

Text Generation • 0.2B • Updated 21 days ago • 6 • 2
BEE-spoke-data/beecoder-220M-python

Text Generation • 0.2B • Updated 21 days ago • 15 • 3
BEE-spoke-data/zephyr-220m-sft-full

Text Generation • 0.2B • Updated 21 days ago • 1.05k • 1

Bee Models 🍯

models fine-tuned to be knowledgeable about apiary practice

BEE-spoke-data/TinyLlama-3T-1.1bee

Text Generation • 1B • Updated 21 days ago • 4 • 2
BEE-spoke-data/TinyLlama-1.1bee

Text Generation • 1B • Updated 21 days ago • 6 • 1
BEE-spoke-data/Meta-Llama-3-8Bee

Text Generation • 8B • Updated 21 days ago • 10
BEE-spoke-data/phi-1bee5

Text Generation • 1B • Updated 21 days ago • 3 • 1

trained and adapted tokenizers - various

BEE-spoke-data/claude-tokenizer

Updated 21 days ago
BEE-spoke-data/claude-tokenizer-forT5

Updated 21 days ago
BEE-spoke-data/slimpajama_tok-48128-BPE-forT5

Updated 21 days ago
BEE-spoke-data/BeeTokenizer

Updated 21 days ago • 1

🚧"raw" pretrained smol_llama checkpoints - WIP 🚧

BEE-spoke-data/smol_llama-101M-GQA

Text Generation • 0.1B • Updated 21 days ago • 2.82k • 31
BEE-spoke-data/smol_llama-81M-tied

Text Generation • 81.3M • Updated 21 days ago • 1.18k • 9
BEE-spoke-data/smol_llama-220M-GQA

Text Generation • 0.2B • Updated 21 days ago • 3.32k • 13
BEE-spoke-data/verysmol_llama-v11-KIx2

Text Generation • 58.1M • Updated 21 days ago • 1.16k • 4

Pretrained Encoders

Pretrained encoder (fill-mask) models we made

BEE-spoke-data/bert-plus-L8-4096-v1.0

Fill-Mask • 88.1M • Updated 21 days ago • 6
BEE-spoke-data/mega-encoder-small-16k-v1

Fill-Mask • 0.1B • Updated 21 days ago • 7 • 4

book genre classifiers

text classification models for book genres

BEE-spoke-data/albert-xxlarge-v2-description2genre

Text Classification • 0.2B • Updated 21 days ago • 6 • 2
BEE-spoke-data/mobilebert-uncased-title2genre

Text Classification • 24.6M • Updated 21 days ago • 4 • 1
BEE-spoke-data/roberta-large-title2genre

Text Classification • 0.4B • Updated 21 days ago • 5 • 1
BEE-spoke-data/roberta-base-description2genre

Text Classification • 0.1B • Updated 21 days ago • 1

FineWeb Concept Datasets

concept datasets extracted from fineweb

BEE-spoke-data/SaunaWeb-50k

Viewer • Updated 21 days ago • 50k • 10
BEE-spoke-data/FineMeme-100k

Viewer • Updated 21 days ago • 100k • 52
BEE-spoke-data/beeweb-5k

Viewer • Updated 21 days ago • 5k • 28
BEE-spoke-data/fineweb-synergy-20k

Viewer • Updated 21 days ago • 20k • 17

Survivor Library Books - OCR

Books from the Survivor Library (mostly ~1920s & earlier) OCR'd with recent VLMs

BEE-spoke-data/SurvivorLib-Nanonets-OCR-s

Viewer • Updated 21 days ago • 14.4k • 19 • 2
BEE-spoke-data/SurvivorLib-rolmOCR

Viewer • Updated 21 days ago • 14.6k • 14 • 1

🚧"raw" pretrained smol_llama checkpoints - WIP 🚧

BEE-spoke-data/smol_llama-101M-GQA

Text Generation • 0.1B • Updated 21 days ago • 2.82k • 31
BEE-spoke-data/smol_llama-81M-tied

Text Generation • 81.3M • Updated 21 days ago • 1.18k • 9
BEE-spoke-data/smol_llama-220M-GQA

Text Generation • 0.2B • Updated 21 days ago • 3.32k • 13
BEE-spoke-data/verysmol_llama-v11-KIx2

Text Generation • 58.1M • Updated 21 days ago • 1.16k • 4

finetuned smol 220M

smol_llama 220M fine-tunes we did

BEE-spoke-data/smol_llama-220M-openhermes

Text Generation • 0.2B • Updated 21 days ago • 1.17k • 5
BEE-spoke-data/smol_llama-220M-open_instruct

Text Generation • 0.2B • Updated 21 days ago • 6 • 2
BEE-spoke-data/beecoder-220M-python

Text Generation • 0.2B • Updated 21 days ago • 15 • 3
BEE-spoke-data/zephyr-220m-sft-full

Text Generation • 0.2B • Updated 21 days ago • 1.05k • 1

Pretrained Encoders

Pretrained encoder (fill-mask) models we made

BEE-spoke-data/bert-plus-L8-4096-v1.0

Fill-Mask • 88.1M • Updated 21 days ago • 6
BEE-spoke-data/mega-encoder-small-16k-v1

Fill-Mask • 0.1B • Updated 21 days ago • 7 • 4

Bee Models 🍯

models fine-tuned to be knowledgeable about apiary practice

BEE-spoke-data/TinyLlama-3T-1.1bee

Text Generation • 1B • Updated 21 days ago • 4 • 2
BEE-spoke-data/TinyLlama-1.1bee

Text Generation • 1B • Updated 21 days ago • 6 • 1
BEE-spoke-data/Meta-Llama-3-8Bee

Text Generation • 8B • Updated 21 days ago • 10
BEE-spoke-data/phi-1bee5

Text Generation • 1B • Updated 21 days ago • 3 • 1

book genre classifiers

text classification models for book genres

BEE-spoke-data/albert-xxlarge-v2-description2genre

Text Classification • 0.2B • Updated 21 days ago • 6 • 2
BEE-spoke-data/mobilebert-uncased-title2genre

Text Classification • 24.6M • Updated 21 days ago • 4 • 1
BEE-spoke-data/roberta-large-title2genre

Text Classification • 0.4B • Updated 21 days ago • 5 • 1
BEE-spoke-data/roberta-base-description2genre

Text Classification • 0.1B • Updated 21 days ago • 1

trained and adapted tokenizers - various

BEE-spoke-data/claude-tokenizer

Updated 21 days ago
BEE-spoke-data/claude-tokenizer-forT5

Updated 21 days ago
BEE-spoke-data/slimpajama_tok-48128-BPE-forT5

Updated 21 days ago
BEE-spoke-data/BeeTokenizer

Updated 21 days ago • 1

FineWeb Concept Datasets

concept datasets extracted from fineweb

BEE-spoke-data/SaunaWeb-50k

Viewer • Updated 21 days ago • 50k • 10
BEE-spoke-data/FineMeme-100k

Viewer • Updated 21 days ago • 100k • 52
BEE-spoke-data/beeweb-5k

Viewer • Updated 21 days ago • 5k • 28
BEE-spoke-data/fineweb-synergy-20k

Viewer • Updated 21 days ago • 20k • 17

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs