Deploying on AWS documentation
DeepSeek OCR Pipeline on SageMaker Training Jobs
DeepSeek OCR Pipeline on SageMaker Training Jobs
π Context: This notebook is part of the VLM-OCR Recipes on GPU Infrastructure article, which explains the architecture and design decisions behind this pipeline.
This notebook runs a three-stage OCR pipeline using SageMaker Training Jobs:
- Extract - Run DeepSeek OCR over a dataset, save Markdown and crop detected figures
- Describe - Generate captions for extracted figures
- Assemble - Enrich Markdown with figure captions
This is the SageMaker equivalent of the HuggingFace Jobs pipeline. It uses SageMaker ModelTrainer V3 with a vLLM container to run GPU-accelerated inference.
Key difference from HF Jobs: This notebook saves datasets to S3 instead of HuggingFace Hub.
Prerequisites
- AWS credentials configured
- SageMaker execution role with S3 access
- HuggingFace token for accessing source models and datasets
- SageMaker SDK V3 installed (
pip install sagemaker --upgrade)
π Table of Contents
- Setup
- Authentication
- Configuration
- Bundle Pipeline Code
- Define Base Environment Variables
- Helper Functions
- Stage 1: Extract
- Stage 2: Describe
- Stage 3: Assemble
- Cost Analysis
- Pipeline Complete
βοΈ Setup
π§ AWS SageMaker Training Jobs
SageMaker Training Jobs provide managed infrastructure for running compute-intensive workloads. While traditionally used for model training, theyβre equally well-suited for batch inference.
Why Training Jobs for batch OCR?
Training Jobs are the best option for accessing GPUs on SageMaker for offline/batch workloads:
- Direct GPU access: Spin up powerful instances like
ml.g6e.2xlarge(L40S GPU) orml.p4d.24xlarge(8x A100) on demand - Pay per use: Billed per second only while the job runs - no idle costs
- Automatic cleanup: Jobs terminate automatically on completion, releasing resources
- S3 integration: Native support for reading/writing large datasets directly to S3
- vLLM DLC: Pre-built Deep Learning Containers with vLLM for efficient inference
- No infrastructure management: No cluster setup, scaling, or maintenance required
Alternative: SageMaker Endpoints
For different use cases, SageMaker Endpoints could be explored:
| Aspect | Training Jobs | Endpoints |
|---|---|---|
| Best for | Batch/offline processing | Real-time inference |
| Billing | Per-second while running | Per-hour while deployed |
| Latency | Minutes to start | Always ready (when deployed) |
| Cost model | Pay only during processing | Pay for uptime |
| Scaling | Single job, fixed resources | Auto-scaling on demand |
For this pipelineβs batch OCR workload - processing thousands of documents in one go - Training Jobs are more cost-effective since we only pay for actual compute time rather than keeping an endpoint running.
π¦ How the pipeline code is shipped
This notebook uses the SageMaker Python SDK v3 with the new ModelTrainer API to launch training jobs. The ModelTrainer class provides a simplified, declarative interface for configuring and running jobs.
For every SageMaker Training Job we launch, the logic is similar:
From this notebook, we bundle and upload to S3:
- The entrypoint script (
entry.sh+sm_job_runner.py) - The pipeline code in
llm_ocr/
SageMaker automatically makes this code available at /opt/ml/input/data/code inside the container.
Then we launch a SageMaker Training Job using ModelTrainer:
trainer = ModelTrainer(
training_image=TRAINING_IMAGE, # vLLM DLC from ECR
source_code=SourceCode(source_dir), # Code bundle
compute=Compute(instance_type, instance_count),
hyperparameters={...}, # Environment variables
output_data_config=OutputDataConfig(s3_output_path),
)
trainer.train(wait=False) # Async executionThe job then:
- Pulls the vLLM Deep Learning Container from ECR
- Downloads the code bundle from S3
- Runs
entry.shwhich installs dependencies viauvand executessm_job_runner.py - The runner starts a vLLM server, then imports
llm_ocr.cliand callsmain()to run the pipeline stage - Results are saved back to S3
π The dataset
This pipeline uses FineVision (HuggingFaceM4/FineVision) as a large, mixed image+text corpus for vision-language training/evaluation. FineVision aggregates many public sub-datasets into one unified interface, and you select a specific subset/config when loading.
- Dataset:
HuggingFaceM4/FineVision - Overview / exploration space:
HuggingFaceM4/FineVisionSpace
The olmOCR subsets
The olmOCR-mix-0225 dataset from Allen AI contains 260,000 crawled PDF pages from over 100,000 diverse PDFs - academic papers, legal documents, public domain books, brochures, and more. It includes challenging content: graphics, handwritten text, multi-column layouts, tables, equations, and poor quality scans.
Available configs:
olmOCR-mix-0225-documents- general documentsolmOCR-mix-0225-books- book pages
π Note: In this pipeline, one document = one page of a PDF.
These mirror real-world enterprise use cases: contracts, invoices, reports, forms, and scanned documents that organizations need to digitize and extract structured information from.
Licensing note: FineVision is a collection of many datasets, each with its own license/terms. Make sure the subset you use is compatible with your intended downstream use (see the dataset card for details).
β‘ Inference Backend: vLLM
This pipeline uses vLLM as the inference backend for DeepSeek-OCR. vLLM provides:
- High throughput via continuous batching and PagedAttention
- OpenAI-compatible API - easy to integrate with existing code
- Efficient memory management - run large models on limited GPU memory
The SageMaker Training Job uses the official AWS vLLM Deep Learning Container (vllm:0.12.0-gpu-py312-cu129-ubuntu22.04-sagemaker). The pipeline sends batched requests (64 concurrent) to saturate the GPU and maximize throughput (~70 docs/min on ml.g6e.2xlarge).
π DeepSeek-OCR Prompts
DeepSeek-OCR supports different prompts for various OCR tasks. See the official config.py for examples:
| Use Case | Prompt |
|---|---|
| Document β Markdown | <image>\n<\|grounding\|>Convert the document to markdown. |
| General OCR | <image>\n<\|grounding\|>OCR this image. |
| Free OCR (no layout) | <image>\nFree OCR. |
| Parse figures | <image>\nParse the figure. |
| Describe image | <image>\nDescribe this image in detail. |
We configure these prompts via environment variables DOC_PROMPT and FIGURE_PROMPT in our job configuration, re-using the special tokens from the official DeepSeek-OCR config.
>> # Preview 3 random document images
>> from datasets import load_dataset
>> from itertools import islice
>> from IPython.display import display
>> ds = load_dataset("HuggingFaceM4/FineVision", "olmOCR-mix-0225-documents", split="train", streaming=True).shuffle(seed=123)
>> for i, s in enumerate(islice(ds, 3)):
... print(f"--- Doc {i} ---")
... img = s["images"][0]
... img.thumbnail((500, 500)) # Resize to max 500px
... display(img)--- Doc 0 ---
π Authentication
π AWS Authentication
Before running this notebook, ensure your AWS credentials are configured. You can authenticate using one of these methods:
Option 1: AWS SSO (recommended for organizations)
aws configure sso aws sso login --profile your-profile-name
Option 2: IAM credentials via environment variables
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_DEFAULT_REGION=us-east-1Option 3: AWS CLI configuration
aws configure
For more details, see the AWS CLI Configuration Guide.
Note: When running on SageMaker Studio or EC2 with an IAM role attached, credentials are automatically available via the instance metadata service.
!pip3 install sagemaker --upgrade --quiet
!pip install -U "datasets>=4.0.0" "s3fs" "fsspec"# π Authenticate with Hugging Face
# Required for accessing private datasets and pushing results
# Get your token at: https://huggingface.co/settings/tokens
import os
from huggingface_hub import login, get_token
login()
# Store token in env var for SageMaker Jobs
HF_TOKEN = get_token()
os.environ["HF_TOKEN"] = HF_TOKEN
print(f"HF_TOKEN set: {HF_TOKEN[:8]}...")>> import os
>> import json
>> import shutil
>> import tempfile
>> import time
>> from pathlib import Path
>> import boto3
>> import sagemaker
>> from sagemaker.train.model_trainer import ModelTrainer
>> from sagemaker.train.configs import SourceCode, Compute, StoppingCondition, OutputDataConfig
>> from sagemaker.core.helper.session_helper import Session, get_execution_rolesagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
βοΈ Configuration
# Initialize SageMaker session
sagemaker_session = Session()
iam = boto3.client('iam')
role = iam.get_role(RoleName='<YOUR-ROLE-NAME>')['Role']['Arn']
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]
print(f"Region: {region}")
print(f"Account: {account_id}")
print(f"Role: {role}")>> # Pipeline Configuration
>> PROJECT_NAME = "deepseek-ocr-sagemaker"
>> BUCKET_NAME = sagemaker_session.default_bucket()
>> S3_PREFIX = f"{PROJECT_NAME}"
>> # S3 output path (single location for all stages - dataset gets updated in place)
>> S3_OUTPUT_URI = f"s3://{BUCKET_NAME}/{S3_PREFIX}"
>> # vLLM Container - use SageMaker vLLM DLC
>> TRAINING_IMAGE = f"763104351884.dkr.ecr.{region}.amazonaws.com/vllm:0.12.0-gpu-py312-cu129-ubuntu22.04-sagemaker-v1.0" # GPU stages
>> LIGHTWEIGHT_IMAGE = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:2.4.0-cpu-py311-ubuntu22.04-sagemaker" # CPU-only assemble
>> # Instance configuration
>> INSTANCE_TYPE = "ml.g6e.2xlarge" # GPU instances for extract/describe stages
>> INSTANCE_TYPE_CPU = "ml.c5.xlarge" # CPU-only instance for assemble stage (much cheaper) # Single L40s GPU
>> # INSTANCE_TYPE = "ml.p4d.24xlarge" # 8x A100 GPUs for larger scale
>> VOLUME_SIZE_GB = 100
>> MAX_RUNTIME_SECONDS = 3 * 60 * 60 # 3 hours
>> # Source dataset (from HuggingFace)
>> SOURCE_DATASET = "HuggingFaceM4/FineVision"
>> SOURCE_CONFIG = "olmOCR-mix-0225-documents"
>> MAX_SAMPLES = 1024 # Start small for testing
>> # HuggingFace token for accessing source datasets
>> HF_TOKEN = os.environ.get("HF_TOKEN", "")
>> print(f"S3 Bucket: s3://{BUCKET_NAME}/{S3_PREFIX}")
>> print(f"S3 Output URI: {S3_OUTPUT_URI}")
>> print(f"Instance: {INSTANCE_TYPE}")
>> print(f"Source: {SOURCE_DATASET}/{SOURCE_CONFIG} ({MAX_SAMPLES} samples)")S3 Bucket: s3://sagemaker-us-east-1-754289655784/deepseek-ocr-sagemaker S3 Output URI: s3://sagemaker-us-east-1-754289655784/deepseek-ocr-sagemaker Instance: ml.g6e.2xlarge Source: HuggingFaceM4/FineVision/olmOCR-mix-0225-documents (1024 samples)
π¦ Bundle Pipeline Code
SageMaker automatically uploads this bundle to S3 and makes it available at /opt/ml/input/data/code.
>> # Paths to pipeline code
>> CODE_PATHS = [
... Path("entry.sh"),
... Path("sm_job_runner.py"),
... Path("../llm_ocr"),
... ]
>> # Create a source directory bundle
>> source_dir = Path(tempfile.mkdtemp(prefix="sm-ocr-code-"))
>> for path in CODE_PATHS:
... src = Path.cwd() / path if not path.is_absolute() else path
... if src.is_dir():
... shutil.copytree(src, source_dir / path.name, dirs_exist_ok=True)
... else:
... shutil.copy2(src, source_dir / path.name)
>> print(f"Source directory: {source_dir}")
>> print(f"Contents: {list(source_dir.iterdir())}")Source directory: /tmp/sm-ocr-code-xymxcvqk
Contents: [PosixPath('/tmp/sm-ocr-code-xymxcvqk/llm_ocr'), PosixPath('/tmp/sm-ocr-code-xymxcvqk/sm_job_runner.py'), PosixPath('/tmp/sm-ocr-code-xymxcvqk/entry.sh')]
# Dependencies are declared in sm_job_runner.py inline metadata (PEP 723)
# entry.sh installs uv and runs: uv run sm_job_runner.py
# This automatically installs all dependenciesπ§ Define Base Environment Variables
# Base environment variables for all stages
# All configuration is passed via environment variables (same as HF Jobs)
BASE_ENV = {
# vLLM configuration
"MODEL_ID": "deepseek-ai/DeepSeek-OCR",
"SERVED_MODEL_NAME": "deepseek-ocr",
"HOST": "0.0.0.0",
"PORT": "8000",
"MAX_MODEL_LEN": "8192",
"GPU_MEMORY_UTILIZATION": "0.90",
"TENSOR_PARALLEL_SIZE": "1",
# HuggingFace authentication (for source datasets)
# Note: For production, consider using AWS Secrets Manager instead of env vars
"HF_TOKEN": os.environ.get("HF_TOKEN", ""),
"HF_HUB_ENABLE_HF_TRANSFER": "1",
# Prompts
"DOC_PROMPT": "<image>\n<|grounding|>Convert this document to Markdown.",
"DOC_MAX_TOKENS": "4096",
"DOC_TEMPERATURE": "0.1",
"FIGURE_PROMPT": "<image>\nDescribe this image in detail.",
"FIGURE_MAX_TOKENS": "512",
"FIGURE_TEMPERATURE": "0.6",
}π οΈ Helper Functions
# Import IO and rendering utilities from llm_ocr
import sys; sys.path.insert(0, "..")
from llm_ocr.sm_io import load_dataset_from_s3
from llm_ocr.document import render_sample_markdown, display_markdown, display_samples
def launch_stage(stage: str, env: dict = None, use_gpu: bool = True):
"""Launch a pipeline stage as a SageMaker Training Job.
Args:
stage: Pipeline stage (extract, describe, assemble)
env: Stage-specific environment variables (optional)
use_gpu: Whether to use GPU instance and image (default True)
Returns:
Tuple of (ModelTrainer, job_name)
"""
import uuid
# Generate unique base job name
unique_id = uuid.uuid4().hex[:8]
base_name = f"{PROJECT_NAME}-{stage}-{unique_id}"
# Merge base env with stage-specific env
full_env = {**BASE_ENV, "PIPELINE_STAGE": stage}
if env:
full_env.update(env)
# Select image and instance based on GPU usage
if use_gpu:
image_uri = TRAINING_IMAGE
instance_type = INSTANCE_TYPE
else:
# Lightweight config for CPU-only stages (assemble)
image_uri = LIGHTWEIGHT_IMAGE
instance_type = INSTANCE_TYPE_CPU
# Create trainer
trainer = ModelTrainer(
sagemaker_session=sagemaker_session,
role=role,
training_mode="SAGEMAKER_TRAINING_JOB",
source_code=SourceCode(
source_dir=str(source_dir),
entry_script="entry.sh",
),
compute=Compute(
instance_type=instance_type,
instance_count=1,
volume_size_in_gb=VOLUME_SIZE_GB,
),
stopping_condition=StoppingCondition(
max_runtime_in_seconds=MAX_RUNTIME_SECONDS,
),
output_data_config=OutputDataConfig(
s3_output_path=f"s3://{BUCKET_NAME}/{S3_PREFIX}/output/",
),
base_job_name=base_name,
environment=full_env,
training_image=image_uri,
)
print(f"Launching {stage} stage...")
trainer.train(wait=False)
# Find the actual job name using list_training_jobs API
sm_client = sagemaker_session.sagemaker_client
time.sleep(2) # Brief wait for job to register
response = sm_client.list_training_jobs(
NameContains=base_name,
SortBy='CreationTime',
SortOrder='Descending',
MaxResults=1
)
if response['TrainingJobSummaries']:
actual_job_name = response['TrainingJobSummaries'][0]['TrainingJobName']
else:
actual_job_name = base_name # Fallback
print(f"Job started: {actual_job_name}")
return trainer, actual_job_name
def wait_for_job(job_name: str, poll_interval: int = 30, timeout: int = 10800):
"""Wait for a SageMaker Training Job to complete.
Args:
job_name: The exact job name
poll_interval: Seconds between status checks
timeout: Maximum seconds to wait
"""
sm_client = sagemaker_session.sagemaker_client
start_time = time.time()
print(f"Waiting for job {job_name}...")
while time.time() - start_time < timeout:
response = sm_client.describe_training_job(TrainingJobName=job_name)
status = response['TrainingJobStatus']
elapsed = time.time() - start_time
mins, secs = divmod(int(elapsed), 60)
if status == 'Completed':
print(f" {job_name}: Completed β ({mins:02d}:{secs:02d})")
return response
elif status == 'Failed':
print(f" {job_name}: Failed β")
print(f" Reason: {response.get('FailureReason', 'Unknown')}")
return response
elif status == 'Stopped':
print(f" {job_name}: Stopped")
return response
else:
print(f" {job_name}: {status}... ({mins:02d}:{secs:02d})")
time.sleep(poll_interval)
raise TimeoutError(f"Job {job_name} did not complete within {timeout}s")π Stage 1: Extract
Run OCR on the source dataset to extract markdown and figures. Output is saved to S3 (not HF Hub).

How to set up batch size for efficient processing
Since weβre running batch inference (not serving live users), we can aggressively maximize GPU utilization without worrying about latency SLAs. The goal is to keep the GPU fully saturated by maintaining enough concurrent requests in flight.
Understanding vLLMβs KV cache capacity
vLLM allocates GPU memory for its KV cache, which determines how many concurrent requests can be processed. When vLLM starts, it calculates and logs the KV cache capacity for your specific GPU:
INFO [kv_cache_utils.py] GPU KV cache size: 567,488 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 8,192 tokens per request: 69.27xCheck your job logs to find these values for your hardware. The maximum concurrency depends on:
max_concurrency = KV_cache_tokens / tokens_per_requestFor the G6E instance (L40S GPU, 48GB) with a sizing length of 8,192 total tokens (prompt + generated):
| GPU | KV Cache Tokens | Hard Cap | Safe Target (70-85%) |
|---|---|---|---|
| L40S (48GB) | ~567,488 | 69 | 50-60 |
Setting safe concurrency
The EXTRACT_BATCH_SIZE parameter controls concurrent requests sent to vLLM. To set it safely:
- Estimate total tokens per request:
L_total = prompt_tokens + generated_tokens. For OCR, generated markdown can be substantial - use your p95 (not average) to avoid preemption. - Apply 70-85% headroom: This accounts for variance in document lengths and prevents KV cache pressure.
- If most docs are well below 8,192 tokens, you can push higher concurrency.
Our dataset: 1 page = 1 request
In this pipeline, each request processes a single PDF page, which typically produces far fewer tokens than the 8,192 sizing length. This allows us to push concurrency well beyond the conservative estimates above. In practice, 128 concurrent requests worked safely on G6E (L40S) - nearly 2x the theoretical hard cap - because actual token usage per page is much lower.
>> # Stage 1: Extract
>> # Output dataset will be saved to S3
>> stage1_env = {
... # Source dataset (from HuggingFace)
... "DATASET_NAME": SOURCE_DATASET,
... "DATASET_CONFIG": SOURCE_CONFIG,
... "DATASET_SPLIT": "train",
... "MAX_SAMPLES": str(MAX_SAMPLES),
... # Local output directory
... "OUTPUT_DIR": "./outputs",
... # Batch settings
... "EXTRACT_BATCH_SIZE": "128",
... # S3 output (single location for all stages)
... "S3_OUTPUT_URI": S3_OUTPUT_URI,
... }
>> stage1_trainer, stage1_job_name = launch_stage("extract", stage1_env)Launching extract stage...
# Wait for Stage 1 to complete
# Estimated time: ~15-20 min for 1024 samples on ml.g6e.2xlarge (scales linearly)
stage1_result = wait_for_job(stage1_job_name)
print(f"Extract stage completed: {stage1_result['TrainingJobStatus']}")>> # Load and display samples after Extract
>> ds_extract = load_dataset_from_s3(f"{S3_OUTPUT_URI}/dataset")
>> display_samples(ds_extract, num_samples=2)Dataset: 1023 samples Columns: ['sample_id', 'dataset_index', 'source_image', 'document_with_boxes_image', 'document_markdown', 'extracted_figures', 'extracted_figures_metadata', 'document_final_markdown'] === Sample 0: sample_00000 === Source image:
π·οΈ Stage 2: Describe
Generate captions for extracted figures. Input is read from S3 (output of Stage 1), output is saved to S3.

>> # Stage 2: Describe
>> # Updates dataset in place (same location as extract)
>> stage2_env = {
... # Local output directory
... "OUTPUT_DIR": "./outputs",
... # Batch settings
... "DESCRIBE_BATCH_SIZE": "128",
... # S3 input and output (same location - updates in place)
... "S3_INPUT_URI": f"{S3_OUTPUT_URI}/dataset",
... "S3_OUTPUT_URI": S3_OUTPUT_URI,
... }
>> stage2_trainer, stage2_job_name = launch_stage("describe", stage2_env)Launching describe stage...
>> # Wait for Stage 2 to complete
>> # Estimated time: ~8-10 min for 1024 samples on ml.g6e.2xlarge
>> stage2_result = wait_for_job(stage2_job_name)
>> print(f"Describe stage completed: {stage2_result['TrainingJobStatus']}")Waiting for job deepseek-ocr-sagemaker-describe-e7a0a2b5-20260115112243... deepseek-ocr-sagemaker-describe-e7a0a2b5-20260115112243: Completed β (00:00) Describe stage completed: Completed
>> # Load and display samples after Describe
>> ds_describe = load_dataset_from_s3(f"{S3_OUTPUT_URI}/dataset")
>> display_samples(ds_describe, num_samples=2)Dataset: 1023 samples Columns: ['sample_id', 'dataset_index', 'source_image', 'document_with_boxes_image', 'document_markdown', 'extracted_figures', 'extracted_figures_metadata', 'document_final_markdown'] === Sample 0: sample_00000 === Source image:
π§© Stage 3: Assemble
Enrich markdown with figure captions to create the final dataset. This stage runs CPU-only with a lightweight image and smaller instance type - no vLLM or GPU needed.
π‘ Uses LIGHTWEIGHT_IMAGE + INSTANCE_TYPE_CPU instead of the full vLLM setup, significantly reducing costs.

>> # Stage 3: Assemble
>> # Updates dataset in place + saves final markdown files
>> stage3_env = {
... # Local output directory
... "OUTPUT_DIR": "./outputs",
... # S3 input and output (same location - updates in place)
... "S3_INPUT_URI": f"{S3_OUTPUT_URI}/dataset",
... "S3_OUTPUT_URI": S3_OUTPUT_URI,
... # Assemble stage doesn't need GPU
... "SKIP_SERVER_LAUNCH": "true",
... }
>> stage3_trainer, stage3_job_name = launch_stage("assemble", stage3_env, use_gpu=False) # CPU-onlyLaunching assemble stage...
>> # Wait for Stage 3 to complete
>> # Estimated time: ~3-5 min (CPU-only, just text processing)
>> stage3_result = wait_for_job(stage3_job_name)
>> print(f"Assemble stage completed: {stage3_result['TrainingJobStatus']}")Waiting for job deepseek-ocr-sagemaker-assemble-107f1257-20260115114038... deepseek-ocr-sagemaker-assemble-107f1257-20260115114038: InProgress... (00:00) deepseek-ocr-sagemaker-assemble-107f1257-20260115114038: InProgress... (00:30) deepseek-ocr-sagemaker-assemble-107f1257-20260115114038: InProgress... (01:00) deepseek-ocr-sagemaker-assemble-107f1257-20260115114038: InProgress... (01:30) deepseek-ocr-sagemaker-assemble-107f1257-20260115114038: InProgress... (02:00) deepseek-ocr-sagemaker-assemble-107f1257-20260115114038: InProgress... (02:30) deepseek-ocr-sagemaker-assemble-107f1257-20260115114038: Completed β (03:00) Assemble stage completed: Completed
>> # Load and display final samples after Assemble
>> ds_final = load_dataset_from_s3(f"{S3_OUTPUT_URI}/dataset")
>> display_samples(ds_final, num_samples=2)Dataset: 1023 samples Columns: ['sample_id', 'dataset_index', 'source_image', 'document_with_boxes_image', 'document_markdown', 'extracted_figures', 'extracted_figures_metadata', 'document_final_markdown'] === Sample 0: sample_00000 === Source image:
# Display rendered markdown with images for sample 1
# This properly renders figure: URIs using images from extracted_figures column
display_markdown(ds_final[1])π° Cost Analysis (Extract stage only)
| Metric | Value |
|---|---|
| π₯οΈ Hardware | ml.g6e.2xlarge (L40S, 48GB) |
| β‘ Throughput | ~83 pages/min |
| π Concurrency | 128 parallel requests (saturates GPU batch) |
| π΅ Hourly rate | ~$2.80/hour |
| Scale | β±οΈ Time | π² Cost |
|---|---|---|
| 1,000 pages | ~12 min | ~$0.56 |
| 10,000 pages | ~2 hours | ~$5.60 |
| 100,000 pages | ~20 hours | ~$56 |
π Note: 1 page = 1 PDF page in these benchmarks. Pricing based on SageMaker on-demand pricing.
π‘ Cost optimization: These costs can be further optimized by evaluating the best instance type and hardware utilization based on your dataset characteristics (average page complexity, token lengths, batch sizes). Consider testing different instance types (e.g., ml.g5, ml.p4d) to find the optimal price/performance ratio for your workload.
β Pipeline Complete
The OCR pipeline has finished. Your dataset is available in S3:
print(f"\n" + "="*60)
print("Pipeline Complete!")
print("="*60)
print(f"\nS3 Output Location: {S3_OUTPUT_URI}")
print(f" - Dataset: {S3_OUTPUT_URI}/dataset/")
print(f" - Files: {S3_OUTPUT_URI}/outputs/")
print(f"\nS3 Job Output: s3://{BUCKET_NAME}/{S3_PREFIX}/output/")
print("\nJob Summary:")
for i, (name, result) in enumerate([
("Extract", stage1_result),
("Describe", stage2_result),
("Assemble", stage3_result),
], 1):
status = result["TrainingJobStatus"]
print(f" {i}. {name}: {status}")Update on GitHubπ Find the complete example on GitHub here!