DeepSeek-OCR is an innovative model that revolutionizes optical character recognition (OCR) by compressing long text into image form, utilizing significantly fewer vision tokens to represent documents. This approach not only enhances efficiency for long-context tasks but also maintains robust OCR capabilities. The model is designed for use with Huggingface transformers on NVIDIA GPUs, requiring specific dependencies such as torch, transformers, and flash-attn. It supports various configurations, allowing users to adjust parameters like base size and image size for optimal performance. DeepSeek-OCR is capable of converting documents into markdown format, among other functionalities. The project acknowledges contributions from models and ideas like Vary, GOT-OCR2.0, and PaddleOCR, and appreciates benchmarks such as Fox and OminiDocBench.
Information shown may be outdated. Found an error? Report it here
Auto-fetched from GitHub today.