feat: 初始提交 - 滚动截屏OCR工具

- 实现智能区域检测算法(灰度阈值 + 连续行判定)
- 支持Umi-OCR和自定义HTTP OCR服务
- 添加热键触发和鼠标框选区域功能
- 实现自动滚动和智能停止逻辑
- 添加完整的README文档
This commit is contained in:
2026-03-06 15:07:51 +08:00
commit 8600c0f576
6 changed files with 1247 additions and 0 deletions

46
.gitignore vendored Normal file
View File

@@ -0,0 +1,46 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# Virtual environments
venv/
env/
ENV/
.venv/
.env/
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# Output directory
output/
# Logs
*.log
logs/
# OS
.DS_Store
Thumbs.db

215
README.md Normal file
View File

@@ -0,0 +1,215 @@
# 滚动截屏OCR工具
一个智能的滚动截屏OCR工具可以自动识别页面中的内容区块div滚动截屏并进行OCR文字识别。
## 功能特点
- 🎯 **智能区域检测**:使用灰度阈值 + 连续行判定算法自动识别内容区块div和空白间隔
- 📜 **自动滚动截屏**:根据内容高度自动计算滚动距离,连续截屏
- 🔤 **OCR文字识别**:支持 Umi-OCR 和自定义HTTP OCR服务
- ⌨️ **热键触发**:按 `Ctrl+F9` 快速启动
- 🖱️ **框选区域**:拖动鼠标选择截图区域
- 🛑 **智能停止**:检测到重复内容时自动停止
## 适用场景
- 长网页滚动截图OCR
- 聊天记录导出
- 长文档内容提取
- 任何需要滚动才能看完全部的内容
## 安装
### 1. 克隆仓库
```bash
git clone <远程仓库地址>
cd long-screen-cut
```
### 2. 安装依赖
```bash
pip install -r requirements.txt
```
依赖列表:
- opencv-python >= 4.8.0
- numpy >= 1.24.0
- pillow >= 10.0.0
- pyautogui >= 0.9.54
- keyboard >= 0.13.5
- mouse >= 0.7.1
- requests >= 2.31.0
- loguru >= 0.7.0
### 3. 安装OCR引擎二选一
#### 方案AUmi-OCR推荐
1. 下载 [Umi-OCR](https://github.com/hiroi-sora/Umi-OCR/releases)
2. 解压并运行 `Umi-OCR.exe`
3. 进入 **设置 → HTTP接口**
4. 勾选 **启用HTTP服务**
5. 确保端口为 `1224`(默认)
#### 方案B自定义HTTP OCR服务
参考 `ocr_server_example.py` 实现自己的OCR服务或修改配置使用其他OCR API。
## 使用方法
### 启动程序
```bash
python main.py
```
### 操作流程
1. **等待热键**:程序启动后会显示 `等待热键 Ctrl+F9 启动...`
2. **触发截屏**:按 `Ctrl+F9`
3. **检查服务**程序会检查OCR服务是否运行
4. **框选区域**:按住鼠标左键拖动,选择要截图的区域
5. **自动处理**:程序会自动:
- 截取当前屏幕
- 分析内容区块div
- OCR识别文字
- 计算滚动距离
- 滚动到下一屏
- 重复上述过程
6. **自动停止**:当检测到重复内容时自动停止
### 输出结果
- 截图保存在 `output/` 目录
- OCR结果保存在 `output/all_results_时间戳.json`
## 配置说明
编辑 `main.py` 中的 `Config` 类:
```python
class Config:
# 热键设置
HOTKEY = "ctrl+f9"
# 图像分析参数
GRAY_THRESHOLD = 240 # 灰度阈值0-255
CONSECUTIVE_LINES = 3 # 连续多少行判定为空白
WHITE_PIXEL_RATIO = 0.9 # 白色像素比例阈值
# OCR设置
OCR_ENGINE = "umi" # "umi" 或 "http"
OCR_API_URL = "http://localhost:8000/ocr" # HTTP模式时使用
OCR_TIMEOUT = 30 # OCR请求超时时间
# Umi-OCR设置
UMI_OCR_HOST = "127.0.0.1"
UMI_OCR_PORT = 1224
# 滚动设置
SCROLL_DELAY = 0.5 # 滚动后等待渲染时间(秒)
MAX_SCROLL_COUNT = 100 # 最大滚动次数
# 输出设置
OUTPUT_DIR = "output"
```
## 核心算法
### 内容区块检测算法
1. **灰度转换**:将截图转换为灰度图
2. **逐行扫描**:计算每行的白色像素比例
3. **空白判定**:如果一行中超过 `WHITE_PIXEL_RATIO`默认90%)的像素灰度值 > `GRAY_THRESHOLD`默认240则认为是空白行
4. **连续判定**:连续 `CONSECUTIVE_LINES`默认3行空白行视为间隔区域
5. **区块划分**非空白行区域视为内容区块div
### 滚动距离计算
```
滚动距离 = 第一个div高度 + 其后空白间隔高度 - 重叠区域
```
重叠区域确保连续性默认为div高度的1/4。
## 项目结构
```
long-screen-cut/
├── main.py # 主程序
├── umi_ocr_client.py # Umi-OCR HTTP客户端
├── ocr_server_example.py # OCR服务示例Flask
├── requirements.txt # Python依赖
├── .gitignore # Git忽略配置
└── README.md # 本文件
```
## API文档
### UmiOCRClient
```python
from umi_ocr_client import UmiOCRClient
client = UmiOCRClient(host="127.0.0.1", port=1224)
# 检查服务状态
if client.is_service_running():
print("服务运行中")
# 截图识别
text = client.recognize_screenshot()
# 图片文件识别
text = client.recognize_image("/path/to/image.png")
# 批量识别
texts = client.recognize_images(["1.png", "2.png", "3.png"])
```
## 常见问题
### Q: 程序提示"Umi-OCR服务未运行"
A: 请确保:
1. Umi-OCR软件已启动
2. 进入 **设置 → HTTP接口**
3. 勾选 **启用HTTP服务**
4. 端口设置为 `1224`
### Q: 识别区域不准确
A: 调整 `Config` 中的图像分析参数:
- `GRAY_THRESHOLD`:降低可以识别更浅的背景色
- `CONSECUTIVE_LINES`:增加可以减少误判
- `WHITE_PIXEL_RATIO`:降低可以容忍更多杂色
### Q: 滚动太快/太慢
A: 调整 `SCROLL_DELAY`
- 网页加载慢:增加延迟
- 本地应用:可以减少延迟
### Q: 如何停止程序
A:
- 正常停止:按 `Ctrl+C`
- 强制停止:关闭终端窗口
## 开发计划
- [ ] 支持更多OCR引擎PaddleOCR、Tesseract等
- [ ] GUI界面
- [ ] 支持水平滚动
- [ ] 智能去重(相似度判断)
- [ ] 导出为多种格式Markdown、Word、PDF
## 许可证
MIT License
## 致谢
- [Umi-OCR](https://github.com/hiroi-sora/Umi-OCR) - 优秀的离线OCR软件

604
main.py Normal file
View File

@@ -0,0 +1,604 @@
"""
滚动截屏OCR工具
功能通过热键激活手动框选区域后自动滚动截屏并进行OCR识别
"""
import json
import time
import base64
import io
import tempfile
from dataclasses import dataclass, field
from typing import List, Tuple, Optional, Callable
from pathlib import Path
import cv2
import numpy as np
import requests
from PIL import Image
import pyautogui
import keyboard
import mouse
from loguru import logger
from umi_ocr_client import UmiOCRClient, check_and_wait_for_service
@dataclass
class DivRegion:
"""div区域数据结构"""
top: int
bottom: int
left: int
right: int
text: str = ""
@property
def height(self) -> int:
return self.bottom - self.top
@property
def width(self) -> int:
return self.right - self.left
@dataclass
class GapInfo:
"""空白间隔信息"""
start_row: int
end_row: int
@property
def height(self) -> int:
return self.end_row - self.start_row
@dataclass
class AnalysisResult:
"""图像分析结果"""
divs: List[DivRegion] = field(default_factory=list)
gaps: List[GapInfo] = field(default_factory=list)
class Config:
"""配置类"""
# 热键设置
HOTKEY = "ctrl+f9"
# 图像分析参数
GRAY_THRESHOLD = 240 # 灰度阈值,接近白色的阈值
CONSECUTIVE_LINES = 3 # 连续多少行判定为空白
WHITE_PIXEL_RATIO = 0.9 # 一行中超过多少比例的像素为白色才认为是空白行
# OCR设置
OCR_ENGINE = "umi" # OCR引擎: "umi" 使用Umi-OCR, "http" 使用HTTP接口
OCR_API_URL = "http://localhost:8000/ocr" # HTTP OCR服务地址 (OCR_ENGINE=http时使用)
OCR_TIMEOUT = 30 # OCR请求超时时间
# Umi-OCR设置
UMI_OCR_HOST = "127.0.0.1"
UMI_OCR_PORT = 1224
# 滚动设置
SCROLL_DELAY = 0.5 # 滚动后等待渲染的时间(秒)
MAX_SCROLL_COUNT = 100 # 最大滚动次数,防止无限循环
# 输出设置
OUTPUT_DIR = "output"
class RegionSelector:
"""区域选择器 - 用于手动框选截图区域"""
def __init__(self):
self.start_pos: Optional[Tuple[int, int]] = None
self.end_pos: Optional[Tuple[int, int]] = None
self.is_selecting = False
def select_region(self) -> Tuple[int, int, int, int]:
"""
手动选择区域,返回 (left, top, right, bottom)
点击确定左上角,拖动释放确定右下角
"""
logger.info("请按住鼠标左键拖动选择区域...")
print("\n>>> 请按住鼠标左键拖动选择截图区域,释放后确定 <<<")
# 等待鼠标按下
while not mouse.is_pressed(button='left'):
time.sleep(0.01)
self.start_pos = mouse.get_position()
self.is_selecting = True
logger.info(f"选择开始位置: {self.start_pos}")
# 等待鼠标释放
while mouse.is_pressed(button='left'):
time.sleep(0.01)
self.end_pos = mouse.get_position()
self.is_selecting = False
logger.info(f"选择结束位置: {self.end_pos}")
# 计算边界
left = min(self.start_pos[0], self.end_pos[0])
top = min(self.start_pos[1], self.end_pos[1])
right = max(self.start_pos[0], self.end_pos[0])
bottom = max(self.start_pos[1], self.end_pos[1])
logger.info(f"选定区域: ({left}, {top}, {right}, {bottom}), 尺寸: {right-left}x{bottom-top}")
print(f"已选择区域: 左上角({left}, {top}), 右下角({right}, {bottom})")
return left, top, right, bottom
class ImageAnalyzer:
"""图像分析器 - 分析div边界和空白间隔"""
def __init__(self, config: Config):
self.config = config
def analyze(self, image: np.ndarray) -> AnalysisResult:
"""
分析图像定位div边界
使用灰度阈值 + 连续行判定
"""
result = AnalysisResult()
# 转换为灰度图
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
gray = image
height, width = gray.shape
logger.debug(f"分析图像尺寸: {width}x{height}")
# 逐行分析
is_in_gap = False
gap_start = 0
div_start = 0
consecutive_blank = 0
for row in range(height):
# 计算当前行的白色像素比例
white_pixels = np.sum(gray[row] > self.config.GRAY_THRESHOLD)
white_ratio = white_pixels / width
is_blank = white_ratio > self.config.WHITE_PIXEL_RATIO
if is_blank:
consecutive_blank += 1
else:
# 如果之前是空白区域现在进入div
if consecutive_blank >= self.config.CONSECUTIVE_LINES and not is_in_gap:
# 记录空白间隔
gap_end = row - consecutive_blank
gap = GapInfo(start_row=gap_start, end_row=gap_end)
result.gaps.append(gap)
logger.debug(f"发现空白间隔: 行 {gap.start_row}-{gap.end_row}, 高度 {gap.height}")
# 记录div开始
div_start = row
is_in_gap = True
consecutive_blank = 0
gap_start = row
# 如果连续多行都是空白,认为是间隔区域
if consecutive_blank >= self.config.CONSECUTIVE_LINES and is_in_gap:
# 记录div结束
div_end = row - consecutive_blank
if div_end > div_start:
div = DivRegion(
top=div_start,
bottom=div_end,
left=0,
right=width
)
result.divs.append(div)
logger.debug(f"发现div区域: 行 {div.top}-{div.bottom}, 高度 {div.height}")
is_in_gap = False
gap_start = row - consecutive_blank + 1
# 处理最后一个div如果图像不以空白结束
if not is_in_gap and div_start < height - consecutive_blank:
div = DivRegion(
top=div_start,
bottom=height - consecutive_blank,
left=0,
right=width
)
result.divs.append(div)
logger.debug(f"发现末尾div区域: 行 {div.top}-{div.bottom}, 高度 {div.height}")
logger.info(f"分析完成: 发现 {len(result.divs)} 个div, {len(result.gaps)} 个空白间隔")
return result
def calculate_scroll_distance(self, result: AnalysisResult) -> int:
"""
根据分析结果计算滚动距离
策略滚动到下一个div的顶部
"""
if not result.divs:
logger.warning("未检测到div使用默认滚动距离")
return 100
# 获取第一个div和第一个空白间隔
first_div = result.divs[0]
# 如果有空白间隔滚动距离为第一个div高度 + 其后的空白间隔
scroll_distance = first_div.height
# 查找第一个div之后的空白间隔
for gap in result.gaps:
if gap.start_row >= first_div.bottom:
scroll_distance += gap.height
break
# 添加一些重叠,确保连续性
overlap = min(20, first_div.height // 4)
scroll_distance = max(scroll_distance - overlap, 50)
logger.info(f"计算滚动距离: {scroll_distance} 像素")
return int(scroll_distance)
class OCREngine:
"""OCR引擎 - 调用OCR服务识别文字"""
def __init__(self, config: Config):
self.config = config
self.umi_client: Optional[UmiOCRClient] = None
if config.OCR_ENGINE == "umi":
self.umi_client = UmiOCRClient(
host=config.UMI_OCR_HOST,
port=config.UMI_OCR_PORT
)
def _recognize_with_http(self, image: np.ndarray) -> List[str]:
"""使用HTTP接口进行OCR识别"""
try:
# 将numpy数组转换为PIL Image
if len(image.shape) == 3:
pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
else:
pil_image = Image.fromarray(image)
# 转换为base64
buffered = io.BytesIO()
pil_image.save(buffered, format="PNG")
img_base64 = base64.b64encode(buffered.getvalue()).decode()
# 调用OCR API
response = requests.post(
self.config.OCR_API_URL,
json={"image": img_base64},
timeout=self.config.OCR_TIMEOUT
)
response.raise_for_status()
data = response.json()
texts = data.get("texts", [])
return texts
except requests.exceptions.ConnectionError:
logger.error(f"无法连接到OCR服务: {self.config.OCR_API_URL}")
return []
except Exception as e:
logger.error(f"HTTP OCR识别失败: {e}")
return []
def _recognize_with_umi(self, image: np.ndarray) -> List[str]:
"""使用Umi-OCR进行识别"""
if not self.umi_client:
logger.error("Umi-OCR客户端未初始化")
return []
try:
# 将图像保存为临时文件
if len(image.shape) == 3:
pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
else:
pil_image = Image.fromarray(image)
# 创建临时文件
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp_file:
tmp_path = tmp_file.name
pil_image.save(tmp_path, format="PNG")
try:
# 调用Umi-OCR识别
text = self.umi_client.recognize_image(tmp_path, timeout=self.config.OCR_TIMEOUT)
if text:
# 按行分割
lines = [line.strip() for line in text.split('\n') if line.strip()]
return lines
return []
finally:
# 删除临时文件
try:
Path(tmp_path).unlink()
except Exception:
pass
except Exception as e:
logger.error(f"Umi-OCR识别失败: {e}")
return []
def recognize(self, image: np.ndarray) -> List[str]:
"""
对图像进行OCR识别
返回识别到的文字列表
"""
if self.config.OCR_ENGINE == "umi":
texts = self._recognize_with_umi(image)
else:
texts = self._recognize_with_http(image)
logger.info(f"OCR识别完成识别到 {len(texts)} 段文字")
return texts
def recognize_divs(self, image: np.ndarray, divs: List[DivRegion]) -> List[str]:
"""
对每个div区域分别进行OCR识别
"""
all_texts = []
for i, div in enumerate(divs):
# 截取div区域
div_image = image[div.top:div.bottom, div.left:div.right]
texts = self.recognize(div_image)
all_texts.extend(texts)
logger.debug(f"Div {i+1} OCR结果: {texts}")
return all_texts
def check_service(self) -> bool:
"""检查OCR服务是否可用"""
if self.config.OCR_ENGINE == "umi":
if not self.umi_client:
return False
return self.umi_client.is_service_running()
else:
try:
response = requests.get(self.config.OCR_API_URL.replace('/ocr', '/health'), timeout=2)
return response.status_code == 200
except Exception:
return False
class ScrollCaptureOCR:
"""滚动截屏OCR主类"""
def __init__(self):
self.config = Config()
self.region_selector = RegionSelector()
self.image_analyzer = ImageAnalyzer(self.config)
self.ocr_engine = OCREngine(self.config)
self.capture_region: Optional[Tuple[int, int, int, int]] = None
self.previous_ocr_result: List[str] = []
self.scroll_count = 0
self.all_results: List[dict] = []
# 创建输出目录
Path(self.config.OUTPUT_DIR).mkdir(exist_ok=True)
def capture_screen(self) -> np.ndarray:
"""截取指定区域的屏幕"""
if not self.capture_region:
raise ValueError("未设置截图区域")
left, top, right, bottom = self.capture_region
screenshot = pyautogui.screenshot(region=(left, top, right - left, bottom - top))
return cv2.cvtColor(np.array(screenshot), cv2.COLOR_RGB2BGR)
def scroll_screen(self, distance: int):
"""在截图区域执行滚动"""
if not self.capture_region:
return
# 将鼠标移动到截图区域中央
left, top, right, bottom = self.capture_region
center_x = (left + right) // 2
center_y = (top + bottom) // 2
pyautogui.moveTo(center_x, center_y)
time.sleep(0.1)
# 执行滚动
pyautogui.scroll(-distance)
logger.info(f"向下滚动 {distance} 像素")
# 等待页面渲染
time.sleep(self.config.SCROLL_DELAY)
def check_duplicate(self, current_texts: List[str]) -> bool:
"""
检查当前OCR结果是否与上一次相同
用于判断是否到达底部
"""
if not self.previous_ocr_result:
return False
# 简单比较:如果文字列表完全相同,认为是重复
is_duplicate = current_texts == self.previous_ocr_result
if is_duplicate:
logger.info("检测到OCR结果重复可能已到达底部")
return is_duplicate
def save_result(self, scroll_index: int, image: np.ndarray, texts: List[str]):
"""保存截图和OCR结果"""
timestamp = time.strftime("%Y%m%d_%H%M%S")
# 保存图片
image_path = Path(self.config.OUTPUT_DIR) / f"capture_{timestamp}_{scroll_index:03d}.png"
cv2.imwrite(str(image_path), image)
# 保存OCR结果
result = {
"index": scroll_index,
"timestamp": timestamp,
"image_path": str(image_path),
"texts": texts
}
self.all_results.append(result)
logger.info(f"保存结果: {image_path}, 识别文字数: {len(texts)}")
def save_final_result(self):
"""保存所有结果到JSON文件"""
output_path = Path(self.config.OUTPUT_DIR) / f"all_results_{time.strftime('%Y%m%d_%H%M%S')}.json"
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(self.all_results, f, ensure_ascii=False, indent=2)
logger.info(f"所有结果已保存到: {output_path}")
print(f"\n所有结果已保存到: {output_path}")
def process_once(self) -> bool:
"""
执行一次处理循环
返回False表示应该停止
"""
logger.info(f"=== 第 {self.scroll_count + 1} 次截屏 ===")
print(f"\n>>> 第 {self.scroll_count + 1} 次截屏处理中...")
# 1. 截取当前屏幕
image = self.capture_screen()
logger.info(f"截图完成,尺寸: {image.shape[1]}x{image.shape[0]}")
# 2. 分析图像定位div边界
analysis = self.image_analyzer.analyze(image)
if not analysis.divs:
logger.warning("未检测到任何div区域可能已到达底部或区域选择有误")
print("警告: 未检测到内容区域")
return False
# 3. OCR提取文字
current_texts = self.ocr_engine.recognize_divs(image, analysis.divs)
print(f"识别到 {len(current_texts)} 段文字")
for i, text in enumerate(current_texts[:3], 1):
preview = text[:50] + "..." if len(text) > 50 else text
print(f" [{i}] {preview}")
if len(current_texts) > 3:
print(f" ... 还有 {len(current_texts) - 3} 段文字")
# 4. 保存结果
self.save_result(self.scroll_count, image, current_texts)
# 5. 判断是否到达底部OCR结果重复
if self.check_duplicate(current_texts):
print("\n>>> 检测到内容重复,已到达底部,处理完成 <<<")
return False
self.previous_ocr_result = current_texts
# 6. 计算滚动距离
scroll_distance = self.image_analyzer.calculate_scroll_distance(analysis)
# 7. 执行滚动
self.scroll_screen(scroll_distance)
self.scroll_count += 1
# 检查最大滚动次数
if self.scroll_count >= self.config.MAX_SCROLL_COUNT:
logger.warning(f"达到最大滚动次数限制 ({self.config.MAX_SCROLL_COUNT})")
print(f"\n>>> 达到最大滚动次数限制,处理完成 <<<")
return False
return True
def run(self):
"""主运行流程"""
print("=" * 60)
print("滚动截屏OCR工具")
print("=" * 60)
print(f"\n使用说明:")
print(f"1. 按下热键 {self.config.HOTKEY} 启动")
print(f"2. 按住鼠标左键拖动选择截图区域")
print(f"3. 程序将自动滚动截屏并进行OCR识别")
print(f"4. 当检测到重复内容时自动停止")
print(f"5. 结果将保存在 '{self.config.OUTPUT_DIR}' 目录")
print("\n" + "=" * 60)
logger.info("程序启动,等待热键触发...")
print(f"\n>>> 等待热键 {self.config.HOTKEY} 启动... <<<")
# 注册热键
keyboard.add_hotkey(self.config.HOTKEY, self._on_hotkey)
# 保持程序运行
try:
while True:
time.sleep(0.1)
except KeyboardInterrupt:
logger.info("程序被用户中断")
print("\n>>> 程序已停止 <<<")
def _on_hotkey(self):
"""热键回调函数"""
logger.info("热键触发,开始处理")
print(f"\n{'='*60}")
print("热键已触发!")
# 检查OCR服务
print("\n>>> 检查OCR服务... <<<")
if not self.ocr_engine.check_service():
if self.config.OCR_ENGINE == "umi":
print("✗ Umi-OCR服务未运行")
print("请先启动Umi-OCR软件并开启HTTP服务")
print(" 1. 打开Umi-OCR")
print(" 2. 进入 设置 -> HTTP接口")
print(" 3. 勾选 '启用HTTP服务'")
print(f" 4. 确保端口为 {self.config.UMI_OCR_PORT}")
else:
print(f"✗ OCR服务未运行: {self.config.OCR_API_URL}")
return
print("✓ OCR服务运行中")
# 选择区域
try:
self.capture_region = self.region_selector.select_region()
except Exception as e:
logger.error(f"区域选择失败: {e}")
print(f"区域选择失败: {e}")
return
# 重置状态
self.previous_ocr_result = []
self.scroll_count = 0
self.all_results = []
print(f"\n>>> 开始自动滚动截屏和OCR识别... <<<")
# 循环处理
try:
while self.process_once():
pass
except Exception as e:
logger.error(f"处理过程中出错: {e}", exc_info=True)
print(f"\n错误: {e}")
# 保存最终结果
if self.all_results:
self.save_final_result()
print(f"\n共处理 {len(self.all_results)} 次截屏")
print(f"结果保存在: {Path(self.config.OUTPUT_DIR).absolute()}")
print(f"\n{'='*60}")
print(">>> 等待下一次热键触发... <<<")
logger.info("处理完成,等待下一次热键触发")
def main():
"""入口函数"""
app = ScrollCaptureOCR()
app.run()
if __name__ == "__main__":
main()

146
ocr_server_example.py Normal file
View File

@@ -0,0 +1,146 @@
"""
OCR服务示例实现
这是一个简单的OCR HTTP服务示例使用 PaddleOCR 或 Tesseract 作为后端
你可以根据实际需求修改此文件或使用其他OCR服务
启动方式: python ocr_server_example.py
服务地址: http://localhost:8000
"""
import base64
import io
from typing import List
try:
from flask import Flask, request, jsonify
except ImportError:
print("请先安装Flask: pip install flask")
raise
try:
from PIL import Image
except ImportError:
print("请先安装Pillow: pip install pillow")
raise
app = Flask(__name__)
# 尝试导入OCR引擎按优先级PaddleOCR > Tesseract > 模拟
ocr_engine = None
ocr_type = None
try:
from paddleocr import PaddleOCR
ocr_engine = PaddleOCR(
use_angle_cls=True,
lang='ch',
show_log=False
)
ocr_type = "paddle"
print("使用 PaddleOCR 引擎")
except ImportError:
try:
import pytesseract
ocr_engine = pytesseract
ocr_type = "tesseract"
print("使用 Tesseract OCR 引擎")
except ImportError:
ocr_type = "mock"
print("警告: 未找到OCR引擎使用模拟模式")
print("建议安装 PaddleOCR: pip install paddleocr")
print("或安装 Tesseract + pytesseract: pip install pytesseract")
def recognize_with_paddle(image: Image.Image) -> List[str]:
"""使用PaddleOCR识别"""
import numpy as np
img_array = np.array(image)
result = ocr_engine.ocr(img_array, cls=True)
texts = []
if result and result[0]:
for line in result[0]:
if line:
text = line[1][0] # 提取文字内容
confidence = line[1][1] # 置信度
if confidence > 0.5: # 过滤低置信度结果
texts.append(text)
return texts
def recognize_with_tesseract(image: Image.Image) -> List[str]:
"""使用Tesseract识别"""
text = ocr_engine.image_to_string(image, lang='chi_sim+eng')
# 按行分割
lines = [line.strip() for line in text.split('\n') if line.strip()]
return lines
def recognize_mock(image: Image.Image) -> List[str]:
"""模拟OCR用于测试"""
return ["[模拟OCR] 请安装实际的OCR引擎"]
def recognize_image(image: Image.Image) -> List[str]:
"""根据配置的引擎进行识别"""
if ocr_type == "paddle":
return recognize_with_paddle(image)
elif ocr_type == "tesseract":
return recognize_with_tesseract(image)
else:
return recognize_mock(image)
@app.route('/ocr', methods=['POST'])
def ocr_endpoint():
"""
OCR API端点
接收JSON: {"image": "base64编码的图片"}
返回JSON: {"texts": ["识别到的文字1", "识别到的文字2", ...]}
"""
try:
data = request.get_json()
if not data or 'image' not in data:
return jsonify({"error": "缺少image字段"}), 400
# 解码base64图片
img_base64 = data['image']
img_data = base64.b64decode(img_base64)
image = Image.open(io.BytesIO(img_data))
# 转换为RGB如果是RGBA或其他模式
if image.mode != 'RGB':
image = image.convert('RGB')
# 执行OCR
texts = recognize_image(image)
return jsonify({
"texts": texts,
"count": len(texts),
"engine": ocr_type
})
except Exception as e:
return jsonify({"error": str(e)}), 500
@app.route('/health', methods=['GET'])
def health_check():
"""健康检查端点"""
return jsonify({
"status": "ok",
"engine": ocr_type
})
if __name__ == '__main__':
print("=" * 60)
print("OCR HTTP 服务")
print("=" * 60)
print(f"OCR引擎: {ocr_type}")
print("API地址: http://localhost:8000/ocr")
print("健康检查: http://localhost:8000/health")
print("=" * 60)
print("\n启动服务中...")
app.run(host='0.0.0.0', port=8000, debug=False)

8
requirements.txt Normal file
View File

@@ -0,0 +1,8 @@
opencv-python>=4.8.0
numpy>=1.24.0
pillow>=10.0.0
pyautogui>=0.9.54
keyboard>=0.13.5
mouse>=0.7.1
requests>=2.31.0
loguru>=0.7.0

228
umi_ocr_client.py Normal file
View File

@@ -0,0 +1,228 @@
"""
Umi-OCR HTTP客户端
用于调用Umi-OCR的argv接口进行OCR识别
Umi-OCR 接口文档:
- 服务地址: http://127.0.0.1:1224
- argv接口: POST /argv
- 请求格式: JSON数组如 ["--screenshot"] 或 ["--path", "图片路径"]
- 返回格式: 纯文本字符串
"""
import time
import requests
from typing import List, Optional, Union
from pathlib import Path
from loguru import logger
class UmiOCRClient:
"""Umi-OCR HTTP客户端"""
DEFAULT_HOST = "127.0.0.1"
DEFAULT_PORT = 1224
def __init__(self, host: str = DEFAULT_HOST, port: int = DEFAULT_PORT):
self.host = host
self.port = port
self.base_url = f"http://{host}:{port}"
self.argv_url = f"{self.base_url}/argv"
def is_service_running(self, timeout: float = 2.0) -> bool:
"""
检查Umi-OCR HTTP服务是否运行
Args:
timeout: 请求超时时间(秒)
Returns:
服务是否可用
"""
try:
response = requests.get(
self.base_url,
timeout=timeout
)
return response.status_code == 200
except requests.exceptions.ConnectionError:
logger.warning(f"无法连接到Umi-OCR服务: {self.base_url}")
return False
except requests.exceptions.Timeout:
logger.warning(f"连接Umi-OCR服务超时: {self.base_url}")
return False
except Exception as e:
logger.error(f"检查Umi-OCR服务状态时出错: {e}")
return False
def recognize_screenshot(self, timeout: float = 30.0) -> Optional[str]:
"""
调用Umi-OCR进行截图识别
等价于命令行: Umi-OCR --screenshot
Args:
timeout: 请求超时时间(秒)
Returns:
识别到的文字失败返回None
"""
if not self.is_service_running():
logger.error("Umi-OCR服务未运行请先启动Umi-OCR")
return None
try:
data = ["--screenshot"]
response = requests.post(
self.argv_url,
headers={"Content-Type": "application/json"},
json=data,
timeout=timeout
)
response.raise_for_status()
text = response.text
logger.info(f"截图OCR完成识别到 {len(text)} 个字符")
return text
except requests.exceptions.Timeout:
logger.error("Umi-OCR请求超时")
return None
except Exception as e:
logger.error(f"Umi-OCR截图识别失败: {e}")
return None
def recognize_image(self, image_path: Union[str, Path], timeout: float = 30.0) -> Optional[str]:
"""
调用Umi-OCR识别指定图片
等价于命令行: Umi-OCR --path "图片路径"
Args:
image_path: 图片文件路径
timeout: 请求超时时间(秒)
Returns:
识别到的文字失败返回None
"""
if not self.is_service_running():
logger.error("Umi-OCR服务未运行请先启动Umi-OCR")
return None
image_path = Path(image_path)
if not image_path.exists():
logger.error(f"图片文件不存在: {image_path}")
return None
try:
# 转换为绝对路径并标准化
abs_path = str(image_path.resolve())
data = ["--path", abs_path]
response = requests.post(
self.argv_url,
headers={"Content-Type": "application/json"},
json=data,
timeout=timeout
)
response.raise_for_status()
text = response.text
logger.info(f"图片OCR完成: {image_path.name}, 识别到 {len(text)} 个字符")
return text
except requests.exceptions.Timeout:
logger.error("Umi-OCR请求超时")
return None
except Exception as e:
logger.error(f"Umi-OCR图片识别失败: {e}")
return None
def recognize_images(self, image_paths: List[Union[str, Path]], timeout: float = 30.0) -> List[str]:
"""
批量识别多张图片
Args:
image_paths: 图片路径列表
timeout: 每张图片的请求超时时间(秒)
Returns:
识别结果列表失败的图片对应位置为None
"""
results = []
for path in image_paths:
result = self.recognize_image(path, timeout)
results.append(result)
# 添加小延迟避免请求过快
time.sleep(0.1)
return results
def check_and_wait_for_service(client: UmiOCRClient, max_wait: float = 10.0, interval: float = 1.0) -> bool:
"""
检查并等待Umi-OCR服务启动
Args:
client: UmiOCRClient实例
max_wait: 最大等待时间(秒)
interval: 检查间隔(秒)
Returns:
服务是否可用
"""
start_time = time.time()
while time.time() - start_time < max_wait:
if client.is_service_running():
logger.info("Umi-OCR服务已就绪")
return True
logger.info("等待Umi-OCR服务启动...")
time.sleep(interval)
logger.error(f"等待Umi-OCR服务超时{max_wait}秒)")
return False
# 便捷函数
def recognize_screenshot(host: str = UmiOCRClient.DEFAULT_HOST,
port: int = UmiOCRClient.DEFAULT_PORT) -> Optional[str]:
"""便捷函数:截图识别"""
client = UmiOCRClient(host, port)
return client.recognize_screenshot()
def recognize_image(image_path: Union[str, Path],
host: str = UmiOCRClient.DEFAULT_HOST,
port: int = UmiOCRClient.DEFAULT_PORT) -> Optional[str]:
"""便捷函数:图片识别"""
client = UmiOCRClient(host, port)
return client.recognize_image(image_path)
if __name__ == "__main__":
# 测试代码
print("=" * 60)
print("Umi-OCR 客户端测试")
print("=" * 60)
client = UmiOCRClient()
# 检查服务状态
print("\n1. 检查服务状态...")
if client.is_service_running():
print("✓ Umi-OCR服务运行中")
else:
print("✗ Umi-OCR服务未运行")
print("请先启动Umi-OCR软件并开启HTTP服务设置->HTTP接口->启用)")
exit(1)
# 测试截图识别
print("\n2. 测试截图识别...")
print("请在5秒内准备好要截图的内容...")
time.sleep(5)
result = client.recognize_screenshot()
if result:
print(f"✓ 识别成功,内容:\n{result[:200]}...")
else:
print("✗ 识别失败")
print("\n" + "=" * 60)
print("测试完成")
print("=" * 60)