Compare commits
46 Commits
389486ad6e
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 498d5110e9 | |||
| 851d536b59 | |||
| adc9c76864 | |||
| 624e158be9 | |||
| 5bc40abbc1 | |||
| bd2c457f54 | |||
| 179bfa327b | |||
| c2357ffb67 | |||
| 0d287e7c1f | |||
| 674ee1e1e2 | |||
| 0cf231f9f7 | |||
| f82da3bab1 | |||
| 22a50ad5c6 | |||
| 0d9e427a34 | |||
| ec68b83827 | |||
| 130bbfb090 | |||
| 6e83136dc6 | |||
| f6f4da7d07 | |||
| a2be43d42a | |||
| a4c106fa5a | |||
| f24ca9aa29 | |||
| a537d3825b | |||
| e67931c3ca | |||
| b7cd03434d | |||
| a9d6c4699d | |||
| 3984b81f86 | |||
| d62cd2fcca | |||
| d44a294bf7 | |||
| 57e0029eb1 | |||
| a2ecc7f451 | |||
| 6ae10c9d36 | |||
| 20b2f46533 | |||
| 43ec564daa | |||
| 8cc25b7c2e | |||
| a158e3d6bf | |||
| 71bef2bd06 | |||
| b62d4ff40d | |||
| 272f4440fd | |||
| 1693c1963f | |||
| e614bfcf93 | |||
| 28ea813110 | |||
| 18aff6b945 | |||
| 9c48648b26 | |||
| afeb00ccc4 | |||
| deea6764cf | |||
| 9e20d439bf |
68
.gitignore
vendored
Normal file
68
.gitignore
vendored
Normal file
@@ -0,0 +1,68 @@
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
integrated_product_system.log
|
||||
|
||||
# Databases
|
||||
*.db
|
||||
*.sqlite
|
||||
|
||||
# IDE
|
||||
.trae/
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Test files
|
||||
*test*.py
|
||||
*Test*.py
|
||||
pytest_cache/
|
||||
.tox/
|
||||
.coverage
|
||||
coverage.xml
|
||||
|
||||
# Temporary files
|
||||
*.tmp
|
||||
*.temp
|
||||
temp*.txt
|
||||
*.bak
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
temp_*.txt
|
||||
|
||||
# Bug and debug files
|
||||
*debug*.png
|
||||
*bug*.txt
|
||||
|
||||
# Batch files
|
||||
*.bat
|
||||
|
||||
# Output files
|
||||
*.out
|
||||
*.output
|
||||
|
||||
# Environment
|
||||
.env
|
||||
.env.local
|
||||
.env.*.local
|
||||
|
||||
# Documentation build
|
||||
_build/
|
||||
build/
|
||||
dist/
|
||||
*.egg-info/
|
||||
|
||||
# Other
|
||||
2025年12月*.txt
|
||||
*.png
|
||||
68
.trae/documents/实现用户关注数转换功能.md
Normal file
68
.trae/documents/实现用户关注数转换功能.md
Normal file
@@ -0,0 +1,68 @@
|
||||
## 实现计划
|
||||
|
||||
### 1. 数据库结构更新
|
||||
|
||||
* **修改`init_database`方法**:在`product_analysis`表中添加`follows`字段,用于存储转换后的用户关注数
|
||||
|
||||
### 2. 添加用户关注数转换方法
|
||||
|
||||
* **创建`convert_user_count_to_number`方法**:使用Ollama API将`user_count`文本转换为数字
|
||||
|
||||
* 处理不同格式:"53 followers" → 53,"1.9K followers" → 1900
|
||||
|
||||
* 调用Ollama API进行智能转换
|
||||
|
||||
* 返回转换后的数字
|
||||
|
||||
### 3. 集成到现有分析流程
|
||||
|
||||
* **修改`get_product_data`方法**:在查询中包含`user_count`和`url`字段
|
||||
|
||||
* **更新`analyze_products`方法**:
|
||||
|
||||
* 扩展返回值处理,包含`user_count`和`url`
|
||||
|
||||
* 在分析过程中调用转换方法处理关注数
|
||||
|
||||
* 将转换后的数字传递给保存方法
|
||||
|
||||
### 4. 更新数据保存方法
|
||||
|
||||
* **修改`save_analysis_result`方法**:添加`follows`参数,将转换后的关注数保存到数据库
|
||||
|
||||
### 5. 添加关注数分析更新功能
|
||||
|
||||
* **创建`analyze_follower_counts`方法**:
|
||||
|
||||
* 查询所有产品及其分析记录
|
||||
|
||||
* 对每个产品转换`user_count`并更新`product_analysis.follows`
|
||||
|
||||
* 处理已有分析记录的关注数更新
|
||||
|
||||
### 6. 完善工作流程
|
||||
|
||||
* **更新`run_full_workflow_async`方法**:添加第4步,执行关注数分析更新
|
||||
|
||||
## 预期效果
|
||||
|
||||
* 新的`product_analysis`表将包含`follows`字段,存储转换后的数字关注数
|
||||
|
||||
* 新分析的产品将自动转换并保存关注数
|
||||
|
||||
* 已有产品将通过额外步骤更新关注数
|
||||
|
||||
* 使用Ollama API确保转换准确性
|
||||
|
||||
## 关键技术点
|
||||
|
||||
* SQLite数据库表结构修改
|
||||
|
||||
* Ollama API调用与结果解析
|
||||
|
||||
* 文本到数字的智能转换
|
||||
|
||||
* 现有代码的无缝集成
|
||||
|
||||
* 批量数据处理与更新
|
||||
|
||||
5785
2025年11月30日91634.txt
5785
2025年11月30日91634.txt
File diff suppressed because it is too large
Load Diff
5830
2025年12月3日18450.txt
5830
2025年12月3日18450.txt
File diff suppressed because it is too large
Load Diff
5790
2026年1月15日1991.txt
Normal file
5790
2026年1月15日1991.txt
Normal file
File diff suppressed because it is too large
Load Diff
5820
2026年1月17日16419.txt
Normal file
5820
2026年1月17日16419.txt
Normal file
File diff suppressed because it is too large
Load Diff
5800
2026年1月18日9249.txt
Normal file
5800
2026年1月18日9249.txt
Normal file
File diff suppressed because it is too large
Load Diff
5840
2026年1月21日19238.txt
Normal file
5840
2026年1月21日19238.txt
Normal file
File diff suppressed because it is too large
Load Diff
5795
2026年1月22日18556.txt
Normal file
5795
2026年1月22日18556.txt
Normal file
File diff suppressed because it is too large
Load Diff
5855
2026年1月29日20470.txt
Normal file
5855
2026年1月29日20470.txt
Normal file
File diff suppressed because it is too large
Load Diff
5795
2026年1月31日91239.txt
Normal file
5795
2026年1月31日91239.txt
Normal file
File diff suppressed because it is too large
Load Diff
5800
2026年3月10日183431.txt
Normal file
5800
2026年3月10日183431.txt
Normal file
File diff suppressed because it is too large
Load Diff
5810
2026年3月8日18119.txt
Normal file
5810
2026年3月8日18119.txt
Normal file
File diff suppressed because it is too large
Load Diff
286
README.md
286
README.md
@@ -1,21 +1,60 @@
|
||||
# TopHub数据处理系统
|
||||
# TopHub数据处理与产品分析系统
|
||||
|
||||
本项目用于处理TopHub网站抓取的临时文件,对数据进行分类并存储到SQLite数据库中。
|
||||
本项目包含两个核心功能模块:
|
||||
1. TopHub网站数据抓取与处理系统
|
||||
2. ProductHunt产品抓取与AI分析系统
|
||||
|
||||
## 功能特点
|
||||
|
||||
1. **文件解析**:读取临时文件(格式为"日期+时间.txt"),每5行作为一个数据单元
|
||||
2. **数据提取**:从每个数据单元中提取标题和链接
|
||||
3. **智能分类**:调用本地API(Ollama)对标题进行自动分类
|
||||
4. **去重处理**:检查标题+日期是否已存在于数据库中,避免重复录入
|
||||
5. **进度显示**:使用进度条显示处理进度
|
||||
6. **分类标准化**:将相似分类合并为标准分类
|
||||
### TopHub数据抓取与处理
|
||||
- **网站抓取**:从tophub.today网站抓取数据,支持节点ID范围遍历
|
||||
- **智能过滤**:根据过滤列表自动跳过指定栏目内容
|
||||
- **数据存储**:将抓取数据保存到SQLite数据库
|
||||
- **分类处理**:调用本地API进行智能分类
|
||||
- **去重处理**:避免重复数据录入
|
||||
- **分类标准化**:相似分类自动合并
|
||||
|
||||
### ProductHunt产品分析
|
||||
- **产品抓取**:从ProductHunt抓取产品详细信息
|
||||
- **AI分析**:调用Ollama API分析产品开发难度
|
||||
- **数据管理**:完整的产品数据库管理
|
||||
- **关注数转换**:将文本形式的关注数转换为数字
|
||||
- **难度评分**:自动计算产品开发难度分数
|
||||
- **缺失数据补充**:自动补全缺失的产品链接和评分
|
||||
|
||||
### 数据可视化
|
||||
- **GUI查看器**:使用PySide6构建的可视化数据查看器
|
||||
- **搜索筛选**:支持关键词搜索和分类筛选
|
||||
- **分类统计**:实时显示分类统计信息
|
||||
- **数据操作**:支持批量删除、标记感兴趣和评分调整
|
||||
|
||||
## 文件说明
|
||||
|
||||
### 核心脚本
|
||||
|
||||
1. **process_temp_files.py** - 主处理脚本
|
||||
1. **tophub_scraper.py** - TopHub网站数据抓取脚本
|
||||
- 从tophub.today网站抓取数据
|
||||
- 根据过滤列表过滤内容
|
||||
- 保存数据到临时文件
|
||||
- 调用数据导入脚本
|
||||
|
||||
2. **product/integrated_product_system.py** - 全功能产品抓取与分析系统
|
||||
- 整合产品抓取和AI分析功能
|
||||
- 从tophub数据库查询ProductHunt链接
|
||||
- 使用Playwright抓取产品详细信息
|
||||
- 调用Ollama API分析产品开发难度
|
||||
- 管理产品数据库
|
||||
- 提供完整的工作流程
|
||||
|
||||
3. **db_viewer.py** - TopHub数据查看器
|
||||
- PySide6界面应用程序
|
||||
- 显示SQLite数据库中的抓取数据
|
||||
- 支持搜索、筛选和分类统计
|
||||
- 支持链接点击和数据操作
|
||||
|
||||
### 辅助脚本
|
||||
|
||||
1. **process_temp_files.py** - 临时文件处理脚本
|
||||
- 解析临时文件
|
||||
- 调用API进行分类
|
||||
- 存储到数据库
|
||||
@@ -28,30 +67,76 @@
|
||||
- 将相似分类合并为标准分类
|
||||
- 提供分类映射规则
|
||||
|
||||
### 辅助脚本
|
||||
4. **run_viewer.py** - 数据库查看器启动脚本
|
||||
- 检查依赖包
|
||||
- 启动SQLite数据库查看器
|
||||
|
||||
1. **check_db.py** - 数据库结构检查脚本
|
||||
2. **test_api.py** - API测试脚本
|
||||
3. **view_categories.py** - 查看分类示例脚本
|
||||
5. **check_db.py** - 数据库结构检查脚本
|
||||
6. **test_api.py** - API测试脚本
|
||||
7. **view_categories.py** - 查看分类示例脚本
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 1. 处理临时文件
|
||||
### 1. TopHub数据抓取
|
||||
|
||||
```bash
|
||||
python process_temp_files.py
|
||||
python tophub_scraper.py
|
||||
```
|
||||
|
||||
该脚本会:
|
||||
- 扫描当前目录下的所有临时文件(格式为"日期+时间.txt")
|
||||
- 解析文件内容,提取标题和链接
|
||||
- 调用本地API对标题进行分类
|
||||
- 检查并避免重复数据
|
||||
- 存储到tophub_data.db数据库
|
||||
- 从tophub.today网站抓取数据
|
||||
- 根据过滤列表过滤内容(可配置tophub_ban_column.txt)
|
||||
- 将抓取数据保存为临时文件(格式:YYYY年MM月DD日HHMMSS.txt)
|
||||
- 调用数据导入脚本处理抓取结果
|
||||
|
||||
### 2. 清理和标准化分类
|
||||
### 2. ProductHunt产品抓取与分析
|
||||
|
||||
```bash
|
||||
# 运行完整工作流程:抓取+分析+数据补充
|
||||
python product/integrated_product_system.py
|
||||
|
||||
# 仅进行分析,不抓取数据
|
||||
python product/integrated_product_system.py --analyze-only
|
||||
|
||||
# 限制最大分析产品数量
|
||||
python product/integrated_product_system.py --max-products 100
|
||||
```
|
||||
|
||||
主要功能:
|
||||
- 从tophub数据库查询ProductHunt链接
|
||||
- 使用Playwright抓取产品详细信息
|
||||
- 调用Ollama API分析产品开发难度
|
||||
- 自动计算难度分数
|
||||
- 转换用户关注数为数字格式
|
||||
- 补全缺失的产品链接
|
||||
- 重新分析无效难度评分
|
||||
|
||||
### 3. 数据可视化查看
|
||||
|
||||
```bash
|
||||
# 启动数据库查看器
|
||||
python db_viewer.py
|
||||
```
|
||||
|
||||
或使用启动脚本:
|
||||
|
||||
```bash
|
||||
python run_viewer.py
|
||||
```
|
||||
|
||||
查看器功能:
|
||||
- 显示数据库中的抓取数据
|
||||
- 支持关键词搜索和分类筛选
|
||||
- 实时分类统计显示
|
||||
- 支持链接点击在浏览器中打开
|
||||
- 支持批量删除和评分调整
|
||||
|
||||
### 4. 分类处理
|
||||
|
||||
```bash
|
||||
# 处理临时文件
|
||||
python process_temp_files.py
|
||||
|
||||
# 清理分类中的特殊字符
|
||||
python cleanup_categories.py
|
||||
|
||||
@@ -59,74 +144,118 @@ python cleanup_categories.py
|
||||
python standardize_categories.py
|
||||
```
|
||||
|
||||
### 3. 查看数据
|
||||
|
||||
```bash
|
||||
# 查看分类示例
|
||||
python view_categories.py
|
||||
|
||||
# 检查数据库结构
|
||||
python check_db.py
|
||||
```
|
||||
|
||||
## 数据库结构
|
||||
|
||||
数据库文件为`tophub_data.db`,包含以下表:
|
||||
### 1. TopHub数据数据库 (tophub_data.db)
|
||||
|
||||
1. **tophub_entries** - 主数据表
|
||||
- id: 主键
|
||||
- text_content: 标题内容(非空)
|
||||
- link: 链接
|
||||
- category: 分类
|
||||
- scrape_time: 抓取时间
|
||||
包含TopHub网站抓取的原始数据:
|
||||
|
||||
2. **classification_progress** - 分类进度表
|
||||
- id: 主键
|
||||
- total_count: 总数量
|
||||
- processed_count: 已处理数量
|
||||
- last_updated: 最后更新时间
|
||||
- **articles** - 主数据表
|
||||
- id: 主键
|
||||
- title: 标题内容
|
||||
- url: 链接
|
||||
- category: 分类
|
||||
- source_date: 来源日期
|
||||
- score: 评分
|
||||
- is_interested: 是否感兴趣
|
||||
|
||||
- **classification_progress** - 分类进度表
|
||||
- id: 主键
|
||||
- total_count: 总数量
|
||||
- processed_count: 已处理数量
|
||||
- last_updated: 最后更新时间
|
||||
|
||||
### 2. 产品分析数据库 (products.db)
|
||||
|
||||
包含ProductHunt产品的详细信息和分析结果:
|
||||
|
||||
- **products** - 产品信息表
|
||||
- id: 主键
|
||||
- url: 产品链接(唯一)
|
||||
- name: 产品名称
|
||||
- introduction: 产品简介
|
||||
- user_count: 用户数量
|
||||
- maker_link: 制作者链接
|
||||
- maker_statement: 制作者声明
|
||||
- created_at: 创建时间
|
||||
- updated_at: 更新时间
|
||||
|
||||
- **product_analysis** - 产品分析结果表
|
||||
- id: 主键
|
||||
- original_name: 原始产品名称
|
||||
- product_intro: 产品简介
|
||||
- development_difficulty: 开发难度描述
|
||||
- ai_response: AI原始响应
|
||||
- difficulty_score: 难度分数
|
||||
- product_link: 产品链接
|
||||
- follows: 关注数
|
||||
- created_at: 创建时间
|
||||
|
||||
## API配置
|
||||
|
||||
脚本使用本地Ollama API进行分类:
|
||||
- API地址:http://localhost:11434/api/generate
|
||||
- 模型:gemma3:4b
|
||||
- 请求格式:JSON
|
||||
项目使用本地Ollama API进行AI相关任务:
|
||||
- **API地址**:http://localhost:11434/api/generate
|
||||
- **模型**:qwen3:8b
|
||||
- **请求格式**:JSON
|
||||
|
||||
主要用途:
|
||||
1. **TopHub数据分类**:对抓取的标题进行智能分类
|
||||
2. **产品开发难度分析**:分析ProductHunt产品的开发难度
|
||||
3. **用户关注数转换**:将文本形式的关注数转换为数字
|
||||
4. **难度评分计算**:自动计算产品开发难度分数
|
||||
|
||||
## 核心依赖
|
||||
|
||||
### 基础依赖
|
||||
- requests: HTTP请求处理
|
||||
- sqlite3: 数据库操作
|
||||
- loguru: 日志记录
|
||||
- tqdm: 进度条显示
|
||||
|
||||
### 产品分析依赖
|
||||
- asyncio: 异步编程
|
||||
- playwright: 网页抓取
|
||||
- PySide6: GUI界面(仅用于查看器)
|
||||
|
||||
## 日志文件
|
||||
|
||||
系统会生成以下日志文件:
|
||||
- **tophub_scraper.log** - TopHub抓取日志
|
||||
- **integrated_product_system.log** - 产品分析系统日志
|
||||
- **process_temp_files.log** - 临时文件处理日志
|
||||
- **cleanup_categories.log** - 分类清理日志
|
||||
- **standardize_categories.log** - 分类标准化日志
|
||||
|
||||
## 分类标准
|
||||
|
||||
系统支持以下标准分类:
|
||||
|
||||
1. 科技 - 新质科技、互联网等
|
||||
2. 社会 - 社会新闻、生活服务等
|
||||
3. 体育 - 体育新闻、足球等
|
||||
4. 历史 - 历史事件、历史人物等
|
||||
5. 安全 - 安全漏洞、安全科技等
|
||||
6. 军事 - 军事新闻、国防等
|
||||
7. 金融 - 金融新闻、市场分析等
|
||||
8. 购物 - 电商、购物等
|
||||
9. 游戏 - 游戏新闻等
|
||||
10. 娱乐 - 娱乐八卦、音乐等
|
||||
11. 健康 - 健康医疗、健康生活等
|
||||
1. 科技 - 新质科技、互联网、人工智能等
|
||||
2. 社会 - 社会新闻、生活服务、热点事件等
|
||||
3. 体育 - 体育新闻、足球、篮球等
|
||||
4. 历史 - 历史事件、历史人物、考古发现等
|
||||
5. 安全 - 安全漏洞、网络安全、数据安全等
|
||||
6. 军事 - 军事新闻、国防、武器装备等
|
||||
7. 金融 - 金融新闻、市场分析、投资等
|
||||
8. 购物 - 电商、购物、消费等
|
||||
9. 游戏 - 游戏新闻、游戏开发、游戏测评等
|
||||
10. 娱乐 - 娱乐八卦、音乐、影视等
|
||||
11. 健康 - 健康医疗、健康生活、健身等
|
||||
12. 其他 - 其他未分类内容
|
||||
|
||||
## 注意事项
|
||||
|
||||
1. 确保本地Ollama服务已启动并可访问
|
||||
2. 临时文件格式必须为"日期+时间.txt"
|
||||
3. 每个数据单元包含5行:节点ID、分类、标题、链接和分隔线
|
||||
4. 数据库文件会自动创建,无需手动创建
|
||||
|
||||
## 日志文件
|
||||
|
||||
系统会生成以下日志文件:
|
||||
- process_temp_files.log - 主处理日志
|
||||
- cleanup_categories.log - 分类清理日志
|
||||
- standardize_categories.log - 分类标准化日志
|
||||
1. **Ollama服务**:确保本地Ollama服务已启动并可访问(默认端口11434)
|
||||
2. **Chrome浏览器**:产品抓取功能需要已运行的Chrome浏览器实例(调试端口9222)
|
||||
3. **临时文件格式**:TopHub抓取生成的临时文件格式为"YYYY年MM月DD日HHMMSS.txt"
|
||||
4. **数据单元结构**:每个数据单元包含5行:节点ID、分类、标题、链接和分隔线
|
||||
5. **数据库自动创建**:所有数据库文件会自动创建,无需手动创建
|
||||
6. **依赖安装**:使用GUI查看器前,请安装依赖:`pip install -r requirements_gui.txt`
|
||||
7. **过滤列表配置**:可通过编辑tophub_ban_column.txt文件配置需要过滤的栏目
|
||||
|
||||
## 示例
|
||||
|
||||
### 临时文件格式示例
|
||||
### TopHub抓取临时文件示例
|
||||
|
||||
```
|
||||
节点ID: 102
|
||||
@@ -141,9 +270,18 @@ python check_db.py
|
||||
--------------------------------------------------
|
||||
```
|
||||
|
||||
### 处理结果示例
|
||||
### 产品分析结果示例
|
||||
|
||||
```
|
||||
标题 '女机器人' 分类为: 科技
|
||||
标题 '这个应该属于底盘不行吗' 分类为: 其他
|
||||
```
|
||||
产品 'AI Assistant' 分析完成
|
||||
- 难度描述: 中等难度,需要一定的AI开发经验
|
||||
- 难度分数: 60/100
|
||||
- 关注数: 1500
|
||||
```
|
||||
|
||||
### 数据库查看器界面
|
||||
|
||||
- 显示所有抓取数据,支持实时搜索和筛选
|
||||
- 分类统计显示在顶部
|
||||
- 点击链接可直接在浏览器中打开
|
||||
- 右键菜单支持批量操作和评分调整
|
||||
BIN
debug_maker_link_failure.png
Normal file
BIN
debug_maker_link_failure.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 526 KiB |
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
Before Width: | Height: | Size: 223 KiB After Width: | Height: | Size: 231 KiB |
Binary file not shown.
@@ -112,10 +112,18 @@ class IntegratedProductSystem:
|
||||
ai_response TEXT,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
difficulty_score INTEGER,
|
||||
product_link TEXT
|
||||
product_link TEXT,
|
||||
follows INTEGER
|
||||
)
|
||||
''')
|
||||
|
||||
# 为现有表添加follows字段(如果不存在)
|
||||
cursor.execute("PRAGMA table_info(product_analysis)")
|
||||
columns = [col[1] for col in cursor.fetchall()]
|
||||
if 'follows' not in columns:
|
||||
cursor.execute("ALTER TABLE product_analysis ADD COLUMN follows INTEGER")
|
||||
logger.info("已为product_analysis表添加follows字段")
|
||||
|
||||
conn.commit()
|
||||
conn.close()
|
||||
logger.success("产品数据库初始化完成")
|
||||
@@ -262,9 +270,9 @@ class IntegratedProductSystem:
|
||||
try:
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 查询products表中的name和introduction字段
|
||||
# 查询products表中的id、name、introduction、user_count和url字段
|
||||
cursor.execute("""
|
||||
SELECT id, name, introduction
|
||||
SELECT id, name, introduction, user_count, url
|
||||
FROM products
|
||||
WHERE name IS NOT NULL AND introduction IS NOT NULL
|
||||
AND name != '' AND introduction != ''
|
||||
@@ -274,8 +282,8 @@ class IntegratedProductSystem:
|
||||
logger.info(f"从数据库获取到 {len(products)} 个产品")
|
||||
|
||||
# 显示前几个产品作为示例
|
||||
for i, (id, name, intro) in enumerate(products[:3], 1):
|
||||
logger.info(f"示例产品{i}: ID={id}, 名称='{name}', 简介='{intro[:50]}...'")
|
||||
for i, (id, name, intro, user_count, url) in enumerate(products[:3], 1):
|
||||
logger.info(f"示例产品{i}: ID={id}, 名称='{name}', 简介='{intro[:50]}...', 用户数='{user_count}', URL='{url}'")
|
||||
|
||||
return products
|
||||
|
||||
@@ -321,6 +329,64 @@ class IntegratedProductSystem:
|
||||
logger.error(f"调用Ollama AI API时出错: {e}")
|
||||
return None
|
||||
|
||||
def convert_user_count_to_number(self, user_count: str) -> Optional[int]:
|
||||
"""使用Ollama API将user_count文本转换为数字
|
||||
|
||||
Args:
|
||||
user_count: 用户数量文本,如"53 followers"或"1.9K followers"
|
||||
|
||||
Returns:
|
||||
转换后的数字,或None如果转换失败
|
||||
"""
|
||||
if not user_count or user_count.strip() == "":
|
||||
logger.info(f"空的用户数量: {user_count}")
|
||||
return None
|
||||
|
||||
try:
|
||||
logger.info(f"正在转换用户数量: {user_count}")
|
||||
|
||||
# 构建请求数据,专门用于用户数量转换
|
||||
prompt = f"请将以下用户数量文本转换为纯数字,不要包含任何其他内容:\n{user_count}\n\n转换规则:\n- 直接数字:如'53 followers' → 53\n- K表示千:如'1.9K followers' → 1900\n- M表示百万:如'2.5M followers' → 2500000\n- 只返回数字,不要添加任何单位或解释"
|
||||
|
||||
data = {
|
||||
"model": "qwen3:8b",
|
||||
"prompt": prompt,
|
||||
"stream": False
|
||||
}
|
||||
|
||||
headers = {
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
# 调用Ollama API
|
||||
response = requests.post(
|
||||
self.api_url,
|
||||
headers=headers,
|
||||
data=json.dumps(data, ensure_ascii=False),
|
||||
timeout=30
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
converted = result.get("response", "").strip()
|
||||
logger.success(f"成功转换用户数量: {user_count} → {converted}")
|
||||
|
||||
# 提取纯数字
|
||||
import re
|
||||
number_match = re.search(r'\d+(?:\.\d+)?', converted)
|
||||
if number_match:
|
||||
return int(float(number_match.group()))
|
||||
else:
|
||||
logger.error(f"无法从转换结果中提取数字: {converted}")
|
||||
return None
|
||||
else:
|
||||
logger.error(f"Ollama API调用失败: {response.status_code}, {response.text}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"转换用户数量时出错: {e}")
|
||||
return None
|
||||
|
||||
def parse_ai_response(self, response: str) -> Tuple[str, str, str, int]:
|
||||
"""解析AI响应内容,提取产品名称、简介、难度描述和难度分数"""
|
||||
try:
|
||||
@@ -398,8 +464,8 @@ class IntegratedProductSystem:
|
||||
def save_analysis_result(self, conn: sqlite3.Connection,
|
||||
original_name: str, difficulty: str,
|
||||
ai_response: str, difficulty_score: int = None,
|
||||
product_link: str = None):
|
||||
"""保存分析结果到数据库,包括难度分数和产品链接"""
|
||||
product_link: str = None, follows: int = None):
|
||||
"""保存分析结果到数据库,包括难度分数、产品链接和关注数"""
|
||||
try:
|
||||
cursor = conn.cursor()
|
||||
|
||||
@@ -409,12 +475,12 @@ class IntegratedProductSystem:
|
||||
|
||||
cursor.execute("""
|
||||
INSERT INTO product_analysis
|
||||
(original_name, development_difficulty, difficulty_score, ai_response, product_link)
|
||||
VALUES (?, ?, ?, ?, ?)
|
||||
""", (original_name, difficulty, difficulty_score, ai_response, product_link))
|
||||
(original_name, development_difficulty, difficulty_score, ai_response, product_link, follows)
|
||||
VALUES (?, ?, ?, ?, ?, ?)
|
||||
""", (original_name, difficulty, difficulty_score, ai_response, product_link, follows))
|
||||
|
||||
conn.commit()
|
||||
logger.success(f"保存分析结果成功: {original_name}, 难度分数: {difficulty_score}")
|
||||
logger.success(f"保存分析结果成功: {original_name}, 难度分数: {difficulty_score}, 关注数: {follows}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"保存分析结果失败: {e}")
|
||||
@@ -450,7 +516,7 @@ class IntegratedProductSystem:
|
||||
# 逐个分析产品
|
||||
success_count = 0
|
||||
skip_count = 0
|
||||
for i, (original_id, name, introduction) in enumerate(products_to_analyze, 1):
|
||||
for i, (original_id, name, introduction, user_count, url) in enumerate(products_to_analyze, 1):
|
||||
logger.info(f"\n分析进度: {i}/{len(products_to_analyze)} - {name}")
|
||||
|
||||
# 检查产品是否已存在
|
||||
@@ -462,7 +528,7 @@ class IntegratedProductSystem:
|
||||
# 显示API调用状态
|
||||
logger.info(f"正在提交API请求... 进度: {i}/{len(products_to_analyze)}")
|
||||
|
||||
# 调用AI API
|
||||
# 调用AI API分析产品
|
||||
ai_response = self.call_ollama_ai_api(name, introduction)
|
||||
|
||||
if ai_response:
|
||||
@@ -472,8 +538,13 @@ class IntegratedProductSystem:
|
||||
# 解析响应
|
||||
product_intro, difficulty, difficulty_score = self.parse_ai_response(ai_response)
|
||||
|
||||
# 保存结果(不再保存product_intro,避免与ai_response重复)
|
||||
self.save_analysis_result(conn, name, difficulty, ai_response, difficulty_score)
|
||||
# 转换用户关注数
|
||||
follows = None
|
||||
if user_count:
|
||||
follows = self.convert_user_count_to_number(user_count)
|
||||
|
||||
# 保存结果
|
||||
self.save_analysis_result(conn, name, difficulty, ai_response, difficulty_score, url, follows)
|
||||
success_count += 1
|
||||
|
||||
# 显示完成状态
|
||||
@@ -660,8 +731,289 @@ class IntegratedProductSystem:
|
||||
conn.close()
|
||||
logger.info("数据库连接已关闭")
|
||||
|
||||
def analyze_follower_counts(self):
|
||||
"""分析并更新产品的关注数,仅当follows字段为空或不存在时更新"""
|
||||
logger.info("=== 开始分析产品关注数 ===")
|
||||
|
||||
conn = None
|
||||
try:
|
||||
# 连接数据库
|
||||
conn = self.connect_to_database()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 查询所有产品及其对应的分析记录,仅包括follows字段为空或不存在的记录
|
||||
cursor.execute("""
|
||||
SELECT p.id, p.name, p.user_count, pa.id as analysis_id, pa.follows
|
||||
FROM products p
|
||||
LEFT JOIN product_analysis pa ON p.name = pa.original_name
|
||||
WHERE p.user_count IS NOT NULL AND p.user_count != ''
|
||||
AND pa.id IS NOT NULL
|
||||
AND (pa.follows IS NULL OR pa.follows = '')
|
||||
""")
|
||||
|
||||
products = cursor.fetchall()
|
||||
logger.info(f"找到 {len(products)} 个需要更新关注数的产品")
|
||||
|
||||
if not products:
|
||||
logger.info("没有发现需要更新关注数的产品")
|
||||
return
|
||||
|
||||
# 为每个产品转换user_count并更新到product_analysis.follows
|
||||
updated_count = 0
|
||||
for i, (product_id, name, user_count, analysis_id, current_follows) in enumerate(products, 1):
|
||||
logger.info(f"处理产品关注数 {i}/{len(products)}: {name}, 用户数: {user_count}")
|
||||
|
||||
if not analysis_id:
|
||||
logger.info(f"产品 '{name}' 没有对应的分析记录,跳过")
|
||||
continue
|
||||
|
||||
# 转换用户关注数
|
||||
follows = self.convert_user_count_to_number(user_count)
|
||||
|
||||
# 更新关注数
|
||||
if follows is not None:
|
||||
cursor.execute("""
|
||||
UPDATE product_analysis
|
||||
SET follows = ?
|
||||
WHERE id = ?
|
||||
""", (follows, analysis_id))
|
||||
conn.commit()
|
||||
updated_count += 1
|
||||
logger.success(f"成功更新产品 '{name}' 的关注数为 {follows}")
|
||||
else:
|
||||
logger.warning(f"无法为产品 '{name}' 转换关注数")
|
||||
|
||||
# 避免API调用过于频繁
|
||||
if i < len(products):
|
||||
time.sleep(2)
|
||||
|
||||
logger.success(f"关注数分析完成! 成功更新 {updated_count} 个产品的关注数")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"分析关注数过程中出错: {e}")
|
||||
finally:
|
||||
if conn:
|
||||
conn.close()
|
||||
logger.info("数据库连接已关闭")
|
||||
|
||||
def reanalyze_invalid_difficulty_scores(self):
|
||||
"""重新分析difficulty_score为1的行,确保难度评分准确"""
|
||||
logger.info("=== 开始重新分析无效难度评分 ===")
|
||||
|
||||
conn = None
|
||||
try:
|
||||
# 连接数据库
|
||||
conn = self.connect_to_database()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 查询difficulty_score为1的记录
|
||||
cursor.execute("""
|
||||
SELECT id, original_name, product_intro, development_difficulty, ai_response
|
||||
FROM product_analysis
|
||||
WHERE difficulty_score = 1
|
||||
""")
|
||||
|
||||
invalid_records = cursor.fetchall()
|
||||
logger.info(f"找到 {len(invalid_records)} 条difficulty_score为1的记录需要重新分析")
|
||||
|
||||
if not invalid_records:
|
||||
logger.info("没有发现需要重新分析的无效难度评分记录")
|
||||
return
|
||||
|
||||
# 为每个无效记录重新分析难度
|
||||
updated_count = 0
|
||||
for i, (analysis_id, name, introduction, development_difficulty, ai_response) in enumerate(invalid_records, 1):
|
||||
logger.info(f"重新分析记录 {i}/{len(invalid_records)}: {name}")
|
||||
|
||||
# 调用AI API重新分析产品难度
|
||||
logger.info(f"重新调用Ollama API分析产品难度: {name}")
|
||||
|
||||
# 构建请求数据 - 使用Ollama API格式,专门用于难度分析
|
||||
prompt = f"这个是【{name}】,简介内容是【{introduction}】。请重新分析这个产品的开发难度,特别是对于一个人加上AI辅助能否开发这个产品,请详细回答。返回的内容是产品名称/产品简介/开发难度。返回的例子一:notion/这个是笔记产品等等/一个人开发难度较高"
|
||||
|
||||
data = {
|
||||
"model": "qwen3:8b",
|
||||
"prompt": prompt,
|
||||
"stream": False
|
||||
}
|
||||
|
||||
headers = {
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
|
||||
try:
|
||||
# 调用Ollama API
|
||||
response = requests.post(
|
||||
self.api_url,
|
||||
headers=headers,
|
||||
data=json.dumps(data, ensure_ascii=False),
|
||||
timeout=60
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
new_ai_response = result.get("response", "").strip()
|
||||
logger.success(f"成功重新分析产品 '{name}'")
|
||||
|
||||
# 解析新的响应,获取难度分数
|
||||
_, new_difficulty, new_difficulty_score = self.parse_ai_response(new_ai_response)
|
||||
|
||||
# 特别处理很难的情况,确保分数在70-90之间
|
||||
difficulty_lower = new_difficulty.lower()
|
||||
if any(keyword in difficulty_lower for keyword in ['高', '很难', '非常难', '复杂', '困难']):
|
||||
if new_difficulty_score < 70:
|
||||
new_difficulty_score = max(70, min(90, new_difficulty_score + 60))
|
||||
logger.info(f"调整很难产品的难度分数为: {new_difficulty_score} (70-90区间)")
|
||||
|
||||
# 更新数据库记录
|
||||
cursor.execute("""
|
||||
UPDATE product_analysis
|
||||
SET development_difficulty = ?,
|
||||
difficulty_score = ?,
|
||||
ai_response = ?
|
||||
WHERE id = ?
|
||||
""", (new_difficulty, new_difficulty_score, new_ai_response, analysis_id))
|
||||
|
||||
conn.commit()
|
||||
updated_count += 1
|
||||
logger.success(f"成功更新产品 '{name}' 的难度分数为 {new_difficulty_score}")
|
||||
else:
|
||||
logger.error(f"API调用失败: {response.status_code}, {response.text}")
|
||||
except Exception as e:
|
||||
logger.error(f"重新分析产品 '{name}' 失败: {e}")
|
||||
|
||||
# 避免API调用过于频繁
|
||||
if i < len(invalid_records):
|
||||
time.sleep(2)
|
||||
|
||||
logger.success(f"无效难度评分重新分析完成! 成功更新 {updated_count} 条记录")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"重新分析无效难度评分过程中出错: {e}")
|
||||
finally:
|
||||
if conn:
|
||||
conn.close()
|
||||
logger.info("数据库连接已关闭")
|
||||
|
||||
def fill_missing_product_links(self):
|
||||
"""检查product_analysis表中的product_link字段是否为空,如果为空则从tophub_data.db补全"""
|
||||
logger.info("=== 开始补全缺失的product_link字段 ===")
|
||||
|
||||
# 检查tophub_data.db是否存在
|
||||
tophub_db_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "tophub_data.db")
|
||||
if not os.path.exists(tophub_db_path):
|
||||
logger.error(f"tophub_data.db不存在: {tophub_db_path}")
|
||||
return
|
||||
|
||||
conn_product = None
|
||||
conn_tophub = None
|
||||
try:
|
||||
# 连接两个数据库
|
||||
conn_product = self.connect_to_database()
|
||||
cursor_product = conn_product.cursor()
|
||||
|
||||
conn_tophub = sqlite3.connect(tophub_db_path)
|
||||
cursor_tophub = conn_tophub.cursor()
|
||||
logger.success(f"成功连接到tophub_data.db: {tophub_db_path}")
|
||||
|
||||
# 查询product_link为空的记录
|
||||
cursor_product.execute("""
|
||||
SELECT id, original_name
|
||||
FROM product_analysis
|
||||
WHERE product_link IS NULL OR product_link = ''
|
||||
""")
|
||||
|
||||
missing_link_records = cursor_product.fetchall()
|
||||
logger.info(f"找到 {len(missing_link_records)} 条product_link为空的记录需要补全")
|
||||
|
||||
if not missing_link_records:
|
||||
logger.info("没有发现需要补全product_link的记录")
|
||||
return
|
||||
|
||||
# 获取tophub_data.db中的所有producthunt链接
|
||||
cursor_tophub.execute("SELECT url FROM articles WHERE url LIKE '%producthunt.com%'")
|
||||
tophub_urls = [row[0] for row in cursor_tophub.fetchall()]
|
||||
logger.info(f"从tophub_data.db获取到 {len(tophub_urls)} 个producthunt链接")
|
||||
|
||||
if not tophub_urls:
|
||||
logger.error("从tophub_data.db中没有找到producthunt链接")
|
||||
return
|
||||
|
||||
# 为每个缺失product_link的记录查找匹配的URL
|
||||
updated_count = 0
|
||||
for i, (analysis_id, original_name) in enumerate(missing_link_records, 1):
|
||||
logger.info(f"处理记录 {i}/{len(missing_link_records)}: {original_name}")
|
||||
|
||||
# 查找匹配的URL
|
||||
matched_url = None
|
||||
|
||||
# 将产品名称转换为URL友好格式
|
||||
import re
|
||||
# 移除特殊字符,替换空格、点号为连字符
|
||||
url_friendly_name = original_name.lower()
|
||||
# 移除常见特殊字符
|
||||
url_friendly_name = re.sub(r'[^a-zA-Z0-9\s.-]', '', url_friendly_name)
|
||||
# 将空格、点号替换为连字符
|
||||
url_friendly_name = re.sub(r'[\s.]+', '-', url_friendly_name)
|
||||
# 移除多余的连字符
|
||||
url_friendly_name = re.sub(r'-+', '-', url_friendly_name).strip('-')
|
||||
|
||||
logger.debug(f"URL友好名称: {url_friendly_name}")
|
||||
|
||||
# 尝试多种匹配方式
|
||||
for url in tophub_urls:
|
||||
url_lower = url.lower()
|
||||
|
||||
# 方式1: URL友好名称完全匹配URL路径中的产品部分
|
||||
if url_friendly_name in url_lower:
|
||||
matched_url = url
|
||||
logger.debug(f"匹配方式1成功: {url}")
|
||||
break
|
||||
|
||||
# 方式2: 检查URL是否包含产品名称的主要部分(按连字符分割)
|
||||
name_parts = url_friendly_name.split('-')
|
||||
# 如果名称包含至少2个部分,检查前两个部分是否都在URL中
|
||||
if len(name_parts) >= 2:
|
||||
if name_parts[0] in url_lower and name_parts[1] in url_lower:
|
||||
matched_url = url
|
||||
logger.debug(f"匹配方式2成功: {url}")
|
||||
break
|
||||
|
||||
# 方式3: 检查产品名称中的主要单词是否在URL中(针对较长名称)
|
||||
if len(name_parts) > 3:
|
||||
# 检查前3个主要部分
|
||||
if all(part in url_lower for part in name_parts[:3]):
|
||||
matched_url = url
|
||||
logger.debug(f"匹配方式3成功: {url}")
|
||||
break
|
||||
|
||||
if matched_url:
|
||||
# 更新product_link字段
|
||||
cursor_product.execute("""
|
||||
UPDATE product_analysis
|
||||
SET product_link = ?
|
||||
WHERE id = ?
|
||||
""", (matched_url, analysis_id))
|
||||
conn_product.commit()
|
||||
updated_count += 1
|
||||
logger.success(f"成功为产品 '{original_name}' 补全链接: {matched_url}")
|
||||
else:
|
||||
logger.warning(f"无法为产品 '{original_name}' 找到匹配的链接")
|
||||
|
||||
logger.success(f"product_link补全完成! 成功更新 {updated_count} 条记录")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"补全product_link过程中出错: {e}")
|
||||
finally:
|
||||
# 关闭数据库连接
|
||||
if conn_product:
|
||||
conn_product.close()
|
||||
if conn_tophub:
|
||||
conn_tophub.close()
|
||||
logger.info("数据库连接已关闭")
|
||||
|
||||
async def run_full_workflow_async(self, max_products=None, analyze_only=False):
|
||||
"""异步运行完整工作流程:抓取+分析+补充缺失分数"""
|
||||
"""异步运行完整工作流程:抓取+分析+补充缺失分数+更新关注数+重新分析无效难度评分+补全product_link"""
|
||||
logger.info("=== 开始全功能产品系统工作流程 ===")
|
||||
|
||||
# 初始化数据库
|
||||
@@ -682,6 +1034,18 @@ class IntegratedProductSystem:
|
||||
logger.info("步骤3: 开始分析并补充缺失的难度分数...")
|
||||
self.analyze_missing_scores()
|
||||
|
||||
# 步骤4: 分析并更新产品关注数
|
||||
logger.info("步骤4: 开始分析并更新产品关注数...")
|
||||
self.analyze_follower_counts()
|
||||
|
||||
# 步骤5: 重新分析invalid难度评分
|
||||
logger.info("步骤5: 开始重新分析invalid难度评分...")
|
||||
self.reanalyze_invalid_difficulty_scores()
|
||||
|
||||
# 步骤6: 补全缺失的product_link字段
|
||||
logger.info("步骤6: 开始补全缺失的product_link字段...")
|
||||
self.fill_missing_product_links()
|
||||
|
||||
logger.success("=== 全功能产品系统工作流程完成 ===")
|
||||
|
||||
def run_full_workflow(self, max_products=None, analyze_only=False):
|
||||
@@ -723,11 +1087,11 @@ async def main():
|
||||
chrome_bat_path = os.path.join(os.path.dirname(__file__), "start_chrome.bat")
|
||||
logger.info(f"正在运行Chrome启动脚本: {chrome_bat_path}")
|
||||
try:
|
||||
# 运行批处理程序,等待其完成
|
||||
subprocess.run([chrome_bat_path], check=True, shell=True)
|
||||
logger.success("Chrome启动脚本执行成功")
|
||||
except subprocess.CalledProcessError as e:
|
||||
logger.error(f"Chrome启动脚本执行失败: {e}")
|
||||
# 异步运行批处理程序,不等待其完成
|
||||
subprocess.Popen([chrome_bat_path], shell=True)
|
||||
logger.success("Chrome启动脚本已启动")
|
||||
except Exception as e:
|
||||
logger.error(f"Chrome启动脚本启动失败: {e}")
|
||||
except FileNotFoundError:
|
||||
logger.error(f"未找到Chrome启动脚本: {chrome_bat_path}")
|
||||
|
||||
|
||||
Binary file not shown.
@@ -137,11 +137,34 @@ class SQLiteWebViewer:
|
||||
cursor.execute(count_query, query_params)
|
||||
total_count = cursor.fetchone()[0]
|
||||
|
||||
# 检查是否有日期相关字段,用于排序
|
||||
date_columns = []
|
||||
# 常见的日期字段名称
|
||||
date_field_names = ['created_at', 'updated_at', 'date', 'publish_date', 'release_date']
|
||||
for col_info in columns_info:
|
||||
field_name = col_info[1]
|
||||
# 检查字段名是否包含日期相关关键词
|
||||
if any(keyword in field_name.lower() for keyword in date_field_names):
|
||||
date_columns.append(field_name)
|
||||
|
||||
# 如果找到日期字段,按最新日期排序
|
||||
order_by_clause = ""
|
||||
if date_columns:
|
||||
# 优先使用updated_at,如果没有则使用created_at,否则使用第一个找到的日期字段
|
||||
if 'updated_at' in date_columns:
|
||||
sort_column = 'updated_at'
|
||||
elif 'created_at' in date_columns:
|
||||
sort_column = 'created_at'
|
||||
else:
|
||||
sort_column = date_columns[0]
|
||||
order_by_clause = f" ORDER BY {sort_column} DESC"
|
||||
logger.info(f"应用日期排序: {sort_column} DESC")
|
||||
|
||||
# 获取分页数据
|
||||
offset = (page - 1) * per_page
|
||||
query_params.extend([per_page, offset])
|
||||
|
||||
data_query = f"SELECT * FROM {table_name}{where_clause} LIMIT ? OFFSET ?"
|
||||
data_query = f"SELECT * FROM {table_name}{where_clause}{order_by_clause} LIMIT ? OFFSET ?"
|
||||
cursor.execute(data_query, query_params)
|
||||
|
||||
rows = cursor.fetchall()
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 354 KiB After Width: | Height: | Size: 717 KiB |
@@ -1,11 +1,11 @@
|
||||
=== Product Hunt 产品信息 ===
|
||||
|
||||
产品名称: QuickWidgets
|
||||
产品名称: Greta
|
||||
|
||||
产品简介: The Quickwidgets is lightweight and diverse functions widgets
|
||||
产品简介: 未获取
|
||||
|
||||
制作人发言: 未获取
|
||||
制作人发言: This is first first proposed project. If you want to support Santiago getting his project built, here are the details.https://onemillionlines.com/proj...
|
||||
|
||||
用户数: 37 followers
|
||||
用户数: 664 followers
|
||||
|
||||
提取时间: 2025-12-03 18:53:22
|
||||
提取时间: 2026-03-08 20:40:13
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -205,9 +205,13 @@ def process_temp_files():
|
||||
continue
|
||||
|
||||
# 处理每篇文章
|
||||
for article in tqdm(articles, desc=f"处理 {file_path}"):
|
||||
for i, article in tqdm(enumerate(articles), desc=f"处理 {file_path}", total=len(articles)):
|
||||
total_processed += 1
|
||||
|
||||
# 每处理10篇文章记录一次进度
|
||||
if i % 10 == 0 and i > 0:
|
||||
logger.info(f"已处理 {i}/{len(articles)} 篇文章,完成 {i/len(articles)*100:.1f}%")
|
||||
|
||||
# 检查重复
|
||||
if check_duplicate(article['title'], source_date):
|
||||
logger.info(f"跳过重复文章(最近三天已存在): {article['title']}")
|
||||
|
||||
BIN
tophub_data.db
BIN
tophub_data.db
Binary file not shown.
70399
tophub_scraper.log
70399
tophub_scraper.log
File diff suppressed because it is too large
Load Diff
@@ -262,7 +262,7 @@ class TopHubScraper:
|
||||
|
||||
# 实时读取输出以避免编码问题
|
||||
try:
|
||||
stdout, stderr = process.communicate(timeout=300) # 5分钟超时
|
||||
stdout, stderr = process.communicate(timeout=3600) # 1小时超时
|
||||
except subprocess.TimeoutExpired:
|
||||
process.kill()
|
||||
logger.error("tophub_add_data_to_db.py执行超时")
|
||||
@@ -287,6 +287,8 @@ class TopHubScraper:
|
||||
|
||||
if __name__ == "__main__":
|
||||
scraper = TopHubScraper()
|
||||
|
||||
|
||||
try:
|
||||
# 抓取数据
|
||||
scraped_data = scraper.scrape_by_node_ids()
|
||||
|
||||
Reference in New Issue
Block a user