2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
可以,现在应该把它整理成一个**分板块执行的落地方案**。你后面就不要再一直在脑子里绕了,按这个顺序推进。
|
|
|
|
|
|
|
|
|
|
|
|
# 知习知识库落地执行方案(完整版)
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
> 最后检查:2026-05-20 14:00 | 标记:✅ 已完成 | 🔶 部分完成 | ⏳ 待完成 | ❌ 未开始
|
|
|
|
|
|
|
2026-05-19 22:54:43 +08:00
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 一、服务器部署板块 ✅ 已完成
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 当前服务器分工
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
4核4G 轻量云:
|
|
|
|
|
|
主业务服务器
|
|
|
|
|
|
|
|
|
|
|
|
8核32G CVM:
|
|
|
|
|
|
知识库 / RAG / 文档处理 / AI 调度服务器
|
|
|
|
|
|
|
|
|
|
|
|
腾讯云 COS:
|
|
|
|
|
|
原始文件存储
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 8核32G 服务器用途
|
|
|
|
|
|
|
|
|
|
|
|
这台服务器负责:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
Qdrant 向量库
|
|
|
|
|
|
LlamaIndex / RAG Worker
|
|
|
|
|
|
文档解析
|
|
|
|
|
|
PDF / DOCX / TXT / Markdown 处理
|
|
|
|
|
|
OCR / 多模态任务调度
|
|
|
|
|
|
chunking 切片
|
|
|
|
|
|
embedding 调用
|
|
|
|
|
|
知识库检索
|
|
|
|
|
|
AI Gateway 部分调度
|
|
|
|
|
|
DocumentImport 异步任务
|
|
|
|
|
|
KnowledgeChunk 写入
|
|
|
|
|
|
ImportCandidate 生成
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
它不负责:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
本地大模型推理
|
|
|
|
|
|
本地多模态大模型
|
|
|
|
|
|
GPU 任务
|
|
|
|
|
|
原始文件长期存储
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 二、服务器基础环境 ✅ 已完成
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 系统配置
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
CPU:8核
|
|
|
|
|
|
内存:32G
|
|
|
|
|
|
系统:Ubuntu Server 22.04 LTS
|
2026-05-20 18:10:44 +08:00
|
|
|
|
系统盘:40G
|
2026-05-19 22:54:43 +08:00
|
|
|
|
数据盘:70G
|
|
|
|
|
|
公网:1Mbps
|
2026-05-20 18:10:44 +08:00
|
|
|
|
地域:北京(蜂驰云)
|
|
|
|
|
|
内网 IP:172.21.0.4 ↔ 4核4G(10.2.0.7),~1.9ms 延迟
|
2026-05-19 22:54:43 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
### 2. 数据盘挂载 ✅ 已完成
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
挂载到:`/data`,9 个子目录已创建
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 三、服务器需要安装的东西 ✅ 全部完成
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 基础环境
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
2026-05-20 18:10:44 +08:00
|
|
|
|
Docker ✅
|
|
|
|
|
|
Docker Compose ✅
|
|
|
|
|
|
Git ✅
|
|
|
|
|
|
Node.js ✅
|
|
|
|
|
|
pnpm ✅
|
|
|
|
|
|
Python 3.11+ ✅ (3.11.15,systemd 用 /usr/bin/python3.11)
|
|
|
|
|
|
pip / poetry ✅
|
|
|
|
|
|
nginx ✅
|
|
|
|
|
|
supervisor / systemd ✅
|
|
|
|
|
|
logrotate ✅
|
2026-05-19 22:54:43 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 知识库服务组件
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
2026-05-20 18:10:44 +08:00
|
|
|
|
Qdrant(Docker)✅
|
|
|
|
|
|
RAG Worker(Python)✅ (systemd zhixi-worker 已启动,polling 正常)
|
|
|
|
|
|
NestJS API(Node)✅ (8核32G systemd + 4核4G Docker 双部署)
|
|
|
|
|
|
COS SDK ✅
|
|
|
|
|
|
AI Gateway ✅
|
|
|
|
|
|
OCR Provider SDK ✅ (百度 OCR AppID 7767914)
|
|
|
|
|
|
Embedding Provider SDK ✅ (硅基流动 bge-m3)
|
2026-05-19 22:54:43 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. Python Worker 主要依赖
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
2026-05-20 18:10:44 +08:00
|
|
|
|
llama-index ✅ (httpx + pymupdf 替代:parser/chunker/embedder 自实现,无需 llama-index 重量级框架)
|
|
|
|
|
|
qdrant-client ✅
|
|
|
|
|
|
pymupdf ✅
|
|
|
|
|
|
python-docx ✅
|
|
|
|
|
|
markdown ✅
|
|
|
|
|
|
pandas ✅
|
|
|
|
|
|
openpyxl ✅
|
|
|
|
|
|
pydantic ✅
|
|
|
|
|
|
httpx ✅
|
|
|
|
|
|
tencentcloud-sdk-python ✅ (httpx 直接调 COS API 替代,无需引入整个 SDK)
|
|
|
|
|
|
Pillow ✅
|
|
|
|
|
|
python-dotenv ✅
|
2026-05-19 22:54:43 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
### 4. Node 后端主要依赖 ✅ 全部安装
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
```text
|
2026-05-20 18:10:44 +08:00
|
|
|
|
NestJS 11.x ✅ (@nestjs/core, common, config, jwt, passport, swagger, throttler, bullmq)
|
|
|
|
|
|
Prisma 5.22 ✅ (@prisma/client)
|
|
|
|
|
|
Redis / BullMQ ✅ (ioredis 5.10 + bullmq 5.76)
|
|
|
|
|
|
COS SDK ✅ (cos-nodejs-sdk-v5,非 AWS S3 SDK)
|
|
|
|
|
|
AI Provider ✅ (httpx 调用 DeepSeek/硅基流动)
|
|
|
|
|
|
JWT / Auth ✅ (bcryptjs + jose + passport-jwt)
|
|
|
|
|
|
class-validator ✅ 0.15.1
|
|
|
|
|
|
zod ✅ 4.4.3
|
|
|
|
|
|
helmet ✅ 8.1
|
2026-05-19 22:54:43 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
> **语言分工:** NestJS(Node)负责 API 层 + AI Gateway + 任务入队。Python 负责 RAG Worker(文档解析、chunking、embedding、Qdrant 写入),因为 llama-index 生态在 Python 侧最成熟。两者通过 Redis/BullMQ(入队) + 内部 HTTP(heartbeat/result)通信。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 四、安全组和网络 ✅ 已配置
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 安全组建议
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
22:只允许你的 IP
|
|
|
|
|
|
80 / 443:如果知识库服务需要对外访问再开
|
|
|
|
|
|
6333 / 6334:Qdrant 不开放公网
|
|
|
|
|
|
3306:MySQL 不开放公网
|
|
|
|
|
|
6379:Redis 不开放公网
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
### 2. 服务器之间通信 ✅ 内网直连
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
| 路径 | 方式 | 延迟 |
|
|
|
|
|
|
|------|------|------|
|
|
|
|
|
|
| 4核4G (10.2.0.7) → 8核32G (172.21.0.4) | 内网 HTTP | ~1.9ms |
|
|
|
|
|
|
| 8核32G → 4核4G Gitea (10.2.0.7:3000) | 内网 HTTP | ~1.9ms |
|
|
|
|
|
|
| 8核32G Runner → Gitea | 内网 | 已切换 |
|
|
|
|
|
|
| COS (ap-beijing) → 两台服务器 | VPC 内网端点 | 免流量费 |
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
> 所有服务器间通信均走内网。知识库内部接口不暴露公网。
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
### 3. COS 访问方式 ✅ 已实现
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
COS 不作为本地硬盘长期挂载。import_pipeline.py 已实现:
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
```text
|
2026-05-20 18:10:44 +08:00
|
|
|
|
Worker 通过 COS 预签名 URL 拉取文件
|
2026-05-19 22:54:43 +08:00
|
|
|
|
→ 临时下载到 /data/tmp/imports/{jobId}
|
|
|
|
|
|
→ 解析处理
|
|
|
|
|
|
→ 写入 Qdrant / MySQL
|
2026-05-20 18:10:44 +08:00
|
|
|
|
→ finally 块 shutil.rmtree 清理临时文件
|
2026-05-19 22:54:43 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
COS 负责原始文件存储,服务器负责临时处理和索引。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 五、COS 存储结构 ✅ Bucket 已验证,同区内网免流
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
核心原则:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
数据库是主关系,COS 是文件仓库
|
|
|
|
|
|
COS 路径按 userId / knowledgeBaseId / sourceId 分类
|
|
|
|
|
|
权限、归属、状态永远以数据库为准
|
|
|
|
|
|
不要直接用用户原文件名当 objectKey(用 fileId/sourceId 避免重名和特殊字符)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 1. Bucket
|
|
|
|
|
|
|
|
|
|
|
|
第一阶段一个 bucket 就够:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
zhixi-prod
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 用户知识库原始文件
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
users/{userId}/knowledge-bases/{knowledgeBaseId}/sources/{sourceId}/original/{fileId}.{ext}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
示例:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
users/user_001/knowledge-bases/kb_001/sources/src_001/original/file_001.pdf
|
|
|
|
|
|
users/user_001/knowledge-bases/kb_001/sources/src_002/original/file_002.docx
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 解析结果
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
users/{userId}/knowledge-bases/{knowledgeBaseId}/sources/{sourceId}/parsed/parsed.md
|
|
|
|
|
|
users/{userId}/knowledge-bases/{knowledgeBaseId}/sources/{sourceId}/parsed/metadata.json
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
`parsed.md` 保留完整解析文本,MySQL 不存全文,通过 `parsedObjectKey` 引用。
|
|
|
|
|
|
|
|
|
|
|
|
### 4. OCR / 多模态结果
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
users/{userId}/knowledge-bases/{knowledgeBaseId}/sources/{sourceId}/processed/ocr/page_001.json
|
|
|
|
|
|
users/{userId}/knowledge-bases/{knowledgeBaseId}/sources/{sourceId}/processed/vision/page_001.json
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
早期可只保存最终 `parsed.md`,调试 OCR/多模态质量时再保存中间结果。
|
|
|
|
|
|
|
|
|
|
|
|
### 5. 用户头像与反馈
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
users/{userId}/profile/avatar/{fileId}.{ext}
|
|
|
|
|
|
users/{userId}/feedback/{feedbackId}/screenshots/{fileId}.{ext}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 6. 系统内置知识库
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
system/knowledge-bases/{systemKnowledgeBaseId}/sources/{sourceId}/original/{fileId}.{ext}
|
|
|
|
|
|
system/knowledge-bases/{systemKnowledgeBaseId}/sources/{sourceId}/parsed/parsed.md
|
|
|
|
|
|
system/knowledge-bases/{systemKnowledgeBaseId}/sources/{sourceId}/parsed/metadata.json
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
系统知识库只存一份,用户使用系统知识库时是"引用",不要给每个用户复制一份。
|
|
|
|
|
|
|
|
|
|
|
|
### 7. 备份(同步到 COS)
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
system/backups/qdrant/{yyyy-mm-dd}/zhixi_chunks.snapshot
|
|
|
|
|
|
system/backups/mysql/{yyyy-mm-dd}/zhixi.sql.gz
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 8. 用户导出
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
exports/users/{userId}/{exportId}/learning_report.pdf
|
|
|
|
|
|
exports/users/{userId}/{exportId}/knowledge_base_export.zip
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 9. 临时文件(服务器本地为主)
|
|
|
|
|
|
|
|
|
|
|
|
默认放服务器 `/data/tmp/imports`,处理完删除。如果必须放 COS:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
temp/imports/{importId}/original.{ext}
|
|
|
|
|
|
temp/imports/{importId}/pages/page_001.png
|
|
|
|
|
|
temp/imports/{importId}/ocr/page_001.json
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 10. 完整 COS 目录树
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
zhixi-prod/
|
|
|
|
|
|
users/
|
|
|
|
|
|
{userId}/
|
|
|
|
|
|
knowledge-bases/
|
|
|
|
|
|
{knowledgeBaseId}/
|
|
|
|
|
|
sources/
|
|
|
|
|
|
{sourceId}/
|
|
|
|
|
|
original/
|
|
|
|
|
|
{fileId}.{ext}
|
|
|
|
|
|
parsed/
|
|
|
|
|
|
parsed.md
|
|
|
|
|
|
metadata.json
|
|
|
|
|
|
processed/
|
|
|
|
|
|
ocr/
|
|
|
|
|
|
page_001.json
|
|
|
|
|
|
vision/
|
|
|
|
|
|
page_001.json
|
|
|
|
|
|
profile/
|
|
|
|
|
|
avatar/
|
|
|
|
|
|
{fileId}.{ext}
|
|
|
|
|
|
feedback/
|
|
|
|
|
|
{feedbackId}/
|
|
|
|
|
|
screenshots/
|
|
|
|
|
|
{fileId}.{ext}
|
|
|
|
|
|
|
|
|
|
|
|
system/
|
|
|
|
|
|
knowledge-bases/
|
|
|
|
|
|
{systemKnowledgeBaseId}/
|
|
|
|
|
|
sources/
|
|
|
|
|
|
{sourceId}/
|
|
|
|
|
|
original/
|
|
|
|
|
|
{fileId}.{ext}
|
|
|
|
|
|
parsed/
|
|
|
|
|
|
parsed.md
|
|
|
|
|
|
metadata.json
|
|
|
|
|
|
backups/
|
|
|
|
|
|
qdrant/
|
|
|
|
|
|
{yyyy-mm-dd}/
|
|
|
|
|
|
zhixi_chunks.snapshot
|
|
|
|
|
|
mysql/
|
|
|
|
|
|
{yyyy-mm-dd}/
|
|
|
|
|
|
zhixi.sql.gz
|
|
|
|
|
|
|
|
|
|
|
|
exports/
|
|
|
|
|
|
users/
|
|
|
|
|
|
{userId}/
|
|
|
|
|
|
{exportId}/
|
|
|
|
|
|
learning_report.pdf
|
|
|
|
|
|
knowledge_base_export.zip
|
|
|
|
|
|
|
|
|
|
|
|
temp/
|
|
|
|
|
|
imports/
|
|
|
|
|
|
{importId}/
|
|
|
|
|
|
original.{ext}
|
|
|
|
|
|
|
|
|
|
|
|
public/
|
|
|
|
|
|
app-assets/
|
|
|
|
|
|
icons/
|
|
|
|
|
|
illustrations/
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 11. 数据库里需保存的 COS 关联字段
|
|
|
|
|
|
|
|
|
|
|
|
`files` 表:`bucket / objectKey / originalFilename / mimeType / sizeBytes / sha256 / purpose / status`
|
|
|
|
|
|
|
|
|
|
|
|
`knowledge_sources` 表:`originalObjectKey / parsedObjectKey / metadataObjectKey`
|
|
|
|
|
|
|
|
|
|
|
|
### 12. COS 生命周期
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
软删除后保留 7 天
|
|
|
|
|
|
每天凌晨定时任务清理超期文件
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 六、知识库支持的上传格式及处理策略 ✅ Parser 代码已完成,百度 OCR 已开通
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 第一阶段必须支持
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
PDF DOCX TXT Markdown / MD
|
|
|
|
|
|
PNG JPG / JPEG WEBP HEIC
|
|
|
|
|
|
CSV XLSX
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 详细处理策略
|
|
|
|
|
|
|
|
|
|
|
|
| 类型 | 处理方式 | 工具 |
|
|
|
|
|
|
|------|---------|------|
|
|
|
|
|
|
| TXT / Markdown | 本地解析 | Python 原生 |
|
|
|
|
|
|
| DOCX | 本地解析 | python-docx |
|
|
|
|
|
|
| 文本型 PDF | PyMuPDF 提取文本层 | pymupdf |
|
|
|
|
|
|
| 扫描 PDF(文本层为空) | 百度 OCR / Qwen3-VL 兜底 | 百度 OCR → 硅基流动 |
|
|
|
|
|
|
| 图片文字 | 百度 OCR | 百度 OCR |
|
|
|
|
|
|
| 表格截图 | Qwen3-VL 多模态 | 硅基流动 |
|
|
|
|
|
|
| 图文混排 | Qwen3-VL 多模态 | 硅基流动 |
|
|
|
|
|
|
| CSV / XLSX | 本地解析为 Markdown table | pandas / openpyxl |
|
|
|
|
|
|
| PPTX | 预留,仅提取文本 | python-pptx(后续) |
|
|
|
|
|
|
| HEIC | 先转 JPG 再处理 | Pillow |
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 暂时预留
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
音频 视频 网页抓取 压缩包批量导入
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4. PDF 解析细化
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
先尝试 PyMuPDF 提取文本层
|
|
|
|
|
|
如果文本层为空或每页文本 < 50 字符
|
|
|
|
|
|
→ 判断为扫描件
|
|
|
|
|
|
→ 走百度 OCR(普通扫描文字)
|
|
|
|
|
|
→ 复杂排版 / 教材类走 Qwen3-VL
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 七、核心数据模型 ✅ 33 张表全部建好
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. File
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
files
|
|
|
|
|
|
- id
|
|
|
|
|
|
- userId
|
|
|
|
|
|
- bucket
|
|
|
|
|
|
- objectKey
|
|
|
|
|
|
- originalFilename
|
|
|
|
|
|
- mimeType
|
|
|
|
|
|
- sizeBytes
|
|
|
|
|
|
- sha256 ← 用于重复文件检测
|
|
|
|
|
|
- purpose
|
|
|
|
|
|
- status
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
- deletedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
重复文件检测逻辑:同一用户、同一知识库内 sha256 重复 → 提示用户"该文件已存在",允许取消 / 引用已有文件 / 仍然新增。
|
|
|
|
|
|
|
|
|
|
|
|
### 2. KnowledgeBase
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
knowledge_bases
|
|
|
|
|
|
- id
|
|
|
|
|
|
- userId
|
|
|
|
|
|
- title
|
|
|
|
|
|
- description
|
|
|
|
|
|
- icon
|
|
|
|
|
|
- coverColor
|
|
|
|
|
|
- visibility ← 先预留,第一阶段默认 private
|
|
|
|
|
|
- status
|
|
|
|
|
|
- itemCount
|
|
|
|
|
|
- sourceCount
|
|
|
|
|
|
- storageUsedBytes
|
|
|
|
|
|
- lastImportedAt
|
|
|
|
|
|
- lastStudiedAt
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
- deletedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. KnowledgeSource
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
knowledge_sources
|
|
|
|
|
|
- id
|
|
|
|
|
|
- userId
|
|
|
|
|
|
- knowledgeBaseId
|
|
|
|
|
|
- fileId
|
|
|
|
|
|
- type
|
|
|
|
|
|
- title
|
|
|
|
|
|
- originalFilename
|
|
|
|
|
|
- mimeType
|
|
|
|
|
|
- sizeBytes
|
|
|
|
|
|
- textLength
|
|
|
|
|
|
- parseStatus
|
|
|
|
|
|
- indexStatus
|
|
|
|
|
|
- learningStatus
|
|
|
|
|
|
- parsedObjectKey
|
|
|
|
|
|
- version ← 预留,默认 1
|
|
|
|
|
|
- parentSourceId ← 预留(版本链)
|
|
|
|
|
|
- replacedBySourceId ← 预留(被哪个新版本替代)
|
|
|
|
|
|
- errorCode
|
|
|
|
|
|
- errorMessage
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
- deletedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4. DocumentImport
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
document_imports
|
|
|
|
|
|
- id
|
|
|
|
|
|
- userId
|
|
|
|
|
|
- knowledgeBaseId
|
|
|
|
|
|
- sourceId
|
|
|
|
|
|
- status
|
|
|
|
|
|
- step
|
|
|
|
|
|
- progress ← 0~100
|
|
|
|
|
|
- workerId
|
|
|
|
|
|
- retryCount ← 新增
|
|
|
|
|
|
- maxRetries ← 新增,默认 3
|
|
|
|
|
|
- heartbeatAt
|
|
|
|
|
|
- errorCode
|
|
|
|
|
|
- errorMessage
|
|
|
|
|
|
- startedAt
|
|
|
|
|
|
- completedAt
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
状态机:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
QUEUED
|
|
|
|
|
|
CLAIMED
|
|
|
|
|
|
DOWNLOADING
|
|
|
|
|
|
PARSING
|
|
|
|
|
|
OCR_PROCESSING
|
|
|
|
|
|
VISION_PROCESSING
|
|
|
|
|
|
CLEANING
|
|
|
|
|
|
CHUNKING
|
|
|
|
|
|
EMBEDDING
|
|
|
|
|
|
INDEXING
|
|
|
|
|
|
GENERATING_CANDIDATES
|
|
|
|
|
|
WAITING_CONFIRM
|
|
|
|
|
|
COMPLETED
|
|
|
|
|
|
FAILED_RETRYABLE
|
|
|
|
|
|
FAILED_FINAL
|
|
|
|
|
|
CANCELED
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Heartbeat 机制:** Worker 每 30 秒上报一次。超过 5 分钟无 heartbeat → 状态回退 QUEUED + workerId 清空 + retryCount +1。
|
|
|
|
|
|
|
|
|
|
|
|
**Stale Job Recovery(定时任务,每分钟):**
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
UPDATE document_imports
|
|
|
|
|
|
SET status = 'QUEUED', workerId = NULL
|
|
|
|
|
|
WHERE status IN ('CLAIMED', 'DOWNLOADING', 'PARSING', 'OCR_PROCESSING',
|
|
|
|
|
|
'VISION_PROCESSING', 'CLEANING', 'CHUNKING', 'EMBEDDING',
|
|
|
|
|
|
'INDEXING', 'GENERATING_CANDIDATES')
|
|
|
|
|
|
AND heartbeatAt < NOW() - INTERVAL 5 MINUTE;
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 5. KnowledgeChunk
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
knowledge_chunks
|
|
|
|
|
|
- id
|
|
|
|
|
|
- userId
|
|
|
|
|
|
- knowledgeBaseId
|
|
|
|
|
|
- sourceId
|
|
|
|
|
|
- content
|
|
|
|
|
|
- chunkIndex
|
|
|
|
|
|
- pageNumber
|
|
|
|
|
|
- sectionTitle
|
|
|
|
|
|
- tokenCount
|
|
|
|
|
|
- externalVectorId ← Qdrant point ID
|
|
|
|
|
|
- embeddingModel ← 'bge-m3'
|
|
|
|
|
|
- embeddingStatus ← PENDING / COMPLETED / FAILED
|
|
|
|
|
|
- metadataJson ← { overlapWith, chunkType, ... }
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
- deletedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 6. ImportCandidate
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
import_candidates
|
|
|
|
|
|
- id
|
|
|
|
|
|
- userId
|
|
|
|
|
|
- knowledgeBaseId
|
|
|
|
|
|
- sourceId
|
|
|
|
|
|
- importId
|
|
|
|
|
|
- title
|
|
|
|
|
|
- summary
|
|
|
|
|
|
- content
|
|
|
|
|
|
- tagsJson
|
|
|
|
|
|
- recallQuestionsJson
|
|
|
|
|
|
- sourceTextSnippet
|
|
|
|
|
|
- sourceChunkIds ← 新增:关联哪些 chunk
|
|
|
|
|
|
- confidence ← 0.0 ~ 1.0
|
|
|
|
|
|
- difficulty ← 新增:easy / medium / hard
|
|
|
|
|
|
- orderIndex
|
|
|
|
|
|
- status ← PENDING / ACCEPTED / REJECTED / EDITED / IMPORTED
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
生成规则(已拍板):
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
每 2000 中文字生成 1~2 个候选
|
|
|
|
|
|
单个 source 上限 30 个
|
|
|
|
|
|
最少生成 3 个(即使文档很短)
|
|
|
|
|
|
第一阶段不自动接受,全部 PENDING 等用户确认
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 7. KnowledgeItem
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
knowledge_items
|
|
|
|
|
|
- id
|
|
|
|
|
|
- userId
|
|
|
|
|
|
- knowledgeBaseId
|
|
|
|
|
|
- sourceId
|
|
|
|
|
|
- importId
|
|
|
|
|
|
- title
|
|
|
|
|
|
- summary
|
|
|
|
|
|
- content
|
|
|
|
|
|
- tagsJson
|
|
|
|
|
|
- sourceType
|
|
|
|
|
|
- masteryLevel
|
|
|
|
|
|
- sourceDeleted ← 新增:原资料是否已删除
|
|
|
|
|
|
- sourceTitleSnapshot ← 新增:原资料标题快照
|
|
|
|
|
|
- sourceSnippetSnapshot ← 新增:原引用片段快照
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
- deletedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 8. MembershipPlan
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
membership_plans
|
|
|
|
|
|
- id
|
|
|
|
|
|
- code ← FREE / PRO_TEST / PRO
|
|
|
|
|
|
- name
|
|
|
|
|
|
- priceMonthly ← 单位:分(避免浮点),配置化可随时调
|
|
|
|
|
|
- priceYearly
|
|
|
|
|
|
- maxKnowledgeBases
|
|
|
|
|
|
- maxStorageBytes
|
|
|
|
|
|
- maxFileSizeBytes
|
|
|
|
|
|
- monthlyOcrPages
|
|
|
|
|
|
- monthlyVisionPages
|
|
|
|
|
|
- monthlyChatCount
|
|
|
|
|
|
- monthlyAiAnalysisCount
|
|
|
|
|
|
- monthlyRecallCount
|
|
|
|
|
|
- monthlyCardGenCount
|
|
|
|
|
|
- isActive
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
- updatedAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
价格不写死在代码里,后端只认 planId + quotaConfig。第一阶段先用 28 元/月作为 Pro 预设,跑 1 个月真实成本后再正式确定价格和额度。
|
|
|
|
|
|
|
|
|
|
|
|
### 9. BackupJob
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
backup_jobs
|
|
|
|
|
|
- id
|
|
|
|
|
|
- type ← QDRANT / MYSQL
|
|
|
|
|
|
- status ← RUNNING / COMPLETED / FAILED
|
|
|
|
|
|
- localPath
|
|
|
|
|
|
- cosObjectKey
|
|
|
|
|
|
- fileSizeBytes
|
|
|
|
|
|
- startedAt
|
|
|
|
|
|
- completedAt
|
|
|
|
|
|
- errorMessage
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 八、Chunking 切片策略(已拍板)✅ chunker.py 已实现
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 默认参数
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
chunk_size = 512 tokens
|
|
|
|
|
|
overlap = 64 tokens(~12%)
|
|
|
|
|
|
策略 = 递归字符分割 + 中文分句保护
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 分文档类型规则
|
|
|
|
|
|
|
|
|
|
|
|
| 文档类型 | 切片方式 |
|
|
|
|
|
|
|---------|---------|
|
|
|
|
|
|
| Markdown | 优先按 `#` / `##` / `###` 标题分层切片 |
|
|
|
|
|
|
| PDF | 保留 pageNumber,在段落边界切 |
|
|
|
|
|
|
| DOCX | 按标题 / 段落层级切 |
|
|
|
|
|
|
| 表格 | **整表保留**,不强行切碎 |
|
|
|
|
|
|
| 代码块 | **整块保留** |
|
|
|
|
|
|
| 公式附近 | 公式 + 上下文保留在同一 chunk |
|
|
|
|
|
|
| 普通文本 | 512 tokens |
|
|
|
|
|
|
| 复杂解释型段落 | 允许扩展到 768 tokens |
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 注入每个 chunk 的元数据
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
sourceId
|
|
|
|
|
|
pageNumber
|
|
|
|
|
|
sectionTitle
|
|
|
|
|
|
chunkIndex
|
|
|
|
|
|
chunkType(text / table / code / formula)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 九、Qdrant 设计(已拍板)✅ 已部署运行
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 部署参数
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
部署模式:单节点 Docker
|
|
|
|
|
|
Collection:zhixi_chunks
|
|
|
|
|
|
vector_size:1024
|
|
|
|
|
|
distance:Cosine
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. Collection 创建参数
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{
|
|
|
|
|
|
"vectors": {
|
|
|
|
|
|
"size": 1024,
|
|
|
|
|
|
"distance": "Cosine"
|
|
|
|
|
|
},
|
|
|
|
|
|
"shard_number": 1,
|
|
|
|
|
|
"replication_factor": 1,
|
|
|
|
|
|
"hnsw_config": {
|
|
|
|
|
|
"m": 16,
|
|
|
|
|
|
"ef_construct": 100
|
|
|
|
|
|
},
|
|
|
|
|
|
"optimizers_config": {
|
|
|
|
|
|
"default_segment_number": 2
|
|
|
|
|
|
},
|
|
|
|
|
|
"on_disk_payload": true
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
> **注意:** 单节点不要设 `replication_factor = 2`,没有意义且浪费资源。
|
|
|
|
|
|
|
|
|
|
|
|
### 3. Payload
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{
|
|
|
|
|
|
"userId": "user_xxx",
|
|
|
|
|
|
"knowledgeBaseId": "kb_xxx",
|
|
|
|
|
|
"sourceId": "src_xxx",
|
|
|
|
|
|
"chunkId": "chunk_xxx",
|
|
|
|
|
|
"pageNumber": 3,
|
|
|
|
|
|
"sectionTitle": "章节标题",
|
|
|
|
|
|
"deleted": false
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4. Payload 索引(必须建)
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
userId → keyword index
|
|
|
|
|
|
knowledgeBaseId → keyword index
|
|
|
|
|
|
sourceId → keyword index
|
|
|
|
|
|
chunkId → keyword index
|
|
|
|
|
|
deleted → bool index
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 5. 检索过滤
|
|
|
|
|
|
|
|
|
|
|
|
每次检索必须带:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
userId = 当前用户
|
|
|
|
|
|
knowledgeBaseId = 当前知识库
|
|
|
|
|
|
deleted = false
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 6. 备份策略(已拍板:本地 + 同步 COS)
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
每日凌晨 3 点生成 Qdrant snapshot → /data/backups/qdrant/
|
|
|
|
|
|
生成后上传到 COS → system/backups/qdrant/{yyyy-mm-dd}/zhixi_chunks.snapshot
|
|
|
|
|
|
本地保留最近 7 天
|
|
|
|
|
|
COS 保留最近 30 天
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
恢复依赖链:COS 原始文件 + MySQL 元数据 + Qdrant 快照 → 三者共同保证可恢复。
|
|
|
|
|
|
|
|
|
|
|
|
### 7. Qdrant 集群迁移时机
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
第一阶段:单节点 Qdrant,1 shard,replication_factor = 1
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
迁移触发条件(任一满足即评估):
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
collection 超过 100 万 points
|
|
|
|
|
|
Qdrant 内存长期超过 70%
|
|
|
|
|
|
检索 p95 延迟超过 1.5 秒
|
|
|
|
|
|
snapshot / restore 时间过长
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
迁移路径:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
阶段 1:单节点 Qdrant(现在)
|
|
|
|
|
|
阶段 2:当前服务器加数据盘
|
|
|
|
|
|
阶段 3:Qdrant 独立迁移到新服务器
|
|
|
|
|
|
阶段 4:Qdrant 集群
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十、Embedding & Rerank(已拍板)✅ 硅基流动 Key 已配置
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. Embedding
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
模型:BAAI/bge-m3
|
|
|
|
|
|
Provider:硅基流动
|
|
|
|
|
|
维度:1024
|
|
|
|
|
|
Batch size:50~100 chunks / 批次
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
注意:不要一个大文档全塞进一次 embedding 调用,分批处理。单批失败只重试该批,不重跑全部。
|
|
|
|
|
|
|
|
|
|
|
|
备选降级(如果 bge-m3 成本或速度不理想):
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
→ 切到 bge-large-zh-v1.5(同样 1024d)
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
但第一版先统一用 bge-m3,不要同时支持多个 embedding 模型。
|
|
|
|
|
|
|
|
|
|
|
|
### 2. Rerank
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
模型:BAAI/bge-reranker-v2-m3
|
|
|
|
|
|
Provider:硅基流动
|
|
|
|
|
|
输入:query + Top-50 候选 chunks
|
|
|
|
|
|
输出:精排 Top-5~8
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十一、RAG 检索流程(已拍板)✅ indexer.py 已实现,待端到端验证
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
用户提问
|
|
|
|
|
|
→ 生成 query embedding(bge-m3)
|
|
|
|
|
|
→ Qdrant ANN 召回 Top-50(带 userId/kbId/deleted 过滤)
|
|
|
|
|
|
→ bge-reranker-v2-m3 精排
|
|
|
|
|
|
→ 取 Top-5(普通问题)~ Top-8(复杂问题)
|
|
|
|
|
|
→ 拼接 context
|
|
|
|
|
|
→ DeepSeek 生成回答 + 引用溯源
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Context 拼接格式
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
[来源:{sourceTitle},章节:{sectionTitle},第 {pageNumber} 页]
|
|
|
|
|
|
{chunkContent}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 知识库对话多轮策略
|
|
|
|
|
|
|
|
|
|
|
|
第一阶段:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
保存 chat session + messages:✅
|
|
|
|
|
|
取最近 3 轮对话拼入 prompt:✅ 少量做
|
|
|
|
|
|
根据历史重写检索 query:❌ 后面再做
|
|
|
|
|
|
通用 AI 问答(非知识库内容):❌ 第一阶段不支持
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
检索 query 仍使用用户当前原问题,但 prompt 上下文附带最近 3 轮历史。
|
|
|
|
|
|
|
|
|
|
|
|
当检索不到相关内容时返回:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
当前知识库中没有找到足够相关的资料。
|
|
|
|
|
|
你可以上传更多资料,或换一种问法。
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
知习第一阶段只做**基于当前知识库内容的 RAG 问答**,不做纯通用 AI 对话。避免产品定位偏离、增加无意义 token 成本和用户把它当 ChatGPT 用。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十二、AI Provider 策略 ✅ 三大 Provider 全部配置完毕
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. DeepSeek 官方 — 核心文本智能
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
知识点提取 默认 V4 Flash(非思考)
|
|
|
|
|
|
摘要 / 标签 默认 V4 Flash
|
|
|
|
|
|
主动回忆题 默认 V4 Flash
|
|
|
|
|
|
复习卡生成 默认 V4 Flash
|
|
|
|
|
|
知识库问答 默认 V4 Flash
|
|
|
|
|
|
主动回忆诊断 V4 Flash thinking
|
|
|
|
|
|
待巩固项分析 V4 Flash thinking
|
|
|
|
|
|
学习报告 V4 Flash thinking
|
|
|
|
|
|
高价值深度分析 V4 Pro
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 硅基流动 — 工具模型
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
Qwen3-VL 多模态 Qwen/Qwen3-VL-32B-Instruct
|
|
|
|
|
|
复杂视觉兜底 Qwen/Qwen3-VL-32B-Thinking
|
|
|
|
|
|
embedding BAAI/bge-m3
|
|
|
|
|
|
rerank BAAI/bge-reranker-v2-m3
|
|
|
|
|
|
备用模型池 营销 / 客服 / 润色测试
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 百度 OCR
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
普通扫描文字
|
|
|
|
|
|
图片文字
|
|
|
|
|
|
截图文字
|
|
|
|
|
|
→ 复杂页面不要用 OCR 硬识别,交给 Qwen3-VL
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十三、AI Gateway ✅ 三层架构已落地
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
所有 AI 调用必须走 AI Gateway。负责:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
模型路由
|
|
|
|
|
|
Prompt 版本管理
|
|
|
|
|
|
JSON Schema 校验
|
|
|
|
|
|
失败重试 + 超时控制
|
|
|
|
|
|
token 统计 + 成本估算
|
|
|
|
|
|
AIUsageLog
|
|
|
|
|
|
会员额度扣减
|
|
|
|
|
|
模型降级
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
业务模块只调用:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
AIGateway.run("extract_knowledge_candidates")
|
|
|
|
|
|
AIGateway.run("analyze_active_recall")
|
|
|
|
|
|
AIGateway.run("knowledge_chat")
|
|
|
|
|
|
AIGateway.run("parse_complex_page")
|
|
|
|
|
|
AIGateway.run("embed_chunks")
|
|
|
|
|
|
AIGateway.run("rerank_chunks")
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十四、Worker 部署与并发控制 ✅ 部署完成
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 进程数与并发(已拍板:单 Worker 起步)
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
- ✅ 代码:10 个文件全部到位
|
|
|
|
|
|
- ✅ systemd:zhixi-worker.service enabled active
|
|
|
|
|
|
- ✅ pip 依赖:28+ packages 全部安装
|
2026-05-19 22:54:43 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
8 核 32G 服务器起步并发:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
文档导入并发 1
|
|
|
|
|
|
embedding batch 50~100 chunks
|
|
|
|
|
|
OCR 并发 1~2
|
|
|
|
|
|
多模态并发 1
|
|
|
|
|
|
候选知识点生成并发 1~2
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
原因:文档解析、embedding、OCR、多模态、Qdrant upsert 都可能吃 CPU/内存/网络。先稳,再提并发。
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 扩展条件
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
CPU 长期低于 50%
|
|
|
|
|
|
内存长期低于 60%
|
|
|
|
|
|
任务队列积压明显
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
满足以上 → 扩展到 2~3 Worker 进程。多 Worker 时靠数据库原子更新防止重复处理:
|
|
|
|
|
|
|
|
|
|
|
|
```sql
|
|
|
|
|
|
UPDATE document_imports
|
|
|
|
|
|
SET status = 'CLAIMED', workerId = ?, heartbeatAt = NOW()
|
|
|
|
|
|
WHERE id = ?
|
|
|
|
|
|
AND status = 'QUEUED';
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 各环节重试策略
|
|
|
|
|
|
|
|
|
|
|
|
| 环节 | 重试次数 | 退避策略 | 失败后行为 |
|
|
|
|
|
|
|------|---------|---------|-----------|
|
|
|
|
|
|
| COS 下载 | 3 次 | 指数退避 1s/4s/16s | FAILED_RETRYABLE |
|
|
|
|
|
|
| OCR API | 2 次 | 固定 2s | 降级到多模态 |
|
|
|
|
|
|
| 多模态 API | 2 次 | 固定 2s | FAILED_RETRYABLE |
|
|
|
|
|
|
| Embedding batch | 2 次 | 固定 2s | 只重试该 batch |
|
|
|
|
|
|
| Qdrant upsert | 2 次 | 固定 2s | 回滚该 batch |
|
|
|
|
|
|
| Worker 崩溃 | — | heartbeat 超时 5min | 自动回队列 |
|
|
|
|
|
|
| DeepSeek 调用 | 2 次 | 固定 1s | FAILED_RETRYABLE |
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 最大重试
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
maxRetries = 3
|
|
|
|
|
|
超过 → FAILED_FINAL
|
|
|
|
|
|
→ 通知用户(iOS 解析失败页)
|
|
|
|
|
|
→ 后台告警
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十五、删除与清理策略(已拍板)✅ Prisma schema 已含 soft delete + 保留天数配置
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 删除单个 KnowledgeSource
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
source.deletedAt = now()
|
|
|
|
|
|
chunks.deletedAt = now()
|
|
|
|
|
|
Qdrant points → deleted = true(不物理删除)
|
|
|
|
|
|
COS 原文件 → 进入待清理队列(7 天后清除)
|
|
|
|
|
|
KnowledgeItem → 默认保留
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
用户已确认、编辑、学习过的知识点,不应因删除原文件而丢失。KnowledgeItem 记录来源状态:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
knowledge_items.sourceDeleted = true
|
|
|
|
|
|
knowledge_items.sourceTitleSnapshot = 原资料标题
|
|
|
|
|
|
knowledge_items.sourceSnippetSnapshot = 原始引用片段
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
这样即使原资料被删,知识点页仍能展示"该知识点来自已删除资料"。
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 用户删除 source 时给两个选项
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
默认:仅删除原资料,保留已确认知识点
|
|
|
|
|
|
高级:同时删除该资料生成的知识点
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 删除整个 KnowledgeBase(与人不同)
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
knowledgeBase.deletedAt = now()
|
|
|
|
|
|
→ 级联软删该 KB 下所有 sources / chunks / candidates / items / review cards / learning records
|
|
|
|
|
|
→ Qdrant 标记 deleted = true
|
|
|
|
|
|
→ COS 异步清理
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
用户删除的是整个学习空间,所以级联删除。
|
|
|
|
|
|
|
|
|
|
|
|
### 4. 后台物理清理(每日凌晨)
|
|
|
|
|
|
|
|
|
|
|
|
数据保留天数全部配置化(环境变量):
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
DATA_RETENTION_DAYS=30 ← 普通业务数据物理删除
|
|
|
|
|
|
SOURCE_PURGE_DAYS=7 ← COS 原文件物理删除
|
|
|
|
|
|
QDRANT_PURGE_DAYS=7 ← Qdrant deleted=true 点物理删除
|
|
|
|
|
|
COS_PURGE_DAYS=7 ← COS 标记删除文件物理删除
|
|
|
|
|
|
AI_USAGE_LOG_RETENTION_DAYS=180 ← AI 成本日志保留更久
|
|
|
|
|
|
ADMIN_AUDIT_RETENTION_DAYS=365 ← 审计日志保留更久
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
规则:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
普通业务数据 soft delete 后 30 天物理删除
|
|
|
|
|
|
COS 原文件 soft delete 后 7 天物理删除
|
|
|
|
|
|
Qdrant deleted=true 后 7 天物理删除
|
|
|
|
|
|
AI 成本日志保留 180 天
|
|
|
|
|
|
后台操作审计日志保留 365 天
|
|
|
|
|
|
用户学习记录 / 支付记录不要跟普通 source 一起快速物理删除
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十六、备份与灾备策略(已拍板)⏳ 表已建,备份脚本待写
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. MySQL 备份
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
优先级最高:MySQL > Qdrant > COS
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
理由:Qdrant 可重建,COS 有原文件,但 MySQL 丢了业务关系、用户数据、学习记录会非常麻烦。
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
每日凌晨备份 → /data/backups/mysql/zhixi_{yyyy-mm-dd}.sql.gz
|
|
|
|
|
|
备份后上传 COS → system/backups/mysql/{yyyy-mm-dd}/zhixi.sql.gz
|
|
|
|
|
|
本地保留 7 天
|
|
|
|
|
|
COS 保留 30 天
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. Qdrant 备份
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
每日凌晨 3 点生成 snapshot → /data/backups/qdrant/
|
|
|
|
|
|
生成后上传 COS → system/backups/qdrant/{yyyy-mm-dd}/zhixi_chunks.snapshot
|
|
|
|
|
|
本地保留 7 天
|
|
|
|
|
|
COS 保留 30 天
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 备份任务记录表
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
backup_jobs
|
|
|
|
|
|
- id
|
|
|
|
|
|
- type ← QDRANT / MYSQL
|
|
|
|
|
|
- status ← RUNNING / COMPLETED / FAILED
|
|
|
|
|
|
- localPath
|
|
|
|
|
|
- cosObjectKey
|
|
|
|
|
|
- fileSizeBytes
|
|
|
|
|
|
- startedAt
|
|
|
|
|
|
- completedAt
|
|
|
|
|
|
- errorMessage
|
|
|
|
|
|
- createdAt
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 4. 恢复依赖链
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
COS 原始文件 + MySQL 元数据 + Qdrant 快照 → 三者共同保证可恢复
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十七、文档版本管理 ✅ 预留字段已建表,第一阶段不做完整版本管理
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
第一阶段不做完整版本管理,但预留字段:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
KnowledgeSource.version = 1
|
|
|
|
|
|
KnowledgeSource.parentSourceId = nullable
|
|
|
|
|
|
KnowledgeSource.replacedBySourceId = nullable
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
当前行为:用户重新上传同名文件 → 按新 source 独立处理。
|
|
|
|
|
|
|
|
|
|
|
|
后续升级:新版本 ready → 旧版本 Qdrant 标记 deleted=true → 旧版本 COS 保留 7 天。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十八、后台管理 ⏳ 待开发
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
后台要能看:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
用户列表
|
|
|
|
|
|
知识库列表
|
|
|
|
|
|
文件列表
|
|
|
|
|
|
DocumentImport 任务(按状态筛选)
|
|
|
|
|
|
失败任务(含错误原因)
|
|
|
|
|
|
OCR 调用记录(按用户、按时间)
|
|
|
|
|
|
多模态调用记录
|
|
|
|
|
|
DeepSeek 调用记录
|
|
|
|
|
|
AI 成本(按 Provider 汇总)
|
|
|
|
|
|
Qdrant 索引状态(collection 大小、points 数量)
|
|
|
|
|
|
高成本用户 Top N
|
|
|
|
|
|
会员额度使用情况
|
|
|
|
|
|
反馈记录
|
|
|
|
|
|
后台操作审计
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
核心目标是**成本可视化和异常发现**。
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 十九、额度系统 ⏳ 表已建(membership_plans + quota_usage),检查逻辑待实现
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 必须控制的维度
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
知识库数量
|
|
|
|
|
|
总存储空间
|
|
|
|
|
|
单文件大小
|
|
|
|
|
|
每月上传文件数
|
|
|
|
|
|
每月普通解析页数
|
|
|
|
|
|
每月 OCR 页数
|
|
|
|
|
|
每月多模态页数
|
|
|
|
|
|
每月知识库对话次数
|
|
|
|
|
|
每月 AI 分析次数
|
|
|
|
|
|
每月主动回忆次数
|
|
|
|
|
|
每月复习卡生成次数
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 检查点(每次调用前)
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
上传前 → 存储 + 数量
|
|
|
|
|
|
解析前 → 页数
|
|
|
|
|
|
OCR 前 → OCR 页数
|
|
|
|
|
|
多模态前 → 多模态页数
|
|
|
|
|
|
知识点生成前 → AI 分析次数
|
|
|
|
|
|
知识库对话前 → 对话次数
|
|
|
|
|
|
AI 诊断前 → AI 分析次数
|
|
|
|
|
|
复习卡生成前 → 复习卡次数
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 推荐初始额度(已拍板:配置化,随时可调)
|
|
|
|
|
|
|
|
|
|
|
|
| 维度 | 免费用户 | Pro 用户 |
|
|
|
|
|
|
|------|---------|---------|
|
|
|
|
|
|
| 知识库数量 | 3 个 | 30 个 |
|
|
|
|
|
|
| 总存储 | 100 MB | 5 GB |
|
|
|
|
|
|
| 单文件大小 | 20 MB | 100 MB |
|
|
|
|
|
|
| 每月 OCR 页数 | 20 页 | 500 页 |
|
|
|
|
|
|
| 每月多模态页数 | 5 页 | 100 页 |
|
|
|
|
|
|
| 每月知识库对话 | 20 次 | 1000 次 |
|
|
|
|
|
|
| 每月 AI 诊断 | 20 次 | 500 次 |
|
|
|
|
|
|
|
|
|
|
|
|
### 4. 定价策略
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
价格不写死在代码里,后端只认 planId + quotaConfig
|
|
|
|
|
|
第一阶段按 28 元/月作为 Pro 预设
|
|
|
|
|
|
跑 1 个月真实成本后正式确定价格和额度
|
|
|
|
|
|
重点记录:DeepSeek token 成本 / 硅基流动视觉成本 / 百度 OCR 页数 / COS 存储 / Qdrant 增长 / 高成本用户行为
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 二十、核心接口设计 ✅ 内部 RAG API (7 端点) + KnowledgeSource + ImportCandidate 已实现
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 文件上传
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
POST /api/files/upload-url
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
返回:
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{
|
|
|
|
|
|
"fileId": "file_xxx",
|
|
|
|
|
|
"uploadUrl": "...",
|
|
|
|
|
|
"objectKey": "...",
|
|
|
|
|
|
"headers": {},
|
|
|
|
|
|
"duplicateOf": null
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
如果 sha256 匹配到已有文件,返回 `duplicateOf` 指向已有 fileId,iOS 提示用户。
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 知识库 CRUD
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
POST /api/knowledge-bases
|
|
|
|
|
|
GET /api/knowledge-bases
|
|
|
|
|
|
GET /api/knowledge-bases/:id
|
|
|
|
|
|
PATCH /api/knowledge-bases/:id
|
|
|
|
|
|
DELETE /api/knowledge-bases/:id
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 3. 资料来源
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
POST /api/knowledge-bases/:id/sources
|
|
|
|
|
|
GET /api/knowledge-bases/:id/sources
|
|
|
|
|
|
GET /api/knowledge-sources/:sourceId
|
|
|
|
|
|
DELETE /api/knowledge-sources/:sourceId
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
创建 source 后自动生成 KnowledgeSource + DocumentImport。
|
|
|
|
|
|
|
|
|
|
|
|
### 4. 导入任务
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
GET /api/document-imports/:id
|
|
|
|
|
|
GET /api/knowledge-sources/:sourceId/imports/latest
|
|
|
|
|
|
POST /api/document-imports/:id/retry
|
|
|
|
|
|
POST /api/document-imports/:id/cancel
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
iOS 用这些接口展示导入进度(排队中 → 解析中 → 索引中 → 生成知识点中 → 等待确认 → 完成 / 失败可重试)。
|
|
|
|
|
|
|
|
|
|
|
|
### 5. Worker 内部接口
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
GET /internal/rag/jobs/next
|
|
|
|
|
|
POST /internal/rag/jobs/:id/heartbeat
|
|
|
|
|
|
POST /internal/rag/jobs/:id/result
|
|
|
|
|
|
POST /internal/rag/jobs/:id/fail
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 6. 候选知识点
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
GET /api/knowledge-sources/:sourceId/import-candidates
|
|
|
|
|
|
PATCH /api/import-candidates/:id
|
|
|
|
|
|
POST /api/import-candidates/:id/accept
|
|
|
|
|
|
POST /api/import-candidates/:id/reject
|
|
|
|
|
|
POST /api/import-candidates/batch-accept
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
用户确认后生成 KnowledgeItem。
|
|
|
|
|
|
|
|
|
|
|
|
### 7. 正式知识点
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
GET /api/knowledge-bases/:id/items
|
|
|
|
|
|
GET /api/knowledge-items/:id
|
|
|
|
|
|
POST /api/knowledge-items ← 手动创建
|
|
|
|
|
|
PATCH /api/knowledge-items/:id
|
|
|
|
|
|
DELETE /api/knowledge-items/:id
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 8. 知识库对话
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
POST /api/knowledge-bases/:id/chat
|
|
|
|
|
|
GET /api/knowledge-bases/:id/chat-sessions
|
|
|
|
|
|
GET /api/knowledge-chat-sessions/:id/messages
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
返回:
|
|
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
|
{
|
|
|
|
|
|
"answer": "...",
|
|
|
|
|
|
"citations": [
|
|
|
|
|
|
{
|
|
|
|
|
|
"sourceId": "src_xxx",
|
|
|
|
|
|
"chunkId": "chunk_xxx",
|
|
|
|
|
|
"title": "资料标题",
|
|
|
|
|
|
"snippet": "引用片段",
|
|
|
|
|
|
"pageNumber": 3
|
|
|
|
|
|
}
|
|
|
|
|
|
],
|
|
|
|
|
|
"suggestedActions": [
|
|
|
|
|
|
"CREATE_KNOWLEDGE_ITEM",
|
|
|
|
|
|
"GENERATE_ACTIVE_RECALL",
|
|
|
|
|
|
"ADD_TO_FOCUS_ITEM"
|
|
|
|
|
|
]
|
|
|
|
|
|
}
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 9. 单文件学习
|
|
|
|
|
|
|
|
|
|
|
|
```http
|
|
|
|
|
|
POST /api/knowledge-sources/:sourceId/prepare-learning
|
|
|
|
|
|
GET /api/knowledge-sources/:sourceId/learning-view
|
|
|
|
|
|
POST /api/knowledge-items/:id/active-recall
|
|
|
|
|
|
POST /api/active-recall-answers/:id/analyze
|
|
|
|
|
|
GET /api/knowledge-items/:id/review-cards
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
流程:
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
source → ImportCandidate → KnowledgeItem
|
|
|
|
|
|
→ ActiveRecall → AIAnalysis → FocusItem → ReviewCard
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 二十一、知识库主流程 🔶 索引流程代码完整(import_pipeline.py),待系统跑通
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 1. 索引流程
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
iOS 上传文件到 COS
|
|
|
|
|
|
→ 后端创建 File(含 sha256 重复检测)
|
|
|
|
|
|
→ 后端创建 KnowledgeSource
|
|
|
|
|
|
→ 创建 DocumentImport(status = QUEUED)
|
|
|
|
|
|
→ Worker claim 任务(QUEUED → CLAIMED)
|
|
|
|
|
|
→ Worker 从 COS 拉文件到 /data/tmp/imports/{jobId}
|
|
|
|
|
|
→ 本地解析 / OCR / 多模态
|
|
|
|
|
|
→ 写入 parsed.md 到 COS
|
|
|
|
|
|
→ 清洗文本
|
|
|
|
|
|
→ chunking(512 tokens + 64 overlap)
|
|
|
|
|
|
→ embedding(bge-m3,batch 50~100)
|
|
|
|
|
|
→ Qdrant upsert
|
|
|
|
|
|
→ 保存 KnowledgeChunk 到 MySQL
|
|
|
|
|
|
→ source.indexStatus = INDEXED
|
|
|
|
|
|
→ import.status = COMPLETED
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 学习流程
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
用户打开文件
|
|
|
|
|
|
→ DeepSeek 生成 ImportCandidate(上限 30 条)
|
|
|
|
|
|
→ 用户确认 / 编辑 / 拒绝
|
|
|
|
|
|
→ 生成 KnowledgeItem
|
|
|
|
|
|
→ 用户主动回忆
|
|
|
|
|
|
→ DeepSeek 诊断(thinking 模式)
|
|
|
|
|
|
→ 生成 FocusItem(待巩固项)
|
|
|
|
|
|
→ 生成 ReviewCard(复习卡)
|
|
|
|
|
|
→ 记录 LearningActivity
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 二十二、iOS 需要的页面 ⏳ 设计完成,待 iOS 实现
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
### 知识库相关
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
知识库列表页
|
|
|
|
|
|
知识库详情页
|
|
|
|
|
|
创建知识库页
|
|
|
|
|
|
上传资料页
|
|
|
|
|
|
资料列表页
|
|
|
|
|
|
资料详情页
|
|
|
|
|
|
导入进度页(含步骤+进度条)
|
|
|
|
|
|
解析失败页(含错误原因+重试按钮)
|
|
|
|
|
|
候选知识点确认页(批量接受/拒绝/编辑)
|
|
|
|
|
|
知识点详情页
|
|
|
|
|
|
知识库对话页
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 学习相关
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
单文件学习首页
|
|
|
|
|
|
主动回忆输入页
|
|
|
|
|
|
AI 分析结果页
|
|
|
|
|
|
待巩固项页
|
|
|
|
|
|
复习卡页
|
|
|
|
|
|
学习记录页
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 额度相关
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
额度展示页
|
|
|
|
|
|
会员升级页
|
|
|
|
|
|
OCR / 多模态额度提示
|
|
|
|
|
|
文件过大提示
|
|
|
|
|
|
解析额度不足提示
|
|
|
|
|
|
重复文件提示
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 二十三、执行顺序(状态更新)
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
### 第一阶段:服务器基础 ✅ 已完成
|
|
|
|
|
|
### 第二阶段:基础数据模型 ✅ 已完成(33 张表全部建好)
|
|
|
|
|
|
### 第三阶段:文件导入闭环 ✅ Worker 已部署运行,polling 正常
|
|
|
|
|
|
### 第四阶段:RAG 索引 ✅ 代码完成(chunker/embedder/indexer),待端到端验证
|
|
|
|
|
|
### 第五阶段:AI 学习化 ⏳ ImportCandidate 模块已完成,待端到端
|
|
|
|
|
|
### 第六阶段:单文件学习 ⏳ 核心模块已建(ActiveRecall/AIAnalysis/FocusItem/Review),待对接
|
|
|
|
|
|
### 第七阶段:知识库增强 ⏳ 待开发
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
## 二十四、最终落地原则
|
|
|
|
|
|
|
|
|
|
|
|
```text
|
|
|
|
|
|
COS 存原始文件 → 检索前临时拉取,不留本地
|
|
|
|
|
|
Qdrant 存向量索引 → 单节点,1024d Cosine,deleted 标记而非物理删除
|
|
|
|
|
|
MySQL 存业务状态 → Prisma ORM,软删除 + 后台清理
|
|
|
|
|
|
DeepSeek 负责核心文本智能 → Flash 日常 + thinking 诊断 + Pro 高价值
|
|
|
|
|
|
硅基流动负责工具模型 → embedding / rerank / 多模态
|
|
|
|
|
|
百度 OCR 负责普通扫描文字 → 复杂页面交给 Qwen3-VL
|
|
|
|
|
|
切片 512 token + 64 overlap → 递归分割 + 中文分句保护
|
|
|
|
|
|
候选知识点上限 30 条/source → 不自动接受,用户确认
|
|
|
|
|
|
删除策略软删除 + 7 天冷却 → 后台定时物理清理
|
|
|
|
|
|
所有成本进入额度系统 → 每次调用前查额度
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2026-05-20 18:10:44 +08:00
|
|
|
|
## 二十五、全部决策汇总(实现状态)
|
|
|
|
|
|
|
|
|
|
|
|
| # | 决策项 | 最终决策 | 状态 |
|
|
|
|
|
|
|---|--------|---------|------|
|
|
|
|
|
|
| 1 | Chunk size | 512 tokens | ✅ chunker.py 已实现 |
|
|
|
|
|
|
| 2 | Overlap | 64 tokens(~12%) | ✅ chunker.py 已实现 |
|
|
|
|
|
|
| 3 | 切片策略 | 递归字符分割 + 中文分句保护 | ✅ chunker.py 已实现 |
|
|
|
|
|
|
| 4 | Embedding 模型 | BAAI/bge-m3,硅基流动 | ✅ embedder.py + Key 已配置 |
|
|
|
|
|
|
| 5 | Vector 维度 | 1024 | ✅ Qdrant collection 已创建 |
|
|
|
|
|
|
| 6 | Qdrant distance | Cosine | ✅ 已配置 |
|
|
|
|
|
|
| 7 | Qdrant 部署 | 单节点 Docker,1 shard | ✅ 已部署运行 |
|
|
|
|
|
|
| 8 | Qdrant 集群时机 | 100 万 points 后评估 | ⏳ 远期 |
|
|
|
|
|
|
| 9 | Rerank 模型 | BAAI/bge-reranker-v2-m3 | ✅ Key 已配置,代码待写 |
|
|
|
|
|
|
| 10 | RAG 召回 | Top-50 ANN → rerank → Top-5~8 | ✅ indexer.py 已实现 |
|
|
|
|
|
|
| 11 | 知识库对话 | 仅限 KB 内检索 | ⏳ 待开发 |
|
|
|
|
|
|
| 12 | 多轮对话 | 保存 session + 最近 3 轮上下文 | ⏳ 待开发 |
|
|
|
|
|
|
| 13 | 候选知识点数量 | 上限 30,最少 3 | ✅ candidate_generator.py 已实现 |
|
|
|
|
|
|
| 14 | 自动接受 | 全部 PENDING 等确认 | ✅ ImportCandidate 模块已实现 |
|
|
|
|
|
|
| 15 | OCR | 百度 OCR + Qwen3-VL | ✅ 百度 OCR AppID 7767914 |
|
|
|
|
|
|
| 16 | 多模态兜底 | Qwen3-VL-32B-Thinking | ✅ Key 已配置 |
|
|
|
|
|
|
| 17 | 删除 source → KI | 默认保留 + sourceDeleted 快照 | ✅ Prisma schema 已含 |
|
|
|
|
|
|
| 18 | 删除 KB → 全对象 | 级联删除 | ⏳ 待实现 |
|
|
|
|
|
|
| 19 | Qdrant 快照 | 本地 + 同步 COS | ⏳ 备份脚本待写 |
|
|
|
|
|
|
| 20 | Qdrant 本地快照保留 | 7 天 | ⏳ |
|
|
|
|
|
|
| 21 | Qdrant COS 快照保留 | 30 天 | ⏳ |
|
|
|
|
|
|
| 22 | MySQL 备份 | 每日凌晨 + 同步 COS | ⏳ 备份脚本待写 |
|
|
|
|
|
|
| 23 | COS 文件清理 | soft delete 后 7 天 | ⏳ |
|
|
|
|
|
|
| 24 | MySQL 物理删除 | 默认 30 天 | ⏳ 清理脚本待写 |
|
|
|
|
|
|
| 25 | AI 成本日志保留 | 180 天 | ✅ AiUsageLog 表已建 |
|
|
|
|
|
|
| 26 | 审计日志保留 | 365 天 | ✅ AdminAuditLog 表已建 |
|
|
|
|
|
|
| 27 | Pro 定价 | 28 元/月预设,配置化 | ✅ MembershipPlan 表已建 |
|
|
|
|
|
|
| 28 | Worker 进程数 | 单 Worker 起步 | ✅ systemd zhixi-worker 运行中 |
|
|
|
|
|
|
| 29 | Worker 扩展 | 压力上来后 2~3 个 | ⏳ 远期 |
|
|
|
|
|
|
| 30 | 文档版本管理 | 预留 version 字段 | ✅ schema 已预留 |
|
|
|
|
|
|
| 31 | 重复文件 | sha256 检测 + 提示用户 | ✅ UploadedFile sha256 已实现 |
|
|
|
|
|
|
| 32 | 语言分工 | Node=API+Gateway, Python=RAG | ✅ 已按此执行 |
|
2026-05-19 22:54:43 +08:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
你接下来就按这个顺序推进:**先部署服务器环境,再建数据模型,再做上传和导入任务,再接 Qdrant,最后接学习闭环和知识库对话。**
|