feat: AI Job 状态机与任务调度设计 (API-AI-002)
Some checks failed
Deploy API Server / build-and-deploy (push) Has been cancelled
Some checks failed
Deploy API Server / build-and-deploy (push) Has been cancelled
定义 5 种 Job 类型、7 种状态、完整状态流转图、数据库字段、防并发锁定 机制、retryable/non-retryable 分类、超时释放、幂等规则、Poll 调度策略。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
eea9e3e7c6
commit
045e0b2501
233
docs/ai-job-state-machine.md
Normal file
233
docs/ai-job-state-machine.md
Normal file
@ -0,0 +1,233 @@
|
||||
# AI Job 状态机与任务调度设计
|
||||
|
||||
## 1. Job 类型
|
||||
|
||||
| jobType | 说明 | 输入 | 输出 |
|
||||
|---------|------|------|------|
|
||||
| `learning_state_analysis` | 学习状态分析 | Snapshot | AiLearningAnalysis |
|
||||
| `weak_point_analysis` | 薄弱点分析 | Snapshot | WeakPointCandidate[] |
|
||||
| `next_action_planning` | 下一步建议 | Snapshot | NextActionRecommendation[] |
|
||||
| `quiz_generation` | 题目候选生成 | Snapshot + params | QuizQuestion[] |
|
||||
| `flashcard_generation` | 卡片候选生成 | Snapshot + params | Flashcard[] |
|
||||
|
||||
## 2. 状态定义
|
||||
|
||||
| 状态 | 含义 | 进入条件 | 退出条件 |
|
||||
|------|------|----------|----------|
|
||||
| `pending` | 等待消费 | API 创建 or retryable fail 回退 | 被 Runtime lock |
|
||||
| `locked` | 已被 Runtime 获取 | Runtime POST /lock 成功 | lockUntil 超时 → expired / Runtime 开始执行 |
|
||||
| `running` | 正在执行 | Runtime 开始执行(heartbeat 或隐式) | 执行完成 → succeeded/failed |
|
||||
| `succeeded` | 执行成功 | API POST /result 处理完毕 | 终态 |
|
||||
| `failed` | 执行失败 | non-retryable 错误 or 超过 maxRetryCount | 终态(除非 Admin 重跑) |
|
||||
| `cancelled` | 已取消 | 用户/Admin 取消 pending job | 终态 |
|
||||
| `expired` | 超时 | lockUntil 超时未 heartbeat or 执行超时 | 可被 Runtime 重新 poll(retryable) |
|
||||
|
||||
## 3. 状态流转
|
||||
|
||||
```
|
||||
┌──────────┐
|
||||
│ pending │ ←──────────────────────┐
|
||||
└────┬─────┘ │
|
||||
│ │
|
||||
POST /lock │
|
||||
│ │
|
||||
┌────▼─────┐ │
|
||||
┌───→│ locked │──→ expired ───────────┘
|
||||
│ └────┬─────┘ (lockUntil 超时)
|
||||
│ │
|
||||
│ heartbeat
|
||||
│ │
|
||||
│ ┌────▼─────┐
|
||||
│ │ running │──→ expired ───────────┘
|
||||
│ └────┬─────┘ (timeoutSeconds 超时)
|
||||
│ │
|
||||
┌───────┼─────────┼──────────┐
|
||||
│ │ │ │
|
||||
succeeded failed failed cancelled
|
||||
(result) (non- (retry- (用户/Admin
|
||||
retry) able 取消pending)
|
||||
│
|
||||
└──→ pending (retryCount++)
|
||||
```
|
||||
|
||||
## 4. 数据库字段
|
||||
|
||||
```prisma
|
||||
model AiRuntimeJob {
|
||||
id String @id @default(cuid())
|
||||
userId String
|
||||
jobType String // learning_state_analysis | weak_point_analysis | next_action_planning | quiz_generation | flashcard_generation
|
||||
targetType String // user | material | knowledge_point
|
||||
targetId String
|
||||
snapshotId String?
|
||||
status String @default("pending") // pending | locked | running | succeeded | failed | cancelled | expired
|
||||
priority Int @default(0) // 0=最高
|
||||
idempotencyKey String? @unique
|
||||
apiKeyMode String @default("platform_key") // platform_key | user_deepseek_key
|
||||
credentialId String?
|
||||
modelProvider String @default("deepseek")
|
||||
modelName String @default("deepseek-chat")
|
||||
promptVersion String?
|
||||
outputSchemaVersion String?
|
||||
attemptNo Int @default(0)
|
||||
retriedFromJobId String?
|
||||
|
||||
// 锁定
|
||||
lockedBy String? // runtimeInstanceId
|
||||
lockedAt DateTime?
|
||||
lockUntil DateTime?
|
||||
|
||||
// 时间
|
||||
startedAt DateTime?
|
||||
finishedAt DateTime?
|
||||
|
||||
// 重试
|
||||
retryCount Int @default(0)
|
||||
maxRetryCount Int @default(3)
|
||||
timeoutSeconds Int @default(120)
|
||||
|
||||
// 错误
|
||||
errorCode String?
|
||||
errorMessage String?
|
||||
|
||||
createdAt DateTime @default(now())
|
||||
updatedAt DateTime @updatedAt
|
||||
|
||||
result AiRuntimeResult?
|
||||
|
||||
@@index([status])
|
||||
@@index([jobType])
|
||||
@@index([userId])
|
||||
@@index([targetType, targetId])
|
||||
@@index([lockUntil])
|
||||
}
|
||||
```
|
||||
|
||||
## 5. 锁定机制
|
||||
|
||||
### 5.1 Lock 流程
|
||||
|
||||
```
|
||||
Runtime POST /internal/runtime/jobs/{jobId}/lock
|
||||
→ API 检查 job.status === pending
|
||||
→ API 检查 job.lockUntil < now (未被其他 Runtime 持有)
|
||||
→ API 设置 lockedBy, lockedAt, lockUntil=now+60s, status=locked
|
||||
→ 返回 lockUntil
|
||||
```
|
||||
|
||||
### 5.2 防并发
|
||||
|
||||
基于数据库行级写操作保证只有一个 Runtime 锁定成功:
|
||||
- `UPDATE ... WHERE status='pending' AND (lockUntil IS NULL OR lockUntil < NOW())`
|
||||
- 影响行数 = 0 则锁定失败(JOB_ALREADY_LOCKED)
|
||||
|
||||
### 5.3 Heartbeat
|
||||
|
||||
```
|
||||
Runtime POST /internal/runtime/jobs/{jobId}/heartbeat
|
||||
→ API 检查 lockedBy === runtimeInstanceId
|
||||
→ API 延长 lockUntil = now + 60s
|
||||
→ 204 No Content
|
||||
```
|
||||
|
||||
### 5.4 超时释放
|
||||
|
||||
`lockUntil` 超时后:
|
||||
- 原 Runtime 的 lock 失效
|
||||
- job 状态变为 `expired`
|
||||
- 其他 Runtime poll 时可重新获取(retryable)
|
||||
- 如 retryCount < maxRetryCount,job 自动回到 `pending`
|
||||
|
||||
## 6. 重试策略
|
||||
|
||||
### 6.1 重试触发
|
||||
|
||||
| 场景 | 处理 |
|
||||
|------|------|
|
||||
| Runtime 提交 retryable fail | job → pending, retryCount++ |
|
||||
| Runtime lock 后无 heartbeat 超时 | job → expired → pending, retryCount++ |
|
||||
| Runtime 执行超时 | job → expired → pending, retryCount++ |
|
||||
|
||||
### 6.2 重试上限
|
||||
|
||||
- `retryCount >= maxRetryCount`:job → failed(终态)
|
||||
- `maxRetryCount` 默认 3,可配置
|
||||
- Admin 可手动重跑 failed job(创建新 job,记录 retriedFromJobId)
|
||||
|
||||
### 6.3 retryable vs non-retryable
|
||||
|
||||
| 错误类型 | retryable | 示例 |
|
||||
|---------|-----------|------|
|
||||
| MODEL_TIMEOUT | true | DeepSeek 超时 |
|
||||
| MODEL_RATE_LIMIT | true | 限流 |
|
||||
| NETWORK_ERROR | true | 网络中断 |
|
||||
| TEMPORARY_PROVIDER_ERROR | true | 5xx |
|
||||
| INVALID_SNAPSHOT | false | 快照结构错 |
|
||||
| INVALID_SCHEMA | false | 输出 schema 错 |
|
||||
| INVALID_CREDENTIAL | false | Key 无效 |
|
||||
| JOB_TIMEOUT | true | 执行超时 |
|
||||
|
||||
## 7. 超时
|
||||
|
||||
| 超时类型 | 默认值 | 说明 |
|
||||
|---------|--------|------|
|
||||
| lockUntil | 60s | lock 后未 heartbeat 自动释放 |
|
||||
| timeoutSeconds | 120s | 总执行超时 |
|
||||
| heartbeat 间隔 | Runtime 自行决定 | 建议 15-30s |
|
||||
|
||||
## 8. 幂等
|
||||
|
||||
### 8.1 Job 创建幂等
|
||||
|
||||
`idempotencyKey` 唯一索引:相同 `userId + jobType + targetType + targetId + idempotencyKey` 的 job 不重复创建。如果没有传 idempotencyKey,则允许重复创建。
|
||||
|
||||
### 8.2 Result 提交幂等
|
||||
|
||||
```
|
||||
resultIdempotencyKey = jobId + ":" + attemptNo + ":" + outputHash
|
||||
```
|
||||
|
||||
- 相同 key 重复提交:返回 200(幂等,不重复落库)
|
||||
- 已有 succeeded result 但 outputHash 不同:返回 409 RESULT_ALREADY_EXISTS
|
||||
|
||||
### 8.3 Admin 重跑
|
||||
|
||||
Admin 重跑创建新 job,记录 `retriedFromJobId`,不复用旧 job。
|
||||
|
||||
## 9. Cancelled / Expired
|
||||
|
||||
| 状态 | 能否被 Runtime 消费 | 处理 |
|
||||
|------|-------------------|------|
|
||||
| cancelled | 否 | API 直接设置,不进入 poll 结果 |
|
||||
| expired | 是(如 retryable) | lockUntil 超时后自动变为 expired,retryable 时回到 pending |
|
||||
|
||||
用户关闭 AI 授权时:
|
||||
- 所有 pending job → cancelled
|
||||
- 所有 running job → cancelRequested(Runtime 下次 heartbeat 获知)
|
||||
|
||||
## 10. 任务调度
|
||||
|
||||
### Poll 规则
|
||||
|
||||
```
|
||||
POST /internal/runtime/jobs/poll
|
||||
→ 返回 status=pending 的 job
|
||||
→ 按 priority ASC, createdAt ASC 排序
|
||||
→ 只返回 Runtime capabilities 支持的 jobType
|
||||
→ limit 最大 50
|
||||
```
|
||||
|
||||
### 无可用 job 时
|
||||
|
||||
返回空数组。Runtime 按 pollIntervalMs 等待后重试。
|
||||
|
||||
## 11. 验收清单
|
||||
|
||||
- [x] 输出 Job 状态机设计文档
|
||||
- [x] 明确每个状态的进入条件和退出条件
|
||||
- [x] 明确 Runtime 如何锁定任务(DB 行级写 + lockUntil)
|
||||
- [x] 明确 lockUntil 超时后如何释放
|
||||
- [x] 明确 retryCount / maxRetryCount 规则
|
||||
- [x] 明确 idempotencyKey 防重复
|
||||
- [x] 明确 Admin 可重跑 failed job
|
||||
- [x] 明确 cancelled / expired 不应被 Runtime 再次消费
|
||||
Loading…
x
Reference in New Issue
Block a user