api-server/docs/issues/API-AI-R01-resolveSnapshot-race.md
wangdl c88af39673
All checks were successful
Deploy API Server / build-and-deploy (push) Successful in 45s
feat: AI Runtime 完整业务逻辑实现
- runtime-internal.service: resolveSnapshot 自动重建、persistResult 5种jobType持久化、validateOutput 校验、convertQuizCandidates/convertFlashcardCandidates 候选转换、notifyJobComplete 通知、JOB_CANCELLED处理、heartbeat 双阶段更新+取消检测
- user-ai.service: createAnalysisJob 11步流程、cancelJob、publishQuiz/publishFlashcard、getAnalysis/listAnalyses等
- user-ai.controller: 20+ 用户API端点
- 新增服务: SnapshotBuilderService、PriorityRulesService、SnapshotCleanupService、JobReaperService
- 新增模块: admin-learning (CRUD管理)
- Prisma schema: cancelRequestedAt/cancelledAt/sourceBlockIds 字段、expiresAt 索引
- 文档: ai-runtime-user-api.md、Issue 记录

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-18 11:22:03 +08:00

89 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# API-AI-R01: resolveSnapshot 并发竞争
## 基本信息
| 字段 | 值 |
|------|-----|
| Issue ID | API-AI-R01 |
| 类型 | Non-blocking / 优化 |
| 仓库 | api-server |
| 关联 Issue | API-AI-016 (Snapshot Builder) |
| 发现日期 | 2026-06-17 |
| 优先级 | P2 |
## 问题描述
`runtime-internal.service.ts` 中的 `resolveSnapshot()` 存在并发竞态窗口。
### 场景
两个 Runtime 实例同时对同一个 job 调用 `getSnapshot()`,且 job 当前没有有效 snapshot未生成或已过期
```
时间线 →
实例 A 实例 B
│ │
├─ resolveSnapshot(job) │
│ snapshotId=null → 进 else │
│ ├─ resolveSnapshot(job)
│ │ snapshotId=null → 进 else
│ │
├─ buildSnapshot() → snap-A │
│ ├─ buildSnapshot() → snap-B
│ │
├─ job.update(snapshotId=A) │
│ ├─ job.update(snapshotId=B) ← 覆盖 A
```
### 后果
1. 数据库产生 snap-A 孤儿行(无 job 引用)
2. 浪费一次全量聚合查询buildSnapshot
3. snap-A 在 24h TTL 后自动过期清理
### 为什么当前影响可接受
- 不会丢数据或返回错误
- snapshot 构建是幂等的,两份结果一致性高
- 触发条件苛刻:两个 Runtime 实例需同时 poll 到同一个 jobpoll 时有 lock 机制大幅降低概率)
- 即使发生,额外开销仅为一次聚合查询
## 建议修复方案
方案:对 job 行加悲观锁后再判断 snapshot 状态。
```typescript
// resolveSnapshot 改为:
private async resolveSnapshot(job) {
// SELECT ... FOR UPDATE 锁住 job 行
const locked = await this.prisma.aiRuntimeJob.findUnique({
where: { id: job.id },
// Prisma 不直接支持 FOR UPDATE需用 $queryRaw
});
if (locked.snapshotId) {
const existing = await this.prisma.learningAnalysisSnapshot.findUnique({
where: { id: locked.snapshotId },
});
if (existing && (!existing.expiresAt || new Date(existing.expiresAt) >= new Date())) {
return existing;
}
}
const snapshot = await this.snapshotBuilder.buildSnapshot(...);
await this.prisma.aiRuntimeJob.update({
where: { id: job.id },
data: { snapshotId: snapshot.id },
});
return snapshot;
}
```
或者使用 `$transaction` 包裹读-判断-写逻辑,依赖数据库隔离级别保护。
## 相关文件
- `src/modules/ai-runtime/internal/runtime-internal.service.ts:resolveSnapshot()`
- `src/modules/ai-runtime/snapshot-builder.service.ts:buildSnapshot()`