Documentation

Benchmark suite

mdcraft.ai Benchmark Suite#

Create a repeatable benchmark set that determines whether the MVP is genuinely better than common free converters and manual copy-paste workflows.

The benchmark should be used before launch and after any major rendering or extraction change.

Developer documentation
- API tables
- code fences with long lines
- Mermaid diagrams
- nested lists
- footnotes
AI-generated product documents
- PRDs
- strategy memos
- meeting summaries
- callouts and task lists
Consulting and business reports
- title page
- section dividers
- images
- quotes
- executive summary layouts
Student and educator documents
- math
- citations
- lecture notes
- dense headings
- print-focused page counts
Layout stress tests
- wide tables
- side-by-side image cases
- code-plus-table on the same page
- long TOCs

Each benchmark document should be scored on a 1-5 scale across these dimensions.

Visual quality
- typography looks polished
- spacing feels deliberate
- hierarchy is obvious
Layout stability
- no broken page breaks
- no clipped tables or code
- images remain aligned
Syntax fidelity
- headings, lists, tables, code, Mermaid, math, and footnotes render correctly
Preview/export parity
- preview and final PDF/HTML match closely
Share-readiness
- output looks safe to send externally without extra cleanup

Structural accuracy
- headings, lists, tables, quotes, and code fences are reconstructed correctly
Readability
- markdown is easy to read and edit
Cleanup burden
- low manual cleanup required for text-first documents
Failure clarity
- ambiguous extraction is surfaced for review instead of silently mangling content
Reusability
- output is genuinely useful for editors, repos, AI tools, or docs systems

Average score of 4 or better in visual quality and syntax fidelity
No catastrophic failures on code, tables, images, Mermaid, or math in the benchmark set
At least 80 percent of benchmark docs rated "ready to share" without manual restyling

At least 65 percent of text-first PDFs converted into usable markdown with light cleanup
Known-bad layout-heavy samples must fail gracefully and clearly

If mdcraft.ai is not clearly better on polish and at least comparable on correctness, do not expand scope. Fix the benchmark failures first.

The repository includes npm run test:benchmark.
The runner executes the full benchmark corpus and writes JSON results to handoff/benchmark-results/latest.json.
The report includes:
- forward and reverse score summaries
- corpus coverage by fixture category
- issue category counts (for example: tables, lists, code, mermaid, math, layout, ocr)
Use issue category counts to prioritize rendering and extraction fixes each sprint.
Run npm run test:benchmark:gate to enforce benchmark release thresholds.
Run npm run test:benchmark:golden to refresh forward-fixture visual goldens in handoff/benchmark-goldens.