Documentation

Benchmark suite

mdcraft.ai Benchmark Suite#

Goal#

Create a repeatable benchmark set that determines whether the MVP is genuinely better than common free converters and manual copy-paste workflows.

The benchmark should be used before launch and after any major rendering or extraction change.

Target size#

  • 20 to 30 total documents
  • Split across forward conversion and reverse conversion
  • Include both everyday examples and failure-prone edge cases

Benchmark categories#

A. Markdown -> PDF (+ shared HTML export checks)#

  1. Developer documentation
    • API tables
    • code fences with long lines
    • Mermaid diagrams
    • nested lists
    • footnotes
  2. AI-generated product documents
    • PRDs
    • strategy memos
    • meeting summaries
    • callouts and task lists
  3. Consulting and business reports
    • title page
    • section dividers
    • images
    • quotes
    • executive summary layouts
  4. Student and educator documents
    • math
    • citations
    • lecture notes
    • dense headings
    • print-focused page counts
  5. Layout stress tests
    • wide tables
    • side-by-side image cases
    • code-plus-table on the same page
    • long TOCs

B. PDF -> Markdown#

  1. Text-first reports
  2. Whitepapers with headings and lists
  3. Documentation exports with code blocks
  4. Moderate table-heavy PDFs
  5. A few known-bad layout-heavy PDFs used to test graceful failure

Quality rubric#

Each benchmark document should be scored on a 1-5 scale across these dimensions.

Forward conversion rubric#

  1. Visual quality
    • typography looks polished
    • spacing feels deliberate
    • hierarchy is obvious
  2. Layout stability
    • no broken page breaks
    • no clipped tables or code
    • images remain aligned
  3. Syntax fidelity
    • headings, lists, tables, code, Mermaid, math, and footnotes render correctly
  4. Preview/export parity
    • preview and final PDF/HTML match closely
  5. Share-readiness
    • output looks safe to send externally without extra cleanup

Reverse conversion rubric#

  1. Structural accuracy
    • headings, lists, tables, quotes, and code fences are reconstructed correctly
  2. Readability
    • markdown is easy to read and edit
  3. Cleanup burden
    • low manual cleanup required for text-first documents
  4. Failure clarity
    • ambiguous extraction is surfaced for review instead of silently mangling content
  5. Reusability
    • output is genuinely useful for editors, repos, AI tools, or docs systems

Minimum pass bar for MVP#

Markdown -> PDF (+ shared HTML export checks)#

  • Average score of 4 or better in visual quality and syntax fidelity
  • No catastrophic failures on code, tables, images, Mermaid, or math in the benchmark set
  • At least 80 percent of benchmark docs rated "ready to share" without manual restyling

PDF -> Markdown#

  • At least 65 percent of text-first PDFs converted into usable markdown with light cleanup
  • Known-bad layout-heavy samples must fail gracefully and clearly

Reference competitors to compare against#

  • Pandoc-based export workflow
  • Typora export
  • RenderMark
  • markdown-to-pdf.org
  • naive copy-paste from browser or PDF

Test execution process#

  1. Run each input through mdcraft.ai
  2. Run the same input through baseline competitors
  3. Score outputs using the rubric
  4. Capture screenshots or output files for visual comparison
  5. Log issues by category: tables, code, pagination, math, Mermaid, extraction

Files to collect next#

  • 6 markdown-heavy technical docs
  • 5 AI-generated markdown docs
  • 4 consulting-style reports
  • 4 student notes or math-heavy docs
  • 5 PDFs with varying complexity

Release rule#

If mdcraft.ai is not clearly better on polish and at least comparable on correctness, do not expand scope. Fix the benchmark failures first.

Automation notes#

  • The repository includes npm run test:benchmark.
  • The runner executes the full benchmark corpus and writes JSON results to handoff/benchmark-results/latest.json.
  • The report includes:
    • forward and reverse score summaries
    • corpus coverage by fixture category
    • issue category counts (for example: tables, lists, code, mermaid, math, layout, ocr)
  • Use issue category counts to prioritize rendering and extraction fixes each sprint.
  • Run npm run test:benchmark:gate to enforce benchmark release thresholds.
  • Run npm run test:benchmark:golden to refresh forward-fixture visual goldens in handoff/benchmark-goldens.