Skip to content

hephaex/hwp2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hwp2md

Crates.io CI License: GPL-3.0-only

hwp2md is a bidirectional converter between Korean 한글(Hangul) document formats — HWP 5.0 (binary OLE2) and HWPX (XML/ZIP) — and CommonMark-compatible Markdown. It ships as both a command-line tool and a Rust library, making it straightforward to integrate document conversion into build pipelines, static-site generators, or document-management workflows that need to exchange content with the Korean public-sector ecosystem.

Features

  • HWP 5.0 binary format (OLE2/CFB container) to Markdown
  • HWPX (ZIP + XML) to Markdown
  • Markdown to HWPX (write path; binary HWP output is not yet supported)
  • Headings levels 1-6, paragraphs, bold, italic, underline, strikethrough, inline code
  • Superscript and subscript inline styles
  • Ordered and unordered lists with nested items
  • Tables (GFM pipe syntax; colspan/rowspan fall back to HTML)
  • Fenced code blocks with language annotation
  • Block quotes
  • Images with alt text; optional extraction to an assets directory
  • Footnotes ([^id] syntax)
  • Math expressions — HWP EqEdit equations converted to LaTeX ($...$ / $$...$$)
  • Document metadata (title, author, date) optionally emitted as YAML front matter
  • info subcommand for quick document inspection without conversion
  • YAML style templates for Markdown-to-HWPX output (interface defined; implementation in progress)
  • Structured intermediate representation (IR) exposed as a public library API
  • Release builds with LTO and symbol stripping for minimal binary size

Installation

From crates.io

cargo install hwp2md

From source

git clone https://github.com/hephaex/hwp2md.git
cd hwp2md
cargo build --release
# Binary at: target/release/hwp2md

Minimum supported Rust version: 1.75.

CLI Usage

Convert HWP or HWPX to Markdown

# Write Markdown to stdout
hwp2md to-md report.hwp

# Write to a file
hwp2md to-md report.hwp -o report.md

# Extract embedded images alongside the Markdown
hwp2md to-md report.hwpx -o report.md --assets-dir ./images

# Include document metadata as YAML front matter
hwp2md to-md report.hwp -o report.md --frontmatter

Convert Markdown to HWPX

# Output file defaults to input name with .hwpx extension
hwp2md to-hwpx draft.md

# Specify output path
hwp2md to-hwpx draft.md -o final.hwpx

# Apply a YAML style template (interface available; full implementation pending)
hwp2md to-hwpx draft.md -o final.hwpx --style corporate.yaml

Inspect a document without converting

hwp2md info report.hwp
# File: report.hwp
# Format: hwp
# Title: Annual Report 2025
# Author: Jane Doe
# Sections: 4
# Blocks: 87
# Characters: ~12430
# Assets: 6

Logging verbosity

The --log-level flag accepts any tracing filter string (default: info):

hwp2md --log-level debug to-md report.hwp
hwp2md --log-level warn  to-md report.hwp -o report.md

Library Usage

Add to Cargo.toml:

[dependencies]
hwp2md = "0.2"

Convert a file

use hwp2md::convert;

fn main() -> anyhow::Result<()> {
    // HWP or HWPX to Markdown (written to stdout when output is None)
    convert::to_markdown(
        "report.hwpx".as_ref(),
        Some("report.md".as_ref()),
        Some("assets/".as_ref()),
        true, // emit YAML front matter
    )?;

    // Markdown to HWPX
    convert::to_hwpx(
        "draft.md".as_ref(),
        Some("draft.hwpx".as_ref()),
        None, // style template
    )?;

    Ok(())
}

Work with the intermediate representation

use hwp2md::{hwp, hwpx, md, ir};

// Parse a document into the IR
let doc: ir::Document = hwpx::read_hwpx("report.hwpx".as_ref())?;

// Inspect metadata
if let Some(title) = &doc.metadata.title {
    println!("Title: {title}");
}

// Iterate blocks in the first section
for block in &doc.sections[0].blocks {
    if let ir::Block::Heading { level, inlines } = block {
        let text: String = inlines.iter().map(|i| i.text.as_str()).collect();
        println!("H{level}: {text}");
    } else if let ir::Block::Table { rows, col_count } = block {
        println!("Table {col_count} cols x {} rows", rows.len());
    }
}

// Render back to Markdown
let markdown = md::write_markdown(&doc, false);
println!("{markdown}");

Parse Markdown into the IR

use hwp2md::md;

let source = std::fs::read_to_string("document.md")?;
let doc = md::parse_markdown(&source);
println!("{} sections", doc.sections.len());

Format Support Matrix

Feature HWP 5.0 -> MD HWPX -> MD MD -> HWPX
Headings (H1-H6) yes yes yes
Paragraphs yes yes yes
Bold / Italic yes yes yes
Underline yes yes yes
Strikethrough yes yes yes
Inline code yes yes yes
Superscript / Subscript yes yes yes
Hyperlinks yes yes yes
Ordered lists yes yes yes
Unordered lists yes yes yes
Nested lists yes yes yes
Tables yes yes yes
Images (extract) yes yes yes
Fenced code blocks yes yes yes
Block quotes yes yes yes
Footnotes yes yes yes
Math (LaTeX) yes yes yes
YAML front matter yes yes n/a
Multi-column layout flattened flattened n/a
Headers / footers skipped skipped planned
DRM-protected HWP no no n/a
MD -> HWP binary n/a n/a no

Architecture

HWP 5.0 (.hwp)  ──── hwp::read_hwp()   ──┐
                                           ├──> ir::Document ──> md::write_markdown() ──> Markdown
HWPX (.hwpx)    ──── hwpx::read_hwpx() ──┘
                                           ┌── ir::Document <── md::parse_markdown() <── Markdown
                                           └──> hwpx::write_hwpx() ──> HWPX (.hwpx)

The conversion pipeline is decoupled through a format-neutral intermediate representation (ir::Document). Every reader produces an ir::Document; every writer consumes one. This keeps format-specific code isolated and makes it straightforward to add new input or output formats in the future.

Key types in ir

Type Description
Document Root: metadata + sections + extracted assets
Metadata Title, author, creation/modification date, subject, keywords
Section Ordered sequence of Block values
Block Heading, Paragraph, Table, CodeBlock, BlockQuote, List, Image, HorizontalRule, Footnote, Math
Inline Leaf text with style flags (bold, italic, underline, strikethrough, code, superscript, subscript, link, footnote reference)
Asset Embedded binary (image or other media) with MIME type

Crate layout

src/
  main.rs          CLI entry point (clap)
  lib.rs           Public re-exports
  convert.rs       High-level convert::to_markdown / to_hwpx / show_info
  ir.rs            Intermediate representation types
  error.rs         Hwp2MdError enum (thiserror)
  hwp/             HWP 5.0 reader (CFB container, record parser, EqEdit)
  hwpx/            HWPX reader + writer (ZIP + quick-xml)
  md/              Markdown parser (comrak) + writer
tests/             Integration tests

Known Limitations

  • DRM-protected (배포용) HWP files are not supported.
  • Multi-column (다단) layouts are flattened to a single column.
  • Tables with non-trivial colspan/rowspan fall back to raw HTML in the Markdown output.
  • Headers and footers are currently skipped.
  • Writing back to the binary HWP 5.0 format (MD -> HWP) is not supported; only HWPX output is available.
  • The --style YAML template option for to-hwpx is accepted by the CLI but not yet applied.

Contributing

Bug reports and pull requests are welcome at https://github.com/hephaex/hwp2md.

Before submitting a patch:

  1. Run cargo fmt and cargo clippy -- -D warnings.
  2. Ensure cargo test --all-targets passes.
  3. Add or update tests for any changed behaviour.

License

Copyright (c) 2026 Mario Cho <hephaex@gmail.com>

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 only.

See LICENSE for the full text.

About

HWP/HWPX ↔ Markdown bidirectional converter written in Rust

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages