HTML Reducer

Sanitizes. Flattens. Writes Markdown. OSCAL-friendly.

Drop your HTML onto the page. Your browser will try to load it; what it can read, it converts it to a plain, simple, flat, Markdown-compatible HTML. The file source (code) is also entered into a form field where it can be edited and run again.

This is a deep clean: structural information is filtered away, except as might be reflected implicitly in header elements (h1 through h6). All styling is dropped. The filter passes through all the text, but of the tagging it keeps only what works safely in reduced environments.

Accordingly, it also discards any script or style that might perturb downstream consumers.

Keeping what works ensures that regular tagging of elements at the line and inline level will come through: this includes paragraphs, very basic inline formatting (such as italics or bold), lists and simple tables. Other tagging is removed, but no data is lost. Whether information screened out constitutes noise or signal will depend on the case.

All operations are accomplished entirely in the browser, without uploading, transmitting or communicating user data, or derived data, to the server (which delivers the application) or to any other system. Once cached, the application can be run off line or air-gapped, with no live access to a network, on new inputs. Similarly the only model for data persistence is the user option to save out results.

Tasks / TODO

Save As and Copy to Clipboard buttons
Refine L/F
- elements to drop? e.g. buttons
- images!
Stress testing with broken HTML
- Support flat element sequences? (now draws a blank)
- Test across browsers
- Character sets?
Unit testing in the back to demonstrate HTML subset coverage?

Limitations

Of course, the results of these transformations will not be 508-compliant. Documents put through these filters are not necessarily fit for use in any downstream application. While all the data is kept, to make your data set ready for ingest into another system, tagging may have to restored or enhanced.

Now passing all (any) inputs to the HTML parser not the XML parser. Good luck! Note that data not already appearing in p tags or other (plain) element markup (lists, headers, tables) will spill out into the containing div - which is schema-valid only because divs can contain basically anything.

Mozilla Developer Network documentation for XSLT cautions that this isn't tested (by them) and they can't promise it won't change. It seems to work, however.

Note that while the application saves out flat (unwrapped) HTML-like elements, it does not read flat inputs, due to limitations in the parser. If your input doesn't come through, try adding wrapper <body> tagging (that is, with start- and end-tagging around the entire text contents) in the Source editor.

Source (editable)

HTML Reducer

HTML code (cleaned)

In display

Markdown