Recent Changes - Search:

Key items

Hosted content

Wiki logistics

Brought to you by Incline. Hosted by Guardian.

HXWP

Hybrid Xanalogical Word Processor (HXWP)

Thoughts by cxw starting in 2018

Storing multiple documents

  • A single file on disk can store any number of docs.
  • Documents: Git-like structure --- each doc links to its parent(s).
  • Documents are identified by Merkle hashes of their contents. That way, when you edit a document or create a new version, you only have to recalculate the hash for part of the document. Maybe one hash per n kilobytes? (Yeah, yeah, kibibytes.)
    • This is so you can merge files when they are being emailed around, and so you can reduce excessive data transfer --- same benefits as Git.
    • However, the Merkle hash also covers a UUID created when a new document is initialized. If two people happen to type the same text into two separate documents, those will have different IDs. This is because you don't want your document to suddenly collapse together with someone else's document. After all, the two may have completely different histories and relationships.
    • The UUID does not change when you create a new version of a document. In that way, it is something like a family ID or ancestry ID.
  • Ent structure (forest of enfilades aka K-trees) with doc hashes at the top and codepoints (not bytes) at the bottom. U+0000 through U+00ff all exist, so can be used to store arbitrary binary data. (NUL is exposed to upper levels as U+0000.)
  • Use something like the Bert canopy for "what links here?" queries? This is the big performance question --- in order to support the layout described below, WLH queries have to be very fast.

Links

  • Each link is represented as a doc. That way every codepoint in a link is connected with every other byte in that link through the link doc.
  • The first codepoint of every doc links to type information for that link. (Maybe?)
  • Each codepoint can be the From of exactly one link. That way a user can have a reasonable expectation of clicking on one character and getting one result.
    • Composing characters and other multi-codepoint graphemes must link be the From of the same link.

Document structure

  • Each document is a stream of codepoints.
  • In non-binary data, each paragraph begins with a dedicated codepoint, each section begins with a dedicated codepoint, &c. That dedicated codepoint links to the text of the relevant portion. That way a literal \n can be used as a hard return, a literal FF can be used as a hard page break, &c. I think FS/GS/RS/US may be useful for this.
  • Formatting is stored as links to the special characters (for whole-paragraph or whole-section formatting) or as links to the affected characters.
  • Link docs are associated with a range of revisions for which they are reachable, similar to a Git commit range. E.g., a link can have the most recent doc as its parent commit (as it were), and can store an earliest commit. Link docs can be updated as the parent changes so they always point to the latest.
    • This way you can tell, of all the links that point to the codepoints in a document, which links are active for any particular version.
    • The reachability graph can be cached for performance.

API

  • The Extent is the basic interface. It points to codepoints, not between codepoints.
  • The start-of-x characters (start of section, para, ...) are part of the text stream, so can be searched (similar to searching for ^p in Word).
  • An Extent can be (is always?) mapped into a particular view, so that it can handle up/down/home/end &c. operations. This way Extent combines some of the functionality of a Word Range and a Word Selection.
  • You can ask WLH for any Extent and get back a collection of Extents with their respective WLHs (only one Extent if everything in the input is linked to by the same docs/links).
Edit - History - Print - Recent Changes - Search
Page last modified on June 27, 2018, at 07:49 PM