Documentation Design for Glossaries and Acronyms

This page will initially serve as a location for a draft design for how glossary- and acronym-like information for the Data Management System (and LSST more generally) will be made available to users and documentation consumers. The draft design will the migrate to a Technical Note which will be linked from here.

Context and History

In the context of recent (mid-2018) user workshops such as the Commissioning Bootcamp, there was considerable demand for access to glossary-like information, in view of the large amount of LSST-specific terminology that users were required to understand. This has been discussed in various forums, including the DM-SST. (See the minutes of the 2018-06-22 meeting.) In that context, Melissa Graham has been asked to assemble a list of terminology, initially in response to the concerns expressed by the workshop users (see DM-14877 - Getting issue details... STATUS ), and Gregory Dubois-Felsmann has been asked to coordinate the development of a technical strategy for delivering this information to users and developers.
This is only the latest round of a long history of concern about the availability of documentation of this nature. A number of attempts at assembling lists of terms for LSST and for the DMS have been made. Most recently, in 2017 DM-9807 - Getting issue details... STATUS identified the need to merge a number of glossary-like documentation sources. Many of them are summarized below (click to expand):

Existing sources of glossary/acronym information

Melissa Graham's recently started page: DM Glossary - superseded
https://confluence.lsstcorp.org/display/LSWUG/DMS+Glossary
https://confluence.lsstcorp.org/display/LSWUG/Astro+Glossary
https://dev.lsstcorp.org/trac/wiki/glossary
https://dev.lsstcorp.org/trac/attachment/wiki/Glossary/EAGlossary.pdf - acronym list from Enterprise Architect (pre-MagicDraw-migration)
Project-level acronyms: http://ls.st/Document-11921*
Project-level glossary: http://ls.st/Document-14412*
Acronym and glossary lists in the lsst-texmf package

Proposed Requirements and Preferences

Users wish to be able to have a one-stop shop for looking up information on a term, and are inclined to a search-based model for doing so, rather than looking up a term in an alphabetic list.
Nonetheless, the ability to produce an alphabetic list of acronyms and/or glossary terms should be retained, as some users may prefer it and it is a very useful basis for internal review of the correctness and readability of the definitions.
The system adopted should permit the generation of both a global list and of lists specific to different areas of the project, which may have partially overlapping content.
Acronyms and glossary terms should be distinguished.
We wish to be able to include a specific "Definitions" section in all of our formal documents that provides definitions and/or expansions of the glossary terms and acronyms actually used in that document. This has to work for our LaTeX-based document production mechanism (i.e., via the lsst-texmf package) and should also be made to work for the reST-based technical notes documentation.
It would be desirable to have the "one sentence" definitions of a term, suitable for inclusion in a glossary, provided from the same source text from which complete documentation of that term is available. It would also be desirable to be able to navigate with something like "one click" from the glossary entry to the complete documentation.
The system should also support cases where a term has only a glossary definition and no backing documentation in the LSST environment.
The system should take into account the possibility that a term may be defined in more than one context - e.g., "visit" may be documented from an observatory-operations perspective as well as from a data-processing perspective. It should persist enough metadata to allow both a) choosing a single "one-sentence" definition as the default, and b) providing access to all of the definitions and backing documentation.
The system should support the use of MagicDraw's glossary and acronym table capabilities - whether by making MagicDraw the authoritative source of this information or by periodic import into MagicDraw of information from some other authoritative system.
However, it would be preferable for the system to permit the editing of this content by people with the general developer skill set, rather than relying on the use of a proprietary tool.

Proposed Design Elements

The Sphinx-based DM documentation-generation system will permit documentation source code to declare terms for inclusion in a glossary or acronym list. At the place of definition, the system will permit a "one-sentence" definition to be provided in addition to the main content. The system will permit the decoration of a term with one or more tags (e.g., "DM", "Observatory Operations") that permit area-specific glossary lists to be generated.
The DM documentation-generation system will process all its source files to produce a table of terms, one-sentence definitions, and references (URLs) to detailed documentation, which can then be used to drive other documentation-generation and -delivery tools.
A format for this table will be defined with a view to making it usable in a variety of contexts, and readily generated from other sources of documentation (including manual generation where appropriate).
- Jonathan Sick will do some thinking and research into this and make a more specific proposal. 03 Jul 2018
Tooling will be developed to allow the use of multiple tables in this format as inputs to the lsst-texmf glossary and acronym processing. This may be configuration-controlled by making PRs against the lsst-texmf package itself, or a separate package may be used.
Tooling will be developed to allow the export of an aggregation of glossary and acronym information from this scheme to a form that can be ingested into MagicDraw. (It looks like even .csv may be a suitable form for this.) If the actual ingest cannot be automated, it can be made a ~monthly human task to apply an update.
- Tim Jenness, with help from Austin Roberts, will look into this. 03 Jul 2018
A means will be devised to allow resolving conflicts between multiple sources of information on the same term.
- Gregory Dubois-Felsmann will make a more specific proposal on this. 29 Jun 2018

Gregory Dubois-Felsmann will pull together a more concrete writeup and describe it at the DM-SST meeting on 06 Jul 2018 .

Glossary Integration for User Documentation

User documentation websites can interact bidirectionally with the LSST glossary system. First, topics (web pages) in user documentation can be authoritative sources for terms. As such, user documentation becomes a decentralized generator of terms for the glossary. Second, user documentation can leverage the LSST glossary system add mouse-over definitions and hyperlinks to terms and acronyms found in the documentation.

This initial design discussion describes the components that help to contribute glossary information from user documentation into the LSST glossary system

reStructuredText directives to help authors efficiently mark up terms in user documentation source.
Metadata to expose glossary information in HTML.
Web crawler that monitors published LSST user documentation sites and adds terms into the LSST glossary system.

Lastly, this discussion covers the second aspect:

Displaying glossary information in user documentation

reStructuredText syntax for documentation authors

The term directive

For Sphinx-based documentation, authors can mark pages that describe a term. The implicit notion of using a term directive on a page is that this page is highly related to a term, and can be used as an extended discussion of a term.

The most flexible syntax for this is a reStructuredText directive that is written outside the flow of the text (at the top or bottom of an rst file, for instance. All of LSST's custom reStructuredText and Sphinx extensions are packaged in Documenteer, a Python package managed by SQuaRE.

A terminology (term) directive might look like this:

.. term:: Instrument Signature Removal
   :abbr: ISR
   :primary: true

   A pipeline that applies calibration reference data in the course of raw data
   processing, to remove artifacts of the instrument or detector electronics,
   such as removal of overscan pixels, bias correction, and the application of
   a flat-field to correct for pixel-to-pixel variations in sensitivity.

The argument of the term directive is the term term itself ("Instrument Signature Removal").
term directives can have an optional abbr field to record a common abbreviation for the term. The naming "abbr" comes from the <abbr> HTML tag.
A second field might be called primary, which takes a boolean flag indicating whether the author of this page believes that the page should be the primary extended discussion of a term (primary true) or a secondary discussion of a term (primary false). We can discuss whether this field is useful, and what the default value should be is the field is not declared.
The content of the term directive is a succinct, generally single-sentence, definition of the term. This is the definition that would appear in a glossary listing, for example.

Automatic terminology

Many pages in the user documentation follow rigid content structures (a system known as topic-based technical writing). In some of these topic types, terminology can be automatically identified.

For example, a Python API reference page can identify a class or function name as a term. The single-sentence summary of that class or function can be used as the term definition. In this case, it should be possible to automatically export terminology information from the page without have author separate term directives.

The Sphinx build process

During the build process, the term directives are parsed and transformed into metadata content that's inserted into the HTML header of that page.

Marking up glossary metadata in HTML

The terminology covered by an LSST webpage is described by metadata in that webpage's header. The reStructuredText term directive, described above, is a convenient way to populate this metadata, but in principle this metadata is HTML-native, and can work with any website, not just Sphinx sites.

As much as possible, we should take a standards-based approach to encoding this metadata. The most flexible standards-based approach is JSON-LD with a schema.org context. LSST is already using JSON-LD in production to to ship metadata with lander-based landing pages of PDF documents (SQR-020).

Unfortunately there is not a specific schema for glossaries. With JSON-LD it is possible to create an extension of schema.org to describe a glossary. Mostly likely the base type would be derived from https://schema.org/WebPage. The type would add a terminology field containing an array of term objects. For example:

<script type="application/ld+json">
{
  "@context": "http://example.org/lsst-webpage-schema.jsonld",
  "@type": "LsstWebPage",
  "name": "...",
  "description": "...",
  "publisher": { ... },
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "terminology": [
    {"name": "Instrument Signature Removal",
     "abbr": "ISR",
     "isMainSubject": true,
     "description": "A pipeline that applies calibration reference data in the course of raw data processing, to remove artifacts of the instrument or detector electronics, such as removal of overscan pixels, bias correction, and the application of a flat-field to correct for pixel-to-pixel variations in sensitivity.},
  ]
}
</script>

Web crawler for aggregating glossary metadata from user documentation

This system of exporting terminology information from documentation is decentralized. As subject matter exports work on documentation, new terms are introduced, definitions may be refined, and new URLs may become available as supporting reference material for terms. We need a way to aggregate this decentralized information into the centralized LSST glossary. This will be done with an automated web crawler (bot).

The bot

The bot will identify root URLs for documentation sites from the LSST the Docs API (GET https://keeper.lsst.codes/products/). Then the bot will pull HTML for each lsst.io page it encounters, both to get links to further pages, and to get JSON-LD metadata. Many Python frameworks are available for creating web crawlers (BeautifulSoup, aiohttp, and Redis is one possible stack).

The bot aggregates web pages into a data structure that might look like this:

[
  {"name": "Instrument Signature Removal",
   "urls": [
     {"url": "https://xyz.lsst.io/...",
      "title": "Web page title",
      "dateUpdated": "2018-07-03",
      "primary": true,
      "definition": "...",
      "abbr": "ISR"},
     {"url": "...",
      "title": "...",
      "dateUpdated": "...",
      "primary": false}
   ]},
  {"name": "...",
   "urls": [...]},
]

Integration with the centralized glossary

What is done with the aggregated glossary data depends on the level of automation desired:

A human could merge the glossary information manually.
A human could use a tool that helps highlight differences between the aggregated glossary and the centralized glossary.
The bot could use heuristics to update the centralized glossary (for example, updated a definition if a "primary" source is newer) and then submit a GitHub pull request.

A key question is whether the centralized glossary has a rich enough structure to encompass all the information aggregated by the bot. Not only does the bot produce terms, abbreviations and definitions, it also produces a list of web pages, each with: title, URL, date updated, whether it's a primary reference, and even a description of the page.

Displaying glossary information in user documentation

User documentation should display glossary information in situ to help make LSST jargon more approachable and transparent.

The centralized glossary database can be downloaded during the documentation build.

As a page is built, terms and acronyms defined in the glossary can be annotated. This annotation is a link that brings up a pop-up with the term's definition. The pop up could also contain links to a centralized glossary web application (see below).

Highlights shouldn't be generated for terminology in these cases:

If a page is the primary source for a term, it shouldn't highlight the term and distract the reader away.
If a term is already linked, a highlight won't be generated because presumably the link created by the author is more useful.

Glossary web application

It may be useful to have a glossary web application that provides access to the full set of terms, their definitions, and the full set of URLs and documents related to that term.

For example, it may not be ideal to pull the full list of URLs related to a term in the terminology pop-overs. Instead, the reader might want to click a "More info" link that brings up a separate page with the full set of information available for a term (definition, URLs).

This site could realistically be implemented as a React app.

This page is being developed under DM-14911 - Getting issue details... STATUS .

Space shortcuts

Page tree