Skip to content

Commit 5db4286

Browse files
committed
feat(taxonomy): start proposal on revamp of taxonomy concept
Signed-off-by: Laura Santamaria <[email protected]>
1 parent f6b1e5c commit 5db4286

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed

docs/taxonomy-revamp-2025.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
author: Laura Santamaria (@nimbinatus)
3+
date: 05 February 2025
4+
status: proposed
5+
---
6+
7+
## Issues
8+
9+
Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work.
10+
11+
The end user typically gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^bikeshed]. The only requirement for the SDG process is sorting things into `knowledge` and `skills`.
12+
13+
The user experience of working with the `qna.yaml` file is poor for a handful of reasons:
14+
15+
- Most of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.
16+
- YAML is a notoriously complex, loose format with a lot of potholes.
17+
- YAML files of different specifications parse completely differently (e.g., 1.2 vs 1.1).
18+
- Note that PyYAML, our base tool, parses YAML 1.1, not 1.2. There is a long way to go[^PyYAML] to support 1.2, which has been the latest spec since 2009. As such, even if someone were to search the Internet for a solution because they are not familiar with YAML, they likely will stumble across 1.2 solutions that don't work for 1.1.
19+
- There are at least 9 different ways to indicate a multi-line string in YAML[^9ways], depending on which block scalar indicator[^blockscalar] is used and which block chomping indicator[^blockchomping] is used (this does **not** count the indentation indicator[^blockindentation]!). Then there are double-quoted flow scalar multilines[^doublequotedflowscalar] and single-quoted flow scalar multilines[^singlequotedflowscalar], which can cause more problems.
20+
- The linting system, intended to ensure the YAML file is readable by the SDG process, adds more burden on the non-technical user.
21+
- The linter for YAML enforces an 80-character line length by default. That makes sense if you're working on code read from a terminal, but not to a typical end user used to working with rich text editors for a reading comprehension experience working with paragraphs.
22+
- The linter also complains about trailing whitespace, another common thing that the typical end user won't understand why everything is failing.
23+
24+
From a code perspective,
25+
26+
- We are already using JSON in the datamixing process in SDG[^datamixing].
27+
- Docling also exports JSON as input and output[^docling].
28+
- JSON is also much more friendly to UI work, which is a primary path we would like people to use.
29+
30+
Overall, the `qna.yaml` file needs to have fewer knobs and fewer pitfalls.
31+
32+
The process of writing question and answer sets also is more like writing reading comprehension sets from a standardized exam. It would be better to frame this hands-on part of the process as similar to the passage and question sets from English reading comprehension exams
33+
34+
## Proposed solution
35+
36+
- Drop the folder structure in favor of a schema field for submission type and even domain, if necessary.
37+
- The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`.
38+
- Streamline the schema.
39+
- Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.
40+
-
41+
- Switch to JSON and Markdown for the `qna.yaml` document.
42+
- Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format.
43+
- Markdown is very user-friendly, and converters handle a lot of the issues with encoding and special characters that happen in situations like working in other languages. We don't have to worry about a linter arguing about line length with the end user, and we wouldn't have to think about whether the user used tabs or spaces or forgot to strip whitespace at the end of a line.
44+
- Frame the Q&A writing process as a reading comprehension process.
45+
- Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams.
46+
- Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested.
47+
48+
[^bikeshed]: The story of the bikeshed is a common metaphor. The story goes that a group that is working on the approvals for the construction plan of a nuclear power plant gets stuck on what color to paint the bike shed at one of the entrances to the plant. Mutliple meetings are scheduled to hash out the issue of the color of the bike shed, with heated arguments. However, the rest of the plan for the power plant is not examined in detail or critiqued. People have an easier time evaluating and having an opinion on something that is as trivial as a bike shed's color when faced with complex decisions on other systems. https://en.wiktionary.org/wiki/bikeshedding
49+
[^9ways]: You can experience this issue in action with the interactive experience on https://yaml-multiline.info/.
50+
[^blockscalar]: https://yaml.org/spec/1.2.2/#81-block-scalar-styles
51+
> YAML provides two block scalar styles, literal and folded. Each provides a different trade-off between readability and expressive power.
52+
[^blockchomping]: https://yaml.org/spec/1.2.2/#8112-block-chomping-indicator
53+
> Chomping controls how final line breaks and trailing empty lines are interpreted. YAML provides three chomping methods:
54+
[^blockindentation]: https://yaml.org/spec/1.2.2/#8111-block-indentation-indicator
55+
> Every block scalar has a content indentation level. The content of the block scalar excludes a number of leading spaces on each line up to the content indentation level.
56+
>
57+
> If a block scalar has an indentation indicator, then the content indentation level of the block scalar is equal to the indentation level of the block scalar plus the integer value of the indentation indicator character.
58+
>
59+
> If no indentation indicator is given, then the content indentation level is equal to the number of leading spaces on the first non-empty line of the contents. If there is no non-empty line then the content indentation level is equal to the number of spaces on the longest line.
60+
>
61+
>It is an error if any non-empty line does not begin with a number of spaces greater than or equal to the content indentation level.
62+
>
63+
>It is an error for any of the leading empty lines to contain more spaces than the first non-empty line.
64+
>
65+
>A YAML processor should only emit an explicit indentation indicator for cases where detection will fail.
66+
[^doublequotedflowscalar]: https://yaml.org/spec/1.2.2/#double-quoted-style
67+
> In a multi-line double-quoted scalar, line breaks are subject to flow line folding, which discards any trailing white space characters. It is also possible to escape the line break character. In this case, the escaped line break is excluded from the content and any trailing white space characters that precede the escaped line break are preserved. Combined with the ability to escape white space characters, this allows double-quoted lines to be broken at arbitrary positions.
68+
[^singlequotedflowscalar]: https://yaml.org/spec/1.2.2/#single-quoted-style
69+
> In addition, it is only possible to break a long single-quoted line where a space character is surrounded by non-spaces. [...] All leading and trailing white space characters are excluded from the content. Each continuation line must therefore contain at least one non-space character. Empty lines, if any, are consumed as part of the line folding.
70+
[^datamixing]: stuff
71+
[^docling]: https://ds4sd.github.io/docling/supported_formats/ notes docling supports JSON-serialized Docling Documents and Markdown as input and JSON and Markdown as outputs.
72+
[^PyYAML]: https://github.com/yaml/pyyaml/issues/486

0 commit comments

Comments
 (0)