Data Document¶
Preamble¶
Data documents or data records stored in a GenboreeKB are sets of properties, where a "property" is a name-value pair.- Properties themselves can have subordinate properties (nesting). This means that a property can have one or more sub-properties in addition to its
"value"
.- Thus, hierarchical/tree-like document models are supported through nesting.
- Alternatively, a property can have a list/array of sub-properties.
- While not strictly hierarchical or tree-like (the list's size is open ended), this allows a property to contain a list of zero or more sub-properties--possibly themselves complex, nested, hierarchical properties.
- The list is homogenous in that ALL sub-properties (items) in the list have the SAME property definition [in the document model or schema].
- Such lists are good for accumulating uniform records. Such accumulation is not possible without lists. As mentioned, list size is open-ended.
- Note that the property definition in a list must be singly-rooted, but may have any level of nesting.
- Note that documents need to be singly rooted and the root property needs to be the document
identifier
property (unique name by which you refer to the document, hopefully in some meaningful way).
- Case matters for property names and acceptable values.
- "chr12" != "Chr12"
- "pathogenic" != "Pathogenic" <-- this is important for enums; if "pathogenic" was one of five acceptable values, "PathoGeniC" will be rejected.
- Domains matter for property values.
- 111222 != "111222" (i.e. using String in an Int field is incorrect)
Syntax for Properties in a Document¶
Given that:- A document is a collection of properties, possibly hierarchical.
- Properties are name-value pairs.
- Properties have a value, and may have sub-properties or a list/array of sub-properties items.
Then: What's the syntax for representing such a property in a document?
- The representation of a property needs its name, its value, and any sub-properties or a homogenous list of sub-property items
- The name is implicit, as it's just a field name.
- i.e. a field within an object (JSON, Javascript perspective)
- i.e. a key within a map or hash (YAML, Ruby, Perl perspective)
- The value stored at that field is always an object--or a Map or Hash if you prefer--with its own specific fields:
"value"
- the actual value for the property is stored in this field."properties"
-[Optional]
the sub-properties, if any; provided as an object/hash; mutually exclusive with"items"
."items"
-[Optional]
a list/array of sub-property items; mutually exclusive with"properties"
.
- Regarding
"properties"
and"items"
keys:- They are mutually exclusive.
- There is no need to provide these for leaf properties that have no sub-properties (or if there are currently no sub-properties available). Just
"value"
is fine.
"value"
field.
- You should make a habit to provide it, even when it is "" or when you want the default value.
- But if you don't, the defined default value for the property will be assumed and used.
Example A. Singly-Rooted, Otherwise Flat Document¶
JSON Notation
{ "rcvID" : { "value" : "RCV000037626", "properties" : { "chr" : { "value" : "chr12" }, "start" : { "value" : 112888156 }, "end" : { "value" : 112888156 }, "status" : { "value" : "current" } } } } // --- or bit more compactly (complex/dense docs can be hard to read though) --- { "rcvID" : { "value" : "RCV000037626", "properties" : { "chr" : { "value" : "chr12" }, "start" : { "value" : 112888156 }, "end" : { "value" : 112888156 }, "status" : { "value" : "current" } }} }
Tabbed Notation
rcvID | RCV000037626 |
- chr | chr12 |
- start | 112888156 |
- end | 112888156 |
- status | current |
Example B. Document with Nesting¶
- Note the
propName : { value, properties }
pattern.
JSON Notation
Tabbed Notation{ "Variant-Phenotype ID" : { "value" : "rs11540652 - Li Fraumeni syndrome 1", "properties" : { "Allele Information" : { "value" : "", "properties" : { "dbSNP ID" : { "value" : "rs11540652", "properties" : { "Build" : { "value" : "SNP 137" }, "Source" : { "value" : "in-house-pipeline" } }}, "Genomic location" : { "value" : "chr17:7577538-7577538", "properties" : { "Assembly version" : { "value" : "hg19" }, "Source" : { "value" : "ClinVar" } }}, "Gene" : { "value" : "TP53" } }} }} }
Variant-Phenotype ID | rs11540652 - Li Fraumeni syndrome 1 |
- Allele Information | |
-- dbSNP ID | rs11540652 |
--- Build | SNP 137 |
--- Source | in-house-pipeline |
-- Genomic location | chr17:7577538-7577538 |
--- Assembly version | hg19 |
--- Source | ClinVar |
-- Gene | TP53 |
Example C. Nested Document Containing a List¶
- Note that
"items"
for the"COSMIC Records"
field is a list/array. - Note that
"COSMIC"
has"items"
instead of"properties"
. - Note that all the
"COSMIC Record"
items in the"COSMIC"
list/array are the same kind of property (homogeneous), are identifier properties, and are singly rooted, albeit a complex & nested kind...
JSON Notation
Tabbed Notation{ "Evidence" : { "value" : "Frequency of variant", "properties" : { "COSMIC Records" : { "value" : "tissue evidence", "items" : [ { "COSMIC Record" : { "value" : "evidence1", "properties": { "Tissue" : { "value" : "lung" }, "Frequency" : { "value" : 0.0179 }, "Population" : { "value" : 168, "properties" : { "Carriers" : { "value": 3} }} }} }, { "COSMIC Record": {"value": "evidence2", "properties": { "Tissue" : { "value" : "plasma" }, "Frequency" : { "value" : 0.0153 }, "Population" : { "value" : 261, "properties" : { "Carriers" : { "value": 4 } }} }} }, { "COSMIC Record": {"value": "evidence3", "properties": { "Tissue" : { "value" : "colon polyps" }, "Frequency" : { "value" : 0.3158 }, "Population" : { "value" : 19, "properties" : { "Carriers" : { "value": 6 } }} }} } ]} }} }
Evidence | Frequency of variant |
* COSMIC Records | tissue evidence |
*- COSMIC Record | evidence1 |
*-- Tissue | lung |
*-- Frequency | 0.0179 |
*-- Population | 168 |
*--- Carriers | 3 |
*- COSMIC Record | evidence2 |
*-- Tissue | plasma |
*-- Frequency | 0.0153 |
*-- Population | 261 |
*--- Carriers | 4 |
*- COSMIC Record | evidence3 |
*-- Tissue | colon polyps |
*-- Frequency | 0.3158 |
*-- Population | 19 |
*--- Carriers | 6 |