Data Document

Preamble

Data documents or data records stored in a GenboreeKB are sets of properties, where a "property" is a name-value pair.
  • Properties themselves can have subordinate properties (nesting). This means that a property can have one or more sub-properties in addition to its "value".
    • Thus, hierarchical/tree-like document models are supported through nesting.
  • Alternatively, a property can have a list/array of sub-properties.
    • While not strictly hierarchical or tree-like (the list's size is open ended), this allows a property to contain a list of zero or more sub-properties--possibly themselves complex, nested, hierarchical properties.
    • The list is homogenous in that ALL sub-properties (items) in the list have the SAME property definition [in the document model or schema].
    • Such lists are good for accumulating uniform records. Such accumulation is not possible without lists. As mentioned, list size is open-ended.
    • Note that the property definition in a list must be singly-rooted, but may have any level of nesting.
  • Note that documents need to be singly rooted and the root property needs to be the document identifier property (unique name by which you refer to the document, hopefully in some meaningful way).
Note:
  • Case matters for property names and acceptable values.
    • "chr12" != "Chr12"
    • "pathogenic" != "Pathogenic" <-- this is important for enums; if "pathogenic" was one of five acceptable values, "PathoGeniC" will be rejected.
  • Domains matter for property values.
    • 111222 != "111222" (i.e. using String in an Int field is incorrect)

Syntax for Properties in a Document

Given that:
  • A document is a collection of properties, possibly hierarchical.
  • Properties are name-value pairs.
  • Properties have a value, and may have sub-properties or a list/array of sub-properties items.

Then: What's the syntax for representing such a property in a document?

  • The representation of a property needs its name, its value, and any sub-properties or a homogenous list of sub-property items
  • The name is implicit, as it's just a field name.
    • i.e. a field within an object (JSON, Javascript perspective)
    • i.e. a key within a map or hash (YAML, Ruby, Perl perspective)
  • The value stored at that field is always an object--or a Map or Hash if you prefer--with its own specific fields:
    1. "value" - the actual value for the property is stored in this field.
    2. "properties" - [Optional] the sub-properties, if any; provided as an object/hash; mutually exclusive with "items".
    3. "items" - [Optional] a list/array of sub-property items; mutually exclusive with "properties".
  • Regarding "properties" and "items" keys:
    • They are mutually exclusive.
    • There is no need to provide these for leaf properties that have no sub-properties (or if there are currently no sub-properties available). Just "value" is fine.
Aside: All property objects have the "value" field.
  • You should make a habit to provide it, even when it is "" or when you want the default value.
  • But if you don't, the defined default value for the property will be assumed and used.

Example A. Singly-Rooted, Otherwise Flat Document

JSON Notation

{
  "rcvID"   :
  {
    "value" : "RCV000037626",
    "properties" : 
    {
      "chr"     : { "value" : "chr12" },
      "start"   : { "value" : 112888156 },
      "end"     : { "value" : 112888156 },
      "status"  : { "value" : "current" }
    }
  }
}
// --- or bit more compactly (complex/dense docs can be hard to read though) ---
{
  "rcvID"   : { "value" : "RCV000037626",  "properties" : {
    "chr"     : { "value" : "chr12" },
    "start"   : { "value" : 112888156 },
    "end"     : { "value" : 112888156 },
    "status"  : { "value" : "current" }
  }}
}

Tabbed Notation

rcvID RCV000037626
- chr chr12
- start 112888156
- end 112888156
- status current

Example B. Document with Nesting

  • Note the propName : { value, properties } pattern.

JSON Notation

{
  "Variant-Phenotype ID" : { "value" : "rs11540652 - Li Fraumeni syndrome 1", "properties" :
  {
    "Allele Information" : { "value" : "", "properties" :
    {
      "dbSNP ID" : { "value" : "rs11540652", "properties" :
      {
        "Build" : { "value" : "SNP 137" },
        "Source" : { "value" : "in-house-pipeline" }
      }},
      "Genomic location" : { "value" : "chr17:7577538-7577538", "properties" :
      {
        "Assembly version" : { "value" : "hg19" },
        "Source" : { "value" : "ClinVar" }
      }},
      "Gene" : { "value" : "TP53" }
    }}
  }}
}

Tabbed Notation
Variant-Phenotype ID rs11540652 - Li Fraumeni syndrome 1
- Allele Information
-- dbSNP ID rs11540652
--- Build SNP 137
--- Source in-house-pipeline
-- Genomic location chr17:7577538-7577538
--- Assembly version hg19
--- Source ClinVar
-- Gene TP53

Example C. Nested Document Containing a List

  • Note that "items" for the "COSMIC Records" field is a list/array.
  • Note that "COSMIC" has "items" instead of "properties".
  • Note that all the "COSMIC Record" items in the "COSMIC" list/array are the same kind of property (homogeneous), are identifier properties, and are singly rooted, albeit a complex & nested kind...

JSON Notation

{
  "Evidence" : { "value" : "Frequency of variant", "properties" :
  {
    "COSMIC Records" : { "value" : "tissue evidence", "items" :
    [
      {
        "COSMIC Record" : { "value" : "evidence1", "properties": 
         {
           "Tissue"     : { "value" : "lung" },
           "Frequency"  : { "value" : 0.0179 },
            "Population" : { "value" : 168, "properties" :
            {
               "Carriers" : { "value": 3}
            }}
        }}
      }, 
     {
         "COSMIC Record": {"value": "evidence2", "properties":
         {
           "Tissue"     : { "value" : "plasma" },
           "Frequency"  : { "value" : 0.0153 },
           "Population" : { "value" : 261, "properties" :
           {
              "Carriers" : { "value": 4 }
           }}
        }}
      },
      {
         "COSMIC Record": {"value": "evidence3", "properties":
         {
           "Tissue"     : { "value" : "colon polyps" },
            "Frequency"  : { "value" : 0.3158 },
            "Population" : { "value" : 19, "properties" :
            {
              "Carriers" : { "value": 6 }
            }}
         }}
      }
    ]}
  }}
}

Tabbed Notation
Evidence Frequency of variant
* COSMIC Records tissue evidence
*- COSMIC Record evidence1
*-- Tissue lung
*-- Frequency 0.0179
*-- Population 168
*--- Carriers 3
*- COSMIC Record evidence2
*-- Tissue plasma
*-- Frequency 0.0153
*-- Population 261
*--- Carriers 4
*- COSMIC Record evidence3
*-- Tissue colon polyps
*-- Frequency 0.3158
*-- Population 19
*--- Carriers 6