View raw Markdown
type: resourceresource: MolecularDefinition

MolecularDefinition

Introduction

Scope and Usage

The MolecularDefinition resource represents molecular entities (e.g., nucleotide or protein sequences) for both clinical and non-clinical use cases, including translational research. The resource is definitional, in that it focuses on discrete, computable, and semantically expressive data structures that reflect the genomic domain. Because the resource focuses on the molecular entities rather than specimen source or annotated knowledge, it supports both patient/participant-specific use cases and population-based data, and both human and non-human data.

The MolecularDefinition resource itself is abstract, but it supports profiles for core molecular concepts, including Sequence (nucleotide and protein), Allele, Variation, Haplotype, and Genotype. Support for additional molecular types, such as structural variation, fusions, and biomarkers, will be considered in the future.

Use cases supported by this resource include but are not limited to:

Sequence Representation

Use cases often require expression of the same genomic concept in different ways. Since the concept is the same and only the serialization of it differs, the Molecular Definition resource supports multiple approaches to representing molecular sequences. This allows senders and receivers of messages to choose a sequence representation that is most intuitive for the particular use case.

It is important to note that all representations of a given sequence MUST resolve to the exact same primary sequence. Therefore, if a single instance of MolecularDefinition contains one literal, two resolvable files, and a code, all four of those representations must represent the same sequence. Note that this equivalence does not apply to metadata or annotations that are outside the scope of the Molecular Definition resource, since those data are not definitional to the molecule.

Boundaries and Relationships

As a definitional resource, MolecularDefintion should be profiled to consistently represent molecular concepts. Profiles for structured representation of the fundamental concepts can be found in the Molecular Definition Implementation Guide for Molecular Data Types that includes Sequence, Allele, and Variation, Haplotype, and Genotype profiles. This guide is work in progress and will evolve to represent additional concepts.

This resource does not capture workflow (e.g., test ordering/resulting process), the method of obtaining or specifying the molecular content (e.g., the test or assay), or the interpretation of the results (e.g., clinical impact). Those concepts will be captured by profiles of Observation and by the Genomic Study resource. In particular, the Genomics Reporting Implementation Guide contains extensive support for the observation and reporting of clinical genomic results.

Background and Context

Provides additional detail on exactly how the resource is to be used

Notes

Notes

Encodings

Molecular Definitions are represented using numerous encodings, which are not always explicitly specified. The representation.literal.encoding attribute captures this information directly, so that implementors can validate the content of messages and computationally determine how a particular sequence should be interpreted.

The examples below illustrate different encodings, which could be used to create terms for this attribute. They are based on the IUPAC symbols for nucleotide and amino acid sequences.

Nucleotide Symbols (1-letter, no ambiguity, DNA residues)

SymbolMeaningOrigin of designation
GGuanineG
AAdenineA
TThymineT
CCytosineC

Nucleotide, 1-letter, no ambiguity, RNA residues

SymbolMeaningOrigin of designation
GGuanineG
AAdenineA
UUracilU
CCytosineC

Nucleotide Symbols (1-letter, no ambiguity except N, DNA residues)

SymbolMeaningOrigin of designation
GGuanineG
AAdenineA
TThymineT
CCytosineC
NG or A or T or CaNy

Nucleotide Symbols (1-letter, with ambiguity, DNA residues)

SymbolMeaningOrigin of designation
GGuanineG
AAdenineA
TThymineT
CCytosineC
RG or ApuRine
YT or CpYrimidine
MA or CaMino
KG or TKeto
SG or CStrong interaction (3 H bonds)
WA or TWeak interaction (2 H bonds)
HA or C or Tnot-G, H follows G in the alphabet
BG or T or Cnot-A, B follows A
VG or C or Anot-T (not-U), V follows U
DG or A or Tnot-C, D follows C
NG or A or T or CaNy

Amino Acid Symbols (1-letter, no ambiguity, 20 common)

SymbolAmino acid
Aalanine
Ccysteine
Daspartic acid
Eglutamic acid
Fphenylalanine
Gglycine
Hhistidine
Iisoleucine
Klysine
Lleucine
Mmethionine
Nasparagine
Pproline
Qglutamine
Rarginine
Sserine
Tthreonine
Vvaline
Wtryptophan
Ytyrosine

Amino Acid Symbols (3-letter, no ambiguity, 20 common)

SymbolAmino acid
Alaalanine
Cyscysteine
Aspaspartic acid
Gluglutamic acid
Phephenylalanine
Glyglycine
Hishistidine
Ileisoleucine
Lyslysine
Leuleucine
Metmethionine
Asnasparagine
Proproline
Glnglutamine
Argarginine
Serserine
Thrthreonine
Valvaline
Trptryptophan
Tyrtyrosine

Amino Acid Symbols (1-letter, with ambiguity)

SymbolAmino acid
Aalanine
Baspartic acid or asparagine
Ccysteine
Daspartic acid
Eglutamic acid
Fphenylalanine
Gglycine
Hhistidine
Iisoleucine
Klysine
Lleucine
Mmethionine
Nasparagine
Pproline
Qglutamine
Rarginine
Sserine
Tthreonine
Uselenocysteine
Vvaline
Wtryptophan
Xunknown or 'other' amino acid
Ytyrosine
Zglutamic acid or glutamine

Amino Acid Symbols (3-letter, with ambiguity)

SymbolAmino acid
Alaalanine
Asxaspartic acid or asparagine
Cyscysteine
Aspaspartic acid
Gluglutamic acid
Phephenylalanine
Glyglycine
Hishistidine
Ileisoleucine
Lyslysine
Leuleucine
Metmethionine
Asnasparagine
Proproline
Glnglutamine
Argarginine
Serserine
Thrthreonine
Secselenocysteine
Valvaline
Trptryptophan
Xaaunknown or 'other' amino acid
Tyrtyrosine
Glxglutamic acid or glutamine

Molecular Representations

The Molecular Definition resource supports several different methods for representing a molecule. Some of the elements described below may apply only to sequences, and different elements may be added to support other types of molecular concepts.

Native representations: The literal, code, and resolvable are native representations, meaning they represent a sequence “as-is” without any additional computation.

Derived representations: The extracted, concatenated, repeated, and relative representations are derived representations, meaning they require one or more computational operations to be performed to create the sequence that is being represented.

Literal

The literal element can be used to represent a sequence as a string of characters. By convention, nucleotide sequences are expressed 5’ to 3’ and protein sequences are expressed N to C terminus. The encoding element can optionally be used to specify the encoding used for the sequence literal. The encoding can be important in disambiguating sequences that share alphabets (for example, ATG might represent a translation start codon in DNA, but it could also represent a peptide containing 3 amino acids).

Code

The code element can be used to represent a sequence by reference, using an accession number that identifies a specific sequence within a repository. The code, system, and version elements of the Coding data type can be used to fully disambiguate one code from another. Note that the code element does not guarantee that the repository is publicly accessible or that the sequence referenced by the code can be retrieved, it only specifies the sequence using a code that could be exchanged. Thus, this element could be used for both a public sequence repository (e.g., GenBank) and a private database (e.g., biobank).

Resolvable

The resolvable element can be used to represent a sequence by reference, but it also implies that the sequence is accessible and SHOULD be resolvable (although a security layer may be present). This element makes use of the Document Reference resource, which contains the content.attachment element. The Attachment datatype can be used to represent sequences that are captured as a formatted file (using .contentType and .data) or as a URL (using .contentType and .url).

Extracted

The extracted element can be used to represent a sequence that is derived from another, longer sequence. The startingMolecule element refers to the “parent” sequence, and is itself an instance of Molecular Definition (with its own representation). The coordinateInterval element specifies a precise interval on the “parent” sequence, which is to be extracted (conceptually or literally) and optionally reverse-complemented. This element provides a way to conveniently reference regions of very long molecules (e.g., chromosomes) without requiring either the “parent” or the extracted sequence to be serialized. Conceptually, this representation is the inverse operation of the concatenated representation.

Concatenated

The concatenated element can be used to represent a sequence that is comprised of other sequences that are concatenated together to form the intended sequence. Each sequenceElement is specified as an instance of Molecular Definition (and each has its own representation). The order of concatenation is explicitly defined using the ordinalIndex element. Conceptually, this representation is the inverse operation of the extracted representation.

Repeated

The repeated element can be used to represent a sequence that is comprised of a sequence motif that is repeated a specified number of times. The sequenceMotif is an instance of Molecular Definition (and has its own representation), and copyCount specifies the number of times the motif is copied in tandem. Conceptually, this representation is a special case of the concatenated representation, where each element is an identical copy of a given motif.

Relative

The relative element can be used to represent a sequence in relation to another sequence, where the difference between the two sequences can be expressed as an ordered series of edit operations. This representation can be used to conveniently represent minor but meaningful differences between long or complex sequences (e.g., HLA alleles). Algorithmically, the relative representation defines a sequence by beginning with a startingMolecule (an instance of Molecular Definition) and performing at least one edit operation on it. Each edit operation is performed in order and includes replacing the sequence (the replacedMolecule) at a defined coordinateInterval with the sequence specified by the replacementMolecule. The resulting sequence after all edits have been performed is the sequence referenced by this representation element.

Note that the edits specified in this representation are operations and NOT variations. Variations are defined as a specific comparison between two states (a reference and an alternative), and while they are sometimes called “changes” and therefore they might be confused for edit operations, they are semantically distinct concepts.

Combining Representations

Since the derived representations (extracted, concatenated, repeated, and relative) each reference Molecular Definition, representations can be combined to support complex use cases. For example:

It is possible to create arbitrarily deep structures using derived representations, and while there might be rationale for doing so implementations should avoid overly-complex representation structures.

Equivalence and Identity

Every representation, regardless of its complexity, can be resolved to a literal. Two instances of MolecularDefinition are considered equivalent if they define the same entity. For molecular sequences, this means that for two instances of MolecularDefinition to be equivalent they must resolve to the same literal sequence. Two instances are identical if their serializations are identical: they must contain the same elements, and each corresponding element must have the same value.

Profiling MolecularDefinition

Support for Molecular Concepts

The Molecular Definition resource supports several profiles that represent molecular concepts:

In addition, profiles have been drafted to represent the concepts of Haplotype and Genotype, although they have not been exercised as deeply as the profiles listed above. Finally, preliminary work has demonstrated that the Molecular Definition resource could be used to represent concepts related to structural variation, including Adjacency and Fusion. It is anticipated that profiles to support these concepts will be developed over time.

Modular Semantics and Schemas

The MolecularDefinition resource is an abstract resource that provides building blocks for creating semantically robust, computable structures that define molecular entities. The two most complex backbone elements, location and representation, support the concept of molecular sequences but they might not be relevant to other types of entities. Conversely, other entities may require different backbone elements. As such, it is expected that these high-level backbone elements will serve as modular schemas that can be profiled as needed for a given molecular entity. Profiling could include constraints on cardinality (e.g., the Sequence profile has 0..0 location, while Allele has 1..1 location) and slicing.

Slicing the Representation Element

The representation backbone element provides a series of methods for specifying the value of a sequence. As a result, the entire structure can be used any time a sequence is referenced, and this is accomplished by slicing. For example, the current sequence-based profiles of MolecularDefinition slice the representation element as follows:

ProfileCardinalityFocus (slice)Semantic meaning
Sequence1..1Primary SequenceThe primary sequence of the molecule
Allele1..1AlleleThe sequence of the Allele at the specified Location
Allele0..1ContextThe sequence of the contextual sequence at the specified Location
Variation1..1ReferenceThe sequence defined as the reference allele (at the specified Location)
Variation1..1AlternateThe sequence defined as the alternate allele (at the specified Location)
Variation0..1ContextThe sequence of the contextual sequence at the specified Location

StructureDefinition

Elements (Simplified)

Mappings

Resource Packs

list-MolecularDefinition-packs.xml

<?xml version="1.0" encoding="UTF-8"?>

<List xmlns="http://hl7.org/fhir">
  <id value="MolecularDefinition-packs"/>
  <status value="current"/>
  <mode value="working"/>
</List>

Search Parameters

Full Search Parameters

Examples

Full Examples

Mapping Exceptions

moleculardefinition-fivews-mapping-exceptions.xml

Unmapped Elements