PFB Files

Overview of the Portable Format for Bioinformatics (PFB) file type

What is a Portable Format for Bioinformatics?

A Portable Format for Bioinformatics (PFB) allows users to transfer both the metadata from the the Data Dictionary as well as the Data Dictionary itself. As a result, data can be transferred while keeping the structure from the original source. Specifically, a PFB consists of three parts:

  • A schema

  • Metadata

  • Data

For more information and an in-depth review that includes Python tools for PFB creation and exploration, refer to the PyPFB github page and install the newest version.

Note The following PFB example is a direct PFB export from the tutorial-synthetic_data_set_1 found on BioData Catalyst Powered by Gen3. Due to the large amount of data stored within PFB files, only small sections are shown with breaks (displayed as ... ) occurring in the output.

Schema

A schema is a JSON formatted Data Dictionary containing information about the properties, such as value types, descriptions, and so on.

To view the PFB schema, use the following command:

pfb show -i PFB_file.avro schema

Example Output

...
  {
    "type": "record",
    "name": "gene_expression",
    "fields": [
      {
        "default": null,
        "name": "data_category",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_category",
            "symbols": [
              "Transcriptome Profiling"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_type",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_type",
            "symbols": [
              "Gene Expression Quantification"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_format",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_format",
            "symbols": [
              "TXT",
              "TSV",
              "CSV",
              "GCT"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "experimental_strategy",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_experimental_strategy",
            "symbols": [
              "RNA-Seq",
              "Total RNA-Seq"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "file_name",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "name": "file_size",
        "type": [
          "null",
          "long"
        ]
      },
      {
        "default": null,
        "name": "md5sum",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "doc": "The GUID of the object in the index service.",
        "name": "object_id",
        "type": [
          "null",
          "string"
        ]
      }
...

NOTE: To make the outputs more human-readable, the above information was then piped through the program jq. Example: pfb show -i PFB_file.avro schema | jq

Metadata

The metadata in a PFB contains all of the information explaining the linkage between nodes and external references for each of the properties.

To view the PFB metadata, use the following command:

pfb show -i PFB_file.avro metadata

Example Output

...
    {
      "name": "exposure",
      "ontology_reference": "",
      "values": {},
      "links": [
        {
          "multiplicity": "MANY_TO_ONE",
          "dst": "subject",
          "name": "subjects"
        }
      ],
      "properties": [
        {
          "name": "years_smoked",
          "ontology_reference": "Person Smoking Duration Year Count",
          "values": {
            "source": "caDSR",
            "cde_id": "3137957",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3137957&version=1.0"
          }
        },
        {
          "name": "years_smoked_gt89",
          "ontology_reference": "Person Smoking Duration Year Count",
          "values": {
            "source": "caDSR",
            "cde_id": "3137957",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3137957&version=1.0"
          }
        },
        {
          "name": "alcohol_history",
          "ontology_reference": "Alcohol Lifetime History Indicator",
          "values": {
            "source": "caDSR",
            "cde_id": "2201918",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=2201918&version=1.0"
          }
        },
        {
          "name": "alcohol_intensity",
          "ontology_reference": "Person Self-Report Alcoholic Beverage Exposure Category",
          "values": {
            "source": "caDSR",
            "cde_id": "3457767",
            "cde_version": "1.0",
            "term_url": "https://cdebrowser.nci.nih.gov/CDEBrowser/search?elementDetails=9&FirstTimer=0&PageId=ElementDetailsGroup&publicId=3457767&version=1.0"
          }
        },
...

Data

The data in the PFB are the values for the properties in the format of the Data Dictionary.

To view the data within the PFB, use the following command:

pfb show -i PFB_file.avro

To view at a certain number of entries in the PFB file, use the flag -n to designate a number. For example, to view the first 10 data entries within the PFB, use the following command:

pfb show -i PFB_file.avro -n 10

Example Output

...
{
  "id": "6c5e21d5-da76-49a5-9f82-7e3a726d44c6",
  "name": "lab_result",
  "object": {
    "cer451q1": null,
    "oxldl1": null,
    "f81c": null,
    "renins1c": null,
    "cystatc1": null,
    "triglycerides": -0.40415245294570923,
    "glucos1c": 6.5463337898254395,
    "glucos1u": null,
    "ldl": 2.0789523124694824,
    "hdl": 2.7123606204986572,
    "creatin1": null,
    "total_cholesterol": 3.039848566055298,
    "chlcat1c": null,
    
...

    "uabcat1c": null,
    "inslnr1t": 1.8090298175811768,
    "vldlp31c": null,
    
...

    "unit_hematocrit_vfr_bld": null,
    "age_at_total_cholesterol": 80,
    "unit_total_cholesterol": null,
    "age_at_triglycerides": 80,
    "unit_triglycerides": null,
    "age_at_hdl": 80,
    "unit_hdl": null,
    "age_at_ldl": 80,
    "unit_ldl": null,
    
...

    "unit_mcv_entvol_rbc": null,
    "submitter_id": "HG00325_lab_res",
    "state": "validated",
    "project_id": "tutorial-synthetic_data_set_1",
    "created_datetime": "2020-01-27T13:54:06.745386+00:00",
    "updated_datetime": "2020-01-27T13:54:06.745386+00:00"
  },
  "relations": [
    {
      "dst_id": "f4fdda57-80f4-4995-bea2-161c3242c525",
      "dst_name": "subject"
    }
  ]
}

Last updated