PFB Files

Overview of the Portable Format for Bioinformatics (PFB) file type

What is a Portable Format for Bioinformatics?

A Portable Format for Bioinformatics (PFB) allows users to transfer both the metadata from the the Data Dictionary as well as the Data Dictionary itself. As a result, data can be transferred while keeping the structure from the original source. Specifically, a PFB consists of three parts:

  • A schema

  • Metadata

  • Data

For more information and an in-depth review that includes Python tools for PFB creation and exploration, refer to the PyPFB github page and install the newest version.

Note The following PFB example is a direct PFB export from the tutorial-synthetic_data_set_1 found on BioData Catalyst Powered by Gen3. Due to the large amount of data stored within PFB files, only small sections are shown with breaks (displayed as ... ) occurring in the output.

Schema

A schema is a JSON formatted Data Dictionary containing information about the properties, such as value types, descriptions, and so on.

To view the PFB schema, use the following command:

pfb show -i PFB_file.avro schema

Example Output

...
  {
    "type": "record",
    "name": "gene_expression",
    "fields": [
      {
        "default": null,
        "name": "data_category",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_category",
            "symbols": [
              "Transcriptome Profiling"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_type",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_type",
            "symbols": [
              "Gene Expression Quantification"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "data_format",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_data_format",
            "symbols": [
              "TXT",
              "TSV",
              "CSV",
              "GCT"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "experimental_strategy",
        "type": [
          "null",
          {
            "type": "enum",
            "name": "gene_expression_experimental_strategy",
            "symbols": [
              "RNA-Seq",
              "Total RNA-Seq"
            ]
          }
        ]
      },
      {
        "default": null,
        "name": "file_name",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "name": "file_size",
        "type": [
          "null",
          "long"
        ]
      },
      {
        "default": null,
        "name": "md5sum",
        "type": [
          "null",
          "string"
        ]
      },
      {
        "default": null,
        "doc": "The GUID of the object in the index service.",
        "name": "object_id",
        "type": [
          "null",
          "string"
        ]
      }
...

NOTE: To make the outputs more human-readable, the above information was then piped through the program jq. Example: pfb show -i PFB_file.avro schema | jq

Metadata

The metadata in a PFB contains all of the information explaining the linkage between nodes and external references for each of the properties.

To view the PFB metadata, use the following command:

Example Output

Data

The data in the PFB are the values for the properties in the format of the Data Dictionary.

To view the data within the PFB, use the following command:

To view at a certain number of entries in the PFB file, use the flag -n to designate a number. For example, to view the first 10 data entries within the PFB, use the following command:

Example Output

Last updated

Was this helpful?