arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

NIH RECOVER Release Notes

Adult Observational Cohort Study: Dataset Release Notes

hashtag
November 2024

hashtag
RECOVER Adult and Pregnancy Observational Cohort Studies

Release file: phs003463.v3.p2

This release contains data collected from the (RECOVER-Adult) between October 29, 2021 and June 15, 2024. These data were obtained from 15,183 adult participants (Including a sub-cohort of 2,332 pregnant women) attending 135,791 study interactions across 83 geographically dispersed enrolling sites. The dataset also includes an inventory of biospecimen samples (935,596 aliquots) collected during baseline and follow-up visits, plus wearable sensor metadata from 2,932 participants in the Digital Health Program. Overall this release comprises approximately 5,386 data elements and 37 million datapoints. Please refer to the for a Data Dictionary organized as a list of all surveys/forms and their respective data fields.

RECOVER-Adult is a combined retrospective and prospective, longitudinal, observational meta-cohort study of individuals aged ≥ 18 who enter the cohort with and without SARS-CoV-2. Individuals with a prior SARS-CoV-2 infection enter the study at varying times after their infection. Individuals with and without SARS-CoV2 infection, and with or without PASC symptoms, are followed to identify risk factors and occurrence of PASC. This study is being conducted in the United States, with subjects recruited through inpatient, outpatient, and community-based settings. Study data including age, demographics, social determinants of health, medical history, vaccination history, details of acute SARS-CoV-2 infection, overall health and physical function, and PASC symptoms are reported at quarterly intervals. Biologic specimens also are collected at specified intervals, with some tests performed in local clinical laboratories and others performed by centralized research centers or banked in the Biospecimen Repository. Advanced clinical examinations and radiologic examinations are performed at local study sites with cross-site standardization. Please refer to this for details on the RECOVER-Adult study rationale, design and objectives, and to this for more on the design of the RECOVER-Pregnancy study.

Importantly, this release includes data underlying the first publication of primary results from the RECOVER-Adult study: Thaweethai et al., Development of a definition of post-acute sequelae of SARS-CoV-2 Infection. Jun 13;329(22):1934-1946.

Please note that this release does not contain any genomic or metabolomic data.

hashtag
RELEASE NOTES FOR THE RECOVER Adult and Pregnancy Observational Cohort Studies

hashtag
Data Quality

The RECOVER-Adult and RECOVER-Pregnancy Observational Cohort Studies consist of approximately 37 million datapoints (31 million Adult Study datapoints and 6 million Pregnancy Study datapoints) that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.

Due to the volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact with any questions or concerns regarding this release of RECOVER data.

hashtag
De-Identification Process

Masking of PARTICIPANT_IDs was performed according to the following protocol:

  1. Maintain PARTICIPANT_IDs assigned in the previous data release (202403.1).

  2. Extract PARTICIPANT_IDs from each table within the Pediatric cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.

  3. Randomly reorder PARTICIPANT_IDs.

hashtag
ZIP Codes

ZIP Codes were truncated to 3 digits according to the following protocol:

  1. Keep a substring of ZIP codes from the first to the third characters.

hashtag
Dates of Birth and Ages

Date of birth and deceased dates were truncated according to the following protocol:

  1. Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.

  2. Top capping age at enrollment at 89

  3. Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

hashtag
Remaining Dates

Date shifting of all other dates was performed according to the following protocol:

  1. Select ENROLL_DATE as the anchor date for each participant.

  2. For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).

  3. Shift the anchor date to the first day of the year, i.e., 2023/01/01.

hashtag
Summary

The following table lists the variables that were de-identified, along with the de-identification protocol that was applied.

Data Table
Variable
De-Identification Protocol

The following table provides a brief description of the 6 BDC files in this release.

Filename
Rows
Columns
File Size (MB)
Participants

The following table provides a brief description of the biospecimens included in this release.

Collected Biospecimens
Stored Biospecimens

The following table lists all 86 REDCap forms in the dataset, and the corresponding number of variables/rows of data associated with each form.

Form name
Total Variables per Form

hashtag
Information for Authors

hashtag
Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study and ; these should be considered when analyzing and interpreting the data.

hashtag
Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v2.p2, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

hashtag
October 2024

hashtag
RECOVER Pediatric Observational Cohort Study

Release file: phs003461.v1.p1

This release contains data collected between March 17, 2022 and June 15, 2024 from the (“RECOVER-Pediatrics”) of the NIH Researching COVID to Enhance Recovery () Initiative. These data were obtained from 24,621 participants attending 62,921 study visits across 105 geographically dispersed enrolling sites. The dataset includes descriptions of 85,858 biospecimens collected during baseline and follow-up visits, plus wearable sensor metadata from approximately 400 participants in the RECOVER Digital Health Program. Taken together, this release consists of 8,604 data elements (variables) and approximately 13.6 million datapoints.

RECOVER-Pediatrics is an observational meta-cohort study of caregiver-child pairs (Birth through 17 years) and young adults (18 through 25 years). As described in the , the pediatric meta-cohort consists of four distinct cohorts: (1) A de novo RECOVER prospective cohort including children and young adults ages birth through 25 years, with or without a known history of SARS-CoV-2 infection, and their respective caregivers; (2) An extant cohort from the , the largest long-term US study of brain development in adolescence; (3) An in utero exposure cohort, including children less than 3 years old born to individuals with and without a SARS-CoV-2 infection during pregnancy; and (4) An extant cohort from the NHLBI Study on .

Importantly, this release includes data underlying the first publication of primary results from the RECOVER-Pediatrics study: Gross et al., Characterizing Long COVID in Children and Adolescents. , August 21, 2024.

Detailed descriptions of the data elements used in RECOVER-Pediatrics will be found in three separate data dictionaries:

  • REDCap Codebook for the Age 13-17 and 18-25 sub-studies (Including ABCD and MUSIC extant cohorts) - due to the number of pages, this document has been divided into two files:

These codebooks are organized as a list of all surveys/forms with their respective data elements. Also available is a in RECOVER-Pediatrics.

Please note that this release does not contain any genomic or metabolomic data.

hashtag
RELEASE NOTES FOR THE “PEDIATRIC-MAIN” STUDY (Includes Age 13-17 and 18-25 sub-studies, plus data from the ABCD and MUSIC extant cohorts)

hashtag
Data Quality

The Pediatric-Main study consists of approximately 9.9 million datapoints that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.

Due to that volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact with any questions or concerns regarding this release of RECOVER data.

hashtag
De-Identification Process

The following steps were taken to de-identify the RECOVER Pediatric-Main observational cohort data:

  • Masking of IDs using a randomly assigned number between 1 and 14097 (number of unique PARTICIPANT_IDs within the Pediatric cohort data)

  • Truncation of ZIP codes to 3 digits

  • Truncation of participant date of birth and deceased date to the year

Participant IDs

Masking of PARTICIPANT_IDs was performed according to the following protocol:

  1. Participant IDs assigned in the previous data release (202309.1) are maintained.

  2. Extract PARTICIPANT_IDs from each table within the Pediatric cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.

  3. Randomly reorder PARTICIPANT_IDs.

ZIP Codes

ZIP Codes were truncated to 3 digits according to the following protocol:

  1. Keep a substring of ZIP codes from the first to the third characters.

Dates of Birth and Ages

Date of birth and deceased dates were truncated according to the following protocol:

  1. Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.

  2. Top capping age at enrollment at 89

  3. Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

Remaining Dates

Date shifting of all other dates was performed according to the following protocol:

  1. Select ENROLL_DATE as the anchor date for each participant.

  2. For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).

  3. Shift the anchor date to the first day of the year, i.e., 2023/01/01.

hashtag
Summary

The following table lists the variables that are de-identified, along with the de-identification protocol that was applied.

Table
Variable
De-Identification Protocol

hashtag
Information for Authors

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study , , and ; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Pediatric Observational Cohort dataset version phs003461.v1.p1, supported by 1OT2HL156812, OT2HL161847, and 1OT2HL161841 awards from the NIH. [If applicable:] This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

hashtag
RELEASE NOTES FOR THE “PEDIATRIC-CAREGIVER” STUDY

hashtag
Data Quality

The Pediatric-Caregiver study consists of approximately 2.8 million datapoints that were generated from surveys and biospecimen collections (Saliva and Tasso blood spot). Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.

Due to that volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact with any questions or concerns regarding this release of RECOVER data.

hashtag
De-Identification Process

The following steps were taken to de-identify the RECOVER Pediatric Caregiver cohort data:

  • Masking of IDs using a randomly assigned number between 1 and 8820 (number of unique PARTICIPANT_IDs within the Pediatric Caregiver cohort data)

  • Truncation of ZIP codes to 3 digits

  • Truncation of participant date of birth and deceased date to the year

Participant IDs

Masking of PARTICIPANT_IDs was performed according to the following protocol:

  1. Participant IDs assigned in the previous data release (202309.1) are maintained.

  2. Extract PARTICIPANT_IDs from each table within the Pediatric Caregiver cohort (currently Answerdata, Biospecimens, Concepts, Demographics, and Visits) and perform de-duplication to ensure uniqueness.

  3. Randomly reorder PARTICIPANT_IDs.

ZIP Codes

ZIP codes were truncated to 3 digits according to the following protocol:

  1. Keep a substring of ZIP codes from the first to the third characters.

Dates of Birth and Ages

Participant date of birth and deceased date were truncated according to the following protocol:

  1. Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.

  2. Top capping age at enrollment at 89

  3. Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

Remaining Dates

Date shifting of all other dates was performed according to the following protocol:

  1. Select ENROLL_DATE as the anchor date for each participant.

  2. For participants who do not have an ENROLL_DATE, generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).

  3. Shift the anchor date to the first day of the year, i.e., 2023/01/01.

hashtag
Summary

The following table lists the variables that are de-identified, along with the deidentification protocol that was applied.

Table
Variable
De-Identification Protocol

hashtag
Information for Authors

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study , , and ; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Pediatric Observational Cohort dataset version phs003461.v1.p1, supported by 1OT2HL156812, OT2HL161847, and 1OT2HL161841 awards from the NIH. [If applicable:] This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

hashtag
RELEASE NOTES FOR THE “PEDIATRIC-CONGENITAL” STUDY

hashtag
Data Quality

The Pediatric-Congenital study consists of approximately 0.9 million datapoints that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.

Due to that volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact with any questions or concerns regarding this release of RECOVER data.

hashtag
De-Identification Process

The following steps were taken to de-identify the RECOVER Pediatric Congenital cohort data:

  • Masking of IDs using a randomly assigned number between 1 and 1702 (number of unique PARTICIPANT_IDs within the Pediatric cohort data)

  • Truncation of ZIP codes to 3 digits

  • Truncation of participant date of birth and deceased date to the year

Participant IDs

We perform masking of PARTICIPANT_IDs according to the following protocol:

  1. Participant IDs assigned in the previous data release (202309.1) are maintained.

  2. Extract PARTICIPANT_IDs from each table within the Pediatric Congenital cohort (currently Answerdata, Biospecimens, Concepts, Demographics, and Visits) and perform de-duplication to ensure uniqueness.

  3. Randomly reorder PARTICIPANT_IDs.

ZIP Codes

Truncation of ZIP codes to 3 digits was performed according to the following protocol:

  1. Keep a substring of ZIP codes from the first to the third characters.

Dates of Birth and Ages

We perform truncation of participant date of birth and deceased date according to the following protocol:

  1. Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.

  2. Top capping age at enrollment at 89

  3. Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

Remaining Dates

We perform date shifting of all other dates according to the following protocol:

  1. We select ENROLL_DATE as the anchor date for each participant.

  2. For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).

  3. We shift the anchor date to the first day of the year, i.e., 2023/01/01.

hashtag
Summary

The following table lists the variables that are de-identified, along with the de-identification protocol that was applied.

Table
Variable
De-Identification Protocol

hashtag
Information for Authors

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study , , and ; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Pediatric Observational Cohort dataset version phs003461.v1.p1, supported by 1OT2HL156812, OT2HL161847, and 1OT2HL161841 awards from the NIH. [If applicable:] This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

hashtag
June 2024

hashtag
Adult Observational Cohort Study: Release Notes

Release file: phs003463.v2.p2

This release contains data collected from the between October 29, 2021 and March 15, 2024. These data were obtained from 15,204 adult participants (including a sub-cohort of 2,192 pregnant women) attending 122,029 study visits across 79 geographically dispersed enrolling sites. The dataset also includes a description of 822,310 biospecimens collected during baseline and follow-up visits, plus wearable sensor metadata from 2,602 participants in the Digital Health Program. Overall this release comprises approximately 35 million rows of data and a total of 5,278 data elements. Please refer to the accompanying for a Data Dictionary organized as a list of all surveys/forms and their respective data fields. Also, please note that this release does not contain any genomic or metabolomic data.

hashtag
Data De-identification Protocols

The following steps were undertaken to de-identify the dataset:

  • Masking of IDs using a randomly assigned number between 1 and 15,204 (number of unique PARTICIPANT_IDs within the Adult cohort data); NOTE: Participant IDs assigned to the previous data release (202309.1) were maintained in the current release.

  • Truncation of ZIP codes to 3 digits.

  • Truncation of participant date of birth and deceased date to the year.

These steps are described in more detail below.

Participant_ID masking

  1. Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.

  2. Randomly reorder PARTICIPANT_IDs.

  3. Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (15204) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.

Truncation of ZIP Codes

  1. Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).

Truncation of participant date of birth and deceased date

  1. Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.

  2. Top capping age at enrollment at 89.

  3. Bottom capping date of birth at 1933 or 1934 (based on enrollment date).

circle-info

NOTE: To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.

Date shifting (All other dates)

Date shifting of all other dates was performed as follows:

  1. Select ENROLL_DATE as the anchor date for each participant.

  2. For participants who do not have an ENROLL_DATE, generate a random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).

  3. Shift the anchor date to the first day of the year, i.e., 2023/01/01.

The following table lists specific variables requiring de-identification, and the de-identification protocol that was applied.

Data Table
Variable
De-identification Protocol

The following table provides a brief description of the 6 BDC files in this release.

Filename
Rows
Columns
File Size (MB)
Participants

The following table provides a brief description of the biospecimens included in this release.

Collected Biospecimens
Stored Biospecimens

The following table lists all 85 REDCap forms in the dataset, and the corresponding number of variables/rows of data associated with each form.

FORM_NAME
Total Variables per Form

hashtag
Data Quality

A detailed list of data quality issues found in this dataset is summarized below.

File
Error Type
N Rows

hashtag
Information for Authors

The RECOVER Initiative is committed to sharing data with the broader research community in a manner that is both timely and transparent. Because the processing required to make the data useful is complex, the following information is offered to inform responsible use of this dataset:

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study and ; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v2.p2, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

hashtag
April 2024

hashtag
Adult Observational Cohort Study: Dataset Release Notes

Release file: phs003463.v1.p1

This release contains subsets of data collected from the between October 29, 2021 and September 15, 2023. These data were obtained from 14,662 participants attending 92,355 study visits across 79 geographically dispersed enrolling sites. The dataset also includes an inventory of 611,882 biospecimens collected at various timepoints, wearable sensor data from the digital health program for 195 participants, and a total of 3,175 data elements. Please refer to the for this release for a list of all surveys/forms and their respective data fields.

hashtag
Data De-identification Protocols

The following steps were undertaken to de-identify the dataset:

  • Masking of IDs using a randomly assigned number between 1 and 14702 (number of unique PARTICIPANT_IDs within the Adult cohort data)

  • Truncation of ZIP codes to 3 digits

  • Truncation of participant date of birth and deceased date to the year

These steps are discussed in more detail below.

Participant_ID masking

  1. Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform deduplication to ensure uniqueness.

  2. Randomly reorder PARTICIPANT_IDs.

  3. Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14702) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.

Truncation of ZIP Codes

  1. Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).

Truncation of participant date of birth and deceased date

  1. Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.

Date shifting (All other dates)

  1. Select ENROLL_DATE as the anchor date for each participant.

  2. For participants who do not have an ENROLL_DATE, generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).

  3. Shift the anchor date to the first day of the year, i.e., 2023/01/01.

Winsorization of participant ages

To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.

The following table details the variables that were de-identified, and the de-identification protocol that was applied.

Data Table
Variable
De-Identification Protocol

Included Data

The forms in this release are inclusive of baseline (first) enrollment visits and all subsequent follow-up visits through September 15, 2023. Collectively they represent 8.9 million rows of data (54% of the REDCap data) and were selected as they have a limited number of outstanding data queries (outside of missingness). When combined with the wearable sensor data and biospecimen inventory data, the release includes 10.1 million rows of data.

FORM_NAME
n

hashtag
Important Information for Authors

The RECOVER Initiative is committed to sharing data with the broader research community in a manner that is both timely and transparent. Because the processing required to make the data useful is complex, the following information is offered to inform responsible use of this dataset:

hashtag
Data Completeness

This is a partial dataset that includes information on adult cohort participants for whom data were collected on or before September 15, 2023. Additionally, some variables with a high degree of missingness or requiring further quality control have been removed. Future releases will restore these redactions.

hashtag
Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study and ; these should be considered when analyzing and interpreting the data.

hashtag
Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v1.p1, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14097) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
  1. Note: From the 202309.1 release, participant IDs were assigned to 14,662 individuals in a random and not contiguous fashion from integers between 1 – 14702, including an excess of 40 integers left unassigned at random. From the 202403.1 release, additional IDs were assigned at random from integers between 14703 – 15211.

  2. For this release, no new IDs were assigned.

  • Add protocol codes as the prefix for randomly assigned Participant IDs.

    1. Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)

    2. Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.

  • Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.

  • Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.

  • VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

  • Calculate the difference in days from the original to the shifted anchor date.
  • Recalculate all other dates by adding that date difference to those dates.

  • Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol. Provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERAdult_BDC_202406_concepts_deID.csv”

  • Demographics

    DOB

    Truncation to year

    Demographics

    WITHDRAW_DATE

    Date shifting

    Demographics

    DECEASED_DATE

    Truncation to year

    Demographics

    ENROLL_ZIP_CODE

    Truncation to 3 digits

    Answerdata

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Answerdata

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Answerdata

    ANSWER_TEXT_VAL

    Date shifting

    Answerdata

    DATA_ENTRY_DATE

    Date shifting

    Biospecimens

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Biospecimens

    COLLECTION_DATE

    Date shifting

    Visits

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Visits

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Visits

    VISIT_START_DATE

    Date shifting

    Fitbit

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Fitbit

    SUMMARY_DATE

    Date shifting

    191,528

    6

    91

    RECOVERAdult_BDC_202406_demographics_BDC.tsv

    15,183

    20

    3

    15,183

    RECOVERAdult_BDC_202406_fitbit_BDC.tsv

    6,415,774

    6

    704

    2,932

    RECOVERAdult_BDC_202406_visits_BDC.tsv

    135,791

    7

    11

    15,182

    Total

    36,672,940

    66

    8,542

    PAXgene RNA – Whole Blood

    Whole Blood

    Cell Preservation Tube (CPT) – Whole Blood

    Peripheral Blood Mononuclear Cells (PBMCs)

    EDTA – Whole Blood

    Plasma

    White Blood Cells (WBCs)

    biospecimens

    13792

    brain_mri_quality_confirmation

    723

    brain_mri_with_gadolinium

    6057

    cardiac_mri

    601

    cardiac_mri_reading_center

    29

    cardiopulmonary_exercise_testing

    1588

    cardiovagal_innervation_testing

    310

    change_in_symptoms_since_infection

    5044

    chest_ct

    6759

    chest_ct_reading_center

    29

    clinical_labs

    1808

    colonoscopy

    31

    comorbidities

    14427

    comprehensive_audiometry

    4431

    covid_treatment

    11554

    cpet_reading_center

    2

    demographics

    14668

    disability

    14034

    drc_data

    7868

    echocardiogram_with_strain

    5692

    electrocardiogram

    7389

    electromyography

    75

    end_of_participation

    1737

    endopat_testing

    1336

    enrollment

    15177

    facility_sleep_questionnaire_morning_after

    131

    facility_sleep_questionnaire_night_before

    140

    facility_sleep_study

    1696

    fibroscan

    3742

    formal_neuropsychological_testing

    1421

    full_ent_examination

    373

    gastric_emptying_study

    51

    hepatitis_tests

    4475

    home_polysomnography_with_ess_and_isi

    4621

    home_sleep_assessment

    1456

    long_covid_treatment_trial

    11554

    lumbar_puncture

    27

    medication_changes

    13464

    medications

    8573

    mhp_data

    14828

    mini

    5518

    mini_prequestionnaire

    1720

    neonatal_delivery_and_outcome_form

    333

    nerve_conduction_study

    104

    neuropathy_examination

    4387

    new_covid_infection

    9994

    nih_toolbox

    5366

    oral_glucose_test

    5155

    pasc_symptoms

    14535

    pcl5

    1418

    pft_reading_center

    249

    pg13r

    148

    plasma_catecholamine_testing

    307

    pregnancy

    10256

    pregnancy_followup

    9568

    psg_data

    109

    psg_quality_summary_form

    134

    pulmonary_function_tests

    6601

    recent_covid_treatment

    4111

    rehabilitation_testing

    6782

    renal_ultrasound

    2755

    research_labs

    14147

    serum_b12_and_methylmalonic_acid

    4006

    six_minute_walk_test

    7184

    skin_biopsy

    57

    sleep_reading_center

    1265

    social_determinants_of_health

    14113

    social_determinants_of_health_followup

    13470

    study_termination

    57

    tier_12_consent_tracking

    13417

    tier_1_consent_tracking

    2537

    tier_1_office_visit

    14464

    tier_2_consent_tracking

    1068

    tilt_table_test

    307

    upsit_smell_test

    5264

    vaccine_status

    14490

    vision_testing

    6775

    visit_form

    14968

    wearable_data

    978

    withdrawal

    603

    REDCap Codebook for the Caregiver sub-studyarrow-up-right
  • REDCap Codebook for the Congenital sub-studyarrow-up-right

  • Date shifting of all other dates within the data within a range of 1 year
  • Winsorization of date of birth and age at enrollment

  • Masking of Pediatric Caregiver IDs on the Answerdata table according to corresponding masked IDs from the Pediatric Caregiver cohort

  • Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14097) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
    • Note: From the previous release, participant IDs were assigned to 10549 individuals in a random fashion from integers between 1 – 10549. 16 individuals were dropped from that release, so there are 16 random integers that are unassigned in the lookup table for this release.

    • For this release, new IDs were assigned to participants covering 10550 - 14097.

  • Add protocol codes as the prefix for randomly assigned Participant IDs.

    • Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)

    • Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.

  • Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.

  • Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.

  • VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

  • VISIT_TYPE on the Visits table contains dates. Those dates have been removed in the de-identified data, with the rest of VISIT_TYPE retained.

  • Calculate the difference in days from the original to the shifted anchor date.
  • Recalculate all other dates by adding that date difference to those dates.

  • Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol. Provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatric_BDC_202406_concepts_deID.csv”

  • Demographics

    WITHDRAW_DATE

    Date shifting

    Demographics

    DECEASED_DATE

    Truncation to year

    Demographics

    ENROLL_ZIP_CODE

    Truncation to 3 digits

    Answerdata

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14097

    Answerdata

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Answerdata

    ANSWER_TEXT_VAL

    Date shifting

    ID masking for values that correspond with Pediatric Caregiver IDs

    Answerdata

    DATA_ENTRY_DATE

    Date shifting

    Biospecimens

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14097

    Biospecimens

    COLLECTION_DATE

    Date shifting

    Visits

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Visits

    VISIT_TYPE

    Removal of dates

    Visits

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14097

    Visits

    VISIT_START_DATE

    Date shifting

    Fitbit

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14097

    Fitbit

    SUMMARY_DATE

    Date shifting

    Date shifting of all other dates within the data within a range of 1 year
  • Winsorization of date of birth and age at enrollment

  • Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (8820) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
    • Note: From the previous release, participant IDs were assigned to 6714 individuals in a random fashion from integers between 1 – 6714. 6 individuals were dropped from that release, so there are 6 random integers that are unassigned in the lookup table for this release.

    • For this release, new IDs were assigned to participants covering 6715 - 8820.

  • Add protocol codes as the prefix for randomly assigned Participant IDs.

    • Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)

    • Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.

  • Merge the ID lookup table with each of the tables in the Pediatric Caregiver cohort by PARTICIPANT_ID to attach the masked IDs.

  • Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.

  • VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

  • Calculate the difference in days from the original to the shifted anchor date.
  • Recalculate all other dates by adding that date difference to those dates.

  • All date values within ANSWER_TEXT_VAL on the Answerdata table are shifted according to this protocol. We provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatricCaregiver_BDC_202406_concepts_deID.csv”

  • Demographics

    WITHDRAW_DATE

    Date shifting

    Demographics

    DECEASED_DATE

    Truncation to year

    Demographics

    ENROLL_ZIP_CODE

    Truncation to 3 digits

    Answerdata

    PARTICIPANT_ID

    ID masking with a random number between 1 and 8820

    Answerdata

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Answerdata

    ANSWER_TEXT_VAL

    Date shifting

    Answerdata

    DATA_ENTRY_DATE

    Date shifting

    Biospecimens

    PARTICIPANT_ID

    ID masking with a random number between 1 and 8820

    Biospecimens

    COLLECTION_DATE

    Date shifting

    Visits

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Visits

    PARTICIPANT_ID

    ID masking with a random number between 1 and 8820

    Visits

    VISIT_START_DATE

    Date shifting

    Date shifting of all other dates within the data within a range of 1 year
  • Winsorization of date of birth and age at enrollment

  • Masking of Adult IDs on the Answerdata table according to corresponding masked IDs from the Adult cohort

  • Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (1702) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
    • Note: From the previous release, participant IDs were assigned to 995 individuals in a random fashion from integers between 1 – 995.

    • For this release, new IDs were assigned to participants covering 996 - 1702.

  • Add protocol codes as the prefix for randomly assigned Participant IDs.

    • Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)

    • Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.

  • Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.

  • Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.

  • VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

  • We calculate the difference in days from the original to the shifted anchor date.
  • We recalculate all other dates by adding that date difference to those dates.

  • All date values within ANSWER_TEXT_VAL on the Answerdata table are shifted according to this protocol. We provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatricCongential_BDC_202406_concepts_deID.csv”

  • Demographics

    WITHDRAW_DATE

    Date shifting

    Demographics

    DECEASED_DATE

    Truncation to year

    Demographics

    ENROLL_ZIP_CODE

    Truncation to 3 digits

    Answerdata

    PARTICIPANT_ID

    ID masking with a random number between 1 and 1702

    Answerdata

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Answerdata

    ANSWER_TEXT_VAL

    Date shifting

    ID masking for values that correspond with Adult IDs

    Answerdata

    DATA_ENTRY_DATE

    Date shifting

    Biospecimens

    PARTICIPANT_ID

    ID masking with a random number between 1 and 1702

    Biospecimens

    COLLECTION_DATE

    Date shifting

    Visits

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Visits

    PARTICIPANT_ID

    ID masking with a random number between 1 and 1702

    Visits

    VISIT_START_DATE

    Date shifting

    Date shifting of all other dates within the data within a range of 1 year.
  • Winsorization of Adult Cohort age at enrollment and date of birth information.

  • Removal of free text fields that did not have data entry validation in REDCap.

  • Note: From the previous release, participant IDs were assigned to 14,662 individuals in a random and not contiguous fashion from integers between 1 – 14702, including an excess of 40 integers left unassigned at random. Of those 14,662 individuals, 7 have withdrawn from the study since the previous release and left their previously assigned integers vacant, for a final tally of 14,655 individuals with randomly assigned integers between 1 – 14702 continuing into the current release.

  • For the current release, 549 new participants were added to the study with randomly assigned integers between 14703 – 15251.

  • Add protocol codes as the prefix for randomly assigned Participant IDs.

    • Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)

    • Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.

  • Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.

  • Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.

  • VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

  • Calculate the difference in days from the original to the shifted anchor date.
  • Recalculate all other dates by adding that date difference to those dates.

  • Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol.

  • Demographics

    DOB

    Truncation to year

    Demographics

    WITHDRAW_DATE

    Date shifting

    Demographics

    DECEASED_DATE

    Truncation to year

    Demographics

    ENROLL_ZIP_CODE

    Truncation to 3 digits

    Answerdata

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Answerdata

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Answerdata

    ANSWER_TEXT_VAL

    Date shifting

    Answerdata

    DATA_ENTRY_DATE

    Date shifting

    Biospecimens

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Biospecimens

    COLLECTION_DATE

    Date shifting

    Visits

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Visits

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Visits

    VISIT_START_DATE

    Date shifting

    Fitbit

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Fitbit

    SUMMARY_DATE

    Date shifting

    173,835

    8

    106

    ---

    RECOVERAdult_BDC_202403_demographics_deid.tsv

    15,204

    20

    3

    15,204

    RECOVERAdult_BDC_202403_fitbit_deid.tsv

    9,889,039

    6

    1,030

    2,602

    RECOVERAdult_BDC_202403_visits_deid.tsv

    122,029

    7

    8

    15,185

    Total

    35,066,262

    67

    7,112

    PAXgene RNA – Whole Blood

    Whole Blood

    Cell Preservation Tube (CPT) – Whole Blood

    Peripheral Blood Mononuclear Cells (PBMCs)

    EDTA – Whole Blood

    Plasma

    White Blood Cells (WBCs)

    audiometry_survey

    1,188

    Biospecimens

    485,333

    bmri_import

    39,678

    brain_mri_quality_confirmation

    2,703

    brain_mri_with_gadolinium

    23,510

    cardiac_mri

    789

    cardiac_mri_reading_center

    51

    cardiopulmonary_exercise_testing

    3,283

    cardiovagal_innervation_testing

    638

    change_in_symptoms_since_infection

    21,439

    chest_ct

    47,387

    chest_ct_reading_center

    185

    clinical_labs

    92,771

    Colonoscopy

    90

    Comorbidities

    1,923,579

    comprehensive_audiometry

    51,936

    covid_treatment

    191,414

    cpet_reading_center

    4

    Demographics

    115,510

    Disability

    97,810

    drc_data

    9,909

    echocardiogram_with_strain

    73,980

    electrocardiogram

    91,215

    electromyography

    112

    end_of_participation

    5,862

    endopat_testing

    8,164

    Enrollment

    160,289

    facility_sleep_questionnaire_morning_after

    416

    facility_sleep_questionnaire_night_before

    1,640

    facility_sleep_study

    2,152

    Fibroscan

    17,927

    formal_neuropsychological_testing

    12,294

    full_ent_examination

    1,253

    gastric_emptying_study

    115

    hepatitis_tests

    37,936

    home_polysomnography_with_ess_and_isi

    22,170

    home_sleep_assessment

    32,266

    hsat_data

    85,222

    long_covid_treatment_trial

    90,359

    medication_changes

    122,181

    Medications

    64,630

    mhp_data

    60,104

    Mini

    83,729

    mini_prequestionnaire

    4,447

    neonatal_delivery_and_outcome_form

    3,754

    nerve_conduction_study

    157

    neuropathy_examination

    130,245

    new_covid_infection

    41,621

    nih_toolbox

    59,473

    oral_glucose_test

    36,264

    pasc_symptoms

    7,714,508

    pcl5

    21,548

    pft_reading_center

    6,768

    pg13r

    1,084

    plasma_catecholamine_testing

    630

    Pregnancy

    109,716

    pregnancy_followup

    86,875

    psg_data

    6,474

    psg_quality_summary_form

    2,164

    pulmonary_function_tests

    85,186

    recent_covid_treatment

    60,207

    rehabilitation_testing

    82,645

    renal_ultrasound

    19,464

    research_labs

    4,012,350

    serum_b12_and_methylmalonic_acid

    31,639

    six_minute_walk_test

    101,931

    skin_biopsy

    131

    sleep_reading_center

    25,471

    social_determinants_of_health

    801,331

    social_determinants_of_health_followup

    678,872

    study_termination

    96

    tier_12_consent_tracking

    113,913

    tier_1_office_visit

    1,719,894

    tilt_table_test

    636

    upsit_smell_test

    184,618

    vaccine_status

    276,537

    vision_testing

    64,756

    visit_form

    978,660

    wearable_data

    1,910

    Withdrawal

    953

    Total Variables

    24,043,845

    RECOVERAdult_BDC_202403_fitbit_deID

    5. Data entry error in summary_date, reported as a future event

    6

    RECOVERAdult_BDC_202403_answerdata_deID

    6. Data entry error in data_entry_date, reported as a future event

    870

    RECOVERAdult_BDC_202403_answerdata_deID

    7. Tab characters present in CONCEPT_NAME variable, causing issues when reading file as tab-separated values

    53

    RECOVERAdult_BDC_202403_concepts

    8. Concept codes repeated across multiple entries

    4

    Date shifting of all other dates within the data within a range of 1 year
  • Winsorization of Adult Cohort age at enrollment and date of birth information

  • Removal of free text fields that did not have data entry validation in REDCap

  • Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.
  • Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.

  • VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

  • Calculate the difference in days from the original to the shifted anchor date.
  • Recalculate all other dates by adding that date difference to those dates.

  • All date values within ANSWER_TEXT_VAL on the Answerdata table were shifted according to this protocol. The concepts that correspond to date values within the Answerdata table are provided in the file “RECOVERAdult_i2b2_concepts_BDC_dateshift.xlsx”.

  • Demographics

    DOB

    Truncation to year; bottom capping at 1933 or 1934

    Demographics

    AGE_AT_ENROLLMENT

    Top capping at 89

    Demographics

    WITHDRAW_DATE

    Date shifting

    Demographics

    DECEASED_DATE

    Truncation to year

    Demographics

    ENROLL_ZIP_CODE

    Truncation to 3 digits

    Answerdata

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14702

    Answerdata

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Answerdata

    ANSWER_TEXT_VAL

    Date shifting

    Answerdata

    DATA_ENTRY_DATE

    Date shifting

    Biospecimens

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14702

    Biospecimens

    COLLECTION_DATE

    Date shifting

    Visits

    VISIT_ID

    Removal of PARTICIPANT_ID portion of VISIT_ID

    Visits

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14702

    Visits

    VISIT_START_DATE

    Date shifting

    Fitbit

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14702

    Fitbit

    SUMMARY_DATE

    Date shifting

    3,771

    enrollment

    154,305

    long_covid_treatment_trial

    38,801

    new_covid_infection

    26,366

    pasc_symptoms

    5,081,366

    pregnancy

    103,109

    pregnancy_followup

    54,437

    recent_covid_treatment

    41,019

    social_determinants_of_health

    769,434

    social_determinants_of_health_followup

    445,687

    study_termination

    96

    tier_12_consent_tracking

    97,222

    visit_form

    749

    withdrawal

    961

    Demographics

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Demographics

    ENROLL_DATE

    Date shifting

    Demographics

    ENROLL_INDEX_DATE

    Date shifting

    Demographics

    CROSSOVER_INDEX_DATE

    RECOVERAdult_BDC_202406_answerdata_BDC.tsv

    28,979,068

    14

    7,629

    15,177

    RECOVERAdult_BDC_202406_biospecimens_BDC.tsv

    935,596

    13

    104

    13,503

    Stool

    Stool

    Urine

    Urine

    Nasal/NP swab

    Nasal or NP Cells

    Oragene 600 – Saliva

    Saliva

    Serum Separator Tube (SST) – Whole Blood

    Serum

    Sodium Citrate – Whole Blood

    Plasma

    acth_and_cortisol_test

    7264

    adult_delivery_and_outcome_form

    341

    alcohol_and_tobacco

    14025

    alcohol_and_tobacco_followup

    13460

    assessment_scores

    14967

    audiometry_survey

    421

    Demographics

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14097

    Demographics

    ENROLL_DATE

    Date shifting

    Demographics

    ENROLL_INDEX_DATE

    Date shifting

    Demographics

    DOB

    Demographics

    PARTICIPANT_ID

    ID masking with a random number between 1 and 8820

    Demographics

    ENROLL_DATE

    Date shifting

    Demographics

    ENROLL_INDEX_DATE

    Date shifting

    Demographics

    DOB

    Demographics

    PARTICIPANT_ID

    ID masking with a random number between 1 and 1702

    Demographics

    ENROLL_DATE

    Date shifting

    Demographics

    ENROLL_INDEX_DATE

    Date shifting

    Demographics

    DOB

    Demographics

    PARTICIPANT_ID

    ID masking with a random number between 1 and 15251

    Demographics

    ENROLL_DATE

    Date shifting

    Demographics

    ENROLL_INDEX_DATE

    Date shifting

    Demographics

    CROSSOVER_INDEX_DATE

    RECOVERAdult_BDC_202403_answerdata_deid.tsv

    24,043,845

    14

    5,887

    15,176

    RECOVERAdult_BDC_202403_biospecimens_deid.tsv

    822,310

    12

    77

    13,460

    Stool

    Stool

    Urine

    Urine

    Nasal/NP swab

    Nasal or NP Cells

    Oragene 600 – Saliva

    Saliva

    Serum Separator Tube (SST) – Whole Blood

    Serum

    Sodium Citrate – Whole Blood

    Plasma

    acth_and_cortisol_test

    70,243

    adult_delivery_and_outcome_form

    5,749

    adult_echo_data

    6,173

    alcohol_and_tobacco

    162,890

    alcohol_and_tobacco_followup

    381,796

    assessment_scores

    1,866,843

    RECOVERAdult_BDC_202403_demographics_deID

    1. Missing enroll_date

    39

    RECOVERAdult_BDC_202403_demographics_deID

    2. Missing DOB

    82

    RECOVERAdult_BDC_202403_biospecimens_deID

    3. Data entry error in collection_date, reported as a future event

    7

    RECOVERAdult_BDC_202403_visits_deID

    4. Data entry error in visit_start_date, reported as a future event

    17

    Demographics

    PARTICIPANT_ID

    ID masking with a random number between 1 and 14702

    Demographics

    ENROLL_DATE

    Date shifting

    Demographics

    ENROLL_INDEX_DATE

    Date shifting

    Demographics

    CROSSOVER_INDEX_DATE

    alcohol_and_tobacco

    156,454

    alcohol_and_tobacco_followup

    251,039

    assessment_scores

    1,281,009

    covid_treatment

    183,355

    demographics

    110,778

    disability

    93,956

    NIH RECOVER Adult Cohort Observational Studyarrow-up-right
    REDCap Codebookarrow-up-right
    linkarrow-up-right
    linkarrow-up-right
    JAMA. 2023arrow-up-right
    BioData Catalyst Supportenvelope
    protocolarrow-up-right
    design publicationarrow-up-right
    Pediatric Observational Cohort Studyarrow-up-right
    RECOVERarrow-up-right
    study protocolarrow-up-right
    Adolescent Brain Cognitive Development (ABCD) studyarrow-up-right
    Long-terM OUtcomes after the Multisystem Inflammatory Syndrome In Children (MUSIC)arrow-up-right
    JAMA 2024.12747arrow-up-right
    Part 1arrow-up-right
    Part 2arrow-up-right
    listing of all surveys and questions arrow-up-right
    BioData Catalyst Supportenvelope
    protocolarrow-up-right
    design publicationarrow-up-right
    participant surveysarrow-up-right
    BioData Catalyst Supportenvelope
    protocolarrow-up-right
    design publicationarrow-up-right
    participant surveysarrow-up-right
    BioData Catalyst Supportenvelope
    protocolarrow-up-right
    design publicationarrow-up-right
    participant surveysarrow-up-right
    NIH RECOVER Adult Cohort Observational Studyarrow-up-right
    REDCap Codebookarrow-up-right
    protocolarrow-up-right
    design publicationarrow-up-right
    NIH RECOVER Adult Cohort Observational Studyarrow-up-right
    RECOVER Data Dictionary/REDCap Codebookarrow-up-right
    protocolarrow-up-right
    design publicationarrow-up-right

    Date shifting

    RECOVERAdult_BDC_202406_concepts_BDC.tsv

    Truncation to year

    Truncation to year

    Truncation to year

    Date shifting

    RECOVERAdult_BDC_202403_concepts_deid.tsv

    Date shifting

    end_of_participation