NIH RECOVER Release Notes

Adult Observational Cohort Study: Dataset Release Notes

November 2024

RECOVER Adult and Pregnancy Observational Cohort Studies

Release file: phs003463.v3.p2

This release contains data collected from the NIH RECOVER Adult Cohort Observational Study (RECOVER-Adult) between October 29, 2021 and June 15, 2024. These data were obtained from 15,183 adult participants (Including a sub-cohort of 2,332 pregnant women) attending 135,791 study interactions across 83 geographically dispersed enrolling sites. The dataset also includes an inventory of biospecimen samples (935,596 aliquots) collected during baseline and follow-up visits, plus wearable sensor metadata from 2,932 participants in the Digital Health Program. Overall this release comprises approximately 5,386 data elements and 37 million datapoints. Please refer to the REDCap Codebook for a Data Dictionary organized as a list of all surveys/forms and their respective data fields.

RECOVER-Adult is a combined retrospective and prospective, longitudinal, observational meta-cohort study of individuals aged ≥ 18 who enter the cohort with and without SARS-CoV-2. Individuals with a prior SARS-CoV-2 infection enter the study at varying times after their infection. Individuals with and without SARS-CoV2 infection, and with or without PASC symptoms, are followed to identify risk factors and occurrence of PASC. This study is being conducted in the United States, with subjects recruited through inpatient, outpatient, and community-based settings. Study data including age, demographics, social determinants of health, medical history, vaccination history, details of acute SARS-CoV-2 infection, overall health and physical function, and PASC symptoms are reported at quarterly intervals. Biologic specimens also are collected at specified intervals, with some tests performed in local clinical laboratories and others performed by centralized research centers or banked in the Biospecimen Repository. Advanced clinical examinations and radiologic examinations are performed at local study sites with cross-site standardization. Please refer to this link for details on the RECOVER-Adult study rationale, design and objectives, and to this link for more on the design of the RECOVER-Pregnancy study.

Importantly, this release includes data underlying the first publication of primary results from the RECOVER-Adult study: Thaweethai et al., Development of a definition of post-acute sequelae of SARS-CoV-2 Infection. JAMA. 2023 Jun 13;329(22):1934-1946.

Please note that this release does not contain any genomic or metabolomic data.

RELEASE NOTES FOR THE RECOVER Adult and Pregnancy Observational Cohort Studies

Data Quality

The RECOVER-Adult and RECOVER-Pregnancy Observational Cohort Studies consist of approximately 37 million datapoints (31 million Adult Study datapoints and 6 million Pregnancy Study datapoints) that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.

Due to the volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact BioData Catalyst Support with any questions or concerns regarding this release of RECOVER data.

De-Identification Process

Masking of PARTICIPANT_IDs was performed according to the following protocol:

Maintain PARTICIPANT_IDs assigned in the previous data release (202403.1).
Extract PARTICIPANT_IDs from each table within the Pediatric cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14097) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
1. Note: From the 202309.1 release, participant IDs were assigned to 14,662 individuals in a random and not contiguous fashion from integers between 1 – 14702, including an excess of 40 integers left unassigned at random. From the 202403.1 release, additional IDs were assigned at random from integers between 14703 – 15211.
2. For this release, no new IDs were assigned.
Add protocol codes as the prefix for randomly assigned Participant IDs.
1. Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
2. Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

ZIP Codes

ZIP Codes were truncated to 3 digits according to the following protocol:

Keep a substring of ZIP codes from the first to the third characters.

Dates of Birth and Ages

Date of birth and deceased dates were truncated according to the following protocol:

Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89
Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

Remaining Dates

Date shifting of all other dates was performed according to the following protocol:

Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol. Provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERAdult_BDC_202406_concepts_deID.csv”

Summary

The following table lists the variables that were de-identified, along with the de-identification protocol that was applied.

Data Table

Variable

De-Identification Protocol

Demographics

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Demographics

ENROLL_DATE

Date shifting

Demographics

ENROLL_INDEX_DATE

Date shifting

Demographics

CROSSOVER_INDEX_DATE

Date shifting

Demographics

DOB

Truncation to year

Demographics

WITHDRAW_DATE

Date shifting

Demographics

DECEASED_DATE

Truncation to year

Demographics

ENROLL_ZIP_CODE

Truncation to 3 digits

Answerdata

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Answerdata

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Answerdata

ANSWER_TEXT_VAL

Date shifting

Answerdata

DATA_ENTRY_DATE

Date shifting

Biospecimens

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Biospecimens

COLLECTION_DATE

Date shifting

Visits

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Visits

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Visits

VISIT_START_DATE

Date shifting

Fitbit

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Fitbit

SUMMARY_DATE

Date shifting

The following table provides a brief description of the 6 BDC files in this release.

Filename

Rows

Columns

File Size (MB)

Participants

RECOVERAdult_BDC_202406_answerdata_BDC.tsv

28,979,068

7,629

15,177

RECOVERAdult_BDC_202406_biospecimens_BDC.tsv

935,596

104

13,503

RECOVERAdult_BDC_202406_concepts_BDC.tsv

191,528

RECOVERAdult_BDC_202406_demographics_BDC.tsv

15,183

RECOVERAdult_BDC_202406_fitbit_BDC.tsv

6,415,774

704

2,932

RECOVERAdult_BDC_202406_visits_BDC.tsv

135,791

15,182

Total

36,672,940

8,542

The following table provides a brief description of the biospecimens included in this release.

Collected Biospecimens

Stored Biospecimens

Stool

Urine

Nasal/NP swab

Nasal or NP Cells

Oragene 600 – Saliva

Saliva

Serum Separator Tube (SST) – Whole Blood

Serum

Sodium Citrate – Whole Blood

Plasma

PAXgene RNA – Whole Blood

Whole Blood

Cell Preservation Tube (CPT) – Whole Blood

Peripheral Blood Mononuclear Cells (PBMCs)

EDTA – Whole Blood

Plasma

White Blood Cells (WBCs)

The following table lists all 86 REDCap forms in the dataset, and the corresponding number of variables/rows of data associated with each form.

Form name

Total Variables per Form

acth_and_cortisol_test

7264

adult_delivery_and_outcome_form

341

alcohol_and_tobacco

14025

alcohol_and_tobacco_followup

13460

assessment_scores

14967

audiometry_survey

421

biospecimens

13792

brain_mri_quality_confirmation

723

brain_mri_with_gadolinium

6057

cardiac_mri

601

cardiac_mri_reading_center

cardiopulmonary_exercise_testing

1588

cardiovagal_innervation_testing

310

change_in_symptoms_since_infection

5044

chest_ct

6759

chest_ct_reading_center

clinical_labs

1808

colonoscopy

comorbidities

14427

comprehensive_audiometry

4431

covid_treatment

11554

cpet_reading_center

demographics

14668

disability

14034

drc_data

7868

echocardiogram_with_strain

5692

electrocardiogram

7389

electromyography

end_of_participation

1737

endopat_testing

1336

enrollment

15177

facility_sleep_questionnaire_morning_after

131

facility_sleep_questionnaire_night_before

140

facility_sleep_study

1696

fibroscan

3742

formal_neuropsychological_testing

1421

full_ent_examination

373

gastric_emptying_study

hepatitis_tests

4475

home_polysomnography_with_ess_and_isi

4621

home_sleep_assessment

1456

long_covid_treatment_trial

11554

lumbar_puncture

medication_changes

13464

medications

8573

mhp_data

14828

mini

5518

mini_prequestionnaire

1720

neonatal_delivery_and_outcome_form

333

nerve_conduction_study

104

neuropathy_examination

4387

new_covid_infection

9994

nih_toolbox

5366

oral_glucose_test

5155

pasc_symptoms

14535

pcl5

1418

pft_reading_center

249

pg13r

148

plasma_catecholamine_testing

307

pregnancy

10256

pregnancy_followup

9568

psg_data

109

psg_quality_summary_form

134

pulmonary_function_tests

6601

recent_covid_treatment

4111

rehabilitation_testing

6782

renal_ultrasound

2755

research_labs

14147

serum_b12_and_methylmalonic_acid

4006

six_minute_walk_test

7184

skin_biopsy

sleep_reading_center

1265

social_determinants_of_health

14113

social_determinants_of_health_followup

13470

study_termination

tier_12_consent_tracking

13417

tier_1_consent_tracking

2537

tier_1_office_visit

14464

tier_2_consent_tracking

1068

tilt_table_test

307

upsit_smell_test

5264

vaccine_status

14490

vision_testing

6775

visit_form

14968

wearable_data

978

withdrawal

603

Information for Authors

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study protocol and design publication; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v2.p2, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

October 2024

RECOVER Pediatric Observational Cohort Study

Release file: phs003461.v1.p1

This release contains data collected between March 17, 2022 and June 15, 2024 from the Pediatric Observational Cohort Study (“RECOVER-Pediatrics”) of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative. These data were obtained from 24,621 participants attending 62,921 study visits across 105 geographically dispersed enrolling sites. The dataset includes descriptions of 85,858 biospecimens collected during baseline and follow-up visits, plus wearable sensor metadata from approximately 400 participants in the RECOVER Digital Health Program. Taken together, this release consists of 8,604 data elements (variables) and approximately 13.6 million datapoints.

RECOVER-Pediatrics is an observational meta-cohort study of caregiver-child pairs (Birth through 17 years) and young adults (18 through 25 years). As described in the study protocol, the pediatric meta-cohort consists of four distinct cohorts: (1) A de novo RECOVER prospective cohort including children and young adults ages birth through 25 years, with or without a known history of SARS-CoV-2 infection, and their respective caregivers; (2) An extant cohort from the Adolescent Brain Cognitive Development (ABCD) study, the largest long-term US study of brain development in adolescence; (3) An in utero exposure cohort, including children less than 3 years old born to individuals with and without a SARS-CoV-2 infection during pregnancy; and (4) An extant cohort from the NHLBI Study on Long-terM OUtcomes after the Multisystem Inflammatory Syndrome In Children (MUSIC).

Importantly, this release includes data underlying the first publication of primary results from the RECOVER-Pediatrics study: Gross et al., Characterizing Long COVID in Children and Adolescents. JAMA 2024.12747, August 21, 2024.

Detailed descriptions of the data elements used in RECOVER-Pediatrics will be found in three separate data dictionaries:

REDCap Codebook for the Age 13-17 and 18-25 sub-studies (Including ABCD and MUSIC extant cohorts) - due to the number of pages, this document has been divided into two files:
- Part 1
- Part 2
REDCap Codebook for the Caregiver sub-study
REDCap Codebook for the Congenital sub-study

These codebooks are organized as a list of all surveys/forms with their respective data elements. Also available is a listing of all surveys and questions in RECOVER-Pediatrics.

Please note that this release does not contain any genomic or metabolomic data.

RELEASE NOTES FOR THE “PEDIATRIC-MAIN” STUDY (Includes Age 13-17 and 18-25 sub-studies, plus data from the ABCD and MUSIC extant cohorts)

Data Quality

The Pediatric-Main study consists of approximately 9.9 million datapoints that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

Due to that volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact BioData Catalyst Support with any questions or concerns regarding this release of RECOVER data.

De-Identification Process

The following steps were taken to de-identify the RECOVER Pediatric-Main observational cohort data:

Masking of IDs using a randomly assigned number between 1 and 14097 (number of unique PARTICIPANT_IDs within the Pediatric cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of date of birth and age at enrollment
Masking of Pediatric Caregiver IDs on the Answerdata table according to corresponding masked IDs from the Pediatric Caregiver cohort

Participant IDs

Masking of PARTICIPANT_IDs was performed according to the following protocol:

Participant IDs assigned in the previous data release (202309.1) are maintained.
Extract PARTICIPANT_IDs from each table within the Pediatric cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14097) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
- Note: From the previous release, participant IDs were assigned to 10549 individuals in a random fashion from integers between 1 – 10549. 16 individuals were dropped from that release, so there are 16 random integers that are unassigned in the lookup table for this release.
- For this release, new IDs were assigned to participants covering 10550 - 14097.
Add protocol codes as the prefix for randomly assigned Participant IDs.
- Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
- Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
VISIT_TYPE on the Visits table contains dates. Those dates have been removed in the de-identified data, with the rest of VISIT_TYPE retained.

ZIP Codes

ZIP Codes were truncated to 3 digits according to the following protocol:

Keep a substring of ZIP codes from the first to the third characters.

Dates of Birth and Ages

Date of birth and deceased dates were truncated according to the following protocol:

Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89
Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

Remaining Dates

Date shifting of all other dates was performed according to the following protocol:

Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol. Provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatric_BDC_202406_concepts_deID.csv”

Summary

The following table lists the variables that are de-identified, along with the de-identification protocol that was applied.

Table

Variable

De-Identification Protocol

Demographics

PARTICIPANT_ID

ID masking with a random number between 1 and 14097

Demographics

ENROLL_DATE

Date shifting

Demographics

ENROLL_INDEX_DATE

Date shifting

Demographics

DOB

Truncation to year

Demographics

WITHDRAW_DATE

Date shifting

Demographics

DECEASED_DATE

Truncation to year

Demographics

ENROLL_ZIP_CODE

Truncation to 3 digits

Answerdata

PARTICIPANT_ID

ID masking with a random number between 1 and 14097

Answerdata

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Answerdata

ANSWER_TEXT_VAL

Date shifting

ID masking for values that correspond with Pediatric Caregiver IDs

Answerdata

DATA_ENTRY_DATE

Date shifting

Biospecimens

PARTICIPANT_ID

ID masking with a random number between 1 and 14097

Biospecimens

COLLECTION_DATE

Date shifting

Visits

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Visits

VISIT_TYPE

Removal of dates

Visits

PARTICIPANT_ID

ID masking with a random number between 1 and 14097

Visits

VISIT_START_DATE

Date shifting

Fitbit

PARTICIPANT_ID

ID masking with a random number between 1 and 14097

Fitbit

SUMMARY_DATE

Date shifting

Information for Authors

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study protocol, design publication, and participant surveys; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Pediatric Observational Cohort dataset version phs003461.v1.p1, supported by 1OT2HL156812, OT2HL161847, and 1OT2HL161841 awards from the NIH. [If applicable:] This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

RELEASE NOTES FOR THE “PEDIATRIC-CAREGIVER” STUDY

Data Quality

The Pediatric-Caregiver study consists of approximately 2.8 million datapoints that were generated from surveys and biospecimen collections (Saliva and Tasso blood spot). Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

De-Identification Process

The following steps were taken to de-identify the RECOVER Pediatric Caregiver cohort data:

Masking of IDs using a randomly assigned number between 1 and 8820 (number of unique PARTICIPANT_IDs within the Pediatric Caregiver cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of date of birth and age at enrollment

Participant IDs

Masking of PARTICIPANT_IDs was performed according to the following protocol:

Participant IDs assigned in the previous data release (202309.1) are maintained.
Extract PARTICIPANT_IDs from each table within the Pediatric Caregiver cohort (currently Answerdata, Biospecimens, Concepts, Demographics, and Visits) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (8820) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
- Note: From the previous release, participant IDs were assigned to 6714 individuals in a random fashion from integers between 1 – 6714. 6 individuals were dropped from that release, so there are 6 random integers that are unassigned in the lookup table for this release.
- For this release, new IDs were assigned to participants covering 6715 - 8820.
Add protocol codes as the prefix for randomly assigned Participant IDs.
- Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
- Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Pediatric Caregiver cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

ZIP Codes

ZIP codes were truncated to 3 digits according to the following protocol:

Keep a substring of ZIP codes from the first to the third characters.

Dates of Birth and Ages

Participant date of birth and deceased date were truncated according to the following protocol:

Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89
Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

Remaining Dates

Date shifting of all other dates was performed according to the following protocol:

Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
All date values within ANSWER_TEXT_VAL on the Answerdata table are shifted according to this protocol. We provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatricCaregiver_BDC_202406_concepts_deID.csv”

Summary

The following table lists the variables that are de-identified, along with the deidentification protocol that was applied.

Table

Variable

De-Identification Protocol

Demographics

PARTICIPANT_ID

ID masking with a random number between 1 and 8820

Demographics

ENROLL_DATE

Date shifting

Demographics

ENROLL_INDEX_DATE

Date shifting

Demographics

DOB

Truncation to year

Demographics

WITHDRAW_DATE

Date shifting

Demographics

DECEASED_DATE

Truncation to year

Demographics

ENROLL_ZIP_CODE

Truncation to 3 digits

Answerdata

PARTICIPANT_ID

ID masking with a random number between 1 and 8820

Answerdata

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Answerdata

ANSWER_TEXT_VAL

Date shifting

Answerdata

DATA_ENTRY_DATE

Date shifting

Biospecimens

PARTICIPANT_ID

ID masking with a random number between 1 and 8820

Biospecimens

COLLECTION_DATE

Date shifting

Visits

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Visits

PARTICIPANT_ID

ID masking with a random number between 1 and 8820

Visits

VISIT_START_DATE

Date shifting

Information for Authors

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study protocol, design publication, and participant surveys; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

RELEASE NOTES FOR THE “PEDIATRIC-CONGENITAL” STUDY

Data Quality

The Pediatric-Congenital study consists of approximately 0.9 million datapoints that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.

De-Identification Process

The following steps were taken to de-identify the RECOVER Pediatric Congenital cohort data:

Masking of IDs using a randomly assigned number between 1 and 1702 (number of unique PARTICIPANT_IDs within the Pediatric cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of date of birth and age at enrollment
Masking of Adult IDs on the Answerdata table according to corresponding masked IDs from the Adult cohort

Participant IDs

We perform masking of PARTICIPANT_IDs according to the following protocol:

Participant IDs assigned in the previous data release (202309.1) are maintained.
Extract PARTICIPANT_IDs from each table within the Pediatric Congenital cohort (currently Answerdata, Biospecimens, Concepts, Demographics, and Visits) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (1702) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
- Note: From the previous release, participant IDs were assigned to 995 individuals in a random fashion from integers between 1 – 995.
- For this release, new IDs were assigned to participants covering 996 - 1702.
Add protocol codes as the prefix for randomly assigned Participant IDs.
- Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
- Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

ZIP Codes

Truncation of ZIP codes to 3 digits was performed according to the following protocol:

Keep a substring of ZIP codes from the first to the third characters.

Dates of Birth and Ages

We perform truncation of participant date of birth and deceased date according to the following protocol:

Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89
Bottom capping date of birth at 1933 or 1934 (based on enrollment date)

Remaining Dates

We perform date shifting of all other dates according to the following protocol:

We select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
We shift the anchor date to the first day of the year, i.e., 2023/01/01.
We calculate the difference in days from the original to the shifted anchor date.
We recalculate all other dates by adding that date difference to those dates.
All date values within ANSWER_TEXT_VAL on the Answerdata table are shifted according to this protocol. We provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatricCongential_BDC_202406_concepts_deID.csv”

Summary

The following table lists the variables that are de-identified, along with the de-identification protocol that was applied.

Table

Variable

De-Identification Protocol

Demographics

PARTICIPANT_ID

ID masking with a random number between 1 and 1702

Demographics

ENROLL_DATE

Date shifting

Demographics

ENROLL_INDEX_DATE

Date shifting

Demographics

DOB

Truncation to year

Demographics

WITHDRAW_DATE

Date shifting

Demographics

DECEASED_DATE

Truncation to year

Demographics

ENROLL_ZIP_CODE

Truncation to 3 digits

Answerdata

PARTICIPANT_ID

ID masking with a random number between 1 and 1702

Answerdata

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Answerdata

ANSWER_TEXT_VAL

Date shifting

ID masking for values that correspond with Adult IDs

Answerdata

DATA_ENTRY_DATE

Date shifting

Biospecimens

PARTICIPANT_ID

ID masking with a random number between 1 and 1702

Biospecimens

COLLECTION_DATE

Date shifting

Visits

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Visits

PARTICIPANT_ID

ID masking with a random number between 1 and 1702

Visits

VISIT_START_DATE

Date shifting

Information for Authors

Protocol Complexity

The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study protocol, design publication, and participant surveys; these should be considered when analyzing and interpreting the data.

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

June 2024

Adult Observational Cohort Study: Release Notes

Release file: phs003463.v2.p2

This release contains data collected from the NIH RECOVER Adult Cohort Observational Study between October 29, 2021 and March 15, 2024. These data were obtained from 15,204 adult participants (including a sub-cohort of 2,192 pregnant women) attending 122,029 study visits across 79 geographically dispersed enrolling sites. The dataset also includes a description of 822,310 biospecimens collected during baseline and follow-up visits, plus wearable sensor metadata from 2,602 participants in the Digital Health Program. Overall this release comprises approximately 35 million rows of data and a total of 5,278 data elements. Please refer to the accompanying REDCap Codebook for a Data Dictionary organized as a list of all surveys/forms and their respective data fields. Also, please note that this release does not contain any genomic or metabolomic data.

Data De-identification Protocols

The following steps were undertaken to de-identify the dataset:

Masking of IDs using a randomly assigned number between 1 and 15,204 (number of unique PARTICIPANT_IDs within the Adult cohort data); NOTE: Participant IDs assigned to the previous data release (202309.1) were maintained in the current release.
Truncation of ZIP codes to 3 digits.
Truncation of participant date of birth and deceased date to the year.
Date shifting of all other dates within the data within a range of 1 year.
Winsorization of Adult Cohort age at enrollment and date of birth information.
Removal of free text fields that did not have data entry validation in REDCap.

These steps are described in more detail below.

Participant_ID masking

Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (15204) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
- Note: From the previous release, participant IDs were assigned to 14,662 individuals in a random and not contiguous fashion from integers between 1 – 14702, including an excess of 40 integers left unassigned at random. Of those 14,662 individuals, 7 have withdrawn from the study since the previous release and left their previously assigned integers vacant, for a final tally of 14,655 individuals with randomly assigned integers between 1 – 14702 continuing into the current release.
- For the current release, 549 new participants were added to the study with randomly assigned integers between 14703 – 15251.
Add protocol codes as the prefix for randomly assigned Participant IDs.
- Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
- Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

Truncation of ZIP Codes

Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).

Truncation of participant date of birth and deceased date

Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89.
Bottom capping date of birth at 1933 or 1934 (based on enrollment date).

NOTE: To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.

Date shifting (All other dates)

Date shifting of all other dates was performed as follows:

Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate a random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol.

The following table lists specific variables requiring de-identification, and the de-identification protocol that was applied.

Data Table

Variable

De-identification Protocol

Demographics

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Demographics

ENROLL_DATE

Date shifting

Demographics

ENROLL_INDEX_DATE

Date shifting

Demographics

CROSSOVER_INDEX_DATE

Date shifting

Demographics

DOB

Truncation to year

Demographics

WITHDRAW_DATE

Date shifting

Demographics

DECEASED_DATE

Truncation to year

Demographics

ENROLL_ZIP_CODE

Truncation to 3 digits

Answerdata

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Answerdata

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Answerdata

ANSWER_TEXT_VAL

Date shifting

Answerdata

DATA_ENTRY_DATE

Date shifting

Biospecimens

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Biospecimens

COLLECTION_DATE

Date shifting

Visits

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Visits

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Visits

VISIT_START_DATE

Date shifting

Fitbit

PARTICIPANT_ID

ID masking with a random number between 1 and 15251

Fitbit

SUMMARY_DATE

Date shifting

The following table provides a brief description of the 6 BDC files in this release.

Filename

Rows

Columns

File Size (MB)

Participants

RECOVERAdult_BDC_202403_answerdata_deid.tsv

24,043,845

5,887

15,176

RECOVERAdult_BDC_202403_biospecimens_deid.tsv

822,310

13,460

RECOVERAdult_BDC_202403_concepts_deid.tsv

173,835

106

---

RECOVERAdult_BDC_202403_demographics_deid.tsv

15,204

RECOVERAdult_BDC_202403_fitbit_deid.tsv

9,889,039

1,030

2,602

RECOVERAdult_BDC_202403_visits_deid.tsv

122,029

15,185

Total

35,066,262

7,112

The following table provides a brief description of the biospecimens included in this release.

Collected Biospecimens

Stored Biospecimens

Stool

Urine

Nasal/NP swab

Nasal or NP Cells

Oragene 600 – Saliva

Saliva

Serum Separator Tube (SST) – Whole Blood

Serum

Sodium Citrate – Whole Blood

Plasma

PAXgene RNA – Whole Blood

Whole Blood

Cell Preservation Tube (CPT) – Whole Blood

Peripheral Blood Mononuclear Cells (PBMCs)

EDTA – Whole Blood

Plasma

White Blood Cells (WBCs)

The following table lists all 85 REDCap forms in the dataset, and the corresponding number of variables/rows of data associated with each form.

FORM_NAME

Total Variables per Form

acth_and_cortisol_test

70,243

adult_delivery_and_outcome_form

5,749

adult_echo_data

6,173

alcohol_and_tobacco

162,890

alcohol_and_tobacco_followup

381,796

assessment_scores

1,866,843

audiometry_survey

1,188

Biospecimens

485,333

bmri_import

39,678

brain_mri_quality_confirmation

2,703

brain_mri_with_gadolinium

23,510

cardiac_mri

789

cardiac_mri_reading_center

cardiopulmonary_exercise_testing

3,283

cardiovagal_innervation_testing

638

change_in_symptoms_since_infection

21,439

chest_ct

47,387

chest_ct_reading_center

185

clinical_labs

92,771

Colonoscopy

Comorbidities

1,923,579

comprehensive_audiometry

51,936

covid_treatment

191,414

cpet_reading_center

Demographics

115,510

Disability

97,810

drc_data

9,909

echocardiogram_with_strain

73,980

electrocardiogram

91,215

electromyography

112

end_of_participation

5,862

endopat_testing

8,164

Enrollment

160,289

facility_sleep_questionnaire_morning_after

416

facility_sleep_questionnaire_night_before

1,640

facility_sleep_study

2,152

Fibroscan

17,927

formal_neuropsychological_testing

12,294

full_ent_examination

1,253

gastric_emptying_study

115

hepatitis_tests

37,936

home_polysomnography_with_ess_and_isi

22,170

home_sleep_assessment

32,266

hsat_data

85,222

long_covid_treatment_trial

90,359

medication_changes

122,181

Medications

64,630

mhp_data

60,104

Mini

83,729

mini_prequestionnaire

4,447

neonatal_delivery_and_outcome_form

3,754

nerve_conduction_study

157

neuropathy_examination

130,245

new_covid_infection

41,621

nih_toolbox

59,473

oral_glucose_test

36,264

pasc_symptoms

7,714,508

pcl5

21,548

pft_reading_center

6,768

pg13r

1,084

plasma_catecholamine_testing

630

Pregnancy

109,716

pregnancy_followup

86,875

psg_data

6,474

psg_quality_summary_form

2,164

pulmonary_function_tests

85,186

recent_covid_treatment

60,207

rehabilitation_testing

82,645

renal_ultrasound

19,464

research_labs

4,012,350

serum_b12_and_methylmalonic_acid

31,639

six_minute_walk_test

101,931

skin_biopsy

131

sleep_reading_center

25,471

social_determinants_of_health

801,331

social_determinants_of_health_followup

678,872

study_termination

tier_12_consent_tracking

113,913

tier_1_office_visit

1,719,894

tilt_table_test

636

upsit_smell_test

184,618

vaccine_status

276,537

vision_testing

64,756

visit_form

978,660

wearable_data

1,910

Withdrawal

953

Total Variables

24,043,845

Data Quality

A detailed list of data quality issues found in this dataset is summarized below.

File

Error Type

N Rows

RECOVERAdult_BDC_202403_demographics_deID

1. Missing enroll_date

RECOVERAdult_BDC_202403_demographics_deID

2. Missing DOB

RECOVERAdult_BDC_202403_biospecimens_deID

3. Data entry error in collection_date, reported as a future event

RECOVERAdult_BDC_202403_visits_deID

4. Data entry error in visit_start_date, reported as a future event

RECOVERAdult_BDC_202403_fitbit_deID

5. Data entry error in summary_date, reported as a future event

RECOVERAdult_BDC_202403_answerdata_deID

6. Data entry error in data_entry_date, reported as a future event

870

RECOVERAdult_BDC_202403_answerdata_deID

7. Tab characters present in CONCEPT_NAME variable, causing issues when reading file as tab-separated values

RECOVERAdult_BDC_202403_concepts

8. Concept codes repeated across multiple entries

Information for Authors

The RECOVER Initiative is committed to sharing data with the broader research community in a manner that is both timely and transparent. Because the processing required to make the data useful is complex, the following information is offered to inform responsible use of this dataset:

Protocol Complexity

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

April 2024

Adult Observational Cohort Study: Dataset Release Notes

Release file: phs003463.v1.p1

This release contains subsets of data collected from the NIH RECOVER Adult Cohort Observational Study between October 29, 2021 and September 15, 2023. These data were obtained from 14,662 participants attending 92,355 study visits across 79 geographically dispersed enrolling sites. The dataset also includes an inventory of 611,882 biospecimens collected at various timepoints, wearable sensor data from the digital health program for 195 participants, and a total of 3,175 data elements. Please refer to the RECOVER Data Dictionary/REDCap Codebook for this release for a list of all surveys/forms and their respective data fields.

Data De-identification Protocols

The following steps were undertaken to de-identify the dataset:

Masking of IDs using a randomly assigned number between 1 and 14702 (number of unique PARTICIPANT_IDs within the Adult cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of Adult Cohort age at enrollment and date of birth information
Removal of free text fields that did not have data entry validation in REDCap

These steps are discussed in more detail below.

Participant_ID masking

Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform deduplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14702) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.

Truncation of ZIP Codes

Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).

Truncation of participant date of birth and deceased date

Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.

Date shifting (All other dates)

Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
All date values within ANSWER_TEXT_VAL on the Answerdata table were shifted according to this protocol. The concepts that correspond to date values within the Answerdata table are provided in the file “RECOVERAdult_i2b2_concepts_BDC_dateshift.xlsx”.

Winsorization of participant ages

To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.

The following table details the variables that were de-identified, and the de-identification protocol that was applied.

Data Table

Variable

De-Identification Protocol

Demographics

PARTICIPANT_ID

ID masking with a random number between 1 and 14702

Demographics

ENROLL_DATE

Date shifting

Demographics

ENROLL_INDEX_DATE

Date shifting

Demographics

CROSSOVER_INDEX_DATE

Date shifting

Demographics

DOB

Truncation to year; bottom capping at 1933 or 1934

Demographics

AGE_AT_ENROLLMENT

Top capping at 89

Demographics

WITHDRAW_DATE

Date shifting

Demographics

DECEASED_DATE

Truncation to year

Demographics

ENROLL_ZIP_CODE

Truncation to 3 digits

Answerdata

PARTICIPANT_ID

ID masking with a random number between 1 and 14702

Answerdata

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Answerdata

ANSWER_TEXT_VAL

Date shifting

Answerdata

DATA_ENTRY_DATE

Date shifting

Biospecimens

PARTICIPANT_ID

ID masking with a random number between 1 and 14702

Biospecimens

COLLECTION_DATE

Date shifting

Visits

VISIT_ID

Removal of PARTICIPANT_ID portion of VISIT_ID

Visits

PARTICIPANT_ID

ID masking with a random number between 1 and 14702

Visits

VISIT_START_DATE

Date shifting

Fitbit

PARTICIPANT_ID

ID masking with a random number between 1 and 14702

Fitbit

SUMMARY_DATE

Date shifting

Included Data

The forms in this release are inclusive of baseline (first) enrollment visits and all subsequent follow-up visits through September 15, 2023. Collectively they represent 8.9 million rows of data (54% of the REDCap data) and were selected as they have a limited number of outstanding data queries (outside of missingness). When combined with the wearable sensor data and biospecimen inventory data, the release includes 10.1 million rows of data.

FORM_NAME

alcohol_and_tobacco

156,454

alcohol_and_tobacco_followup

251,039

assessment_scores

1,281,009

covid_treatment

183,355

demographics

110,778

disability

93,956

end_of_participation

3,771

enrollment

154,305

long_covid_treatment_trial

38,801

new_covid_infection

26,366

pasc_symptoms

5,081,366

pregnancy

103,109

pregnancy_followup

54,437

recent_covid_treatment

41,019

social_determinants_of_health

769,434

social_determinants_of_health_followup

445,687

study_termination

tier_12_consent_tracking

97,222

visit_form

749

withdrawal

961

Important Information for Authors

Data Completeness

This is a partial dataset that includes information on adult cohort participants for whom data were collected on or before September 15, 2023. Additionally, some variables with a high degree of missingness or requiring further quality control have been removed. Future releases will restore these redactions.

Protocol Complexity

Author Acknowledgements

RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:

The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v1.p1, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.

PreviousData Versioning Release Notes NextBDC-Seven Bridges Tutorials

Last updated 12 months ago

Was this helpful?