NIH RECOVER Release Notes
Adult Observational Cohort Study: Dataset Release Notes
June 2024
Adult Observational Cohort Study: Release Notes
Release file: phs003463.v2.p2
This release contains data collected from the NIH RECOVER Adult Cohort Observational Study between October 29, 2021 and March 15, 2024. These data were obtained from 15,204 adult participants (including a sub-cohort of 2,192 pregnant women) attending 122,029 study visits across 79 geographically dispersed enrolling sites. The dataset also includes a description of 822,310 biospecimens collected during baseline and follow-up visits, plus wearable sensor metadata from 2,602 participants in the Digital Health Program. Overall this release comprises approximately 35 million rows of data and a total of 5,278 data elements. Please refer to the accompanying REDCap Codebook for a Data Dictionary organized as a list of all surveys/forms and their respective data fields. Also, please note that this release does not contain any genomic or metabolomic data.
Data De-identification Protocols
The following steps were undertaken to de-identify the dataset:
Masking of IDs using a randomly assigned number between 1 and 15,204 (number of unique PARTICIPANT_IDs within the Adult cohort data); NOTE: Participant IDs assigned to the previous data release (202309.1) were maintained in the current release.
Truncation of ZIP codes to 3 digits.
Truncation of participant date of birth and deceased date to the year.
Date shifting of all other dates within the data within a range of 1 year.
Winsorization of Adult Cohort age at enrollment and date of birth information.
Removal of free text fields that did not have data entry validation in REDCap.
These steps are described in more detail below.
Participant_ID masking
Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (15204) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Note: From the previous release, participant IDs were assigned to 14,662 individuals in a random and not contiguous fashion from integers between 1 – 14702, including an excess of 40 integers left unassigned at random. Of those 14,662 individuals, 7 have withdrawn from the study since the previous release and left their previously assigned integers vacant, for a final tally of 14,655 individuals with randomly assigned integers between 1 – 14702 continuing into the current release.
For the current release, 549 new participants were added to the study with randomly assigned integers between 14703 – 15251.
Add protocol codes as the prefix for randomly assigned Participant IDs.
Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
Truncation of ZIP Codes
Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).
Truncation of participant date of birth and deceased date
Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89.
Bottom capping date of birth at 1933 or 1934 (based on enrollment date).
NOTE: To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.
Date shifting (All other dates)
Date shifting of all other dates was performed as follows:
Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate a random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol.
The following table lists specific variables requiring de-identification, and the de-identification protocol that was applied.
Data Table | Variable | De-identification Protocol |
---|---|---|
Demographics | PARTICIPANT_ID | ID masking with a random number between 1 and 15251 |
Demographics | ENROLL_DATE | Date shifting |
Demographics | ENROLL_INDEX_DATE | Date shifting |
Demographics | CROSSOVER_INDEX_DATE | Date shifting |
Demographics | DOB | Truncation to year |
Demographics | WITHDRAW_DATE | Date shifting |
Demographics | DECEASED_DATE | Truncation to year |
Demographics | ENROLL_ZIP_CODE | Truncation to 3 digits |
Answerdata | PARTICIPANT_ID | ID masking with a random number between 1 and 15251 |
Answerdata | VISIT_ID | Removal of PARTICIPANT_ID portion of VISIT_ID |
Answerdata | ANSWER_TEXT_VAL | Date shifting |
Answerdata | DATA_ENTRY_DATE | Date shifting |
Biospecimens | PARTICIPANT_ID | ID masking with a random number between 1 and 15251 |
Biospecimens | COLLECTION_DATE | Date shifting |
Visits | VISIT_ID | Removal of PARTICIPANT_ID portion of VISIT_ID |
Visits | PARTICIPANT_ID | ID masking with a random number between 1 and 15251 |
Visits | VISIT_START_DATE | Date shifting |
Fitbit | PARTICIPANT_ID | ID masking with a random number between 1 and 15251 |
Fitbit | SUMMARY_DATE | Date shifting |
The following table provides a brief description of the 6 BDC files in this release.
Filename | Rows | Columns | File Size (MB) | Participants |
---|---|---|---|---|
RECOVERAdult_BDC_202403_answerdata_deid.tsv
|
24,043,845
|
14
|
5,887
|
15,176
|
RECOVERAdult_BDC_202403_biospecimens_deid.tsv
|
822,310
|
12
|
77
|
13,460
|
RECOVERAdult_BDC_202403_concepts_deid.tsv
|
173,835
|
8
|
106
|
---
|
RECOVERAdult_BDC_202403_demographics_deid.tsv
|
15,204
|
20
|
3
|
15,204
|
RECOVERAdult_BDC_202403_fitbit_deid.tsv
|
9,889,039
|
6
|
1,030
|
2,602
|
RECOVERAdult_BDC_202403_visits_deid.tsv
|
122,029
|
7
|
8
|
15,185
|
Total
|
35,066,262
|
67
|
7,112
|
|
The following table provides a brief description of the biospecimens included in this release.
Collected Biospecimens | Stored Biospecimens |
---|---|
Stool | Stool |
Urine | Urine |
Nasal/NP swab | Nasal or NP Cells |
Oragene 600 – Saliva | Saliva |
Serum Separator Tube (SST) – Whole Blood | Serum |
Sodium Citrate – Whole Blood | Plasma |
PAXgene RNA – Whole Blood | Whole Blood |
Cell Preservation Tube (CPT) – Whole Blood | Peripheral Blood Mononuclear Cells (PBMCs) |
EDTA – Whole Blood | Plasma White Blood Cells (WBCs) |
The following table lists all 85 REDCap forms in the dataset, and the corresponding number of variables/rows of data associated with each form.
FORM_NAME | Total Variables per Form |
---|---|
acth_and_cortisol_test | 70,243 |
adult_delivery_and_outcome_form | 5,749 |
adult_echo_data | 6,173 |
alcohol_and_tobacco | 162,890 |
alcohol_and_tobacco_followup | 381,796 |
assessment_scores | 1,866,843 |
audiometry_survey | 1,188 |
Biospecimens | 485,333 |
bmri_import | 39,678 |
brain_mri_quality_confirmation | 2,703 |
brain_mri_with_gadolinium | 23,510 |
cardiac_mri | 789 |
cardiac_mri_reading_center | 51 |
cardiopulmonary_exercise_testing | 3,283 |
cardiovagal_innervation_testing | 638 |
change_in_symptoms_since_infection | 21,439 |
chest_ct | 47,387 |
chest_ct_reading_center | 185 |
clinical_labs | 92,771 |
Colonoscopy | 90 |
Comorbidities | 1,923,579 |
comprehensive_audiometry | 51,936 |
covid_treatment | 191,414 |
cpet_reading_center | 4 |
Demographics | 115,510 |
Disability | 97,810 |
drc_data | 9,909 |
echocardiogram_with_strain | 73,980 |
electrocardiogram | 91,215 |
electromyography | 112 |
end_of_participation | 5,862 |
endopat_testing | 8,164 |
Enrollment | 160,289 |
facility_sleep_questionnaire_morning_after | 416 |
facility_sleep_questionnaire_night_before | 1,640 |
facility_sleep_study | 2,152 |
Fibroscan | 17,927 |
formal_neuropsychological_testing | 12,294 |
full_ent_examination | 1,253 |
gastric_emptying_study | 115 |
hepatitis_tests | 37,936 |
home_polysomnography_with_ess_and_isi | 22,170 |
home_sleep_assessment | 32,266 |
hsat_data | 85,222 |
long_covid_treatment_trial | 90,359 |
medication_changes | 122,181 |
Medications | 64,630 |
mhp_data | 60,104 |
Mini | 83,729 |
mini_prequestionnaire | 4,447 |
neonatal_delivery_and_outcome_form | 3,754 |
nerve_conduction_study | 157 |
neuropathy_examination | 130,245 |
new_covid_infection | 41,621 |
nih_toolbox | 59,473 |
oral_glucose_test | 36,264 |
pasc_symptoms | 7,714,508 |
pcl5 | 21,548 |
pft_reading_center | 6,768 |
pg13r | 1,084 |
plasma_catecholamine_testing | 630 |
Pregnancy | 109,716 |
pregnancy_followup | 86,875 |
psg_data | 6,474 |
psg_quality_summary_form | 2,164 |
pulmonary_function_tests | 85,186 |
recent_covid_treatment | 60,207 |
rehabilitation_testing | 82,645 |
renal_ultrasound | 19,464 |
research_labs | 4,012,350 |
serum_b12_and_methylmalonic_acid | 31,639 |
six_minute_walk_test | 101,931 |
skin_biopsy | 131 |
sleep_reading_center | 25,471 |
social_determinants_of_health | 801,331 |
social_determinants_of_health_followup | 678,872 |
study_termination | 96 |
tier_12_consent_tracking | 113,913 |
tier_1_office_visit | 1,719,894 |
tilt_table_test | 636 |
upsit_smell_test | 184,618 |
vaccine_status | 276,537 |
vision_testing | 64,756 |
visit_form | 978,660 |
wearable_data | 1,910 |
Withdrawal | 953 |
Total Variables | 24,043,845 |
Data Quality
A detailed list of data quality issues found in this dataset is summarized below.
File | Error Type | N Rows |
---|---|---|
RECOVERAdult_BDC_202403_demographics_deID | 1. Missing enroll_date | 39 |
RECOVERAdult_BDC_202403_demographics_deID | 2. Missing DOB | 82 |
RECOVERAdult_BDC_202403_biospecimens_deID | 3. Data entry error in collection_date, reported as a future event | 7 |
RECOVERAdult_BDC_202403_visits_deID | 4. Data entry error in visit_start_date, reported as a future event | 17 |
RECOVERAdult_BDC_202403_fitbit_deID | 5. Data entry error in summary_date, reported as a future event | 6 |
RECOVERAdult_BDC_202403_answerdata_deID | 6. Data entry error in data_entry_date, reported as a future event | 870 |
RECOVERAdult_BDC_202403_answerdata_deID | 7. Tab characters present in CONCEPT_NAME variable, causing issues when reading file as tab-separated values | 53 |
RECOVERAdult_BDC_202403_concepts | 8. Concept codes repeated across multiple entries | 4 |
Information for Authors
The RECOVER Initiative is committed to sharing data with the broader research community in a manner that is both timely and transparent. Because the processing required to make the data useful is complex, the following information is offered to inform responsible use of this dataset:
Protocol Complexity
The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study protocol and design publication; these should be considered when analyzing and interpreting the data.
Author Acknowledgements
RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:
The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v2.p2, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.
April 2024
Adult Observational Cohort Study: Dataset Release Notes
Release file: phs003463.v1.p1
This release contains subsets of data collected from the NIH RECOVER Adult Cohort Observational Study between October 29, 2021 and September 15, 2023. These data were obtained from 14,662 participants attending 92,355 study visits across 79 geographically dispersed enrolling sites. The dataset also includes an inventory of 611,882 biospecimens collected at various timepoints, wearable sensor data from the digital health program for 195 participants, and a total of 3,175 data elements. Please refer to the RECOVER Data Dictionary/REDCap Codebook for this release for a list of all surveys/forms and their respective data fields.
Data De-identification Protocols
The following steps were undertaken to de-identify the dataset:
Masking of IDs using a randomly assigned number between 1 and 14702 (number of unique PARTICIPANT_IDs within the Adult cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of Adult Cohort age at enrollment and date of birth information
Removal of free text fields that did not have data entry validation in REDCap
These steps are discussed in more detail below.
Participant_ID masking
Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform deduplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14702) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
Truncation of ZIP Codes
Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).
Truncation of participant date of birth and deceased date
Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Date shifting (All other dates)
Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
All date values within ANSWER_TEXT_VAL on the Answerdata table were shifted according to this protocol. The concepts that correspond to date values within the Answerdata table are provided in the file “RECOVERAdult_i2b2_concepts_BDC_dateshift.xlsx”.
Winsorization of participant ages
To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.
The following table details the variables that were de-identified, and the de-identification protocol that was applied.
Data Table | Variable | De-Identification Protocol |
---|---|---|
Demographics | PARTICIPANT_ID | ID masking with a random number between 1 and 14702 |
Demographics | ENROLL_DATE | Date shifting |
Demographics | ENROLL_INDEX_DATE | Date shifting |
Demographics | CROSSOVER_INDEX_DATE | Date shifting |
Demographics | DOB | Truncation to year; bottom capping at 1933 or 1934 |
Demographics | AGE_AT_ENROLLMENT | Top capping at 89 |
Demographics | WITHDRAW_DATE | Date shifting |
Demographics | DECEASED_DATE | Truncation to year |
Demographics | ENROLL_ZIP_CODE | Truncation to 3 digits |
Answerdata | PARTICIPANT_ID | ID masking with a random number between 1 and 14702 |
Answerdata | VISIT_ID | Removal of PARTICIPANT_ID portion of VISIT_ID |
Answerdata | ANSWER_TEXT_VAL | Date shifting |
Answerdata | DATA_ENTRY_DATE | Date shifting |
Biospecimens | PARTICIPANT_ID | ID masking with a random number between 1 and 14702 |
Biospecimens | COLLECTION_DATE | Date shifting |
Visits | VISIT_ID | Removal of PARTICIPANT_ID portion of VISIT_ID |
Visits | PARTICIPANT_ID | ID masking with a random number between 1 and 14702 |
Visits | VISIT_START_DATE | Date shifting |
Fitbit | PARTICIPANT_ID | ID masking with a random number between 1 and 14702 |
Fitbit | SUMMARY_DATE | Date shifting |
Included Data
The forms in this release are inclusive of baseline (first) enrollment visits and all subsequent follow-up visits through September 15, 2023. Collectively they represent 8.9 million rows of data (54% of the REDCap data) and were selected as they have a limited number of outstanding data queries (outside of missingness). When combined with the wearable sensor data and biospecimen inventory data, the release includes 10.1 million rows of data.
FORM_NAME | n |
---|---|
alcohol_and_tobacco | 156,454 |
alcohol_and_tobacco_followup | 251,039 |
assessment_scores | 1,281,009 |
covid_treatment | 183,355 |
demographics | 110,778 |
disability | 93,956 |
end_of_participation | 3,771 |
enrollment | 154,305 |
long_covid_treatment_trial | 38,801 |
new_covid_infection | 26,366 |
pasc_symptoms | 5,081,366 |
pregnancy | 103,109 |
pregnancy_followup | 54,437 |
recent_covid_treatment | 41,019 |
social_determinants_of_health | 769,434 |
social_determinants_of_health_followup | 445,687 |
study_termination | 96 |
tier_12_consent_tracking | 97,222 |
visit_form | 749 |
withdrawal | 961 |
Important Information for Authors
The RECOVER Initiative is committed to sharing data with the broader research community in a manner that is both timely and transparent. Because the processing required to make the data useful is complex, the following information is offered to inform responsible use of this dataset:
Data Completeness
This is a partial dataset that includes information on adult cohort participants for whom data were collected on or before September 15, 2023. Additionally, some variables with a high degree of missingness or requiring further quality control have been removed. Future releases will restore these redactions.
Protocol Complexity
The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study protocol and design publication; these should be considered when analyzing and interpreting the data.
Author Acknowledgements
RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:
The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v1.p1, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.
Last updated