NIH RECOVER Release Notes
Adult Observational Cohort Study: Dataset Release Notes
October 2024
RECOVER Pediatric Observational Cohort Study
Release file: phs003461.v1.p1
This release contains data collected between March 17, 2022 and June 15, 2024 from the Pediatric Observational Cohort Study (“RECOVER-Pediatrics”) of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative. These data were obtained from 24,621 participants attending 62,921 study visits across 105 geographically dispersed enrolling sites. The dataset includes descriptions of 85,858 biospecimens collected during baseline and follow-up visits, plus wearable sensor metadata from approximately 400 participants in the RECOVER Digital Health Program. Taken together, this release consists of 8,604 data elements (variables) and approximately 13.6 million datapoints.
RECOVER-Pediatrics is an observational meta-cohort study of caregiver-child pairs (Birth through 17 years) and young adults (18 through 25 years). As described in the study protocol, the pediatric meta-cohort consists of four distinct cohorts: (1) A de novo RECOVER prospective cohort including children and young adults ages birth through 25 years, with or without a known history of SARS-CoV-2 infection, and their respective caregivers; (2) An extant cohort from the Adolescent Brain Cognitive Development (ABCD) study, the largest long-term US study of brain development in adolescence; (3) An in utero exposure cohort, including children less than 3 years old born to individuals with and without a SARS-CoV-2 infection during pregnancy; and (4) An extant cohort from the NHLBI Study on Long-terM OUtcomes after the Multisystem Inflammatory Syndrome In Children (MUSIC).
Importantly, this release includes data underlying the first publication of primary results from the RECOVER-Pediatrics study: Gross et al., Characterizing Long COVID in Children and Adolescents. JAMA 2024.12747, August 21, 2024.
Detailed descriptions of the data elements used in RECOVER-Pediatrics will be found in three separate data dictionaries:
These codebooks are organized as a list of all surveys/forms with their respective data elements. Also available is a listing of all surveys and questions in RECOVER-Pediatrics.
Please note that this release does not contain any genomic or metabolomic data.
RELEASE NOTES FOR THE “PEDIATRIC-MAIN” STUDY (Includes Age 13-17 and 18-25 sub-studies, plus data from the ABCD and MUSIC extant cohorts)
Data Quality
The Pediatric-Main study consists of approximately 9.9 million datapoints that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.
These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.
Due to that volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact BioData Catalyst Support with any questions or concerns regarding this release of RECOVER data.
De-Identification Process
The following steps were taken to de-identify the RECOVER Pediatric-Main observational cohort data:
Masking of IDs using a randomly assigned number between 1 and 14097 (number of unique PARTICIPANT_IDs within the Pediatric cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of date of birth and age at enrollment
Masking of Pediatric Caregiver IDs on the Answerdata table according to corresponding masked IDs from the Pediatric Caregiver cohort
Participant IDs
Masking of PARTICIPANT_IDs was performed according to the following protocol:
Participant IDs assigned in the previous data release (202309.1) are maintained.
Extract PARTICIPANT_IDs from each table within the Pediatric cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14097) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Note: From the previous release, participant IDs were assigned to 10549 individuals in a random fashion from integers between 1 – 10549. 16 individuals were dropped from that release, so there are 16 random integers that are unassigned in the lookup table for this release.
For this release, new IDs were assigned to participants covering 10550 - 14097.
Add protocol codes as the prefix for randomly assigned Participant IDs.
Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
VISIT_TYPE on the Visits table contains dates. Those dates have been removed in the de-identified data, with the rest of VISIT_TYPE retained.
ZIP Codes
ZIP Codes were truncated to 3 digits according to the following protocol:
Keep a substring of ZIP codes from the first to the third characters.
Dates of Birth and Ages
Date of birth and deceased dates were truncated according to the following protocol:
Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89
Bottom capping date of birth at 1933 or 1934 (based on enrollment date)
Remaining Dates
Date shifting of all other dates was performed according to the following protocol:
Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol. Provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatric_BDC_202406_concepts_deID.csv”
Summary
The following table lists the variables that are de-identified, along with the de-identification protocol that was applied.
Demographics
PARTICIPANT_ID
ID masking with a random number between 1 and 14097
Demographics
ENROLL_DATE
Date shifting
Demographics
ENROLL_INDEX_DATE
Date shifting
Demographics
DOB
Truncation to year
Demographics
WITHDRAW_DATE
Date shifting
Demographics
DECEASED_DATE
Truncation to year
Demographics
ENROLL_ZIP_CODE
Truncation to 3 digits
Answerdata
PARTICIPANT_ID
ID masking with a random number between 1 and 14097
Answerdata
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Answerdata
ANSWER_TEXT_VAL
Date shifting
ID masking for values that correspond with Pediatric Caregiver IDs
Answerdata
DATA_ENTRY_DATE
Date shifting
Biospecimens
PARTICIPANT_ID
ID masking with a random number between 1 and 14097
Biospecimens
COLLECTION_DATE
Date shifting
Visits
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Visits
VISIT_TYPE
Removal of dates
Visits
PARTICIPANT_ID
ID masking with a random number between 1 and 14097
Visits
VISIT_START_DATE
Date shifting
Fitbit
PARTICIPANT_ID
ID masking with a random number between 1 and 14097
Fitbit
SUMMARY_DATE
Date shifting
Information for Authors
Protocol Complexity
The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study protocol, design publication, and participant surveys; these should be considered when analyzing and interpreting the data.
Author Acknowledgements
RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:
The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Pediatric Observational Cohort dataset version phs003461.v1.p1, supported by 1OT2HL156812, OT2HL161847, and 1OT2HL161841 awards from the NIH. [If applicable:] This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.
RELEASE NOTES FOR THE “PEDIATRIC-CAREGIVER” STUDY
Data Quality
The Pediatric-Caregiver study consists of approximately 2.8 million datapoints that were generated from surveys and biospecimen collections (Saliva and Tasso blood spot). Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.
These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.
Due to that volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact BioData Catalyst Support with any questions or concerns regarding this release of RECOVER data.
De-Identification Process
The following steps were taken to de-identify the RECOVER Pediatric Caregiver cohort data:
Masking of IDs using a randomly assigned number between 1 and 8820 (number of unique PARTICIPANT_IDs within the Pediatric Caregiver cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of date of birth and age at enrollment
Participant IDs
Masking of PARTICIPANT_IDs was performed according to the following protocol:
Participant IDs assigned in the previous data release (202309.1) are maintained.
Extract PARTICIPANT_IDs from each table within the Pediatric Caregiver cohort (currently Answerdata, Biospecimens, Concepts, Demographics, and Visits) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (8820) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Note: From the previous release, participant IDs were assigned to 6714 individuals in a random fashion from integers between 1 – 6714. 6 individuals were dropped from that release, so there are 6 random integers that are unassigned in the lookup table for this release.
For this release, new IDs were assigned to participants covering 6715 - 8820.
Add protocol codes as the prefix for randomly assigned Participant IDs.
Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Pediatric Caregiver cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
ZIP Codes
ZIP codes were truncated to 3 digits according to the following protocol:
Keep a substring of ZIP codes from the first to the third characters.
Dates of Birth and Ages
Participant date of birth and deceased date were truncated according to the following protocol:
Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89
Bottom capping date of birth at 1933 or 1934 (based on enrollment date)
Remaining Dates
Date shifting of all other dates was performed according to the following protocol:
Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
All date values within ANSWER_TEXT_VAL on the Answerdata table are shifted according to this protocol. We provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatricCaregiver_BDC_202406_concepts_deID.csv”
Summary
The following table lists the variables that are de-identified, along with the deidentification protocol that was applied.
Demographics
PARTICIPANT_ID
ID masking with a random number between 1 and 8820
Demographics
ENROLL_DATE
Date shifting
Demographics
ENROLL_INDEX_DATE
Date shifting
Demographics
DOB
Truncation to year
Demographics
WITHDRAW_DATE
Date shifting
Demographics
DECEASED_DATE
Truncation to year
Demographics
ENROLL_ZIP_CODE
Truncation to 3 digits
Answerdata
PARTICIPANT_ID
ID masking with a random number between 1 and 8820
Answerdata
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Answerdata
ANSWER_TEXT_VAL
Date shifting
Answerdata
DATA_ENTRY_DATE
Date shifting
Biospecimens
PARTICIPANT_ID
ID masking with a random number between 1 and 8820
Biospecimens
COLLECTION_DATE
Date shifting
Visits
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Visits
PARTICIPANT_ID
ID masking with a random number between 1 and 8820
Visits
VISIT_START_DATE
Date shifting
Information for Authors
Protocol Complexity
The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study protocol, design publication, and participant surveys; these should be considered when analyzing and interpreting the data.
Author Acknowledgements
RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:
The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Pediatric Observational Cohort dataset version phs003461.v1.p1, supported by 1OT2HL156812, OT2HL161847, and 1OT2HL161841 awards from the NIH. [If applicable:] This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.
RELEASE NOTES FOR THE “PEDIATRIC-CONGENITAL” STUDY
Data Quality
The Pediatric-Congenital study consists of approximately 0.9 million datapoints that were generated from surveys and laboratory tests, biospecimens, and metadata from diagnostic studies such as electrocardiograms, cardiac MRIs and other imaging modalities. Data quality management processes (automated queries and periodic reviews with study staff) have been employed throughout data collection, ingestion, deidentification and preparation for release to the public.
These processes have focused on evaluating data completeness and validity. Source data may contain legitimate “missing” variables due to branching logic or participants opting to not answer, and missing dates will propagate to the date-shifted variables as null entries. Invalid data may be due to manual entry errors (including incorrect formatting) or flawed data checks. Central to the data quality management effort has been a validation querying process that is run against all ingested data, and which triggers automated alerts to study sites if a datapoint is missing, out of range, inconsistent, or potentially erroneous.
Due to that volume of data and complexity of data types, it is difficult to assess the overall completeness and validity of RECOVER datasets. Based on various analytics and observations, we estimate the overall rate of data missingness to be less than 5%, and the rate of data validation issues to be less than 0.01%. Users are encouraged to contact BioData Catalyst Support with any questions or concerns regarding this release of RECOVER data.
De-Identification Process
The following steps were taken to de-identify the RECOVER Pediatric Congenital cohort data:
Masking of IDs using a randomly assigned number between 1 and 1702 (number of unique PARTICIPANT_IDs within the Pediatric cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of date of birth and age at enrollment
Masking of Adult IDs on the Answerdata table according to corresponding masked IDs from the Adult cohort
Participant IDs
We perform masking of PARTICIPANT_IDs according to the following protocol:
Participant IDs assigned in the previous data release (202309.1) are maintained.
Extract PARTICIPANT_IDs from each table within the Pediatric Congenital cohort (currently Answerdata, Biospecimens, Concepts, Demographics, and Visits) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (1702) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Note: From the previous release, participant IDs were assigned to 995 individuals in a random fashion from integers between 1 – 995.
For this release, new IDs were assigned to participants covering 996 - 1702.
Add protocol codes as the prefix for randomly assigned Participant IDs.
Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Pediatric cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
ZIP Codes
Truncation of ZIP codes to 3 digits was performed according to the following protocol:
Keep a substring of ZIP codes from the first to the third characters.
Dates of Birth and Ages
We perform truncation of participant date of birth and deceased date according to the following protocol:
Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89
Bottom capping date of birth at 1933 or 1934 (based on enrollment date)
Remaining Dates
We perform date shifting of all other dates according to the following protocol:
We select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, we generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
We shift the anchor date to the first day of the year, i.e., 2023/01/01.
We calculate the difference in days from the original to the shifted anchor date.
We recalculate all other dates by adding that date difference to those dates.
All date values within ANSWER_TEXT_VAL on the Answerdata table are shifted according to this protocol. We provide the concepts that correspond to date values within the Answerdata table in the file “RECOVERPediatricCongential_BDC_202406_concepts_deID.csv”
Summary
The following table lists the variables that are de-identified, along with the de-identification protocol that was applied.
Demographics
PARTICIPANT_ID
ID masking with a random number between 1 and 1702
Demographics
ENROLL_DATE
Date shifting
Demographics
ENROLL_INDEX_DATE
Date shifting
Demographics
DOB
Truncation to year
Demographics
WITHDRAW_DATE
Date shifting
Demographics
DECEASED_DATE
Truncation to year
Demographics
ENROLL_ZIP_CODE
Truncation to 3 digits
Answerdata
PARTICIPANT_ID
ID masking with a random number between 1 and 1702
Answerdata
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Answerdata
ANSWER_TEXT_VAL
Date shifting
ID masking for values that correspond with Adult IDs
Answerdata
DATA_ENTRY_DATE
Date shifting
Biospecimens
PARTICIPANT_ID
ID masking with a random number between 1 and 1702
Biospecimens
COLLECTION_DATE
Date shifting
Visits
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Visits
PARTICIPANT_ID
ID masking with a random number between 1 and 1702
Visits
VISIT_START_DATE
Date shifting
Information for Authors
Protocol Complexity
The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Pediatric Observational Cohort Study protocol, design publication, and participant surveys; these should be considered when analyzing and interpreting the data.
Author Acknowledgements
RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:
The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Pediatric Observational Cohort dataset version phs003461.v1.p1, supported by 1OT2HL156812, OT2HL161847, and 1OT2HL161841 awards from the NIH. [If applicable:] This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.
June 2024
Adult Observational Cohort Study: Release Notes
Release file: phs003463.v2.p2
This release contains data collected from the NIH RECOVER Adult Cohort Observational Study between October 29, 2021 and March 15, 2024. These data were obtained from 15,204 adult participants (including a sub-cohort of 2,192 pregnant women) attending 122,029 study visits across 79 geographically dispersed enrolling sites. The dataset also includes a description of 822,310 biospecimens collected during baseline and follow-up visits, plus wearable sensor metadata from 2,602 participants in the Digital Health Program. Overall this release comprises approximately 35 million rows of data and a total of 5,278 data elements. Please refer to the accompanying REDCap Codebook for a Data Dictionary organized as a list of all surveys/forms and their respective data fields. Also, please note that this release does not contain any genomic or metabolomic data.
Data De-identification Protocols
The following steps were undertaken to de-identify the dataset:
Masking of IDs using a randomly assigned number between 1 and 15,204 (number of unique PARTICIPANT_IDs within the Adult cohort data); NOTE: Participant IDs assigned to the previous data release (202309.1) were maintained in the current release.
Truncation of ZIP codes to 3 digits.
Truncation of participant date of birth and deceased date to the year.
Date shifting of all other dates within the data within a range of 1 year.
Winsorization of Adult Cohort age at enrollment and date of birth information.
Removal of free text fields that did not have data entry validation in REDCap.
These steps are described in more detail below.
Participant_ID masking
Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform de-duplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (15204) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Note: From the previous release, participant IDs were assigned to 14,662 individuals in a random and not contiguous fashion from integers between 1 – 14702, including an excess of 40 integers left unassigned at random. Of those 14,662 individuals, 7 have withdrawn from the study since the previous release and left their previously assigned integers vacant, for a final tally of 14,655 individuals with randomly assigned integers between 1 – 14702 continuing into the current release.
For the current release, 549 new participants were added to the study with randomly assigned integers between 14703 – 15251.
Add protocol codes as the prefix for randomly assigned Participant IDs.
Adult (RA1), Pediatric (RP2), Autopsy (RD3), Congenital (RG4), Caregiver (RC5)
Where R=Recover, A and 1 = Adult, P and 2 = Pediatric, D and 3 = Autopsy, G and 4 = Congenital, C and 5 = Caregiver.
Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
Truncation of ZIP Codes
Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).
Truncation of participant date of birth and deceased date
Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Top capping age at enrollment at 89.
Bottom capping date of birth at 1933 or 1934 (based on enrollment date).
NOTE: To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.
Date shifting (All other dates)
Date shifting of all other dates was performed as follows:
Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate a random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
Shift all date values within ANSWER_TEXT_VAL on the Answerdata table according to this protocol.
The following table lists specific variables requiring de-identification, and the de-identification protocol that was applied.
Demographics
PARTICIPANT_ID
ID masking with a random number between 1 and 15251
Demographics
ENROLL_DATE
Date shifting
Demographics
ENROLL_INDEX_DATE
Date shifting
Demographics
CROSSOVER_INDEX_DATE
Date shifting
Demographics
DOB
Truncation to year
Demographics
WITHDRAW_DATE
Date shifting
Demographics
DECEASED_DATE
Truncation to year
Demographics
ENROLL_ZIP_CODE
Truncation to 3 digits
Answerdata
PARTICIPANT_ID
ID masking with a random number between 1 and 15251
Answerdata
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Answerdata
ANSWER_TEXT_VAL
Date shifting
Answerdata
DATA_ENTRY_DATE
Date shifting
Biospecimens
PARTICIPANT_ID
ID masking with a random number between 1 and 15251
Biospecimens
COLLECTION_DATE
Date shifting
Visits
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Visits
PARTICIPANT_ID
ID masking with a random number between 1 and 15251
Visits
VISIT_START_DATE
Date shifting
Fitbit
PARTICIPANT_ID
ID masking with a random number between 1 and 15251
Fitbit
SUMMARY_DATE
Date shifting
The following table provides a brief description of the 6 BDC files in this release.
RECOVERAdult_BDC_202403_answerdata_deid.tsv
24,043,845
14
5,887
15,176
RECOVERAdult_BDC_202403_biospecimens_deid.tsv
822,310
12
77
13,460
RECOVERAdult_BDC_202403_concepts_deid.tsv
173,835
8
106
---
RECOVERAdult_BDC_202403_demographics_deid.tsv
15,204
20
3
15,204
RECOVERAdult_BDC_202403_fitbit_deid.tsv
9,889,039
6
1,030
2,602
RECOVERAdult_BDC_202403_visits_deid.tsv
122,029
7
8
15,185
Total
35,066,262
67
7,112
The following table provides a brief description of the biospecimens included in this release.
Stool
Stool
Urine
Urine
Nasal/NP swab
Nasal or NP Cells
Oragene 600 – Saliva
Saliva
Serum Separator Tube (SST) – Whole Blood
Serum
Sodium Citrate – Whole Blood
Plasma
PAXgene RNA – Whole Blood
Whole Blood
Cell Preservation Tube (CPT) – Whole Blood
Peripheral Blood Mononuclear Cells (PBMCs)
EDTA – Whole Blood
Plasma
White Blood Cells (WBCs)
The following table lists all 85 REDCap forms in the dataset, and the corresponding number of variables/rows of data associated with each form.
acth_and_cortisol_test
70,243
adult_delivery_and_outcome_form
5,749
adult_echo_data
6,173
alcohol_and_tobacco
162,890
alcohol_and_tobacco_followup
381,796
assessment_scores
1,866,843
audiometry_survey
1,188
Biospecimens
485,333
bmri_import
39,678
brain_mri_quality_confirmation
2,703
brain_mri_with_gadolinium
23,510
cardiac_mri
789
cardiac_mri_reading_center
51
cardiopulmonary_exercise_testing
3,283
cardiovagal_innervation_testing
638
change_in_symptoms_since_infection
21,439
chest_ct
47,387
chest_ct_reading_center
185
clinical_labs
92,771
Colonoscopy
90
Comorbidities
1,923,579
comprehensive_audiometry
51,936
covid_treatment
191,414
cpet_reading_center
4
Demographics
115,510
Disability
97,810
drc_data
9,909
echocardiogram_with_strain
73,980
electrocardiogram
91,215
electromyography
112
end_of_participation
5,862
endopat_testing
8,164
Enrollment
160,289
facility_sleep_questionnaire_morning_after
416
facility_sleep_questionnaire_night_before
1,640
facility_sleep_study
2,152
Fibroscan
17,927
formal_neuropsychological_testing
12,294
full_ent_examination
1,253
gastric_emptying_study
115
hepatitis_tests
37,936
home_polysomnography_with_ess_and_isi
22,170
home_sleep_assessment
32,266
hsat_data
85,222
long_covid_treatment_trial
90,359
medication_changes
122,181
Medications
64,630
mhp_data
60,104
Mini
83,729
mini_prequestionnaire
4,447
neonatal_delivery_and_outcome_form
3,754
nerve_conduction_study
157
neuropathy_examination
130,245
new_covid_infection
41,621
nih_toolbox
59,473
oral_glucose_test
36,264
pasc_symptoms
7,714,508
pcl5
21,548
pft_reading_center
6,768
pg13r
1,084
plasma_catecholamine_testing
630
Pregnancy
109,716
pregnancy_followup
86,875
psg_data
6,474
psg_quality_summary_form
2,164
pulmonary_function_tests
85,186
recent_covid_treatment
60,207
rehabilitation_testing
82,645
renal_ultrasound
19,464
research_labs
4,012,350
serum_b12_and_methylmalonic_acid
31,639
six_minute_walk_test
101,931
skin_biopsy
131
sleep_reading_center
25,471
social_determinants_of_health
801,331
social_determinants_of_health_followup
678,872
study_termination
96
tier_12_consent_tracking
113,913
tier_1_office_visit
1,719,894
tilt_table_test
636
upsit_smell_test
184,618
vaccine_status
276,537
vision_testing
64,756
visit_form
978,660
wearable_data
1,910
Withdrawal
953
Total Variables
24,043,845
Data Quality
A detailed list of data quality issues found in this dataset is summarized below.
RECOVERAdult_BDC_202403_demographics_deID
1. Missing enroll_date
39
RECOVERAdult_BDC_202403_demographics_deID
2. Missing DOB
82
RECOVERAdult_BDC_202403_biospecimens_deID
3. Data entry error in collection_date, reported as a future event
7
RECOVERAdult_BDC_202403_visits_deID
4. Data entry error in visit_start_date, reported as a future event
17
RECOVERAdult_BDC_202403_fitbit_deID
5. Data entry error in summary_date, reported as a future event
6
RECOVERAdult_BDC_202403_answerdata_deID
6. Data entry error in data_entry_date, reported as a future event
870
RECOVERAdult_BDC_202403_answerdata_deID
7. Tab characters present in CONCEPT_NAME variable, causing issues when reading file as tab-separated values
53
RECOVERAdult_BDC_202403_concepts
8. Concept codes repeated across multiple entries
4
Information for Authors
The RECOVER Initiative is committed to sharing data with the broader research community in a manner that is both timely and transparent. Because the processing required to make the data useful is complex, the following information is offered to inform responsible use of this dataset:
Protocol Complexity
The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study protocol and design publication; these should be considered when analyzing and interpreting the data.
Author Acknowledgements
RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:
The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v2.p2, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.
April 2024
Adult Observational Cohort Study: Dataset Release Notes
Release file: phs003463.v1.p1
This release contains subsets of data collected from the NIH RECOVER Adult Cohort Observational Study between October 29, 2021 and September 15, 2023. These data were obtained from 14,662 participants attending 92,355 study visits across 79 geographically dispersed enrolling sites. The dataset also includes an inventory of 611,882 biospecimens collected at various timepoints, wearable sensor data from the digital health program for 195 participants, and a total of 3,175 data elements. Please refer to the RECOVER Data Dictionary/REDCap Codebook for this release for a list of all surveys/forms and their respective data fields.
Data De-identification Protocols
The following steps were undertaken to de-identify the dataset:
Masking of IDs using a randomly assigned number between 1 and 14702 (number of unique PARTICIPANT_IDs within the Adult cohort data)
Truncation of ZIP codes to 3 digits
Truncation of participant date of birth and deceased date to the year
Date shifting of all other dates within the data within a range of 1 year
Winsorization of Adult Cohort age at enrollment and date of birth information
Removal of free text fields that did not have data entry validation in REDCap
These steps are discussed in more detail below.
Participant_ID masking
Extract PARTICIPANT_IDs from each table within the Adult cohort (currently Answerdata, Biospecimens, Concepts, Demographics, Visits, and Fitbit) and perform deduplication to ensure uniqueness.
Randomly reorder PARTICIPANT_IDs.
Create a sequential list of numbers from 1 to the total number of unique PARTICIPANT_IDs (14702) and join that list to the list of randomly reordered unique PARTICIPANT_IDs, creating the ID lookup table.
Merge the ID lookup table with each of the tables in the Adult cohort by PARTICIPANT_ID to attach the masked IDs.
Drop the column of original PARTICIPANT_IDs from the merged tables, leaving just the masked ID to uniquely identify participants.
VISIT_ID on the Answerdata and Visits tables also contains PARTICIPANT_ID. Drop that portion of the VISIT_ID within the de-identified data, retaining the rest of the VISIT_ID.
Truncation of ZIP Codes
Retain a substring of ZIP codes from the first to the third numbers (removing the fourth and fifth numbers).
Truncation of participant date of birth and deceased date
Keep a substring of date of birth and deceased date from the first to the fourth characters, corresponding to the year portion of the date.
Date shifting (All other dates)
Select ENROLL_DATE as the anchor date for each participant.
For participants who do not have an ENROLL_DATE, generate random anchor date within the year of 2023 (2023/01/01 to 2023/12/31).
Shift the anchor date to the first day of the year, i.e., 2023/01/01.
Calculate the difference in days from the original to the shifted anchor date.
Recalculate all other dates by adding that date difference to those dates.
All date values within ANSWER_TEXT_VAL on the Answerdata table were shifted according to this protocol. The concepts that correspond to date values within the Answerdata table are provided in the file “RECOVERAdult_i2b2_concepts_BDC_dateshift.xlsx”.
Winsorization of participant ages
To maintain compliance with HIPAA Safe Harbor requirements, participant age at enrollment was top-capped at 89 years old, and participant date of birth was bottom capped at either 1933 or 1934, depending on year of enrollment (2022 or 2023). An additional column containing a flag for whether participants had an age over 89 was added to distinguish them from participants whose age is 89. Data from seven participants were affected by this de-identification protocol.
The following table details the variables that were de-identified, and the de-identification protocol that was applied.
Demographics
PARTICIPANT_ID
ID masking with a random number between 1 and 14702
Demographics
ENROLL_DATE
Date shifting
Demographics
ENROLL_INDEX_DATE
Date shifting
Demographics
CROSSOVER_INDEX_DATE
Date shifting
Demographics
DOB
Truncation to year; bottom capping at 1933 or 1934
Demographics
AGE_AT_ENROLLMENT
Top capping at 89
Demographics
WITHDRAW_DATE
Date shifting
Demographics
DECEASED_DATE
Truncation to year
Demographics
ENROLL_ZIP_CODE
Truncation to 3 digits
Answerdata
PARTICIPANT_ID
ID masking with a random number between 1 and 14702
Answerdata
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Answerdata
ANSWER_TEXT_VAL
Date shifting
Answerdata
DATA_ENTRY_DATE
Date shifting
Biospecimens
PARTICIPANT_ID
ID masking with a random number between 1 and 14702
Biospecimens
COLLECTION_DATE
Date shifting
Visits
VISIT_ID
Removal of PARTICIPANT_ID portion of VISIT_ID
Visits
PARTICIPANT_ID
ID masking with a random number between 1 and 14702
Visits
VISIT_START_DATE
Date shifting
Fitbit
PARTICIPANT_ID
ID masking with a random number between 1 and 14702
Fitbit
SUMMARY_DATE
Date shifting
Included Data
The forms in this release are inclusive of baseline (first) enrollment visits and all subsequent follow-up visits through September 15, 2023. Collectively they represent 8.9 million rows of data (54% of the REDCap data) and were selected as they have a limited number of outstanding data queries (outside of missingness). When combined with the wearable sensor data and biospecimen inventory data, the release includes 10.1 million rows of data.
alcohol_and_tobacco
156,454
alcohol_and_tobacco_followup
251,039
assessment_scores
1,281,009
covid_treatment
183,355
demographics
110,778
disability
93,956
end_of_participation
3,771
enrollment
154,305
long_covid_treatment_trial
38,801
new_covid_infection
26,366
pasc_symptoms
5,081,366
pregnancy
103,109
pregnancy_followup
54,437
recent_covid_treatment
41,019
social_determinants_of_health
769,434
social_determinants_of_health_followup
445,687
study_termination
96
tier_12_consent_tracking
97,222
visit_form
749
withdrawal
961
Important Information for Authors
The RECOVER Initiative is committed to sharing data with the broader research community in a manner that is both timely and transparent. Because the processing required to make the data useful is complex, the following information is offered to inform responsible use of this dataset:
Data Completeness
This is a partial dataset that includes information on adult cohort participants for whom data were collected on or before September 15, 2023. Additionally, some variables with a high degree of missingness or requiring further quality control have been removed. Future releases will restore these redactions.
Protocol Complexity
The RECOVER protocol underlying this dataset is complex. It employs a tiered approach to administer certain tests, evaluations, and data collection procedures to specific subsets of participants. The criteria for triggering these tests and other important information are described in the Adult Cohort Study protocol and design publication; these should be considered when analyzing and interpreting the data.
Author Acknowledgements
RECOVER investigators request that publications based on the information in this dataset include the following acknowledgement:
The authors of this publication wish to acknowledge that the data utilized in this study were obtained from RECOVER Adult Cohort dataset version phs003463.v1.p1, supported by 1OT2HL156812-01, OT2HL161847-01, and 1OT2HL161841-01 awards from the NIH. This research was conducted independently of RECOVER, and the authors did not collaborate with RECOVER investigators, patient community or caregiver representatives during the course of this study.
Last updated