Table of Contents

DEEP/DAG1

Lifelines-DEEP, also known as DAG1, is one of Lifelines' additional assessments performed in collaboration with the UMCG department of genetics (see also: DAG2 and DAG3). DAG is the abbreviation of DArmGezondheid, or “Gastrointestinal health” in Dutch, a research project in which the microbiome is analysed in faecal samples.
The primary goal of the Lifelines-DEEP project is to get insight in the relations between genome, epigenome, transcriptome, microbiome, metabolome, and other biological and phenotypic parameters. Lifelines-DEEP is an example of a ‘next- generation’ population cohort study—in which multiple molecular data levels are combined with observational research methods 1)

Subcohort

From April 2012 to August 2013, all adult participants registered at the Lifelines location in Groningen were invited to participate in Lifelines-DEEP, in addition to the regular Lifelines programme. Inclusion stopped when the target group size of n=~1500 was reached.

Protocol

From the 1500 DEEP-participants, a variety of additional data and samples were collected, including:

Data availability

Type N Available?
Genomics (cytosnp) +/- 1400 Yes
Methylation +/- 800 Yes
RNAseq +/- 1300 Yes
MGS +/- 1200 Yes*
16S +/- 1000 Yes*
Metabolomics +/- 1400 Yes
Cytokines +/- 1100 Yes
Proteomics +/- 1100 Yes
WES +/- 1000 Yes

*Raw data is available; the processed data is not available yet

Genomics

Genotyping of genomic DNA was performed using both the HumanCytoSNP-12 BeadChip15 and the ImmunoChip, a customised Illumina Infinium array.16 Genotyping was successful for 1385 samples (CytoSNP) and 1374 samples (IChip), respectively. First, SNP quality control was applied independently for both platforms. SNPs were filtered on MAF above 0.001, a HWE p value >1e−4 and call rate of 0.98 using Plink.17 The genotypes from both platforms were merged into one data set. For genotypes present on both platforms, the genotypes were put on missing in the case of non-concordant calls. After merging, SNPs were filtered again on MAF 0.05 and call rate of 0.98, resulting in a total of 379 885 genotyped SNPs. Next, these data were imputed based on the Genome of the Netherlands (GoNL) reference panel.18–20 The merged genotypes were prephased using SHAPEIT221 and aligned to the GoNL reference panel using Genotype Harmonizer22 in order to resolve strand issues. The imputation was performed using IMPUTE223 V.2.3.0 against the GoNL reference panel. We used a MOLGENIS compute24 imputation pipeline to generate our scripts and monitor the imputation. Imputation yielded 8 606 371 variants with Info score ≥0.8. In addition, HLA type was established via the Broad SNP2HLA imputation pipeline.25

Methylation

Exerpt from Shah et al. (2015)2). We isolated total DNA from EDTA tubes and profiled genome-wide methylation using the Infinium HumanMethylation450 BeadChip, as previously described.11 In short, 500 ng of genomic DNA was bisulfite modified and used for hybridisation on Infinium HumanMethylation450 BeadChips, according to the Illumina Infinium HD Methylation protocol.
Details of DNA extraction and methylation profiling are described elsewhere.19 Probe QC, background correction, color correction, and normalization were performed with a custom pipeline based on the pipeline by Tost and Touleimat.24 All methylation probes were re-mapped to the human genome (hg37, UCSC Genome Browser),25 and both poorly mapping probes and probes with a SNP in the single-base extension side (according to GoNL26) were removed in the same step. Data were normalized with DASEN.

Transcriptomics / RNAseq

Exerpt from Tighelaar et al. (2015). Genome-wide transcription was assessed to measure genome wide gene expression. RNA was isolated from whole blood collected in a PAXgene tube using PAXgene Blood miRNA Kit (Qiagen, California, USA). The RNA samples were quantified and assessed for integrity before sequencing. Total RNA from whole blood was deprived of globin using GLOBINclear kit (Ambion, Austin, Texas, USA), and subsequently processed for sequencing using Truseq V.2 library preparation kit (Illumina Inc, San Diego, California, USA). Illumina HIseq2000 was used for paired-end sequencing of 2×50 bp, i.e. fragments of 50 base pairs in length were sequenced in both directions. Ten samples were pooled per lane. Finally, read sets per sample were generated using CASAVA, retaining only reads passing Illumina’s Chastity Filter for further processing. On average, the number of raw reads per individual after QC was 44.3 million. After adapter trimming, the reads were mapped to human genome build 37 using STAR (https://code.google.com/p/rna-star/). Of these, 96% of reads were successfully mapped to the genome. Transcription was quantified on the gene and meta-exon level using BEDTools (https://code.google.com/p/bedtools/) and custom scripts, and on the transcript level using FluxCapacitor (http://sammeth.net/confluence/display/FLUX/Home).

Microbiome / 16S analysis

Exerpt from Tighelaar et al. (2015). Faecal samples were collected in order to study the gut microbiome. Gut microbial composition was assessed by 16S rRNA gene sequencing of the V4 variable region on the Illumina MiSeq platform according to the manufacturer's specifications.27 Reads were quality filtered and taxonomy was inferred using a closed reference Operational Taxonomic Unit-picking protocol against a preclustered GreenGenes database, as implemented by QIIME (V.1.7.0 and V.1.8.0).28 ,29 Moreover, faecal aliquots were stored for future analysis of GI-health-related biomarkers.

Metagenome Shotgun analysis

Exerpt from Zhernakova et al. (2016)3). The gut microbiome was analyzed using paired-end metagenomic shotgun sequencing (MGS) on a HiSeq2000, generating an average of 3.0 Gb of data (about 32.3 million reads) per sample. After excluding 44 samples with low read counts, 1,135 participants (474 males and 661 females) remained for further analysis. We tested 207 factors with respect to the microbiomes of these participants: 41 intrinsic factors of various physiological and biomedical measures, 39 self-reported diseases, 44 categories of drugs, 5 categories of smoking status and 78 dietary factors (fig. S1 and table S1).

Metabolomics

Exerpt from Tighelaar et al. (2015). We determined metabolites in exhaled air and blood. Metabolites from exhaled air were measured by a combination of gas chromatography and time-of-flight mass spectrometry (GC-tof-MS), as described previously. In short, the exhaled air sample was introduced in a GC that separates the different compounds in the mixture. Subsequently, the compounds were introduced into the MS to detect and also to identify the separated volatile organic compounds. The metabolites in plasma were measured using the nuclear MR (NMR) method, as described by Kettunen et al.

Proteomics

Exerpt from Zhernakova et al (2018)4): To estimate the effect of host genetics on protein levels, we first performed a local protein quantitative trait loci (cis-pQTL) analysis by testing SNPs located within 250 kb of the genes coding for the 92 proteins. This yielded 129 significant cis-pQTLs for 66 proteins at genome-wide false discovery rate (FDR) 0.05 level (Supplementary Table 2 and Supplementary Fig. 2). We then regressed out the cis-pQTL effects and conducted a trans-pQTL mapping in a genome-wide manner and then separately on disease- and trait-associated SNPs only, which together yielded 85 independent trans-pQTLs for 36 proteins (Supplementary Table 3 and Supplementary Fig. 3). Of these, 19 cis-pQTLs and 74 trans-pQTLs were associated with complex traits and diseases, including 10 cis- and 7 trans-regulated proteins known to be relevant for CVD (Supplementary Tables 2 and 3). In addition, we separately assessed associations to 422 putative CVD-associated SNPs7 and detected pQTL associations for 14 proteins (Supplementary Table 4 and Supplementary Note 1). These pQTLs could point to driver genes in CVD (Supplementary Note 2); for example, as can be seen in the pleotropic trans-pQTLs effect observed at the KLKB1 gene (Supplementary Fig. 4).
Next, we examined the power of our study by assessing the replication rate of previously reported pQTLs8,9,10,11 and identified a 95% replication rate for cis effects and an 88% replication rate for trans effects, all with the same allelic direction (Supplementary Table 5). Our data also revealed novel pQTL associations including 36 cis-pQTLs for 25 proteins and 48 trans-pQTLs for 27 proteins (Supplementary Tables 2 and 3).
We found that only 64% of cis-pQTLs had at least one corresponding significant cis-eQTL, and 76% of these had the same allelic direction in blood from the same individuals or in other tissue types from the GTEx project12 (Supplementary Note 3, Supplementary Tables 2 and 6, and Supplementary Fig. 5). In contrast, none of the 85 trans-pQTLs were detectable at expression level, but this may be due to the power issue as the effect sizes of trans-eQTLs are known to be very modest. Despite this, our data do provide evidence that a large amount of trans-regulation can happen at translation- or protein-level; for example, through regulation of translational rate and protein secretion to blood, through post-translational modification, or through protein–protein interactions (PPIs), and these trans effects are not necessarily detectable at transcription level.

Publications using DEEP-data

Requesting access

1)
Tigchelaar EF et al. (2015) Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ Open 5(8): e006772
2)
Shah S et al. (2015) Improving Phenotypic Prediction by Combining Genetic and Epigenetic Associations. Am J Hum Genet. 97(1):75–85
3)
Zhernakova A et al. (2016) Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science 352(6285): 565-569
4)
Zhernakova DV et al. (2018) Individual variations in cardiovascular-disease-related protein levels are driven by genetics and gut microbiome. Nat. Gen. 50(11): 1524-1532