Cindy Boer

208 | Chapter 5.1 Data pre-processing, OTU picking and quality control Phylogenetic multi-sample profiling was performed using an in-house developed anal- ysis pipeline (microRapTor) based on QIIME (version 1.9.0)[27] and UPARSE (version 8.1)[28] software packages. Briefly, index sequences (12bp) were removed from each read and concatenated to generate a unique index of 24bp for each read-pair. Spac- er and primer sequences were removed using TAGCleaner (version 0.16)[29]. Paired reads were merged using PEAR (version 0.9.6)[30] with the following settings: mini- mum overlap of 10bp (default) and an average read quality phred-score of 20 over a 30bp sliding window. Merged reads shorter than 200bp were discarded. Reads were de-multiplexed using QIIME including extra quality filtering steps: merged reads were truncated before three consecutive low-quality bases, ambiguous bases were not al- lowed. Chimeric reads were removed using UCHIME (version 8.1)[28]. Duplicate sam- ples, samples with less than 10,000 reads, and samples from participants that have used antibiotics (self-reported) in one year prior to sample production were excluded ( Sup- plementary Figure 1 ). The 16S sequence reads of the remaining samples (2,214 for GenR and 1,544 for RS) were randomly subsampled at 10,000 reads per sample (after rarefaction analysis). Combined reads of all samples, in each cohort separately, were clustered into operational taxonomic units (OTUs) using UPARSE at a minimum cluster identity of 97%. The representative read from each OTU was then mapped to the SILVA rRNA database version 128[31] using RDP Naïve Bayesian Classifier version 2.12[32]. OTUs containing less than 40 reads were removed as described by Benson et al[33]. This threshold was established based on the correlation analysis of OTU tables of 5 pairs of technical replicates, of which DNA was amplified, sequenced and profiled twice ( Supplementary Figure 2 ). The sequence data was then analyzed for α-diversity met- rics (Shannon diversity Index, species richness and Inverse Simpson Index). Final OTU filtering was performed by removing OTUs with a total read count less than 0.005% of all reads and OTUs observed in less than 1% of the total number of samples of each cohort as described previously[34]. The final OTU table was divided into 5 sub-tables at different taxonomic levels (in QIIME environment): phylum, class, order, family, and genus. Statistical analyses All statistical analyses were performed in R[35] using vegan[36], phyloseq[37] and MaAsLin[38] packages. As MaAsLin performs paralleled multiple analyses, q-val- ue<0.05 (false discovery rate (FDR) multiple testing corrected) was used as signifi- cance threshold. All MaAsLin models were adjusted for technical covariates and other confounders as described in the sections below.

RkJQdWJsaXNoZXIy ODAyMDc0