python - How to aggregate values over a bigger than RAM gzip'ed csv file? -
for starters new bioinformatics , programming, have built script go through so-called vcf file (only individuals included, 1 clumn = 1 individual), , uses search string find out every variant (line) whether individual homozygous or heterozygous.
this script works, @ least on small subsets, know stores in memory. on large zipped files (of whole genomes), not know how transform script script line line (because want count through whole columns not see how solve that).
so output 5 things per individual (total variants, number homozygote, number heterozygote, , proportions of homo- , heterozygotes). see code below:
#!usr/bin/env python import re import gzip subset_cols = 'subset_cols_chr18.vcf.gz' #nuc_div = 'nuc_div_chr18.txt' gz_infile = gzip.gzipfile(subset_cols, "r") #gz_outfile = gzip.gzipfile(nuc_div, "w") # make dictionary of header line easy retrieval of elements later on headers = gz_infile.readline().rstrip().split('\t') print headers column_dict = {} header in headers: column_dict[header] = [] line in gz_infile: columns = line.rstrip().split('\t') in range(len(columns)): c_header=headers[i] column_dict[c_header].append(columns[i]) #print column_dict key in column_dict: number_homozygotes = 0 number_heterozygotes = 0 values in column_dict[key]: searchstr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+' #this search string contains regexp (this regexp tested) result = re.search(searchstr,values) if result: #here, skip missing genoytypes ./. variant_one = int(result.group(1)) variant_two = int(result.group(2)) if variant_one == 0 , variant_two == 0: continue elif variant_one == variant_two: #count +1 in case variant 1 , 2 equal (so 0/0, 1/1, etc.) number_homozygotes += 1 elif variant_one != variant_two: #count +1 in case variant 1 not equal variant 2 (so 1/0, 0/1, etc.) number_heterozygotes += 1 print "%s homozygotes %s" % (number_homozygotes, key) print "%s heterozygotes %s" % (number_heterozygotes,key) variants = number_homozygotes + number_heterozygotes print "%s variants" % variants prop_homozygotes = (1.0*number_homozygotes/variants)*100 prop_heterozygotes = (1.0*number_heterozygotes/variants)*100 print "%s %% homozygous %s" % (prop_homozygotes, key) print "%s %% heterozygous %s" % (prop_heterozygotes, key)
any appreciated can go on investigating large datasets, thank :)
the vcf file way looks this: individual_1 individual_2 individual_3 0/0:9,0:9:24:0,24,221 1/0:5,4:9:25:25,0,26 1/1:0,13:13:33:347,33,0
this header line individual id names (i have in total 33 individuals more complicated id tags, simplified here) , have lot of these information lines same specific pattern. interested in first part slash hence regular epxression.
disclosure: work full-time on hail project.
hi there! welcome programming , bioinformatics!
amirouche correctly identifies need sort of "streaming" or "line-by-line" algorithm handle data large fit in ram of machine. unfortunately, if limited python without libraries, have manually chunk file , handle parsing of vcf.
the hail project free, open-source tool scientists genetic data big fit in ram way big fit on 1 machine (i.e. tens of terabytes of compressed vcf data). hail can take advantage of cores on 1 machine or cores on cloud of machines. hail runs on mac os x , flavors of gnu/linux. hail exposes statistical genetics domain-specific language makes question shorter express.
the simplest answer
the faithful translation of python code hail this:
/path/to/hail importvcf -f your_file.vcf.gz \ annotatesamples expr -c \ 'sa.ncalled = gs.filter(g => g.iscalled).count(), sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(), sa.nhet = gs.filter(g => g.ishet).count()' annotatesamples expr -c \ 'sa.phom = sa.nhom / sa.ncalled, sa.phet = sa.nhet / sa.ncalled' \ exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv
i ran above command on dual-core laptop on 2.0gb file:
# ls -alh profile225.vcf.bgz -rw-r--r-- 1 dking 1594166068 2.0g aug 25 15:43 profile225.vcf.bgz # ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \ annotatesamples expr -c \ 'sa.ncalled = gs.filter(g => g.iscalled).count(), sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(), sa.nhet = gs.filter(g => g.ishet).count()' \ annotatesamples expr -c \ 'sa.phom = sa.nhom / sa.ncalled, sa.phet = sa.nhet / sa.ncalled' \ exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: running: importvcf -f profile225.vcf.bgz [stage 0:=======================================================> (63 + 2) / 65]hail: info: coerced sorted dataset hail: info: running: annotatesamples expr -c 'sa.ncalled = gs.filter(g => g.iscalled).count(), sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(), sa.nhet = gs.filter(g => g.ishet).count()' [stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.phom = sa.nhom / sa.ncalled, sa.phet = sa.nhet / sa.ncalled' hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: while importing: file:/users/dking/projects/hail-data/profile225.vcf.bgz import clean hail: info: timing: importvcf: 34.211s annotatesamples expr: 6m52.4s annotatesamples expr: 21.399ms exportsamples: 121.786ms total: 7m26.8s # head sampleinfo.tsv sample phomref phet nhom nhet ncalled hg00096 9.49219e-01 5.07815e-02 212325 11359 223684 hg00097 9.28419e-01 7.15807e-02 214035 16502 230537 hg00099 9.27182e-01 7.28184e-02 211619 16620 228239 hg00100 9.19605e-01 8.03948e-02 214554 18757 233311 hg00101 9.28714e-01 7.12865e-02 214283 16448 230731 hg00102 9.24274e-01 7.57260e-02 212095 17377 229472 hg00103 9.36543e-01 6.34566e-02 209944 14225 224169 hg00105 9.29944e-01 7.00564e-02 214153 16133 230286 hg00106 9.25831e-01 7.41687e-02 213805 17128 230933
wow! 7 minutes 2gb, that's slow! unfortunately, because vcfs aren't great format data analysis!
optimizing storage format
let's convert hail's optimized storage format, vds, , re-run command:
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds hail: info: running: importvcf -f profile225.vcf.bgz [stage 0:========================================================>(64 + 1) / 65]hail: info: coerced sorted dataset hail: info: running: write -o profile225.vds [stage 1:> (0 + 4) / 65] [stage 1:========================================================>(64 + 1) / 65] # ../hail/build/install/hail/bin/hail read -i profile225.vds \ annotatesamples expr -c \ 'sa.ncalled = gs.filter(g => g.iscalled).count(), sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(), sa.nhet = gs.filter(g => g.ishet).count()' \ annotatesamples expr -c \ 'sa.phom = sa.nhom / sa.ncalled, sa.phet = sa.nhet / sa.ncalled' \ exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: running: read -i profile225.vds [stage 1:> (0 + 0) / 4]slf4j: failed load class "org.slf4j.impl.staticloggerbinder". slf4j: defaulting no-operation (nop) logger implementation slf4j: see http://www.slf4j.org/codes.html#staticloggerbinder further details. [stage 1:============================================> (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.ncalled = gs.filter(g => g.iscalled).count(), sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(), sa.nhet = gs.filter(g => g.ishet).count()' [stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.phom = sa.nhom / sa.ncalled, sa.phet = sa.nhet / sa.ncalled' hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: timing: read: 2.969s annotatesamples expr: 1m20.4s annotatesamples expr: 21.868ms exportsamples: 151.829ms total: 1m23.5s
about 5 times faster! regard larger scale, running same command on google cloud on vds representing full vcf 1000 genomes project (2535 whole genomes, 315gb compressed) took 3m42s using 328 worker cores.
using hail built-in
hail has sampleqc
command computes of want (and more!):
../hail/build/install/hail/bin/hail read -i profile225.vds \ sampleqc \ annotatesamples expr -c \ 'sa.myqc.phomref = (sa.qc.nhomref + sa.qc.nhomvar) / sa.qc.ncalled, sa.myqc.phet= sa.qc.nhet / sa.qc.ncalled' \ exportsamples -c 'sample = s, sa.myqc.*, nhom = sa.qc.nhomref + sa.qc.nhomvar, nhet = sa.qc.nhet, ncalled = sa.qc.ncalled' -o sampleinfo.tsv hail: info: running: read -i profile225.vds [stage 0:> (0 + 0) / 4]slf4j: failed load class "org.slf4j.impl.staticloggerbinder". slf4j: defaulting no-operation (nop) logger implementation slf4j: see http://www.slf4j.org/codes.html#staticloggerbinder further details. [stage 1:============================================> (3 + 1) / 4]hail: info: running: sampleqc [stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.phomref = (sa.qc.nhomref + sa.qc.nhomvar) / sa.qc.ncalled, sa.myqc.phet= sa.qc.nhet / sa.qc.ncalled' hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nhom = sa.qc.nhomref + sa.qc.nhomvar, nhet = sa.qc.nhet, ncalled = sa.qc.ncalled' -o sampleinfo.tsv hail: info: timing: read: 2.928s sampleqc: 1m27.0s annotatesamples expr: 229.653ms exportsamples: 353.942ms total: 1m30.5s
installing hail
installing hail pretty easy , have docs help you. need more help? can real-time support in hail users chat room or, if prefer forums, hail discourse (both linked home page, unfortunately don't have enough reputation create real links).
the near future
in near future (less 1 month today), hail team complete python api allow express first snippet as:
result = importvcf("your_file.vcf.gz") .annotatesamples('sa.ncalled = gs.filter(g => g.iscalled).count(), sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(), sa.nhet = gs.filter(g => g.ishet).count()') .annotatesamples('sa.phom = sa.nhom / sa.ncalled, sa.phet = sa.nhet / sa.ncalled') (x in result.sampleannotations): print("sample " + x + " ncalled " + x.ncalled + " nhom " + x.nhom + " nhet " + x.nhet + " percent hom " + x.phom * 100 + " percent het " + x.phet * 100) result.sampleannotations.write("sampleinfo.tsv")
edit: added output of head
on tsv file.
edit2: latest hail doesn't need biallelic sampleqc
edit3: note scaling cloud hundreds of cores
Comments
Post a Comment