python - How to aggregate values over a bigger than RAM gzip'ed csv file? -

for starters new bioinformatics , programming, have built script go through so-called vcf file (only individuals included, 1 clumn = 1 individual), , uses search string find out every variant (line) whether individual homozygous or heterozygous.

this script works, @ least on small subsets, know stores in memory. on large zipped files (of whole genomes), not know how transform script script line line (because want count through whole columns not see how solve that).

so output 5 things per individual (total variants, number homozygote, number heterozygote, , proportions of homo- , heterozygotes). see code below:

#!usr/bin/env python import re import gzip  subset_cols = 'subset_cols_chr18.vcf.gz' #nuc_div = 'nuc_div_chr18.txt'  gz_infile = gzip.gzipfile(subset_cols, "r")   #gz_outfile = gzip.gzipfile(nuc_div, "w")   # make dictionary of header line easy retrieval of elements later on  headers = gz_infile.readline().rstrip().split('\t')              print headers                                                     column_dict = {}                                         header in headers:         column_dict[header] = []                         line in gz_infile:                                              columns = line.rstrip().split('\t')                      in range(len(columns)):                                    c_header=headers[i]                                      column_dict[c_header].append(columns[i]) #print column_dict  key in column_dict:                                  number_homozygotes = 0                   number_heterozygotes = 0                  values in column_dict[key]:                  searchstr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+'      #this search string contains regexp (this regexp tested)                 result = re.search(searchstr,values)                                     if result: #here, skip missing genoytypes ./.                         variant_one = int(result.group(1))                                       variant_two = int(result.group(2))                                        if variant_one == 0 , variant_two == 0:                                 continue                         elif variant_one == variant_two:                   #count +1 in case variant 1 , 2 equal (so 0/0, 1/1, etc.)                                 number_homozygotes += 1                         elif variant_one != variant_two: #count +1 in case variant 1 not equal variant 2 (so 1/0, 0/1, etc.)                                 number_heterozygotes += 1          print "%s homozygotes %s" % (number_homozygotes, key)          print "%s heterozygotes %s" % (number_heterozygotes,key)          variants = number_homozygotes + number_heterozygotes         print "%s variants" % variants          prop_homozygotes = (1.0*number_homozygotes/variants)*100         prop_heterozygotes = (1.0*number_heterozygotes/variants)*100          print "%s %% homozygous %s" % (prop_homozygotes, key)         print "%s %% heterozygous %s" % (prop_heterozygotes, key)

any appreciated can go on investigating large datasets, thank :)

the vcf file way looks this: individual_1 individual_2 individual_3 0/0:9,0:9:24:0,24,221 1/0:5,4:9:25:25,0,26 1/1:0,13:13:33:347,33,0

this header line individual id names (i have in total 33 individuals more complicated id tags, simplified here) , have lot of these information lines same specific pattern. interested in first part slash hence regular epxression.

disclosure: work full-time on hail project.

hi there! welcome programming , bioinformatics!

amirouche correctly identifies need sort of "streaming" or "line-by-line" algorithm handle data large fit in ram of machine. unfortunately, if limited python without libraries, have manually chunk file , handle parsing of vcf.

the hail project free, open-source tool scientists genetic data big fit in ram way big fit on 1 machine (i.e. tens of terabytes of compressed vcf data). hail can take advantage of cores on 1 machine or cores on cloud of machines. hail runs on mac os x , flavors of gnu/linux. hail exposes statistical genetics domain-specific language makes question shorter express.

the simplest answer

the faithful translation of python code hail this:

/path/to/hail importvcf -f your_file.vcf.gz \   annotatesamples expr -c \     'sa.ncalled = gs.filter(g => g.iscalled).count(),      sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(),      sa.nhet = gs.filter(g => g.ishet).count()'   annotatesamples expr -c \     'sa.phom =  sa.nhom / sa.ncalled,      sa.phet =  sa.nhet / sa.ncalled' \   exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv

i ran above command on dual-core laptop on 2.0gb file:

# ls -alh profile225.vcf.bgz -rw-r--r--  1 dking  1594166068   2.0g aug 25 15:43 profile225.vcf.bgz # ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \   annotatesamples expr -c \     'sa.ncalled = gs.filter(g => g.iscalled).count(),      sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(),      sa.nhet = gs.filter(g => g.ishet).count()' \   annotatesamples expr -c \     'sa.phom =  sa.nhom / sa.ncalled,      sa.phet =  sa.nhet / sa.ncalled' \   exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: running: importvcf -f profile225.vcf.bgz [stage 0:=======================================================> (63 + 2) / 65]hail: info: coerced sorted dataset hail: info: running: annotatesamples expr -c 'sa.ncalled = gs.filter(g => g.iscalled).count(),      sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(),      sa.nhet = gs.filter(g => g.ishet).count()' [stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.phom =  sa.nhom / sa.ncalled,      sa.phet =  sa.nhet / sa.ncalled' hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: while importing:     file:/users/dking/projects/hail-data/profile225.vcf.bgz  import clean hail: info: timing:   importvcf: 34.211s   annotatesamples expr: 6m52.4s   annotatesamples expr: 21.399ms   exportsamples: 121.786ms   total: 7m26.8s # head sampleinfo.tsv  sample  phomref phet    nhom    nhet    ncalled hg00096 9.49219e-01 5.07815e-02 212325  11359   223684 hg00097 9.28419e-01 7.15807e-02 214035  16502   230537 hg00099 9.27182e-01 7.28184e-02 211619  16620   228239 hg00100 9.19605e-01 8.03948e-02 214554  18757   233311 hg00101 9.28714e-01 7.12865e-02 214283  16448   230731 hg00102 9.24274e-01 7.57260e-02 212095  17377   229472 hg00103 9.36543e-01 6.34566e-02 209944  14225   224169 hg00105 9.29944e-01 7.00564e-02 214153  16133   230286 hg00106 9.25831e-01 7.41687e-02 213805  17128   230933

wow! 7 minutes 2gb, that's slow! unfortunately, because vcfs aren't great format data analysis!

optimizing storage format

let's convert hail's optimized storage format, vds, , re-run command:

# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds hail: info: running: importvcf -f profile225.vcf.bgz [stage 0:========================================================>(64 + 1) / 65]hail: info: coerced sorted dataset hail: info: running: write -o profile225.vds [stage 1:>                                                         (0 + 4) / 65] [stage 1:========================================================>(64 + 1) / 65] # ../hail/build/install/hail/bin/hail read -i profile225.vds \        annotatesamples expr -c \          'sa.ncalled = gs.filter(g => g.iscalled).count(),           sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(),           sa.nhet = gs.filter(g => g.ishet).count()' \        annotatesamples expr -c \          'sa.phom =  sa.nhom / sa.ncalled,           sa.phet =  sa.nhet / sa.ncalled' \        exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: running: read -i profile225.vds [stage 1:>                                                          (0 + 0) / 4]slf4j: failed load class "org.slf4j.impl.staticloggerbinder". slf4j: defaulting no-operation (nop) logger implementation slf4j: see http://www.slf4j.org/codes.html#staticloggerbinder further details. [stage 1:============================================>              (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.ncalled = gs.filter(g => g.iscalled).count(),          sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(),          sa.nhet = gs.filter(g => g.ishet).count()' [stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.phom =  sa.nhom / sa.ncalled,          sa.phet =  sa.nhet / sa.ncalled' hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleinfo.tsv hail: info: timing:   read: 2.969s   annotatesamples expr: 1m20.4s   annotatesamples expr: 21.868ms   exportsamples: 151.829ms   total: 1m23.5s

about 5 times faster! regard larger scale, running same command on google cloud on vds representing full vcf 1000 genomes project (2535 whole genomes, 315gb compressed) took 3m42s using 328 worker cores.

using hail built-in

hail has sampleqc command computes of want (and more!):

../hail/build/install/hail/bin/hail  read -i profile225.vds \       sampleqc \       annotatesamples expr -c \         'sa.myqc.phomref = (sa.qc.nhomref + sa.qc.nhomvar) / sa.qc.ncalled,          sa.myqc.phet= sa.qc.nhet / sa.qc.ncalled' \       exportsamples -c 'sample = s, sa.myqc.*, nhom = sa.qc.nhomref + sa.qc.nhomvar, nhet = sa.qc.nhet, ncalled = sa.qc.ncalled' -o sampleinfo.tsv hail: info: running: read -i profile225.vds [stage 0:>                                                          (0 + 0) / 4]slf4j: failed load class "org.slf4j.impl.staticloggerbinder". slf4j: defaulting no-operation (nop) logger implementation slf4j: see http://www.slf4j.org/codes.html#staticloggerbinder further details. [stage 1:============================================>              (3 + 1) / 4]hail: info: running: sampleqc [stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.phomref = (sa.qc.nhomref + sa.qc.nhomvar) / sa.qc.ncalled,          sa.myqc.phet= sa.qc.nhet / sa.qc.ncalled' hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nhom = sa.qc.nhomref + sa.qc.nhomvar, nhet = sa.qc.nhet, ncalled = sa.qc.ncalled' -o sampleinfo.tsv hail: info: timing:   read: 2.928s   sampleqc: 1m27.0s   annotatesamples expr: 229.653ms   exportsamples: 353.942ms   total: 1m30.5s

installing hail

installing hail pretty easy , have docs help you. need more help? can real-time support in hail users chat room or, if prefer forums, hail discourse (both linked home page, unfortunately don't have enough reputation create real links).

the near future

in near future (less 1 month today), hail team complete python api allow express first snippet as:

result = importvcf("your_file.vcf.gz")   .annotatesamples('sa.ncalled = gs.filter(g => g.iscalled).count(),                     sa.nhom = gs.filter(g => g.ishomref || g.ishomvar).count(),                     sa.nhet = gs.filter(g => g.ishet).count()')   .annotatesamples('sa.phom =  sa.nhom / sa.ncalled,                     sa.phet =  sa.nhet / sa.ncalled')  (x in result.sampleannotations):   print("sample " + x +         " ncalled " + x.ncalled +         " nhom " + x.nhom +         " nhet " + x.nhet +         " percent hom " + x.phom * 100 +         " percent het " + x.phet * 100)  result.sampleannotations.write("sampleinfo.tsv")

edit: added output of head on tsv file.

edit2: latest hail doesn't need biallelic sampleqc

edit3: note scaling cloud hundreds of cores

Search This Blog

CSS