python - Uncompressed size of a webpage using chunked transfer encoding and gzip compression -


i'm writing application calculates savings got after using gzip on web page. when user inputs url of web page used gzip, application should spit out savings in size due gzip.

how should approach problem?

this getting header request on page:

{     'x-powered-by': 'php/5.5.9-1ubuntu4.19',     'transfer-encoding': 'chunked',     'content-encoding': 'gzip',     'vary': 'accept-encoding',      'server': 'nginx/1.4.6 (ubuntu)',     'connection': 'keep-alive',     'date': 'thu, 10 nov 2016 09:49:58 gmt',     'content-type': 'text/html' } 

i retrieving page requests:

r  = requests.get(url, headers) data = r.text print "webpage size : " , len(data)/1024 

if downloaded url (using requests get request without stream option, have both sizes available whole response downloaded , decompressed, , original length available in headers:

from __future__ import division  r = requests.get(url, headers=headers) compressed_length = int(r.headers['content-length']) decompressed_length = len(r.content)  ratio = compressed_length / decompressed_length 

you could compare accept-encoding: identity head request content-length header 1 setting accept-encoding: gzip instead:

no_gzip = {'accept-encoding': 'identity'} no_gzip.update(headers) uncompressed_length = int(requests.get(url, headers=no_gzip).headers['content-length']) force_gzip = {'accept-encoding': 'gzip'} force_gzip.update(headers) compressed_length = int(requests.get(url, headers=force_gzip).headers['content-length']) 

however, may not work servers, dynamically-generated content servers routinely futz content-length header in such cases avoid having render content first.

if requesting chunked transfer encoding resource, there won't be content-length header, in case head request may or may not provide correct information either.

in case you'd have stream whole response , extract decompressed size end of stream (the gzip format includes little-endian 4-byte unsigned int @ end). use stream() method on raw urllib3 response object:

import requests collections import deque  if hasattr(int, 'from_bytes'):     # python 3.2 ,     _extract_size = lambda q: int.from_bytes(bytes(q), 'little') else:     import struct     _le_int = struct.struct('<i').unpack     _extract_size = lambda q: _le_int(b''.join(q))[0]  def get_content_lengths(url, headers=none, chunk_size=2048):     """return compressed , uncompressed lengths given url      works resources accessible get, regardless of transfer-encoding     , discrepancies between head , responses. have     download full request (streamed) determine sizes.      """     only_gzip = {'accept-encoding': 'gzip'}     only_gzip.update(headers or {})     # set `stream=true` ensure can access original stream:     r = requests.get(url, headers=only_gzip, stream=true)     r.raise_for_status()     if r.headers.get('content-encoding') != 'gzip':         raise valueerror('response not gzip-compressed')     # need last 4 bytes of data stream     last_data = deque(maxlen=4)     compressed_length = 0     # stream directly urllib3 response can ensure     # data not decompressed iterate     chunk in r.raw.stream(chunk_size, decode_content=false):         compressed_length += len(chunk)         last_data.extend(chunk)     if compressed_length < 4:         raise valueerror('not enough data loaded determine uncompressed size')     return compressed_length, _extract_size(last_data) 

demo:

>>> compressed_length, decompressed_length = get_content_lengths('http://httpbin.org/gzip') >>> compressed_length 179 >>> decompressed_length 226 >>> compressed_length / decompressed_length 0.7920353982300885 

Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -