python - Uncompressed size of a webpage using chunked transfer encoding and gzip compression -
i'm writing application calculates savings got after using gzip on web page. when user inputs url of web page used gzip, application should spit out savings in size due gzip.
how should approach problem?
this getting header request on page:
{ 'x-powered-by': 'php/5.5.9-1ubuntu4.19', 'transfer-encoding': 'chunked', 'content-encoding': 'gzip', 'vary': 'accept-encoding', 'server': 'nginx/1.4.6 (ubuntu)', 'connection': 'keep-alive', 'date': 'thu, 10 nov 2016 09:49:58 gmt', 'content-type': 'text/html' }
i retrieving page requests
:
r = requests.get(url, headers) data = r.text print "webpage size : " , len(data)/1024
if downloaded url (using requests
get
request without stream
option, have both sizes available whole response downloaded , decompressed, , original length available in headers:
from __future__ import division r = requests.get(url, headers=headers) compressed_length = int(r.headers['content-length']) decompressed_length = len(r.content) ratio = compressed_length / decompressed_length
you could compare accept-encoding: identity
head request content-length header 1 setting accept-encoding: gzip
instead:
no_gzip = {'accept-encoding': 'identity'} no_gzip.update(headers) uncompressed_length = int(requests.get(url, headers=no_gzip).headers['content-length']) force_gzip = {'accept-encoding': 'gzip'} force_gzip.update(headers) compressed_length = int(requests.get(url, headers=force_gzip).headers['content-length'])
however, may not work servers, dynamically-generated content servers routinely futz content-length header in such cases avoid having render content first.
if requesting chunked transfer encoding resource, there won't be content-length header, in case head request may or may not provide correct information either.
in case you'd have stream whole response , extract decompressed size end of stream (the gzip format includes little-endian 4-byte unsigned int @ end). use stream()
method on raw urllib3 response object:
import requests collections import deque if hasattr(int, 'from_bytes'): # python 3.2 , _extract_size = lambda q: int.from_bytes(bytes(q), 'little') else: import struct _le_int = struct.struct('<i').unpack _extract_size = lambda q: _le_int(b''.join(q))[0] def get_content_lengths(url, headers=none, chunk_size=2048): """return compressed , uncompressed lengths given url works resources accessible get, regardless of transfer-encoding , discrepancies between head , responses. have download full request (streamed) determine sizes. """ only_gzip = {'accept-encoding': 'gzip'} only_gzip.update(headers or {}) # set `stream=true` ensure can access original stream: r = requests.get(url, headers=only_gzip, stream=true) r.raise_for_status() if r.headers.get('content-encoding') != 'gzip': raise valueerror('response not gzip-compressed') # need last 4 bytes of data stream last_data = deque(maxlen=4) compressed_length = 0 # stream directly urllib3 response can ensure # data not decompressed iterate chunk in r.raw.stream(chunk_size, decode_content=false): compressed_length += len(chunk) last_data.extend(chunk) if compressed_length < 4: raise valueerror('not enough data loaded determine uncompressed size') return compressed_length, _extract_size(last_data)
demo:
>>> compressed_length, decompressed_length = get_content_lengths('http://httpbin.org/gzip') >>> compressed_length 179 >>> decompressed_length 226 >>> compressed_length / decompressed_length 0.7920353982300885
Comments
Post a Comment