Detecting char encoding on direct file uploads PHP -
on site allow direct text file uploads. these files stored on server, , displayed on website. use utf-8 on site.
now run trouble when people upload non-utf-8
files contain special chars, such é
.
i've been doing testing. made 2 text files, both containing same word fiancée
. 1 encoded utf-8 , 1 encoded iso 8859-2.
the utf-8 1 uploads fine, , shows text correct, iso 8859-2 shows fianc�e
.
now i've tried detect uploaded file content mb_detect_encoding
, whatever file throw @ it, detect utf-8.
i noticed can use utf8_encode
convert iso 8859-2 files valid utf-8, works on non-utf files. , cannot detect non-utf files, cannot use utf8_encode
function, messes valid utf-8 files.
hope makes sense :)
so question is, how can detect files sure not utf-8 encoded start with, can use utf8_encode
function on them.
you cannot. welcome encodings.
seriously though, files binary blobs. bits , bytes in file mean anything @ all; images, cad data or, perhaps, text. depends on how interpret bytes. text files means encoding interpret them. there's nothing in files tells correct encoding, have know it. typically want know metadata accompanying file. in case of random user uploads though, there no metadata, and/or wouldn't reliable. cannot "know".
the next step guess, not foolproof. can rule out encodings, example if file not validate utf-8 (mb_check_encoding($data, 'utf-8') == false
), cannot utf-8. however, any single byte encoding validate other single byte encoding. it's impossible distinguish iso-8859-1 iso-8859-2 way, bytes equally valid in both. it's characters show may not ones want. detect that automatically need statistical language analyser can tell this character shouldn't show in that word grammatical. work need know language used in file, or need detect that first… , hardly foolproof.
the sanest way ask user. accept upload, perhaps upfront testing on encodings can ruled out, ask user of bunch of possible encodings file in. present them result, file looks when interpreted chosen encoding, let user confirm looks alright. many decent text editors when open file ambiguous encoding.
Comments
Post a Comment