c# - Slow execution of USQL -
i have created simple script score between 2 strings. please find usql , backend .net code below
cn_matcher.usql:
reference assembly master.fuzzystring; @searchlog = extract id int, input_cn string, output_cn string "/cn_matcher/input/sample.txt" using extractors.tsv(); @cleanscheck = select id,input_cn, output_cn, cn_validator.trial.cleanser(input_cn) input_cn_cleansed, cn_validator.trial.cleanser(output_cn) output_cn_cleansed @searchlog; @checkdata= select id,input_cn, output_cn, input_cn_cleansed, output_cn_cleansed, cn_validator.trial.hamming(input_cn_cleansed, output_cn_cleansed) hammingscore, cn_validator.trial.levinstiendistance(input_cn_cleansed, output_cn_cleansed) levinstiendistance, fuzzystring.comparisonmetrics.jarowinklerdistance(input_cn_cleansed, output_cn_cleansed) jarowinklerdistance @cleanscheck; output @checkdata "/cn_matcher/cn_full_run.txt" using outputters.tsv();
cn_matcher.usql.cs:
using microsoft.analytics.interfaces; using microsoft.analytics.types.sql; using system; using system.collections.generic; using system.io; using system.linq; using system.text; namespace cn_validator { public static class trial { public static string cleanser(string val) { list<string> wordstoremove = "l.p. registered pc bldg pllc lp. l.c. div. national l p l.l.c international r. limited school azioni joint co-op corporation corp., (corp) inc., societa company llp liability l.l.l.p llc bancorporation manufacturing c dst (inc) jv ltd. llc. technology ltd., s.a. mfg rllp incorporated per venture l.l.p c. p.l.l.c l.p.. p. partnership corp co-operative s.p.a tech schl bancorp association lllp n r ltd inc. l.l.p. p.c. co district int intl assn. sa inc l.p co, co. division lc intl. lp professional corp. l. l.l.c. building r.l.l.p co.,".split(' ').tolist(); return string.join(" ", val.tolower().split(' ').except(wordstoremove)); } public static int hamming(string source, string target) { int distance = 0; if (source.length == target.length) { (int = 0; < source.length; i++) { if (!source[i].equals(target[i])) { distance++; } } return distance; } else { return 99999; } } public static int levinstiendistance(string source, string target) { int n = source.length; int m = target.length; int[,] d = new int[n + 1, m + 1]; // matrix int cost; // cost // step 1 if (n == 0) return m; if (m == 0) return n; (int = 0; <= n; d[i, 0] = i++) ; (int j = 0; j <= m; d[0, j] = j++) ; (int = 1; <= n; i++) { (int j = 1; j <= m; j++) { cost = (target.substring(j - 1, 1) == source.substring(i - 1, 1) ? 0 : 1); d[i, j] = system.math.min(system.math.min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost); } } return d[n, m]; } } }
i have ran sample batch 100 inputs , set parallelism 1 , priority 1000. the job completed in 1.6 minutes.
i wanted test same job 1000 inputs , set parallelism 1 , priority 1000 , per calculation since took 1.6 minutes 100 inputs thought take around 20 minutes 1000 inputs running more 50 minutes , did not see progress.
so added 100 input job , tested ran same previous time. i thought of increasing parallelism , increased 3 , ran again did not complete after 1 hour.
job_id=07c0850d-0770-4430-a288-5cddcfc26699
the main issue not able see progress or status.
please let me know if doing wrong.
is there anyway use constructor in usql?. since if able not need same cleansing steps again , again.
i assume using file set syntax specify 1000 files? unfortunately current default implementation of file sets not scaling , compilation (preparation) phase going take long time (as execution). have better implementation in preview. can please send me mail usql @ microsoft dot com , tell how can try out preview implementation.
thanks michael
Comments
Post a Comment