google cloud ml - In distributed tensorflow, how to write to summary from workers as well -

June 15, 2014

i using google cloud ml distributed sample training model on cluster of computers. input , output (ie rfrecords, checkpoints, tfevents) on gs:// (google storage)

similarly distributed sample, use evaluation step called @ end, , result written summary, in order use parameter hypertuning / either within cloud ml, or using own stack of tools.

but rather performing single evaluation on large batch of data, running several evaluation steps, in order retrieve statistics on performance criteria, because don't want limited single value. want information regarding performance interval. in particular, variance of performance important me. i'd rather select model lower average performance better worst cases.

i therefore run several evaluation steps. parallelize these evaluation steps because right now, master evaluating. when using large clusters, source of inefficiency, , task workers evaluate well.

basically, supervisor created :

self.sv = tf.train.supervisor(             graph,             is_chief=self.is_master,             logdir=train_dir(self.args.output_path),             init_op=init_op,             saver=self.saver,             # write summary_ops hand.             summary_op=none,             global_step=self.tensors.global_step,             # no saving; manually in order evaluate             # afterwards.             save_model_secs=0)

at end of training call summary writer. :

            # on master, want remove             if self.is_master , not self.should_stop:                  # want have idea of statistics of accuracy                 # not mean, hence run on 10 batches                  in range(10):                     self.global_step += 1                      # call evaluator, , extract accuracy                      evaluation_values = self.evaluator.evaluate()                     accuracy_value = self.model.accuracy_value(evaluation_values)                      # dump accuracy, ready use within hptune                     eval_summary = tf.summary(value=[                         tf.summary.value(                             tag='training/hptuning/metric', simple_value=accuracy_value)                     ])                      self.sv.summary_computed(session, eval_summary, self.global_step)

i tried write summaries workers , got error : summary can written masters only. there easy way workaround ? error : "writing summary requires summary writer."

my guess you'd create separate summary writer on each worker yourself, , write out summaries directly rather.

i suspect wouldn't use supervisor eval processing either. load session on each worker doing eval latest checkpoint, , writing out independent summaries.

Search This Blog

CSS

google cloud ml - In distributed tensorflow, how to write to summary from workers as well -

Comments

Post a Comment

Popular posts from this blog

php - trouble displaying mysqli database results in correct order -

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

depending on nth recurrence of job in control M -