google cloud ml - In distributed tensorflow, how to write to summary from workers as well -


i using google cloud ml distributed sample training model on cluster of computers. input , output (ie rfrecords, checkpoints, tfevents) on gs:// (google storage)

similarly distributed sample, use evaluation step called @ end, , result written summary, in order use parameter hypertuning / either within cloud ml, or using own stack of tools.

but rather performing single evaluation on large batch of data, running several evaluation steps, in order retrieve statistics on performance criteria, because don't want limited single value. want information regarding performance interval. in particular, variance of performance important me. i'd rather select model lower average performance better worst cases.

i therefore run several evaluation steps. parallelize these evaluation steps because right now, master evaluating. when using large clusters, source of inefficiency, , task workers evaluate well.

basically, supervisor created :

self.sv = tf.train.supervisor(             graph,             is_chief=self.is_master,             logdir=train_dir(self.args.output_path),             init_op=init_op,             saver=self.saver,             # write summary_ops hand.             summary_op=none,             global_step=self.tensors.global_step,             # no saving; manually in order evaluate             # afterwards.             save_model_secs=0) 

at end of training call summary writer. :

            # on master, want remove             if self.is_master , not self.should_stop:                  # want have idea of statistics of accuracy                 # not mean, hence run on 10 batches                  in range(10):                     self.global_step += 1                      # call evaluator, , extract accuracy                      evaluation_values = self.evaluator.evaluate()                     accuracy_value = self.model.accuracy_value(evaluation_values)                      # dump accuracy, ready use within hptune                     eval_summary = tf.summary(value=[                         tf.summary.value(                             tag='training/hptuning/metric', simple_value=accuracy_value)                     ])                      self.sv.summary_computed(session, eval_summary, self.global_step) 

i tried write summaries workers , got error : summary can written masters only. there easy way workaround ? error : "writing summary requires summary writer."

my guess you'd create separate summary writer on each worker yourself, , write out summaries directly rather.

i suspect wouldn't use supervisor eval processing either. load session on each worker doing eval latest checkpoint, , writing out independent summaries.


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -