google cloud ml - In distributed tensorflow, how to write to summary from workers as well -
i using google cloud ml distributed sample training model on cluster of computers. input , output (ie rfrecords, checkpoints, tfevents) on gs:// (google storage)
similarly distributed sample, use evaluation step called @ end, , result written summary, in order use parameter hypertuning / either within cloud ml, or using own stack of tools.
but rather performing single evaluation on large batch of data, running several evaluation steps, in order retrieve statistics on performance criteria, because don't want limited single value. want information regarding performance interval. in particular, variance of performance important me. i'd rather select model lower average performance better worst cases.
i therefore run several evaluation steps. parallelize these evaluation steps because right now, master evaluating. when using large clusters, source of inefficiency, , task workers evaluate well.
basically, supervisor created :
self.sv = tf.train.supervisor( graph, is_chief=self.is_master, logdir=train_dir(self.args.output_path), init_op=init_op, saver=self.saver, # write summary_ops hand. summary_op=none, global_step=self.tensors.global_step, # no saving; manually in order evaluate # afterwards. save_model_secs=0)
at end of training call summary writer. :
# on master, want remove if self.is_master , not self.should_stop: # want have idea of statistics of accuracy # not mean, hence run on 10 batches in range(10): self.global_step += 1 # call evaluator, , extract accuracy evaluation_values = self.evaluator.evaluate() accuracy_value = self.model.accuracy_value(evaluation_values) # dump accuracy, ready use within hptune eval_summary = tf.summary(value=[ tf.summary.value( tag='training/hptuning/metric', simple_value=accuracy_value) ]) self.sv.summary_computed(session, eval_summary, self.global_step)
i tried write summaries workers , got error : summary can written masters only. there easy way workaround ? error : "writing summary requires summary writer."
my guess you'd create separate summary writer on each worker yourself, , write out summaries directly rather.
i suspect wouldn't use supervisor eval processing either. load session on each worker doing eval latest checkpoint, , writing out independent summaries.
Comments
Post a Comment