python - Save data from Dataproc to Datastore -
i have implemented recommendation engine using python2.7 in google dataproc/ spark, , need store output records in datastore, subsequent use app engine apis. however, there doesn't seem way directly.
there no python datastore connector dataproc far can see. python dataflow sdk doesn't support writing datastore (although java 1 does). mapreduce doesn't have output writer datastore.
that doesn't appear leave many options. @ moment think have write records google cloud storage , have separate task running in app engine harvest them , store in datastore. not ideal- aligning 2 processes has own difficulties.
is there better way data dataproc datastore?
i succeeded in saving datastore records dataproc. involved installing additional components on master vm (ssh console)
the appengine sdk installed , initialised using
sudo apt-get install google-cloud-sdk-app-engine-python sudo gcloud init
this places new google directory under /usr/lib/google-cloud-sdk/platform/google_appengine/
.
the datastore library installed via
sudo apt-get install python-dev sudo apt-get install python-pip sudo pip install -t /usr/lib/google-cloud-sdk/platform/google_appengine/ google-cloud-datastore
for reasons have yet understand, installed @ 1 level lower, i.e. in /usr/lib/google-cloud-sdk/platform/google_appengine/google/google
, purposes necessary manually move components 1 level in path.
to enable interpreter find code had add /usr/lib/google-cloud-sdk/platform/google_appengine/
path. usual bash tricks weren't being sustained, ended doing @ start of recommendation engine.
because of large amount of data stored, spent lot of time attempting save via mapreduce. came conclusion many of required services missing on dataproc. instead using multiprocessing pool, achieving acceptable performance
Comments
Post a Comment