python - Save data from Dataproc to Datastore -


i have implemented recommendation engine using python2.7 in google dataproc/ spark, , need store output records in datastore, subsequent use app engine apis. however, there doesn't seem way directly.

there no python datastore connector dataproc far can see. python dataflow sdk doesn't support writing datastore (although java 1 does). mapreduce doesn't have output writer datastore.

that doesn't appear leave many options. @ moment think have write records google cloud storage , have separate task running in app engine harvest them , store in datastore. not ideal- aligning 2 processes has own difficulties.

is there better way data dataproc datastore?

i succeeded in saving datastore records dataproc. involved installing additional components on master vm (ssh console)

the appengine sdk installed , initialised using

sudo apt-get install google-cloud-sdk-app-engine-python sudo gcloud init 

this places new google directory under /usr/lib/google-cloud-sdk/platform/google_appengine/.

the datastore library installed via

sudo apt-get install python-dev sudo apt-get install python-pip sudo pip install -t /usr/lib/google-cloud-sdk/platform/google_appengine/ google-cloud-datastore 

for reasons have yet understand, installed @ 1 level lower, i.e. in /usr/lib/google-cloud-sdk/platform/google_appengine/google/google, purposes necessary manually move components 1 level in path.

to enable interpreter find code had add /usr/lib/google-cloud-sdk/platform/google_appengine/ path. usual bash tricks weren't being sustained, ended doing @ start of recommendation engine.

because of large amount of data stored, spent lot of time attempting save via mapreduce. came conclusion many of required services missing on dataproc. instead using multiprocessing pool, achieving acceptable performance


Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -