scala - How to map on a Spark Dataset partition elements? -


i have list of event log entries this: (user_id, timestamp). entries stored on hive table, , partitioned date of timestamp.

now, want create sessions out of events. session collection of events belong single user. if there gap in user activity of 30 minutes, assume there's new session. have method looks this:

def sessionize(events: list[trackingevent]): map[integer, list[usersession]] = {     val eventsbyuser = events.sortwith((a, b) => a.timestamp.before(b.timestamp)).groupby(_.userid)     val sessionsbyuser: mutablemap[integer, list[usersession]] = mutablemap()     ((userid, eventlist) <- eventsbyuser) {         val sessions: mutablelist[usersession] = mutablelist()         (event <- eventlist) {             sessions.lastoption match {                 case none => sessions += usersession.fromevent(event)                 case some(lastsession) if event.belongstosession(lastsession) => lastsession.includeevent(event)                 case some(_) => sessions += usersession.fromevent(event)             }             sessionsbyuser(userid) = sessions.tolist         }     }     sessionsbyuser.tomap } 

the problem code needs all events of single day work, should fine, because files partitioned this. nevertheless, spark still doing lot of shuffling. there better way this?

thanks!


Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

depending on nth recurrence of job in control M -

asp.net - Problems sending emails from forum -