python - finding the earliest occurence -


i'm running trouble this: need find first time user clicks on email (variable sending) , put 1 in respective row when occurs.

the dataset has several thousand users (hashed) click part of email in newsletter. tried group them sending, hash , find earliest date, not make work.

so went little nasty solution, which, returns strange thing:

my dataset (relevant variables):

>>> clicks[['datetime','hash','sending']].head()               datetime                              hash  sending 0 2016-11-01 19:13:34  0b1f4745df5925dfb1c8f53a56c43995        5 1 2016-11-01 10:47:14  0a73d5953ebf5826fbb7f3935bad026d        5 2 2016-10-31 19:09:21  605cebbabe0ba1b4248b3c54c280b477        5 3 2016-10-31 13:42:36  d26d61fb10c834292803b247a05b6cb7        5 4 2016-10-31 10:46:30  48f8ab83e8790d80af628e391f3325ad        5 

there 6 sending rounds, datetime datetime64[ns].

my way of doing follows:

clicks['first'] = 0  hash in clicks['hash'].unique():     t = clicks.ix[clicks.hash==hash, ['hash','datetime','sending']]     part = t['sending'].unique()      in part:         temp = t.ix[t.sending == i,'datetime']         clicks.ix[t[t.datetime == np.min(temp)].index.values,'first']=1 

first of all, dont think pythonic, , quite slow. returns weird type! there 0.0 , 1.0 values, cannot work them:

    >>> type(clicks.first)     <type 'instancemethod'>  >>> clicks.loc[clicks.first==1] traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1296, in __getitem__     return self._getitem_axis(key, axis=0)   file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1467, in _getitem_axis     return self._get_label(key, axis=axis)   file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 93, in _get_label     return self.obj._xs(label, axis=axis)   file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1749, in xs     loc = self.index.get_loc(key)   file "/users/air/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py", line 1947, in get_loc     return self._engine.get_loc(self._maybe_cast_indexer(key))   file "pandas/index.pyx", line 137, in pandas.index.indexengine.get_loc (pandas/index.c:4154)   file "pandas/index.pyx", line 156, in pandas.index.indexengine.get_loc (pandas/index.c:3977)   file "pandas/index.pyx", line 373, in pandas.index.int64engine._check_type (pandas/index.c:7634) keyerror: false 

so ideas, please? lot!

----- update: ------

  installed versions     ------------------     commit: none     python: 2.7.12.final.0     python-bits: 64     os: darwin     os-release: 15.6.0     machine: x86_64     processor: i386     byteorder: little     lc_all: none     lang: en_us.utf-8      pandas: 0.18.1 

i think need groupby apply compare values minimal , output boolean - need cast int 0 , 1 astype:

clicks = pd.dataframe({'hash': {0: '0b1f4745df5925dfb1c8f53a56c43995', 1: '0a73d5953ebf5826fbb7f3935bad026d', 2: '605cebbabe0ba1b4248b3c54c280b477', 3: '0b1f4745df5925dfb1c8f53a56c43995', 4: '0a73d5953ebf5826fbb7f3935bad026d', 5: '605cebbabe0ba1b4248b3c54c280b477', 6: 'd26d61fb10c834292803b247a05b6cb7', 7: '48f8ab83e8790d80af628e391f3325ad'}, 'sending': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5}, 'datetime': {0: pd.timestamp('2016-11-01 19:13:34'), 1: pd.timestamp('2016-11-01 10:47:14'), 2: pd.timestamp('2016-10-31 19:09:21'), 3: pd.timestamp('2016-11-01 19:13:34'), 4: pd.timestamp('2016-11-01 11:47:14'), 5: pd.timestamp('2016-10-31 19:09:20'), 6: pd.timestamp('2016-10-31 13:42:36'), 7: pd.timestamp('2016-10-31 10:46:30')}}) print (clicks)              datetime                              hash  sending 0 2016-11-01 19:13:34  0b1f4745df5925dfb1c8f53a56c43995        5 1 2016-11-01 10:47:14  0a73d5953ebf5826fbb7f3935bad026d        5 2 2016-10-31 19:09:21  605cebbabe0ba1b4248b3c54c280b477        5 3 2016-11-01 19:13:34  0b1f4745df5925dfb1c8f53a56c43995        5 4 2016-11-01 11:47:14  0a73d5953ebf5826fbb7f3935bad026d        5 5 2016-10-31 19:09:20  605cebbabe0ba1b4248b3c54c280b477        5 6 2016-10-31 13:42:36  d26d61fb10c834292803b247a05b6cb7        5 7 2016-10-31 10:46:30  48f8ab83e8790d80af628e391f3325ad        5 
#if column dtype of column datetime not datetime (with sample not necessary) clicks.datetime = pd.to_datetime(clicks.datetime) clicks['first'] = clicks.groupby(['hash','sending'])['datetime'] \                         .apply(lambda x: x == x.min()) \                         .astype(int) print (clicks)              datetime                              hash  sending  first 0 2016-11-01 19:13:34  0b1f4745df5925dfb1c8f53a56c43995        5      1 1 2016-11-01 10:47:14  0a73d5953ebf5826fbb7f3935bad026d        5      1 2 2016-10-31 19:09:21  605cebbabe0ba1b4248b3c54c280b477        5      0 3 2016-11-01 19:13:34  0b1f4745df5925dfb1c8f53a56c43995        5      1 4 2016-11-01 11:47:14  0a73d5953ebf5826fbb7f3935bad026d        5      0 5 2016-10-31 19:09:20  605cebbabe0ba1b4248b3c54c280b477        5      1 6 2016-10-31 13:42:36  d26d61fb10c834292803b247a05b6cb7        5      1 7 2016-10-31 10:46:30  48f8ab83e8790d80af628e391f3325ad        5      1 

----- update: ------

installed versions ------------------ commit: none python: 2.7.12.final.0 python-bits: 64 os: darwin os-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little lc_all: none lang: en_us.utf-8  pandas: 0.18.1 

Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

depending on nth recurrence of job in control M -

asp.net - Problems sending emails from forum -