python - finding the earliest occurence -
i'm running trouble this: need find first time user clicks on email (variable sending) , put 1 in respective row when occurs.
the dataset has several thousand users (hashed) click part of email in newsletter. tried group them sending, hash , find earliest date, not make work.
so went little nasty solution, which, returns strange thing:
my dataset (relevant variables):
>>> clicks[['datetime','hash','sending']].head() datetime hash sending 0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5 1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d 5 2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477 5 3 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7 5 4 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad 5
there 6 sending rounds, datetime
datetime64[ns]
.
my way of doing follows:
clicks['first'] = 0 hash in clicks['hash'].unique(): t = clicks.ix[clicks.hash==hash, ['hash','datetime','sending']] part = t['sending'].unique() in part: temp = t.ix[t.sending == i,'datetime'] clicks.ix[t[t.datetime == np.min(temp)].index.values,'first']=1
first of all, dont think pythonic, , quite slow. returns weird type! there 0.0
, 1.0
values, cannot work them:
>>> type(clicks.first) <type 'instancemethod'> >>> clicks.loc[clicks.first==1] traceback (most recent call last): file "<stdin>", line 1, in <module> file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1296, in __getitem__ return self._getitem_axis(key, axis=0) file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 1467, in _getitem_axis return self._get_label(key, axis=axis) file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/indexing.py", line 93, in _get_label return self.obj._xs(label, axis=axis) file "/users/air/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1749, in xs loc = self.index.get_loc(key) file "/users/air/anaconda/lib/python2.7/site-packages/pandas/indexes/base.py", line 1947, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) file "pandas/index.pyx", line 137, in pandas.index.indexengine.get_loc (pandas/index.c:4154) file "pandas/index.pyx", line 156, in pandas.index.indexengine.get_loc (pandas/index.c:3977) file "pandas/index.pyx", line 373, in pandas.index.int64engine._check_type (pandas/index.c:7634) keyerror: false
so ideas, please? lot!
----- update: ------
installed versions ------------------ commit: none python: 2.7.12.final.0 python-bits: 64 os: darwin os-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little lc_all: none lang: en_us.utf-8 pandas: 0.18.1
i think need groupby
apply
compare values minimal
, output boolean - need cast int
0
, 1
astype
:
clicks = pd.dataframe({'hash': {0: '0b1f4745df5925dfb1c8f53a56c43995', 1: '0a73d5953ebf5826fbb7f3935bad026d', 2: '605cebbabe0ba1b4248b3c54c280b477', 3: '0b1f4745df5925dfb1c8f53a56c43995', 4: '0a73d5953ebf5826fbb7f3935bad026d', 5: '605cebbabe0ba1b4248b3c54c280b477', 6: 'd26d61fb10c834292803b247a05b6cb7', 7: '48f8ab83e8790d80af628e391f3325ad'}, 'sending': {0: 5, 1: 5, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5}, 'datetime': {0: pd.timestamp('2016-11-01 19:13:34'), 1: pd.timestamp('2016-11-01 10:47:14'), 2: pd.timestamp('2016-10-31 19:09:21'), 3: pd.timestamp('2016-11-01 19:13:34'), 4: pd.timestamp('2016-11-01 11:47:14'), 5: pd.timestamp('2016-10-31 19:09:20'), 6: pd.timestamp('2016-10-31 13:42:36'), 7: pd.timestamp('2016-10-31 10:46:30')}}) print (clicks) datetime hash sending 0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5 1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d 5 2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477 5 3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5 4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d 5 5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477 5 6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7 5 7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad 5
#if column dtype of column datetime not datetime (with sample not necessary) clicks.datetime = pd.to_datetime(clicks.datetime) clicks['first'] = clicks.groupby(['hash','sending'])['datetime'] \ .apply(lambda x: x == x.min()) \ .astype(int) print (clicks) datetime hash sending first 0 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5 1 1 2016-11-01 10:47:14 0a73d5953ebf5826fbb7f3935bad026d 5 1 2 2016-10-31 19:09:21 605cebbabe0ba1b4248b3c54c280b477 5 0 3 2016-11-01 19:13:34 0b1f4745df5925dfb1c8f53a56c43995 5 1 4 2016-11-01 11:47:14 0a73d5953ebf5826fbb7f3935bad026d 5 0 5 2016-10-31 19:09:20 605cebbabe0ba1b4248b3c54c280b477 5 1 6 2016-10-31 13:42:36 d26d61fb10c834292803b247a05b6cb7 5 1 7 2016-10-31 10:46:30 48f8ab83e8790d80af628e391f3325ad 5 1
----- update: ------
installed versions ------------------ commit: none python: 2.7.12.final.0 python-bits: 64 os: darwin os-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little lc_all: none lang: en_us.utf-8 pandas: 0.18.1
Comments
Post a Comment