python - Pandas: Querying a DataFrame - multiple criteria -
i have large dataframe 2 of columns are:
- a question code (e.g q453)
- true or false denote whether or not question answered correctly
i not know possible question codes , each question has been answered multiple times. like:
q666 true
q463 true
q056 true
q666 false
i wish know question easiest, precisely question has highest ratio of attempts correct answers. steps achieve using pandas?
i made sample of data post. you're looking for?
i made true/false indicators lowercase strings; not sure stored in file. if you're on python2, believe can change from io import stringio from stringio import stringio.
in [105]: import pandas pd ...: io import stringio ...: ...: data = """q666 true ...: ...: q463 true ...: ...: q056 true ...: ...: q666 false ...: q666 true ...: ...: ...: q463 true ...: ...: q056 true ...: q666 false ...: q463 false ...: q666 false""" ...: in [106]: df = pd.read_csv(stringio(data), sep=" ", header=none) in [107]: df.columns = [['question', 'answer']] in [108]: df['answer'] = df['answer'].astype("str").apply(lambda x: x.lower()) in [109]: df out[109]: question answer 0 q666 true 1 q463 true 2 q056 true 3 q666 false 4 q666 true 5 q463 true 6 q056 true 7 q666 false 8 q463 false 9 q666 false assign constant value to dataframe can count groupby our conditions, move multiindex need it.
in [110]: piv = (df ...: .assign(val = 1) ...: .groupby(["question", "answer"]) ...: .count() ...: .unstack()['val']) it's easy here...
in [111]: piv['ratio'] = piv['true'] / piv.sum(axis=1) in [112]: piv.sort_values("ratio") out[112]: answer false true ratio question q666 3.0 2.0 0.400000 q463 1.0 2.0 0.666667 q056 nan 2.0 1.000000
Comments
Post a Comment