@hainingwyx 2016-11-27T13:50:57.000000Z 字数 16837 阅读 2028

Pandas入门

Python pandas

Getting started with pandas

from pandas import Series, DataFrame
import pandas as pd

from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)

%pwd

u'C:\\Users\\WangYixin\\Desktop\\pydata-book-master'

Introduction to pandas data structures

Series

由一组数据和一组与之相关的数据标签组成，仅由一组数据即可产生最简单的Series

obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

obj.values
obj.index

RangeIndex(start=0, stop=4, step=1)

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

obj2.index

Index([u'd', u'b', u'a', u'c'], dtype='object')

obj2['a']#索引

-5

obj2['d'] = 6
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

obj2 * 2#保留Numpy数组运算

d    12
b    14
a   -10
c     6
dtype: int64

np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

'b' in obj2#索引值到数据值的映射

True

'e' in obj2

False

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}#可以通过字典直接创建Series
obj3 = Series(sdata)
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

pd.isnull(obj4)#检测数据缺失

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

obj3 + obj4#数据对齐

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

obj4.name = 'population'#对象和索引的name属性
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

表格型数据结构，含有一组有序的列。既有行索引也有列索引

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

frame

DataFrame(data, columns=['year', 'state', 'pop'])#重新排列

frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])#缺失值
frame2

frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

frame2['state']#返回Series

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

frame2.year#返回Series

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

frame2.ix['three']#索引行字段

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

frame2['debt'] = 16.5#修改列字段
frame2

frame2['debt'] = np.arange(5.)
frame2

val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val#匹配索引，空位置NA
frame2

frame2['eastern'] = frame2.state == 'Ohio'
frame2

del frame2['eastern']#删除列
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}#嵌套字典，外层作为列，内层作为嵌套索引

frame3 = DataFrame(pop)
frame3

frame3.T#转置

frame3

DataFrame(pop, index=[2001, 2002, 2003])

pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}#最后一行不要
DataFrame(pdata)

frame3.index.name = 'year'; frame3.columns.name = 'state'#index和columns的name属性
frame3

frame3.values#values属性

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

frame2

frame2.values

array([[2000L, 'Ohio', 1.5, nan],
       [2001L, 'Ohio', 1.7, -1.2],
       [2002L, 'Ohio', 3.6, nan],
       [2001L, 'Nevada', 2.4, -1.5],
       [2002L, 'Nevada', 2.9, -1.7]], dtype=object)

Index objects

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index([u'a', u'b', u'c'], dtype='object')

index[1:]

Index([u'b', u'c'], dtype='object')

index[1] = 'd'#index对象不可以修改

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-65-c44a2554ac58> in <module>()
----> 1 index[1] = 'd'#index对象不可以修改


C:\Users\WangYixin\Anaconda2\lib\site-packages\pandas\indexes\base.pyc in __setitem__(self, key, value)
   1235 
   1236     def __setitem__(self, key, value):
-> 1237         raise TypeError("Index does not support mutable operations")
   1238 
   1239     def __getitem__(self, key):


TypeError: Index does not support mutable operations

index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

True

frame3

'Ohio' in frame3.columns

True

2003 in frame3.index

False

index的方法和属性
append：连接另一个index对象，产生新的index
diff：计算差集。并得到一个index
intersection：计算交集
union：计算并集
isin：计算一个指示各值是否都包含在参数集合中的布尔型数组
delete：删除索引i处的元素，并得到新的Index
drop：删除传入的值，并得到新的Index
insert：将元素插入索引i处，并得到新的Index
is_monotonic：各元素均大于等于前一个元素时，返回true
is_unique:当index没有重复值时，返回true
unique：计算Index中唯一值的数组

Essential functionality

Reindexing

obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')#前向值填充

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
frame

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)#重新索引列

frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
              columns=states)#同时行列索引

frame.ix[['a', 'b', 'c', 'd'], states]#重新索引

reindex的method选项
ffill/pad 前向填充或搬运值
bfill或backfill 后向填充或搬运值
reindex函数的参数
index 用作索引的新序列
method 填充方式
fill_value 重新索引的过程中，需要引入缺失值时使用的替代值
limit 前向或后向填充时最大的填充量
level MultiIndex指定级别上匹配简单索引，否则选取子集
copy 默认为true，复制。false，新旧相等就不复制

Dropping entries from an axis

obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])

data.drop(['Colorado', 'Ohio'])

data.drop('two', axis=1)

data.drop(['two', 'four'], axis=1)

Indexing, selection, and filtering

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']

1.0

obj[1]

1.0

obj[2:4]

c    2.0
d    3.0
dtype: float64

obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

obj[obj < 2]

a    0.0
b    1.0
dtype: float64

obj['b':'c']#末端包含

b    1.0
c    2.0
dtype: float64

obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data

data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

data[['three', 'one']]

data[:2]

data[data['three'] > 5]

data < 5

data[data < 5] = 0

data.ix['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

data.ix[['Colorado', 'Utah'], [3, 0, 1]]

data.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

data.ix[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

data.ix[data.three > 5, :3]

DataFrame的索引选项
obj[val] 选取单列或者一组列
obj.ix[val] 选取单行或者一组行
obj.ix[:,val] 选取单列或者列子集
obj.ix[val1, val2] 同时选取行和列
reindex方法将一个或多个轴匹配到新索引
xs方法根据标签选取单行或单列，返回Series
icol、irow方法根据整数位置选取单行或单列，并返回一个Series
get_value、set_value方法根据航标签和列表前选取单个值

Arithmetic and data alignment

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

df2

df1 + df2

Arithmetic methods with fill values

df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1

df2

df1 + df2

df1.add(df2, fill_value=0)

df1.reindex(columns=df2.columns, fill_value=0)

Operations between DataFrame and Series

arr = np.arange(12.).reshape((3, 4))
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

arr[0]

array([ 0.,  1.,  2.,  3.])

arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]#第一行
frame

series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

frame - series

series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2

series3 = frame['d']
frame

series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

frame.sub(series3, axis=0)#匹配行索引进行广播

Function application and mapping

frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])

frame

np.abs(frame)

f = lambda x: x.max() - x.min()

frame.apply(f)#默认是按照列

b    1.802165
d    1.684034
e    2.689627
dtype: float64

frame.apply(f, axis=1)#按照行

Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64

def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

format = lambda x: '%.2f' % x
frame.applymap(format)#dataFrame

frame['e'].map(format)#Series

Utah      -0.52
Ohio       1.39
Texas      0.77
Oregon    -1.30
Name: e, dtype: object

Sorting and ranking

obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()

frame.sort_index(axis=1)#columns排序

frame.sort_index(axis=1, ascending=False)#降序

obj = Series([4, 7, -3, 2])
obj.order()

C:\Users\WangYixin\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
  from ipykernel import kernelapp as app





2   -3
3    2
0    4
1    7
dtype: int64

obj = Series([4, np.nan, 7, np.nan, -3, 2])#Series排序，缺失值放到末尾
obj.order()

C:\Users\WangYixin\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
  from ipykernel import kernelapp as app





4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

frame.sort_index(by='b')#根据某一列的值进行排序

C:\Users\WangYixin\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  if __name__ == '__main__':

frame.sort_index(by=['a', 'b'])#根据多个列进行排序

C:\Users\WangYixin\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  if __name__ == '__main__':

obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()#为各组分配一个平均排名

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='first')#根据值在元数据中出现的顺序

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(ascending=False, method='max')#降序

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame

frame.rank(axis=1)

排序时破坏平级关系的method选项
average 为各个值分配平均排名
min 使用整个分组的最小排名
max 使用整个分组的最大排名
first 按值在原始数据中出现顺序分配排名

Axis indexes with duplicate values

obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])#重复索引
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

obj.index.is_unique

False

obj['a']

a    0
a    1
dtype: int64

obj['c']

df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

df.ix['b']

Summarizing and computing descriptive statistics

df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df

df.sum()#NA自动排除

one    9.25
two   -5.80
dtype: float64

df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

df.idxmax()#每列最大值的id

one    b
two    d
dtype: object

df.cumsum()#按列累积求和

df.describe()

obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

描述和汇总统计
count 非NA值的数量
describe 针对Series或个DataFrame列计算汇总统计
min、max 计算最小值和最大值
argmin、argmax 计算最大值和最小值的索引位置
idxmin、idxmax 计算最小值和最大值的索引值
quantile 计算样本的分位数(0-1)
sum 值的总和
mean 值得平均数
median 算数中位数
mad 平均绝对离差
var 方差
std 标准差
skew 样本值的偏度
kurt 峰度
cumsum 累计和
cummin\cummax 累计最大值和累计最小值
cumprod 累计积
diff 一阶差分
pct_change 百分数变化

Correlation and covariance

import pandas.io.data as web
all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)
price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.iteritems()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.iteritems()})

C:\Users\WangYixin\Anaconda2\lib\site-packages\pandas\io\data.py:35: FutureWarning: 
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
  FutureWarning)

returns = price.pct_change()#价格的百分数变化
returns.tail()

returns.MSFT.corr(returns.IBM)#计算两个Series的相关系数

0.49616105806910621

returns.MSFT.cov(returns.IBM)#计算协方差

8.7745727843692117e-05

returns.corr()#DataFrame的相关系数矩阵

returns.cov()#DataFrame的协方差矩阵

returns.corrwith(returns.IBM)#计算列或行与另一个Series或DataFrame之间的相关系数

AAPL    0.383470
GOOG    0.401322
IBM     1.000000
MSFT    0.496161
dtype: float64

returns.corrwith(volume)

AAPL   -0.073558
GOOG   -0.007108
IBM    -0.202749
MSFT   -0.092586
dtype: float64

Unique values, value counts, and membership

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

uniques = obj.unique()#唯一值数组
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()#各值出现的频率

c    3
a    3
b    2
d    1
dtype: int64

pd.value_counts(obj.values, sort=False)#降序排列

a    3
c    3
b    2
d    1
dtype: int64

mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
data

result = data.apply(pd.value_counts).fillna(0)
result

isin 计算一个表示Series各值是否包含于传入的值序列中的布尔型数组
unique 计算Series中的唯一值数组，按发现的顺序返回
value_counts 返回一个Series，其索引为唯一值，值为频率

Handling missing data

string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

NA处理方法
dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤
fillna 用指定值或者插值的办法填充缺失数据
isnull 返回一个含有布尔值的对象，指示哪些值是缺失值
notnull isnull的否定形式

Filtering out missing data

from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data

cleaned

data.dropna(how='all')

data[4] = NA
data

data.dropna(axis=1, how='all')

df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA# 注意行是包含的
df

df.dropna(thresh=2)#每一行至少两个不为缺失值

Filling in missing data

df.fillna(0)

df.fillna({1: 0.5, 2: -1})#不同的列填充不同的值

# always returns a reference to the filled object
_ = df.fillna(0, inplace=True)
df

df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df

df.fillna(method='ffill')

df.fillna(method='ffill', limit=2)#最多两个

data = Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

fillna函数的参数
value 用于填充缺失值的标量值或者字典对象
method 插值方式
axis 待填充的轴
inplace 修改调用者对象而不产生副本
limit 可以连续填充的最大数量

Hierarchical indexing/层次化索引

在一个轴上拥有多个索引级别，低维度处理高维度数据

data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1    0.029610
   2    0.795253
   3    0.118110
b  1   -0.748532
   2    0.584970
   3    0.152677
c  1   -1.565657
   2   -0.562540
d  2   -0.032664
   3   -0.929006
dtype: float64

data.index

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

data['b']

1   -0.748532
2    0.584970
3    0.152677
dtype: float64

data['b':'c']

b  1   -0.748532
   2    0.584970
   3    0.152677
c  1   -1.565657
   2   -0.562540
dtype: float64

data.ix[['b', 'd']]

b  1   -0.748532
   2    0.584970
   3    0.152677
d  2   -0.032664
   3   -0.929006
dtype: float64

data[:, 2]#内层索引

a    0.795253
b    0.584970
c   -0.562540
d   -0.032664
dtype: float64

data.unstack()#组成DataFrame

data.unstack().stack()

a  1    0.029610
   2    0.795253
   3    0.118110
b  1   -0.748532
   2    0.584970
   3    0.152677
c  1   -1.565657
   2   -0.562540
d  2   -0.032664
   3   -0.929006
dtype: float64

frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
frame

frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

frame['Ohio']

MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'])#可以单独创建MultiIndex

Reordering and sorting levels

frame.swaplevel('key1', 'key2')

frame

frame.sortlevel(1)

frame.swaplevel(0, 1).sortlevel(0)

Summary statistics by level

frame.sum(level='key2')

frame.sum(level='color', axis=1)

Using a DataFrame's columns

frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame

frame2 = frame.set_index(['c', 'd'])#将列转换为行索引
frame2

frame.set_index(['c', 'd'], drop=False)#不去掉索引列

frame2.reset_index()#层次化索引回到列里面

Pandas入门

Getting started with pandas

Introduction to pandas data structures

Series

DataFrame

Index objects

Essential functionality

Reindexing

Dropping entries from an axis

Indexing, selection, and filtering

Arithmetic and data alignment

Arithmetic methods with fill values

Operations between DataFrame and Series

Function application and mapping

Sorting and ranking

Axis indexes with duplicate values

Summarizing and computing descriptive statistics

Correlation and covariance

Unique values, value counts, and membership

Handling missing data

Filtering out missing data

Filling in missing data

Hierarchical indexing/层次化索引

Reordering and sorting levels

Summary statistics by level

Using a DataFrame's columns

Other pandas topics

Integer indexing/整数索引

Panel data

Pandas入门

Getting started with pandas

Introduction to pandas data structures

Series

DataFrame

Index objects

Essential functionality

Reindexing

Dropping entries from an axis

Indexing, selection, and filtering

Arithmetic and data alignment

Arithmetic methods with fill values

Operations between DataFrame and Series

Function application and mapping

Sorting and ranking

Axis indexes with duplicate values

Summarizing and computing descriptive statistics

Correlation and covariance

Unique values, value counts, and membership

Handling missing data

Filtering out missing data

Filling in missing data

Hierarchical indexing/层次化索引

Reordering and sorting levels

Summary statistics by level

Using a DataFrame's columns

Other pandas topics

Integer indexing/整数索引

Panel data

内容目录