[关闭]
@jk88876594 2017-08-07T18:27:50.000000Z 字数 7666 阅读 5843

DataFrame—数据处理

阿雷边学边教python数据分析第3期——pandas与numpy


一.数据的增删改

1.增加行

(1)手动输入新增行的内容

  1. #示例数据:
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  1. df1.loc[5] = ["baby","shanghai",80]

(2)将同字段的DataFrame添加进来

  1. #示例数据:
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  6. df1_1 = pd.DataFrame({"name":["faker","lucy"],
  7. "city":["guangzhou","shenzhen"],
  8. "score":[70,75]},
  9. columns=["name","city","score"])
  1. #append做法
  2. df1.append(df1_1,ignore_index=True)
  1. #concat做法
  2. pd.concat([df1,df1_1],ignore_index=True)

2.删除行

  1. #示例数据:
  2. df_concat = pd.concat([df1,df1_1],ignore_index=True)
  1. #删除第7行,也即是索引号为6的这一行
  2. df_concat.drop(6)
  1. #删除第4行和第6行
  2. df_concat.drop([3,5])

通过drop我们可以删除掉不需要的特定行,但其实更多的时候我们可能只需要直接筛选需要的行或者过滤掉一些不需要的行即可,关于如何选择和过滤在下期会详细讲。

3.修改行

若要修改行,则要先选出需要修改的一行或多行,再重新赋值

  1. #示例数据
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  1. #将第一行的ray修改成demon,hangzhou改成wenzhou,10改成35
  2. df1.loc[0] = ["demon","hangzhou",35]
  1. #多行修改
  2. df1.loc[0:2] = [["d","j","l"],["h","b","h"],[40,50,60]]

4.增加列

(1)在末尾插入列
DataFrame["new coloum name"] = value

  1. #示例数据
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  1. #末尾增加一列:gender(性别)
  2. df1["gender"] = ["male","male","female","male","female"]

(2)在任意位置插入新列
DataFrame.insert(index,"new coloum name",value)

  1. #在第2列的位置插入新的一列:height(身高)
  2. df1.insert(1,"height",[170,165,172,180,169])
  3. #第1个参数表示索引号即插入的位置,第2个参数填列的名称,第3个参数填值

5.删除列

  1. #示例数据
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])

(1)del DataFrame["colname"]
删除特定行,并且直接从原DataFrame中修改过来

  1. #删除score列
  2. del df1["score"]

(2)DataFrame.drop(["colname"],axis = 1)
axis=1表示按列删除,默认是返回一个新的DataFrame,如果需要直接修改原DataFrame,则添加参数inplace=True,如果要删除多列,则在方括号[]内继续添加列名,如果只需要删除单列,方括号[]可加可不加

  1. #删除city列
  2. df1.drop(["city"],axis=1)

6.修改列

若要修改列,则要先选出需要修改的一列或多列,再重新赋值

  1. #示例数据
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  1. #修改score列
  2. df1["score"] = 50
  1. #修改city和score列
  2. df1[["city","score"]] = [["hz","bj","hz","cd","sz"],60]

二.数据集的合并

merge

Q1:什么是merge?
通过一个键或多个键将不同的DataFrame中的行连接起来

Q2:如何merge?(连接方式)

1.将列作为键来合并

  1. #示例数据:
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  6. df2 = pd.DataFrame({"name":["ray","lucy","demon"],
  7. "age":[15,17,16]},
  8. columns=["name","age"])

(1)inner连接(交集)

  1. pd.merge(df1,df2,on="name",how="inner")

只保留df1和df2公有的name,以及这些name对应的其他列

这种写法等价于

  1. pd.merge(df1,df2)

因为merge在没有指定某列作为键的情况下,会默认将重叠列的列名作为键,即name列作为键;且默认的连接方式是inner。

(2)outer连接(并集)

  1. pd.merge(df1,df2,on="name",how="outer")

保留df1和df2所有的name,以及name对应的其他列,name匹配不到的列值显示NaN

(3)left连接(保左加右)

  1. pd.merge(df1,df2,on="name",how="left")

保留左边的数据集df1的所有列,用df1的name去匹配df2的name,把df2中能匹配得上的name所对应的其他列都加进来。换言之,也即是在df1的基础上加上df2中能够匹配的数据,不能匹配的数据舍弃掉。简而言之就是保左加右。

(4)right连接(保右加左)

  1. pd.merge(df1,df2,on="name",how="right")

和left连接的原理一样,保留右边的数据即第二个参数df2,在这个基础上加上左边能匹配上的数据即可。

一些可能遇到的问题:
Q1:如果两个数据指定列的列名不一样怎么办?
比如创建一个df3,和df2数据一样,只是name列名改成name2:

  1. df3 = df2.rename(columns={"name":"name2"})

现在需要按照df1的name列和df3的name2列来进行合并

  1. pd.merge(df1,df3,left_on="name",right_on="name2",how="inner")

Q2:如果需要多个键来进行合并怎么办呢?
先修改下源数据df1和df2

  1. df1.loc[5] = ["ray","wuhan",80] #给df1增加新的一行,名称为已出现的ray
  2. df2["city"] = ["hangzhou","hangzhou","heilongjiang"] #给df2增加新的一列city

如果只用name来作为键来匹配是无法成功的,因为有2个ray,他们是同名的不同人,所以需要更多的约束条件来精确匹配,用name和city这两列共同作为键来精确匹配:

  1. pd.merge(df1,df2,on=["name","city"],how="left")

2.将索引作为键来合并

  1. #示例数据
  2. left1 = pd.DataFrame({"key":list("acba"),"value":range(4)})
  3. right1 = pd.DataFrame({"value2":[10,20]},index=["a","b"])

合并left1和right1的交集,通过left1的"key"列和right1的索引列作为键来合并

  1. pd.merge(left1,right1,left_on="key",right_index=True,how="inner")

三.数据的轴向连接

concat

axis=0:表示在横轴上工作,所谓横轴也即是行,而行的方向是上下,因此你可以理解为在上下方向执行操作
axis=1:表示在纵轴上工作,所谓纵轴也即是列,而列的方向是左右,因此你可以理解为在左右方向直行操作
那么数据的轴向连接也就是指:当axis=0时,将两份或多份数据按照上下方向拼接起来;当axis=1时,将两份或多份数据按照左右方向拼接起来。

(1)横轴上的连接,axis=0时(concat默认axis=0)

  1. #示例数据
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  6. df2 = pd.DataFrame({"name":["faker","fizz"],
  7. "city":["wenzhou","shanghai"],
  8. "score":[55,80]},
  9. columns=["name","city","score"])
  1. #按横轴连接df1和df2
  2. pd.concat([df1,df2],ignore_index=True)
  1. #示例数据
  2. df1 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "city":["hangzhou","beijing","hangzhou","chengdu","suzhou"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","city","score"])
  6. df2 = pd.DataFrame({"name":["faker","fizz"],
  7. "city":["wenzhou","shanghai"],
  8. "gender":["male","female"]},
  9. columns=["name","city","gender"])
  1. #按横轴连接df1和df2
  2. pd.concat([df1,df2],ignore_index=True)
  3. #会得到这两份数据的并集,没有的值会以NaN的方式填充
  1. df_concat = pd.concat([df1,df2],keys=["df1","df2"])

(2)纵轴上的连接,axis=1时

  1. #按纵轴方向合并df1和df2
  2. pd.concat([df1,df2],axis=1)

四.合并重叠数据

combine_first

  1. #示例数据
  2. data1 = pd.DataFrame({"score":[60,np.nan,75,80],
  3. "level":[np.nan,"a",np.nan,"f"],
  4. "cost":[1000,1500,np.nan,1200]})
  5. data2 = pd.DataFrame({"score":[34,58,np.nan],
  6. "level":[np.nan,"c","s"]})
  1. data1.combine_first(data2)

data1和data2有索引重叠的部分:即level列和score列的前三行。那么对于data1中的数据,如果data1已有数据,则继续用data1的数据,如果data1中有缺失数据,那么对于缺失数据用参数里的对象data2中的对应值来补充

五.数据分组

示例数据:

  1. df = pd.read_csv("pokemon_data.csv",encoding="gbk")

1.np.where

  1. #Q1
  2. df["攻击强弱度"] = np.where(df["攻击力"] >= 79,"强攻","弱攻")
  1. #Q2
  2. df["划分级"] = np.where((df["攻击力"] >= 79) & (df["防御力"] >= 100),"S级" ,"A级")

2.loc

  1. #Q1
  2. df.loc[df["攻击力"] >= 79,"攻击强弱度"] = "强攻"
  3. df.loc[df["攻击力"] < 79,"攻击强弱度"] = "弱攻"
  1. #Q2
  2. df.loc[(df["攻击力"] >=79) & (df["防御力"] >=100),"划分级"] = "S级"
  3. df.loc[~((df["攻击力"] >=79) & (df["防御力"] >=100)),"划分级"] = "A级"

np.where和loc分组的区别:
如果你想按照条件判断分两组,那么就使用np.where
如果你想按照条件判断分为多个组,就使用loc

六.数据分列与合并

1.分列

split

  1. #示例数据
  2. df4 = pd.DataFrame({"name":["ray","jack","lucy","bob","candy"],
  3. "h&w":["175-70","180-80","168-74","177-72","182-90"],
  4. "score":[10,30,20,15,50]},
  5. columns=["name","h&w","score"])

需求:现在需要将h&w这一列的数据拆分开,分为身高列和体重列,并且拆分后的列需要在源数据里呈现出来

  1. #对h&w列的值依次进行分列,并创建数据表df_split,索引值为df4的索引,列名位height和weight
  2. df_split = pd.DataFrame([x.split("-") for x in df4["h&w"]],index=df4.index,columns=["height","weight"])
  3. #将分列后的数据表df_split与原df4数据表进行匹配
  4. df4 = pd.merge(df4,df_split,right_index=True,left_index=True)

2.合并

需求:现在需要将name和score合并为一列,形式如ray:10,合并后的新列需要在源数据里呈现出来

  1. df4["name:score"] = df4["name"] + ":" + df4["score"].apply(str)

apply可以将一个函数用于DataFrame的所有行或者所有列,也可以应用在DataFrame的一行或者是一列。

七.排序

  1. #示例数据
  2. df = pd.read_csv("pokemon_data.csv",encoding="gbk")

1.根据值排序

sort_values

(1)升序排列(由小到大)

  1. df.sort_values(by="总计") #默认就是升序排列

(2)降序排列(由大到小)

  1. df.sort_values(by="总计",ascending=False) #添加ascending=False参数来降序排列

(3)多级关键词排序

  1. #如果都做升序排列
  2. df.sort_values(by=["总计","攻击力"])
  1. #如果都做降序排列
  2. df.sort_values(by=["总计","攻击力"],ascending=False)
  1. #如果按照总计列做升序排列,但是总计相等的行按照攻击力做降序排列
  2. df.sort_values(by=["总计","攻击力"],ascending=[True,False])

(4)如果希望直接在修改源数据的基础上进行排序,添加参数inplace=True即可

  1. #以升序排列为例
  2. df.sort_values(by="总计",inplace=True)

2.根据索引排序

sort_index

(1)升序排列

  1. df.sort_index(inplace=True)

(2)降序排列

  1. df.sort_index(ascending=False,inplace=True)

八.排名

rank

  1. #示例数据
  2. df = pd.read_csv("pokemon_data.csv",encoding="gbk")
  3. df_rank = df.head(10)

做个中国式排名:希望通过总计列的大小进行排名,并新增一个排名列,值大的排名靠前,值小的排名靠后,如果值一样则占同一个排名位,排名位是连续的,不存在间断。

  1. df_rank["rank"] = df_rank["总计"].rank(ascending=False,method="dense")

九.转置

T

  1. #示例数据
  2. df5 = pd.DataFrame({"cost":[800,1000,1200],"sale":[1000,2000,2500]},index=["1月","2月","3月"])
  1. #行索引和列索引进行转置
  2. df5 = df5.T
  3. df5

REMA(F(WS@VU{KCN86QU3CS.png-265.8kB

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注