Python For Data Analysis-八章第二节
《Python For Data Analysis》的第八章主要讨论的是数据的组合、合并和变形等问题。而第八章的第二节主要讨论的是合并数据问题, 这个有点儿像数据库的表的连接问题: 内连接、外连接、左连接、右连接、全连接等。主要涉及两个函数:merge和concat, merge函数依据key(两个dataframe想重合的列或关联的列)实现类似数据库的join操作;concat函数类似于简单的拼接操作。 为了更好的理解合并数据,这里先设计两个DataFrame:假设一个卖水果的店主手上有两张表,一张表记录供应商和联系电话的供应商信息表supplierInfo, 另一张表记录正在卖的水果和该水果的供应商的热卖水果表saleInfo。如何构建两个dataframe来存储这样的两张表呢?
import pandas as pd
import numpy as np
print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}
supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")
执行结果:
1rd DataFrame : Supplier Info 600 records
----------------------------------------
Telephone supplier
0 13421972855 200
1 13427373491 201
2 13639747726 202
3 13563018819 203
4 13463388117 204
5 13397690432 205
6 13428386545 206
7 13516691855 207
8 13514679589 208
9 13583503843 209 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
fruit supplier
0 coconut 396
1 lemon 718
2 apple 268
3 lemon 703
4 peach 275
5 lemon 327
6 peach 448
7 strawberry 373
8 apple 574
9 banana 329 # saleInfo
----------------------------------------
由于页面限制,每张表只打印前10条数据。第一张表supplierInfo记录者600个供应商的电话和id编号,共600条记录。
第二张表saleInfo记录正在热卖的水果和供应商信息,共33条记录。
18.1 merge实现内连接inner
merge函数合并两个dataframe的时候默认是依据两个dataframe的共有的列,或者指定每个dataframe关联的列,默认是内连接两个dataframe即求两个dataframe的交集(intersection)。利用上边的两张表,提出一个问题,如何得到一个新的dataframe,即含热卖水果、该水果的供应商还有供应的电话呢?可以直接使用默认的merge的功能,即依据共同列和内连接。供应商表supplierInfo的供应商有600个,而热卖水果表saleInfo里只有33条记录,且有可能一个供应商提供多个水果都热卖,利用merge可以基于两表共有列supplier来提取供应商表里供应商的电话信息。
import pandas as pd
import numpy as np
print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}
supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")
print "-" * 40
inner = pd.merge(saleInfo,supplierInfo)
print inner, "# merge by inner default"
print "-" * 40
执行结果:
1rd DataFrame : Supplier Info 600 records
----------------------------------------
Telephone supplier
0 13397314405 200
1 13640069730 201
2 13452263985 202
3 13365776382 203
4 13372081583 204
5 13362805386 205
6 13691396902 206
7 13472919613 207
8 13630580967 208
9 13572586629 209 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
fruit supplier
0 cherry 744
1 peach 530
2 watermelon 720
3 banana 753
4 watermelon 362
5 lemon 236
6 pear 536
7 watermelon 682
8 cherry 539
9 cherry 378 # saleInfo
----------------------------------------
fruit supplier Telephone
0 cherry 744 13512653004
1 peach 530 13451525038
2 watermelon 720 13506123412
3 banana 753 13585478535
4 watermelon 362 13602478941
5 lemon 236 13448737006
6 pear 536 13615857657
7 watermelon 682 13305228132
8 cherry 539 13320374321
9 cherry 378 13617634073
10 peach 696 13555357057
11 pear 230 13547184365
12 watermelon 365 13439731283
13 pear 533 13640788128
14 pineapple 791 13609052279
15 peach 520 13449112307
16 lemon 437 13366315828
17 cherry 524 13465646212
18 pineapple 466 13370749700
19 apple 380 13687984821
20 pear 434 13334299009
21 strawberry 367 13489740422
22 watermelon 644 13614544558
23 pineapple 491 13419549293
24 pineapple 675 13381339073
25 apple 366 13408400428
26 pear 444 13551163610
27 pear 507 13588231571
28 watermelon 702 13427105077
29 apple 505 13560167229
30 banana 725 13622088077
31 strawberry 785 13582343547
32 cherry 582 13574155943 # merge by inner default
----------------------------------------
程序运行无误的原因是两个表都有supplier共有列且名字一致,但是如果两个表的关联列的名字不一致,可以使用merge函数的left_on
和right_on
参数来指定关联列。
import pandas as pd
import numpy as np
print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}
supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"sup":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")
print "-" * 40
inner = pd.merge(saleInfo,supplierInfo, left_on = "sup", right_on = "supplier")
print inner, "# merge by inner default"
print "-" * 40
执行结果:
1rd DataFrame : Supplier Info 600 records
----------------------------------------
Telephone supplier
0 13419790746 200
1 13551084291 201
2 13463359987 202
3 13655480653 203
4 13457048691 204
5 13521227468 205
6 13452934190 206
7 13678163518 207
8 13587003496 208
9 13363122472 209 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
fruit sup
0 pineapple 376
1 cherry 672
2 peach 384
3 pineapple 247
4 coconut 592
5 cherry 448
6 lemon 529
7 banana 615
8 lemon 374
9 strawberry 741 # saleInfo
----------------------------------------
fruit sup Telephone supplier
0 pineapple 376 13619661261 376
1 cherry 672 13659543874 672
2 peach 384 13586651620 384
3 pineapple 247 13490523210 247
4 coconut 592 13417280009 592
5 cherry 448 13652830401 448
6 lemon 529 13662448776 529
7 banana 615 13679672834 615
8 lemon 374 13349513320 374
9 strawberry 741 13458734929 741
10 lemon 249 13410741189 249
11 cherry 264 13359359933 264
12 lemon 787 13469956160 787
13 strawberry 727 13644866542 727
14 coconut 426 13522184540 426
15 orange 659 13589132728 659
16 peach 402 13538576857 402
17 coconut 235 13374021445 235
18 cherry 202 13463359987 202
19 cherry 764 13544761364 764
20 banana 287 13610190258 287
21 banana 355 13503122927 355
22 orange 455 13586231983 455
23 coconut 319 13463046713 319
24 coconut 796 13501732894 796
25 pear 350 13405649819 350
26 orange 431 13693748090 431
27 pear 250 13318097140 250
28 pineapple 712 13658847104 712
29 pear 224 13636139762 224
30 apple 327 13573669416 327
31 cherry 513 13651095441 513
32 orange 580 13534756074 580 # merge by inner default
----------------------------------------
程序变化之处是两个dataframe的列名字有所不同:
val = {"supplier" : sid, "Telephone" : tel}
supplierInfo = pd.DataFrame(val)
和
saleInfo = pd.DataFrame({"fruit":fid,"sup":sup})
那么在用merge时由于两个dataframe没有同名的列,需要用left_on
和right_on
参数来指定关联列,即:
inner = pd.merge(saleInfo,supplierInfo, left_on = "sup", right_on = "supplier")
18.2 merge实现外连接outer
merge函数默认是内连接,可以使用how
形参来指定非内连接,例如外连接outer,外连接实际类似于数学上的集合的并集。
import pandas as pd
import numpy as np
print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}
supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")
print "-" * 40
outer = pd.merge(saleInfo,supplierInfo, how = "outer")
print outer, "# merge by outer"
print "-" * 40
执行结果:
1rd DataFrame : Supplier Info 600 records
----------------------------------------
Telephone supplier
0 13543440652 200
1 13433706661 201
2 13549572735 202
3 13619468030 203
4 13573591940 204
5 13606770156 205
6 13360419930 206
7 13665511872 207
8 13628056017 208
9 13637414407 209 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
fruit supplier
0 orange 507
1 cherry 372
2 strawberry 514
3 lemon 663
4 orange 727
5 pineapple 479
6 strawberry 315
7 peach 439
8 strawberry 454
9 pear 240 # saleInfo
----------------------------------------
fruit supplier Telephone
0 orange 507 13439505961
1 cherry 372 13303155133
2 strawberry 514 13525833383
3 lemon 663 13590968869
4 orange 727 13417794792
....
28 apple 485 13614700517
29 pear 519 13601530338
.. ... ... ...
570 NaN 770 13641022823
571 NaN 771 13602055526
...
598 NaN 798 13610828633
599 NaN 799 13699702573
[600 rows x 3 columns] # merge by outer
----------------------------------------
18.3 merge实现左右连接
为了更好的展示用merge实现左右连接,将两张表的数据缩减一些。
import pandas as pd
import numpy as np
print "1rd DataFrame : Supplier Info 10 records"
print "-" * 40
sid = pd.Series(np.arange(200, 210))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 10))
val = {"supplier" : sid, "Telephone" : tel}
supplierInfo = pd.DataFrame(val)
print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo, "# saleInfo"
print "-" * 40
saleInfo.to_csv("saleInfo.csv")
print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, how = "left")
print left, "# merge by left"
print "-" * 40
print "\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, how = "right")
print right, "# merge by right"
print "-" * 40
print "\n"
print "*" * 40
print "supplierInfo is left table, saleInfo is right table"
print "-" * 40
left = pd.merge(supplierInfo,saleInfo, how = "left")
print left, "# merge by left"
print "-" * 40
print "\n"
print "*" * 40
print "supplierInfo is left table, saleInfo is right table"
print "-" * 40
right = pd.merge(supplierInfo,saleInfo, how = "right")
print right, "# merge by right"
print "-" * 40
执行结果:
1rd DataFrame : Supplier Info 10 records
----------------------------------------
Telephone supplier
0 13600789375 200
1 13660182483 201
2 13685123194 202
3 13654834211 203
4 13639572499 204
5 13329426145 205
6 13558906512 206
7 13457094656 207
8 13484579614 208
9 13314452594 209 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
fruit supplier
0 banana 201
1 orange 204
2 banana 203
3 orange 206
4 pineapple 205
5 orange 202 # saleInfo
----------------------------------------
****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
fruit supplier Telephone
0 banana 201 13660182483
1 orange 204 13639572499
2 banana 203 13654834211
3 orange 206 13558906512
4 pineapple 205 13329426145
5 orange 202 13685123194 # merge by left
----------------------------------------
****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
fruit supplier Telephone
0 banana 201 13660182483
1 orange 204 13639572499
2 banana 203 13654834211
3 orange 206 13558906512
4 pineapple 205 13329426145
5 orange 202 13685123194
6 NaN 200 13600789375
7 NaN 207 13457094656
8 NaN 208 13484579614
9 NaN 209 13314452594 # merge by right
----------------------------------------
****************************************
supplierInfo is left table, saleInfo is right table
----------------------------------------
Telephone supplier fruit
0 13600789375 200 NaN
1 13660182483 201 banana
2 13685123194 202 orange
3 13654834211 203 banana
4 13639572499 204 orange
5 13329426145 205 pineapple
6 13558906512 206 orange
7 13457094656 207 NaN
8 13484579614 208 NaN
9 13314452594 209 NaN # merge by left
----------------------------------------
****************************************
supplierInfo is left table, saleInfo is right table
----------------------------------------
Telephone supplier fruit
0 13660182483 201 banana
1 13685123194 202 orange
2 13654834211 203 banana
3 13639572499 204 orange
4 13329426145 205 pineapple
5 13558906512 206 orange # merge by right
----------------------------------------
1).左连接的意思是以merge函数里的第一个dataframe为基准(saleInfo),如果第二个dataframe和第一个dataframe有共同列(关联列)的数据被合并成一个新dataframe,没有NaN;
语句left = pd.merge(saleInfo,supplierInfo, how = "left")
的左表为saleInfo
,右表为supplierInfo
,函数merge的how指定了左连接,那么左表全输出,右表其他非共同或关联列如果有数据则输出数据,没有用NaN填充。
----------------------------------------
fruit supplier Telephone
0 banana 201 13660182483
1 orange 204 13639572499
2 banana 203 13654834211
3 orange 206 13558906512
4 pineapple 205 13329426145
5 orange 202 13685123194 # merge by left
----------------------------------------
2). 右连接则是以第二个dataframe为基准,列出第一个dataframe里有与第二个dataframe能关联(共同)上的列的数据。
语句right = pd.merge(saleInfo,supplierInfo, how = "right")
的左表为saleInfo
,右表为supplierInfo
,函数merge的how指定了右连接,
那么以’supplierInfo‘为基准并全部输出,左表saleInfo
如果有和右表supplierInfo
共同列数据则输出其他非共同列数据,没有用NaN填充。
----------------------------------------
fruit supplier Telephone
0 banana 201 13660182483
1 orange 204 13639572499
2 banana 203 13654834211
3 orange 206 13558906512
4 pineapple 205 13329426145
5 orange 202 13685123194
6 NaN 200 13600789375
7 NaN 207 13457094656
8 NaN 208 13484579614
9 NaN 209 13314452594 # merge by right
----------------------------------------
18.4 Merging on Index
"Merging on Index"的意思是可以一个表的某列和另一个表的行索引index作为两个表的关联连接的依据,也可两个表都用index来关联连接。
- 先演示一个表用index和另一个表用列来关联连接,为此需要对上边的两个表做一下简单的调整,对于记录供应商电话信息的supplierInfo表调整成这样,用供应商的id作为行索引,电话作为值。
import pandas as pd
import numpy as np
print "1rd DataFrame : Supplier Info 10 records"
print "-" * 40
sid = np.arange(200, 210)
tel = np.random.randint(13300000000, 13700000000, size = 10)
supplierInfo = pd.DataFrame(tel, index = sid, columns = ["telephone"])
print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print ""
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = pd.Series(sid).sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo, "# saleInfo"
print "-" * 40
saleInfo.to_csv("saleInfo.csv")
print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, left_on="supplier", right_index = True, how = "right")
print right, "# merge by key and index , right join"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, left_on="supplier", right_index = True, how = "left")
print left, "# merge by key and index , left join"
print "-" * 40
执行结果:
1rd DataFrame : Supplier Info 10 records
----------------------------------------
telephone
200 13414429305
201 13324093133
202 13603734461
203 13691986220
204 13518758252
205 13631127768
206 13535125903
207 13544359176
208 13316553234
209 13455143637 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
fruit supplier
0 orange 203
1 orange 202
2 banana 200
3 orange 208
4 orange 207
5 orange 209 # saleInfo
----------------------------------------
****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
fruit supplier telephone
0 orange 203 13691986220
1 orange 202 13603734461
2 banana 200 13414429305
3 orange 208 13316553234
4 orange 207 13544359176
5 orange 209 13455143637
5 NaN 201 13324093133
5 NaN 204 13518758252
5 NaN 205 13631127768
5 NaN 206 13535125903 # merge by key and index , right join
----------------------------------------
fruit supplier telephone
0 orange 203 13691986220
1 orange 202 13603734461
2 banana 200 13414429305
3 orange 208 13316553234
4 orange 207 13544359176
5 orange 209 13455143637 # merge by key and index , left join
----------------------------------------
由于supplierInfo用供应商的id作为行索引是Index,而saleInfo里的supplier列也是供应商的id,那么如果想匹配出(连接)热卖水果供应商的电话,需要用supplierInfo的index和saleInfo的列进行连接操作即语句right = pd.merge(saleInfo,supplierInfo, left_on="supplier", right_index = True, how = "right")
原因是saleInfo是左表,用了supplier
列需用形参left_on="supplier"
和右表supplierInfo的index进行连接,需要用形参right_index = True
来完成右连接或左连接。总之,这条语句的意思就是左表用列、右表用index行索引。
- 左右表均用index来关联连接。需要使用
left_index=True, right_index = True
形参。
import pandas as pd
import numpy as np
print "1rd DataFrame : Supplier Info 10 records"
print "-" * 40
sid = np.arange(200, 210)
tel = np.random.randint(13300000000, 13700000000, size = 10)
supplierInfo = pd.DataFrame(tel, index = sid, columns = ["telephone"])
print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print ""
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = pd.Series(sid).sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame(fid.values, index = sup)
print saleInfo, "# saleInfo"
print "-" * 40
saleInfo.to_csv("saleInfo.csv")
print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "right")
print right, "# merge by both index , right join"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "left")
print left, "# merge by both index , left join"
print "-" * 40
执行结果:
1rd DataFrame : Supplier Info 10 records
----------------------------------------
telephone
200 13453805505
201 13461207097
202 13619022642
203 13658147818
204 13611310274
205 13545485123
206 13685924267
207 13562572868
208 13636868229
209 13385445647 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
0
202 pineapple
209 pineapple
200 pineapple
208 banana
201 pineapple
204 orange # saleInfo
----------------------------------------
****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
0 telephone
200 pineapple 13453805505
201 pineapple 13461207097
202 pineapple 13619022642
203 NaN 13658147818
204 orange 13611310274
205 NaN 13545485123
206 NaN 13685924267
207 NaN 13562572868
208 banana 13636868229
209 pineapple 13385445647 # merge by both index , right join
----------------------------------------
0 telephone
202 pineapple 13619022642
209 pineapple 13385445647
200 pineapple 13453805505
208 banana 13636868229
201 pineapple 13461207097
204 orange 13611310274 # merge by both index , left join
----------------------------------------
- 这种左右均用index的merge函数使用方式可以用join函数来简化。
import pandas as pd
import numpy as np
#print "1rd DataFrame : Supplier Info 10 records"
#print "-" * 40
sid = np.arange(200, 210)
tel = np.random.randint(13300000000, 13700000000, size = 10)
supplierInfo = pd.DataFrame(tel, index = sid, columns = ["telephone"])
#print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
#print ""
#print "2rd DataFrame : Sales price info"
#print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = pd.Series(sid).sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame(fid.values, index = sup)
#print saleInfo, "# saleInfo"
#print "-" * 40
saleInfo.to_csv("saleInfo.csv")
print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "right")
print right, "# merge by both index , right join"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "left")
print left, "# merge by both index , left join"
print "-" * 40
print saleInfo.join(supplierInfo), "# saleInfo.join(supplierInfo)"
print "-" * 40
print supplierInfo.join(saleInfo), "# supplierInfo.join(saleInfo)"
print "-" * 40
执行结果:
****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
0 telephone
200 NaN 13499398039
201 NaN 13472434376
202 NaN 13469093351
203 banana 13474856911
204 orange 13329906610
205 orange 13349143149
206 pineapple 13353763211
207 banana 13333693459
208 NaN 13305384269
209 banana 13531020571 # merge by both index , right join
----------------------------------------
0 telephone
204 orange 13329906610
206 pineapple 13353763211
205 orange 13349143149
203 banana 13474856911
207 banana 13333693459
209 banana 13531020571 # merge by both index , left join
----------------------------------------
0 telephone
204 orange 13329906610
206 pineapple 13353763211
205 orange 13349143149
203 banana 13474856911
207 banana 13333693459
209 banana 13531020571 # saleInfo.join(supplierInfo)
----------------------------------------
telephone 0
200 13499398039 NaN
201 13472434376 NaN
202 13469093351 NaN
203 13474856911 banana
204 13329906610 orange
205 13349143149 orange
206 13353763211 pineapple
207 13333693459 banana
208 13305384269 NaN
209 13531020571 banana # supplierInfo.join(saleInfo)
----------------------------------------
18.5 concat拼接数据
第八章的第二节除了介绍了可以实现数据库连接的merge函数外,还介绍了另一个函数就是concat函数,concat函数可以拼接数据,例如将series拼接成dataframe,将dataframe拼接成更长、更宽的dataframe数据。
18.5.1 concat函数对Series的拼接
- 将Series拼接成DataFrame,这个比较有用。concat函数在没有指定轴的情况下默认是0轴,对于series而言竖向连接。
import pandas as pd
import numpy as np
ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss
ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40
sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40
执行结果:
----------------------------------------
a 2
e 3
g 4
b 5
f 0
c 6
dtype: int64 # s0
----------------------------------------
g 5
d 2
b 1
e 4
dtype: int64 # s1
----------------------------------------
g 6
d 5
a 4
e 3
c 2
dtype: int64 # s2
----------------------------------------
a 2
e 3
g 4
b 5
f 0
c 6
g 5
d 2
b 1
e 4
g 6
d 5
a 4
e 3
c 2
dtype: int64 # concat axis = 0
----------------------------------------
解释一下程序的各个部分:
1). 创建一个a~g的Series。
ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss
2). 先将作为index的索引乱序,通过permutation函数,然后通过sample函数随机选取若干个样本构建s0、s1、s2三个Series。
ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40
3). 使用concat连接这三个Series,得到一个更长的Series。
sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40
- 横向连接Series可以组成DataFrame数据类型,需要注意的是DataFrame是有列索引和行索引的,在concat函数里需要使用
axis = 1
来告诉concat函数横向连接Series,使用keys
参数来指定各个列的名字。
import pandas as pd
import numpy as np
ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss
ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40
sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40
df = pd.concat([s0, s1, s2], axis = 1, keys = ["col_a", "col_b", "col_c"])
print df, "# concat axis = 1"
print "-" * 40
执行结果:
----------------------------------------
d 4
f 3
e 5
g 0
c 6
b 2
dtype: int64 # s0
----------------------------------------
a 1
g 2
c 6
b 3
dtype: int64 # s1
----------------------------------------
d 1
f 2
b 0
g 6
c 5
dtype: int64 # s2
----------------------------------------
d 4
f 3
e 5
g 0
c 6
b 2
a 1
g 2
c 6
b 3
d 1
f 2
b 0
g 6
c 5
dtype: int64 # concat axis = 0
----------------------------------------
col_a col_b col_c
a NaN 1.0 NaN
b 2.0 3.0 0.0
c 6.0 6.0 5.0
d 4.0 NaN 1.0
e 5.0 NaN NaN
f 3.0 NaN 2.0
g 0.0 2.0 6.0 # concat axis = 1
----------------------------------------
从执行结果可以看出用Series使用concat横向连接可以组成(创建)DataFrame数据类型,结果有点儿像数据库里的outer连接方式,列会出现NaN数据,很自然想到有inner内连接么?
- concat函数通过
join
形参给出连接的方式,默认是外连接outer。
import pandas as pd
import numpy as np
ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss
ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40
sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40
df = pd.concat([s0, s1, s2], axis = 1, keys = ["col_a", "col_b", "col_c"])
print df, "# concat axis = 1"
print "-" * 40
df = pd.concat([s0, s1, s2], axis = 1, keys = ["col_a", "col_b", "col_c"], join = "inner")
print df, "# concat axis = 1, inner join"
print "-" * 40
执行结果:
----------------------------------------
g 6
d 5
c 0
b 3
a 1
e 4
dtype: int64 # s0
----------------------------------------
c 3
d 2
g 6
e 4
dtype: int64 # s1
----------------------------------------
a 1
b 6
d 5
g 2
e 3
dtype: int64 # s2
----------------------------------------
g 6
d 5
c 0
b 3
a 1
e 4
c 3
d 2
g 6
e 4
a 1
b 6
d 5
g 2
e 3
dtype: int64 # concat axis = 0
----------------------------------------
col_a col_b col_c
a 1 NaN 1.0
b 3 NaN 6.0
c 0 3.0 NaN
d 5 2.0 5.0
e 4 4.0 3.0
g 6 6.0 2.0 # concat axis = 1
----------------------------------------
col_a col_b col_c
g 6 6 2
d 5 2 5
e 4 4 3 # concat axis = 1, inner join
----------------------------------------
18.5.2 concat函数拼接DataFrame
同样concat函数也可以作用于dataframe数据类型。如果不指定concat函数的axis,axis默认为0,即竖向连接得到更长的dataframe,以outer方式连接,即没有的列上数据填充NaN。
import pandas as pd
import numpy as np
val1 = np.arange(24).reshape((6,4))
col1 = list("abcd")
ind1 = list("opqrst")
df1 = pd.DataFrame(val1, columns = col1, index = ind1)
val2 = np.arange(25).reshape((5,5))
col2 = list("bxced")
ind2 = list("ijkst")
df2 = pd.DataFrame(val2, columns = col2, index = ind2)
print "-" * 40
print df1, "# df1"
print "-" * 40
print df2, "# df2"
print "-" * 40
print pd.concat([df1, df2]), "# concat axis = 0, outer"
print "-" * 40
执行结果:
----------------------------------------
a b c d
o 0 1 2 3
p 4 5 6 7
q 8 9 10 11
r 12 13 14 15
s 16 17 18 19
t 20 21 22 23 # df1
----------------------------------------
b x c e d
i 0 1 2 3 4
j 5 6 7 8 9
k 10 11 12 13 14
s 15 16 17 18 19
t 20 21 22 23 24 # df2
----------------------------------------
a b c d e x
o 0.0 1 2 3 NaN NaN
p 4.0 5 6 7 NaN NaN
q 8.0 9 10 11 NaN NaN
r 12.0 13 14 15 NaN NaN
s 16.0 17 18 19 NaN NaN
t 20.0 21 22 23 NaN NaN
i NaN 0 2 4 3.0 1.0
j NaN 5 7 9 8.0 6.0
k NaN 10 12 14 13.0 11.0
s NaN 15 17 19 18.0 16.0
t NaN 20 22 24 23.0 21.0 # concat axis = 0, outer
----------------------------------------
可以修改这些参数,例如横向:
import pandas as pd
import numpy as np
val1 = np.arange(24).reshape((6,4))
col1 = list("abcd")
ind1 = list("opqrst")
df1 = pd.DataFrame(val1, columns = col1, index = ind1)
val2 = np.arange(25).reshape((5,5))
col2 = list("bxced")
ind2 = list("ijkst")
df2 = pd.DataFrame(val2, columns = col2, index = ind2)
print "-" * 40
print df1, "# df1"
print "-" * 40
print df2, "# df2"
print "-" * 40
print pd.concat([df1, df2]), "# concat axis = 0, outer"
print "-" * 40
print pd.concat([df1, df2], axis = 1), "# concat axis = 1, outer"
print "-" * 40
print pd.concat([df1, df2], axis = 1, join = "inner"), "# concat axis = 1, inner"
print "-" * 40
执行结果:
----------------------------------------
a b c d
o 0 1 2 3
p 4 5 6 7
q 8 9 10 11
r 12 13 14 15
s 16 17 18 19
t 20 21 22 23 # df1
----------------------------------------
b x c e d
i 0 1 2 3 4
j 5 6 7 8 9
k 10 11 12 13 14
s 15 16 17 18 19
t 20 21 22 23 24 # df2
----------------------------------------
a b c d e x
o 0.0 1 2 3 NaN NaN
p 4.0 5 6 7 NaN NaN
q 8.0 9 10 11 NaN NaN
r 12.0 13 14 15 NaN NaN
s 16.0 17 18 19 NaN NaN
t 20.0 21 22 23 NaN NaN
i NaN 0 2 4 3.0 1.0
j NaN 5 7 9 8.0 6.0
k NaN 10 12 14 13.0 11.0
s NaN 15 17 19 18.0 16.0
t NaN 20 22 24 23.0 21.0 # concat axis = 0, outer
----------------------------------------
a b c d b x c e d
i NaN NaN NaN NaN 0.0 1.0 2.0 3.0 4.0
j NaN NaN NaN NaN 5.0 6.0 7.0 8.0 9.0
k NaN NaN NaN NaN 10.0 11.0 12.0 13.0 14.0
o 0.0 1.0 2.0 3.0 NaN NaN NaN NaN NaN
p 4.0 5.0 6.0 7.0 NaN NaN NaN NaN NaN
q 8.0 9.0 10.0 11.0 NaN NaN NaN NaN NaN
r 12.0 13.0 14.0 15.0 NaN NaN NaN NaN NaN
s 16.0 17.0 18.0 19.0 15.0 16.0 17.0 18.0 19.0
t 20.0 21.0 22.0 23.0 20.0 21.0 22.0 23.0 24.0 # concat axis = 1, outer
----------------------------------------
a b c d b x c e d
s 16 17 18 19 15 16 17 18 19
t 20 21 22 23 20 21 22 23 24 # concat axis = 1, inner
----------------------------------------
对于两个dataframe均有的列,均保留,合并后原来的dataframe行数没数据填充NaN。
18.6 合并重叠数据
本节第一个例子介绍了numpy的where函数,这个之前已经讲解过了,可以查看Numpy的where函数,本节介绍的另一个函数是pandas的combine_frist函数,比较有用:
a.combine_first(b)
作用是用b的数据填补a的缺失值。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
print "-" * 40
print df1, "# df1"
print "-" * 40
print df2, "# df2"
print "-" * 40
print df1.combine_first(df2), "# df1.combine_first(df2)"
print "-" * 40
print df2.combine_first(df1), "# df2.combine_first(df1)"
print "-" * 40
执行结果:
----------------------------------------
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14 # df1
----------------------------------------
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0 # df2
----------------------------------------
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN # df1.combine_first(df2)
----------------------------------------
a b c
0 5.0 NaN 2.0
1 4.0 3.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN # df2.combine_first(df1)
----------------------------------------
语句df1.combine_first(df2)
的执行结果:
----------------------------------------
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN # df1.combine_first(df2)
----------------------------------------
df1的‘a’列1行是NaN,df2的‘a’列1行是4,combine_first函数用df2的‘a’列1行是数据4填充df1的‘a’列1行,所以结果里的‘a’列1行是4。
感谢Klang(金浪)智能数据看板klang.org.cn鼎力支持!