Data analysis/๋ฐ์ดํ„ฐ ๋ถ„์„

Pandas ์‚ฌ์šฉ๋ฒ• - Series์™€ DataFrame [๊ธฐ๋ณธ]

Mainyoung 2022. 1. 15. 15:35

๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ•˜๋‹ค ๋ณด๋ฉด Pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ•„์ˆ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. Pandas๋ž€ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ๋ฐ ์œ„ํ‚ค๋ฐฑ๊ณผ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋‹ค.

Pandas๋Š” ๋ฐ์ดํ„ฐ ์กฐ์ž‘ ๋ฐ ๋ถ„์„์„ ์œ„ํ•œ Python ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ์šฉ์œผ๋กœ ์ž‘์„ฑ๋œ ์†Œํ”„ํŠธ์›จ์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ์ˆซ์ž ํ…Œ์ด๋ธ”๊ณผ ์‹œ๊ณ„์—ด ์„ ์กฐ์ž‘ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ์™€ ์—ฐ์‚ฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

https://pandas.pydata.org/

 

pandas - Python Data Analysis Library

pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Install pandas now!

pandas.pydata.org

ํ•„์ž๋„ Pandas๋ฅผ ๋”ฐ๋กœ ๊ณต๋ถ€ํ•˜๊ธฐ ๋ณด๋‹จ ๊ทธ๋ƒฅ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ•˜๋ฉด์„œ ์ด๊ฒƒ์ €๊ฒƒ ์“ฐ๊ฒŒ ๋˜์—ˆ๋Š”๋ฐ, ์ด๋ฒˆ ๊ธฐํšŒ์— ์ฃผ์š” ๊ธฐ๋Šฅ๋“ค์— ๋Œ€ํ•œ ๊ฒƒ๋“ค์„ ์ •๋ฆฌํ•˜๋ ค ํ•œ๋‹ค. 

 


1. Series - 1์ฐจ์› ๋ฐ์ดํ„ฐ (์ •์ˆ˜, ์‹ค์ˆ˜, ๋ฌธ์ž์—ด ๋“ฑ)

Series๋Š” 1์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 

>>> import pandas as pd
>>> series1 = pd.Series([-20,-10,10,20])
>>> print(series1)


0   -20
1   -10
2    10
3    20
dtype: int64

๊ฐ„๋‹จํ•œ Series๋Š” ์œ„์™€๊ฐ™์ด ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ  1์›”๋ถ€ํ„ฐ 4์›”๊นŒ์ง€ ํ‰๊ท  ์˜จ๋„์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ Series๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. Index๋„ ์„ค์ • ๊ฐ€๋Šฅํ•˜๋‹ค.

 

>>> series1 = pd.Series([-20,-10,10,20], index = ['Jan', 'Feb', 'Mar', 'Apr'])
>>> print(Series1)

Jan   -20
Feb   -10
Mar    10
Apr    20
dtype: int64

 

Pandas Series๊ฐ€ ์ œ๊ณตํ•˜๋Š” ๋ฉ”์„œ๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

Pandas์˜ Series๋ฅผ ์ƒ์„ฑํ•  ๋•Œ dtype์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์•„๊นŒ์™€ ๊ฐ™์€ ํ‰๊ท  ์˜จ๋„ ๊ฐ’์„ ์ •์ˆ˜๊ฐ€ ์•„๋‹Œ string ํ˜•ํƒœ๋กœ ์„ค์ •ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค.

 

>>> import pandas as pd
>>> temp = pd.Series([-20,-10,10,20],dtype = 'str')
>>> print(temp)

0    -20
1    -10
2     10
3     20
dtype: object

Pandas ์ž๋ฃŒํ˜•์—์„œ Object๋Š” ๋ฌธ์ž์—ด์„ ์˜๋ฏธํ•œ๋‹ค. ํ—ท๊ฐˆ๋ฆฌ์ง€ ๋ง์ž!

 

๊ฐ parameter์— ๋Œ€ํ•œ ์„ค๋ช…์€ ๊ณต์‹ Document๋ฅผ ์ฐธ๊ณ ํ•˜์ž!

https://pandas.pydata.org/docs/reference/api/pandas.Series.html

 

pandas.Series — pandas 1.3.5 documentation

Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the ind

pandas.pydata.org

 


2. DataFrame - 2์ฐจ์› ๋ฐ์ดํ„ฐ (Series ๋“ค์˜ ๋ชจ์Œ)

DataFrame์€ 2์ฐจ์› ๋ฐ์ดํ„ฐ๋กœ Series๋“ค์ด ๋‘๊ฐœ ์ด์ƒ ๋ชจ์ธ ๊ฒƒ์ด๋‹ค.

 

์•„๊นŒ์™€ ๊ฐ™์ด 1~4์›”์˜ ์˜จ๋„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์Šต๋„, ๋‚ ์”จ, ์ผ์ถœ ์‹œ๊ฐ„ ๋“ฑ๋“ฑ...์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ถ”๊ฐ€๋กœ ๋” ์žˆ๋‹ค๋ฉด DataFrame์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

 

>>> temp = pd.DataFrame([[-20,-10,10,20],[64,32,53,12],['๋ง‘์Œ','ํ๋ฆผ','๋น„','๋ง‘์Œ']], columns = ['Jan','Feb','Mar','Apr'],index = ['์˜จ๋„','์Šต๋„','๋‚ ์”จ'])
>>> temp


      Jan 	Feb	 Mar	 Apr
์˜จ๋„	-20	-10	10	20
์Šต๋„	64	32	53	12
๋‚ ์”จ	๋ง‘์Œ	ํ๋ฆผ	๋น„	๋ง‘์Œ

๊ฐ์ฒด๋ฅผ ์„ค์ •ํ•˜๊ณ , ์›ํ•˜๋Š” ๊ฐ’๋“ค์„ 2์ฐจ์› ๋ฐฐ์—ด ํ˜•ํƒœ๋กœ ์ž…๋ ฅํ•˜์—ฌ DataFrame์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ํ˜น์€ ์‚ฌ์ „(dictionary) ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์ƒ์„ฑํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

>>> data = {
    '์ด๋ฆ„' : ['์ฑ„์น˜์ˆ˜', '์ •๋Œ€๋งŒ', '์†กํƒœ์„ญ', '์„œํƒœ์›…', '๊ฐ•๋ฐฑํ˜ธ', '๋ณ€๋•๊ทœ', 'ํ™ฉํƒœ์‚ฐ', '์œค๋Œ€ํ˜‘'],
    'ํ•™๊ต' : ['๋ถ์‚ฐ๊ณ ', '๋ถ์‚ฐ๊ณ ', '๋ถ์‚ฐ๊ณ ', '๋ถ์‚ฐ๊ณ ', '๋ถ์‚ฐ๊ณ ', '๋Šฅ๋‚จ๊ณ ', '๋Šฅ๋‚จ๊ณ ', '๋Šฅ๋‚จ๊ณ '],
    'ํ‚ค' : [197, 184, 168, 187, 188, 202, 188, 190],
    '๊ตญ์–ด' : [90, 40, 80, 40, 15, 80, 55, 100],
    '์˜์–ด' : [85, 35, 75, 60, 20, 100, 65, 85],
    '์ˆ˜ํ•™' : [100, 50, 70, 70, 10, 95, 45, 90],
    '๊ณผํ•™' : [95, 55, 80, 75, 35, 85, 40, 95],
    '์‚ฌํšŒ' : [85, 25, 75, 80, 10, 80, 35, 95],
    'SWํŠน๊ธฐ' : ['Python', 'Java', 'Javascript', '', '', 'C', 'PYTHON', 'C#']
}
>>> df = pd.DataFrame(data)
>>> print(df)

DataFrame ํ˜•ํƒœ

๋‹ค์Œ๊ณผ ๊ฐ™์ด DataFrame์ด ์ƒ์„ฑ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ถ”๊ฐ€๋กœ ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ DataFrame ๋‚ด์˜ Data์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.

df['์ด๋ฆ„']
df[['์ด๋ฆ„','ํ‚ค']] #ํ•˜๋‚˜์˜ ์—ด๋งŒ ๊ฐ€์ง€๊ณ  ์˜ค๋ ค๋ฉด ํ•˜๋‚˜๋งŒ ์ ์œผ๋ฉด ๋˜์ง€๋งŒ ๋‘๊ฐœ ์ด์ƒ ๊ฐ€์ง€๊ณ  ์˜ค๋ ค๋ฉด ๋ฆฌ์ŠคํŠธ๋กœ ๋ฌถ์–ด์ค˜์•ผํ•จ

ํ•˜๋‚˜์˜ ์—ด๋งŒ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด df๋’ค์— []๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ด์˜ ์ด๋ฆ„์„ ์ ์œผ๋ฉด ๋˜์ง€๋งŒ, ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์—ด์„ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ๋ฐฐ์—ด๋กœ ๋ฌถ์€ ํ›„ ์ž…๋ ฅํ•ด์•ผ ํ•œ๋‹ค.  ์™ผ์ชฝ์— 0~7์˜ ์ˆซ์ž๊ฐ€ ๋งˆ์Œ์— ์•ˆ๋“ ๋‹ค๋ฉด index๋ฅผ ๋ณ€๊ฒฝํ•ด์ฃผ๋ฉด ๋œ๋‹ค.

#DataFrame ๊ฐ์ฒด ์ƒ์„ฑ  (Index ์ง€์ •)
df = pd.DataFrame(data,index = ['1๋ฒˆ','2๋ฒˆ','3๋ฒˆ','4๋ฒˆ','5๋ฒˆ','6๋ฒˆ','7๋ฒˆ','8๋ฒˆ'])
df

index์˜ ์ด๋ฆ„์„ ์„ค์ •ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ?

##index ์ด๋ฆ„ ์„ค์ •
df.index.name = '์ง€์›๋ฒˆํ˜ธ'
df

์ธ๋ฑ์Šค์˜ ์ด๋ฆ„์ด ์ง€์›๋ฒˆํ˜ธ๋กœ ์„ค์ •๋˜์—ˆ๋‹ค. ๊ทผ๋ฐ ์ด์ „์ฒ˜๋Ÿผ 0~7์˜ ์ˆซ์ž๋ฅผ ๋‹ค์‹œ ๋ถ™์ด๊ณ  ์‹ถ๋‹ค๋ฉด?

#index ์ดˆ๊ธฐํ™”
df.reset_index()

๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ธ๋ฑ์Šค๊ฐ€ ์ดˆ๊ธฐํ™” ๋˜์—ˆ๋‹ค. reset_index์˜ ๋ณ€์ˆ˜ ์ค‘์—๋Š” drop์ด ์žˆ๋Š”๋ฐ, drop์„ True๋กœ ์„ค์ •ํ•˜๋ฉด '์ง€์›๋ฒˆํ˜ธ'์—ด์ด ์‚ญ์ œ๋œ๋‹ค.

df.reset_index(drop=True, inplace=False) #์›๋ž˜ ์“ฐ๋˜ '์ง€์›๋ฒˆํ˜ธ' ์ธ๋ฑ์Šค ์‚ญ์ œ -> ์‹ค์ œ ๋ฐ์ดํ„ฐ์—๋Š” ๋ฐ˜์˜์ด ์•ˆ๋จ. ๋”ฐ๋ผ์„œ ์ง€๊ธˆ df๋ฅผ ์ถœ๋ ฅํ•˜๋ฉด ์œ„์— ๊ฒฐ๊ณผ๋ž‘ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ

ํ˜„์žฌ๋Š” '์ง€์›๋ฒˆํ˜ธ'๊ฐ€ ์‚ญ์ œ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ, df๋ฅผ ์ถœ๋ ฅํ•ด๋ณด๋ฉด ๊ทธ๋Œ€๋กœ ์žˆ๋‹ค. ์‚ญ์ œ๋œ ์ƒํƒœ๋ฅผ df์— ์ €์žฅํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด inplace๋ฅผ True๋กœ ํ•ด์ฃผ๋ฉด ๋œ๋‹ค.

 

ํ˜น์€ ์ž๊ธฐ๊ฐ€ ์›ํ•˜๋Š” ์—ด์„ ์ธ๋ฑ์Šค๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

df.set_index('์ด๋ฆ„',inplace=True)

์ด์ œ๋Š” ์‚ฌ๋žŒ ๋ณ„๋กœ ์ •๋ณด๋ฅผ ์•Œ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ •๋ ฌ์ด ์•ˆ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ.. ์ด๋ฅผ ์ •๋ ฌํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด sort_index๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

df.sort_index(ascending=True)  #์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

 

728x90