当前位置:Gxl网 > 互联网 > 数据探索及数据处理&文本数据的处理(二)

数据探索及数据处理&文本数据的处理(二)

时间:2021-07-01 10:21:17 帮助过:3人阅读

part1数据探索及数据处理

数据处理

# 复制原数据
df3 = df.copy()
df3.info()

RangeIndex: 3004 entries, 0 to 3003
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          3004 non-null   int64  
 1   name        3004 non-null   object 
 2   gender      2868 non-null   object 
 3   age         2904 non-null   float64
 4   edu         1073 non-null   object 
 5   custom_amt  3004 non-null   object 
 6   order_date  3004 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 164.4+ KB
# 数值型转字符串类型
df3["id"] = df3["id"].astype("str")

# 字符串类型转浮点型
df3["custom_amt"] = df3["custom_amt"].str.strip("¥").astype("float") # 去掉部分字符

# 字符串类型转日期型
df3["order_date"] = pd.to_datetime(df3["order_date"],format="%Y年%m月%d日")
df3.info()

RangeIndex: 3004 entries, 0 to 3003
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   id          3004 non-null   object        
 1   name        3004 non-null   object        
 2   gender      2868 non-null   object        
 3   age         2904 non-null   float64       
 4   edu         1073 non-null   object        
 5   custom_amt  3002 non-null   float64       
 6   order_date  3004 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(4)
memory usage: 164.4+ KB
df3.dtypes # 方法是针对特定对象的函数
id                    object
name                  object
gender                object
age                  float64
edu                   object
custom_amt           float64
order_date    datetime64[ns]
dtype: object
df3.duplicated().sum()
4
df3.drop_duplicates() # 只是删除了行,索引不变

3000 rows × 7 columns

idnamegenderagecustom_amtorder_date
0 890 李小胆李l female 43.0 2177.94 2018-12-25
1 2391 881xt male 52.0 2442.18 2017-05-24
2 2785 haoah male 39.0 849.79 2018-05-15
3 1361 snaen female 26.0 2482.22 2018-05-16
4 888 sue女少 female 61.0 2027.90 2018-01-21
df4["gender"] = df4["gender"].apply(lambda x:x.title()) # 对每一个自变量都执行首字母大写 map
df4["gender"] = df4["gender"].map({"Female":0,"Male":1}) # 分类变量
# 去掉数据集中一些特殊的符号
df["custom_amt"].str.replace("¥","")
0       2177.94
1       2442.18
2        849.79
3       2482.22
4        2027.9
         ...   
2999     542.02
3000    2593.38
3001     139.68
3002     670.89
3003     118.37
Name: custom_amt, Length: 3004, dtype: object