量化投资03---小市值轮动因子---准备工作01
对于每只股票,除了行情,从哪里可以获得市值的数据呢?能不能有一份自己的数据,自己整理?数据整理,会碰到哪些问题,有哪些工作,复杂么?经常听他说,数据要复权,为什么?什么是前收盘?不进行复权,对股票数据的计算有什么影响?复权计算怎么做?计算量大么?如何才能把处理好的数据保存下来?
import akshare as ak
import pandas as pd
#找出历史所有的股票(代码)
# 所有退市的股票代码及名称
df_stop = ak.stock_zh_a_stop_em()
d1 = df_stop[['代码','名称']]
col=['code','name']
d1.columns=col
d1.set_index('code',inplace=True)
#这里要特别注意,不管你把那一列设置为索引之后
# 索引这一列的数据类型为
print(type(d1.index))
# 取得这个类型的字符串值需要这样处理
print(d1.index.str)
d1 = d1.sort_index()
d1
#详细的查看退市的股票有哪些
#d1 = d1[~(d1.index.str.startswith('400'))]
#d1 = d1[~(d1.index.str.startswith('420'))]
#d1.shape
# 经过查看,都只是4开头的,这是为啥?
<class 'pandas.core.indexes.base.Index'>
<pandas.core.strings.accessor.StringMethods object at 0x000001F9D341B640>
| name | |
|---|---|
| code | |
| 400002 | 长 白 5 |
| 400005 | 海国实 5 |
| 400006 | 京中兴1 |
| 400007 | 华 凯 1 |
| 400008 | 水仙A 5 |
| ... | ... |
| 420103 | *ST舜喆B |
| 420108 | 退市鹏起B |
| 420120 | 绿庭B5 |
| 420140 | 东海B3 |
| 420153 | 海创B3 |
154 rows × 1 columns
#获得A股分钟级历史股票数据
# 跟日历史数据不同
# 这里有昨收这个价格,而没有前收盘价,对于实时数据来讲是对的
# 这里有最新价,而没有收盘价,深市有收盘集合竞价
# 今开 不一定 跟做收对应,开盘有集合竞价
df_now = ak.stock_zh_a_spot_em()
df_now.columns
Index(['序号', '代码', '名称', '最新价', '涨跌幅', '涨跌额', '成交量', '成交额', '振幅', '最高', '最低',
'今开', '昨收', '量比', '换手率', '市盈率-动态', '市净率', '总市值', '流通市值', '涨速', '5分钟涨跌',
'60日涨跌幅', '年初至今涨跌幅'],
dtype='object')
# 上市的股票代码及名称
d2 = df_now[['代码','名称']]
col=['code','name']
d2.columns=col
d2.set_index('code',inplace=True)
# 注意:一旦code被设置成索引,则dataframe的cloumns中将不包含code
d2 = d2.sort_index()
#股票代码规则请收看
#https://blog.csdn.net/turui/article/details/127113419
#找出我们需要的数据,上证,60,68开头(600,601,603,605,688),深证00,30开头(000,002,300,301)
# 605 是上海主板的新号段
# 001×××国债现货
# 003特殊时期股票
# 400 420 403 新三板的企业
# 689科创板的特殊股票
# 82,83,87以及新三板都是北交所的股票代码
#d2 = d2[~(d2.index.str.startswith('600'))]
#d2 = d2[~(d2.index.str.startswith('601'))]
#d2 = d2[~(d2.index.str.startswith('603'))]
#d2 = d2[~(d2.index.str.startswith('605'))]
#d2 = d2[~(d2.index.str.startswith('688'))]
#d2 = d2[~(d2.index.str.startswith('000'))]
#d2 = d2[~(d2.index.str.startswith('002'))]
#d2 = d2[~(d2.index.str.startswith('300'))]
#d2 = d2[~(d2.index.str.startswith('301'))]
#不玩以下的股票
d2 = d2[~(d2.index.str.startswith('001'))]
d2 = d2[~(d2.index.str.startswith('003'))]
d2 = d2[~(d2.index.str.startswith('400'))]
d2 = d2[~(d2.index.str.startswith('420'))]
d2 = d2[~(d2.index.str.startswith('430'))]
d2 = d2[~(d2.index.str.startswith('689'))]
d2 = d2[~(d2.index.str.startswith('83'))]
d2 = d2[~(d2.index.str.startswith('87'))]
d2.columns
Index(['name'], dtype='object')
d2.index
Index(['000001', '000002', '000003', '000004', '000005', '000006', '000007',
'000008', '000009', '000010',
...
'688786', '688787', '688788', '688789', '688793', '688798', '688799',
'688800', '688819', '688981'],
dtype='object', name='code', length=4940)
# 本来想把退市的股票跟上市的股票一起整理来的
# 发现退市的股票都是400开头,则这里不再额外整理
#df_all = pd.concat([d1,d2],axis=0)
#df_all = df_all.sort_index()
# 需要生成 股票代码 股票名称 交易日期 开盘价 最高价 最低价 收盘价 前收盘价 成交量 成交额 流通市值 总市值
# 当前数据中有的 代码 名称 今开 最高 最低 最新价 昨收 成交量 成交额 流通市值 总市值
# 收盘后可以这样计算当前总股本数
#d3 = df_now.loc[df_now['代码'].isin(d2.index),['代码','名称','交易日期','开盘','收盘','昨收','流通市值','总市值']]
d3 = df_now.loc[df_now['代码'].isin(d2.index),['代码']]
d3['股票名称'] = df_now.loc[df_now['代码'].isin(d2.index),['名称']]
d3['交易日期'] = pd.Timestamp.today().strftime('%Y-%m-%d') #datetime.now()
d3['开盘价'] = df_now.loc[df_now['代码'].isin(d2.index),['今开']]
d3['收盘价'] = df_now.loc[df_now['代码'].isin(d2.index),['最新价']]
d3['昨收'] = df_now.loc[df_now['代码'].isin(d2.index),['昨收']]
d3['流通市值'] = df_now.loc[df_now['代码'].isin(d2.index),['流通市值']]
d3['总市值'] = df_now.loc[df_now['代码'].isin(d2.index),['总市值']]
col=['code','name','date_trade','open','close','close_yesterday','value_liquidity','value_total']
d3.columns=col
d3.set_index('code',inplace=True)
d3 = d3.sort_index()
#新加两列'股本'
d3['equity_liquidity'] = d3['value_liquidity'] / d3['close']
d3['equity_total'] = d3['value_total'] / d3['close']
d3
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | |
|---|---|---|---|---|---|---|---|---|---|
| code | |||||||||
| 000001 | 平安银行 | 2022-10-10 | 11.87 | 11.84 | 11.86 | 2.297617e+11 | 2.297661e+11 | 1.940555e+10 | 1.940592e+10 |
| 000002 | 万 科A | 2022-10-10 | 17.54 | 17.83 | 17.15 | 1.732605e+11 | 2.073755e+11 | 9.717361e+09 | 1.163071e+10 |
| 000003 | PT金田A | 2022-10-10 | NaN | NaN | 2.71 | NaN | NaN | NaN | NaN |
| 000004 | ST国华 | 2022-10-10 | 8.36 | 8.44 | 8.36 | 9.818123e+08 | 1.121237e+09 | 1.163285e+08 | 1.328480e+08 |
| 000005 | ST星源 | 2022-10-10 | 1.67 | 1.68 | 1.67 | 1.777350e+09 | 1.778342e+09 | 1.057946e+09 | 1.058537e+09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 688798 | 艾为电子 | 2022-10-10 | 91.73 | 90.40 | 91.80 | 8.604724e+09 | 1.500640e+10 | 9.518500e+07 | 1.660000e+08 |
| 688799 | 华纳药厂 | 2022-10-10 | 31.24 | 31.48 | 31.23 | 1.741253e+09 | 2.952824e+09 | 5.531300e+07 | 9.380000e+07 |
| 688800 | 瑞可达 | 2022-10-10 | 136.00 | 130.00 | 136.00 | 9.227140e+09 | 1.471042e+10 | 7.097800e+07 | 1.131571e+08 |
| 688819 | 天能股份 | 2022-10-10 | 34.12 | 32.90 | 34.51 | 4.279480e+09 | 3.198209e+10 | 1.300754e+08 | 9.721000e+08 |
| 688981 | 中芯国际 | 2022-10-10 | 37.51 | 37.78 | 37.56 | 7.395067e+10 | 2.992704e+11 | 1.957403e+09 | 7.921399e+09 |
4940 rows × 9 columns
#处理NAN的数据
# 会看到 value_market(市值)为NaN,即当天没有市值,即已经退市的一些股票
# 这些股票的历史数据也是有意义的,所以股票代码保持全量,但是退市前的市值,股本取起来是个麻烦事情
# 有一些是退市的股票,有些是未上市股票
# 有昨收的是退市的,没昨收的是将上市的
# 有Nan的,总计198支,其中退市189,其中188只有昨收,000508 这只股票没有昨收但已经退市,即将上市9只
# 即将上市的股票代码取历史日线数据会出错
dnan = d3.loc[d3['value_total'].isnull()]
#dnan = dnan[~(dnan.index.str.startswith('000'))]
#dnan = dnan[~(dnan.index.str.startswith('002'))]
#dnan = dnan[~(dnan.index.str.startswith('300'))]
#dnan = dnan[~(dnan.index.str.startswith('301'))]
#dnan = dnan[~(dnan.index.str.startswith('600'))]
#dnan = dnan[~(dnan.index.str.startswith('601'))]
#dnan = dnan[~(dnan.index.str.startswith('603'))]
#dnan = dnan[~(dnan.index.str.startswith('688'))]
dnan
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | |
|---|---|---|---|---|---|---|---|---|---|
| code | |||||||||
| 000003 | PT金田A | 2022-10-10 | NaN | NaN | 2.71 | NaN | NaN | NaN | NaN |
| 000013 | *ST石化A | 2022-10-10 | NaN | NaN | 2.47 | NaN | NaN | NaN | NaN |
| 000015 | PT中浩A | 2022-10-10 | NaN | NaN | 6.85 | NaN | NaN | NaN | NaN |
| 000018 | 神城A退 | 2022-10-10 | NaN | NaN | 0.27 | NaN | NaN | NaN | NaN |
| 000024 | 招商地产 | 2022-10-10 | NaN | NaN | 40.50 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 603996 | 退市中新 | 2022-10-10 | NaN | NaN | 0.39 | NaN | NaN | NaN | NaN |
| 688152 | 麒麟信安 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 688291 | 金橙子 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 688372 | 伟测科技 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 688426 | 康为世纪 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
198 rows × 9 columns
#新上市股票代码
df_newstock = dnan[dnan['value_total'].isnull() & dnan['close_yesterday'].isnull()]
#df_newstock = df_newstock[~df_newstock.index.str.startswith("000508")]
# PD的索引类型可以跟字符串比较是否相等,却不能跟时间类型直接比较
df_newstock = df_newstock[df_newstock.index != '000508']
df_newstock
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | |
|---|---|---|---|---|---|---|---|---|---|
| code | |||||||||
| 301223 | 中荣股份 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 301230 | 泓博医药 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 301367 | 怡和嘉业 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 301379 | 天山电子 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 301389 | 隆扬电子 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 688152 | 麒麟信安 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 688291 | 金橙子 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 688372 | 伟测科技 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 688426 | 康为世纪 | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# 退市股票代码
# 这里不能分两步写,否则要用到conact
df_closed = dnan[(~dnan['close_yesterday'].isnull())|(dnan.index == '000508')]
df_closed
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | |
|---|---|---|---|---|---|---|---|---|---|
| code | |||||||||
| 000003 | PT金田A | 2022-10-10 | NaN | NaN | 2.71 | NaN | NaN | NaN | NaN |
| 000013 | *ST石化A | 2022-10-10 | NaN | NaN | 2.47 | NaN | NaN | NaN | NaN |
| 000015 | PT中浩A | 2022-10-10 | NaN | NaN | 6.85 | NaN | NaN | NaN | NaN |
| 000018 | 神城A退 | 2022-10-10 | NaN | NaN | 0.27 | NaN | NaN | NaN | NaN |
| 000024 | 招商地产 | 2022-10-10 | NaN | NaN | 40.50 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 601268 | *ST二重 | 2022-10-10 | NaN | NaN | 2.35 | NaN | NaN | NaN | NaN |
| 601299 | 中国北车 | 2022-10-10 | NaN | NaN | 29.98 | NaN | NaN | NaN | NaN |
| 601558 | 退市锐电 | 2022-10-10 | NaN | NaN | 0.25 | NaN | NaN | NaN | NaN |
| 603157 | 退市拉夏 | 2022-10-10 | NaN | NaN | 0.59 | NaN | NaN | NaN | NaN |
| 603996 | 退市中新 | 2022-10-10 | NaN | NaN | 0.39 | NaN | NaN | NaN | NaN |
189 rows × 9 columns
# 取出子集,包括了退市的189支+4742支上市的,合计4931支
# 不包含新股票9只
d4 = d3.loc[~d3.index.isin(df_newstock.index)]
d4 = d4.sort_index()
d4
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | |
|---|---|---|---|---|---|---|---|---|---|
| code | |||||||||
| 000001 | 平安银行 | 2022-10-10 | 11.87 | 11.84 | 11.86 | 2.297617e+11 | 2.297661e+11 | 1.940555e+10 | 1.940592e+10 |
| 000002 | 万 科A | 2022-10-10 | 17.54 | 17.83 | 17.15 | 1.732605e+11 | 2.073755e+11 | 9.717361e+09 | 1.163071e+10 |
| 000003 | PT金田A | 2022-10-10 | NaN | NaN | 2.71 | NaN | NaN | NaN | NaN |
| 000004 | ST国华 | 2022-10-10 | 8.36 | 8.44 | 8.36 | 9.818123e+08 | 1.121237e+09 | 1.163285e+08 | 1.328480e+08 |
| 000005 | ST星源 | 2022-10-10 | 1.67 | 1.68 | 1.67 | 1.777350e+09 | 1.778342e+09 | 1.057946e+09 | 1.058537e+09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 688798 | 艾为电子 | 2022-10-10 | 91.73 | 90.40 | 91.80 | 8.604724e+09 | 1.500640e+10 | 9.518500e+07 | 1.660000e+08 |
| 688799 | 华纳药厂 | 2022-10-10 | 31.24 | 31.48 | 31.23 | 1.741253e+09 | 2.952824e+09 | 5.531300e+07 | 9.380000e+07 |
| 688800 | 瑞可达 | 2022-10-10 | 136.00 | 130.00 | 136.00 | 9.227140e+09 | 1.471042e+10 | 7.097800e+07 | 1.131571e+08 |
| 688819 | 天能股份 | 2022-10-10 | 34.12 | 32.90 | 34.51 | 4.279480e+09 | 3.198209e+10 | 1.300754e+08 | 9.721000e+08 |
| 688981 | 中芯国际 | 2022-10-10 | 37.51 | 37.78 | 37.56 | 7.395067e+10 | 2.992704e+11 | 1.957403e+09 | 7.921399e+09 |
4931 rows × 9 columns
#增加一列,是否退市的标记
# 这可以说是一份基本完整的快照数据了
df_all = d4
df_all['is_closed'] = False
df_all['is_closed'] = df_all['value_total'].isnull()
df_all
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | is_closed | |
|---|---|---|---|---|---|---|---|---|---|---|
| code | ||||||||||
| 000001 | 平安银行 | 2022-10-10 | 11.87 | 11.84 | 11.86 | 2.297617e+11 | 2.297661e+11 | 1.940555e+10 | 1.940592e+10 | False |
| 000002 | 万 科A | 2022-10-10 | 17.54 | 17.83 | 17.15 | 1.732605e+11 | 2.073755e+11 | 9.717361e+09 | 1.163071e+10 | False |
| 000003 | PT金田A | 2022-10-10 | NaN | NaN | 2.71 | NaN | NaN | NaN | NaN | True |
| 000004 | ST国华 | 2022-10-10 | 8.36 | 8.44 | 8.36 | 9.818123e+08 | 1.121237e+09 | 1.163285e+08 | 1.328480e+08 | False |
| 000005 | ST星源 | 2022-10-10 | 1.67 | 1.68 | 1.67 | 1.777350e+09 | 1.778342e+09 | 1.057946e+09 | 1.058537e+09 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 688798 | 艾为电子 | 2022-10-10 | 91.73 | 90.40 | 91.80 | 8.604724e+09 | 1.500640e+10 | 9.518500e+07 | 1.660000e+08 | False |
| 688799 | 华纳药厂 | 2022-10-10 | 31.24 | 31.48 | 31.23 | 1.741253e+09 | 2.952824e+09 | 5.531300e+07 | 9.380000e+07 | False |
| 688800 | 瑞可达 | 2022-10-10 | 136.00 | 130.00 | 136.00 | 9.227140e+09 | 1.471042e+10 | 7.097800e+07 | 1.131571e+08 | False |
| 688819 | 天能股份 | 2022-10-10 | 34.12 | 32.90 | 34.51 | 4.279480e+09 | 3.198209e+10 | 1.300754e+08 | 9.721000e+08 | False |
| 688981 | 中芯国际 | 2022-10-10 | 37.51 | 37.78 | 37.56 | 7.395067e+10 | 2.992704e+11 | 1.957403e+09 | 7.921399e+09 | False |
4931 rows × 10 columns
# 找出三个典型例子观察下
df_all.loc[df_all.index.isin(['000001','000003','000508'])] # 三种情况:正常,退市,000508这个特殊的额存在
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | is_closed | |
|---|---|---|---|---|---|---|---|---|---|---|
| code | ||||||||||
| 000001 | 平安银行 | 2022-10-10 | 11.87 | 11.84 | 11.86 | 2.297617e+11 | 2.297661e+11 | 1.940555e+10 | 1.940592e+10 | False |
| 000003 | PT金田A | 2022-10-10 | NaN | NaN | 2.71 | NaN | NaN | NaN | NaN | True |
| 000508 | 琼民源A | 2022-10-10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | True |
# 因为显式索引为code,那么直接使用df_all[0]会报错
# 根据数组下标(隐式索引),获得股票代码(显式索引)
df_all.index[0]
'000001'
#带上列取值
df_all.loc[df_all.index[0],'name']
'平安银行'
# 按照数组下标取得一行则为
# 注意 dataframe 的 loc 后面跟的是[]
df_all.loc[df_all.index[0]]
name 平安银行
date_trade 2022-10-10
open 11.87
close 11.84
close_yesterday 11.86
value_liquidity 229761675888.0
value_total 229766071464.0
equity_liquidity 19405546950.0
equity_total 19405918197.972973
is_closed False
Name: 000001, dtype: object
#df_all.loc[df_all.index[0],'close_yesterday']
df_all.iloc[0]['close_yesterday']
11.86
# 把收盘价补回来(目前只有一个000508)
# 这里有个取单元格(close_yesterday)的值得方法(想了俺半天)
# 注意看打印
for i in range(0,len(df_all)):
if pd.isna(df_all.iloc[i]['close_yesterday']):
# 从另一个接口,取出所有的历史日线,放入 df_day
code_full = 'sz'+ df_all.index[i]
df_day = ak.stock_zh_index_daily(symbol=code_full)
df_day.sort_values(by='date')
print(df_day.sort_values(by='date').tail(3))
print(df_all.iloc[i]['close_yesterday'])
# df_day 最后一个的收盘价用来赋值
df_all.loc[df_all.index[i],'close_yesterday'] = df_day.sort_values(by='date').tail(1)['close'].iloc[0]
print(df_all.iloc[i]['close_yesterday'])
#df_all.loc[df_all.index[i],'close_yesterday'] = 23.5
# 这里是个谜,只能用loc这个热方法来赋值,目前还不清楚iloc方法不能赋值问题出在哪里
#df_all.iloc[i]['close_yesterday'] = df_day.sort_values(by='date').tail(1)['close'].iloc[0]
#df_all.iloc[i]['close_yesterday'] = 23.5
#df_all[~df_all['close_yesterday'].isnull()]
date open high low close volume
937 1997-02-26 24.15 25.48 24.12 25.41 10165400
938 1997-02-27 26.00 26.18 24.90 25.22 10688600
939 1997-02-28 24.80 24.80 22.70 23.50 56362000
nan
23.5
df_all.loc['000508']
d_test = df_all.loc[df_closed.index]
#d_test = df_all.loc[df_closed.index,['name','close_yesterday']]
#d_test = df_all.loc[df_closed.index,'close_yesterday'] #换成这一行会报错
# 这里还发现iloc 的另外一个问题,当只有一列的时候,会报错
d_test.iloc[0]['close_yesterday']
2.71
# 针对已经退市的股票,想对比一下
# 从实时行情得到的昨收df_all['close_yesterday']
# 从各支股票最后一条数据的收盘价
# 是否一样
d_test = df_all.loc[df_closed.index]
for i in range(0,len(d_test)):
code_now = d_test.index[i]
code_full = ''
if code_now.startswith('6'):
code_full = 'sh' + code_now
else:
code_full = 'sz' + code_now
df_day = ak.stock_zh_index_daily(symbol=code_full)
if(d_test.loc[d_test.index[i]]['close_yesterday'] != df_day.sort_values(by='date').tail(1)['close'].iloc[0]):
print(d_test.loc[d_test.index[i]])
#else:
#print(d_test.iloc[i])
print('Nice!')
#运行这模块要等待一段时间,毕竟请求了189次股票的数据
#验证完毕这里的数据都是对的
Nice!
#网易的数据接口,没有复权,有总市值
#sz000003 这只退市的股票为例,请求为
#http://quotes.money.163.com/service/chddata.html?code=1000003&start=19910103&end=20020426&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP
#http://quotes.money.163.com/service/chddata.html?code=1000003
#函数:网易数据接口获得某只股票的历史日线(行情)数据
def get_daily_163(code,start='19900101',end=''):
if code.startswith('6'):
code = '0'+code
else:
code = '1'+code
url_mod="http://quotes.money.163.com/service/chddata.html?code=%s&start=%s&end=%s"
url=url_mod%(code,start,end)
df=pd.read_csv(url, encoding = 'gb2312')
#修复数据,上市第一天,前开盘价肯定是1,原始数据为0
df.loc[df.index[-1],['前收盘']] = 1
#赋值的老疑惑问题,以下两句有警告
#df.tail(1)['前收盘'] = 1
#df.iloc[-1]['前收盘'] = 1
# 这里还是一定要转换成时间类型的
# 否则用字符串类型作为索引的话很多赋值切片工作无法完成
df.index = pd.to_datetime(df.日期)
df.sort_index() # 这里排序很重要哦,关系到下面的数据填充
df = df[['股票代码','名称','日期','开盘价','收盘价','前收盘','流通市值','总市值']]
df['股票代码'] = code[1:]
df['流通股本'] = df['流通市值'] / df['收盘价']
df['总股本'] = df['总市值'] / df['收盘价']
df['未开张'] =~((df['开盘价'] > 0) & (df['收盘价'] > 0)) #针对某只股票当前是指它是否退市,针对某只股票某天它是否交易
#print(~df['未开张'])
#向下填充,处理未开市的日子
#df.replace(to_replace=np.inf,value=np.nan) #为什么这里用replace不起作用?搞不懂
df.loc[df['未开张'],'流通股本'] = np.nan
df.loc[df['未开张'],'总股本'] = np.nan
df = df.fillna(method='ffill')
df = df.rename(columns={'名称':'股票名称'})
df = df.rename(columns={'前收盘':'前收盘价'})
df = df.sort_index()
df.rename(columns={'股票代码':'code',
'股票名称':'name',
'日期' :'date_trade',
'开盘价' :'open',
'收盘价' :'close',
'前收盘价':'close_before',
'流通市值':'value_liquidity',
'总市值' :'value_total',
'流通股本':'equity_liquidity',
'总股本' :'equity_total',
'未开张' :'is_closed'},
inplace=True)
return df
#一只股票为例,处理数据
df_his = get_daily_163('000003')
df_his
| code | name | date_trade | open | close | close_before | value_liquidity | value_total | equity_liquidity | equity_total | is_closed | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 日期 | |||||||||||
| 1991-01-02 | 000003 | 深金田A | 1991-01-02 | 189.66 | 189.66 | 1.00 | 1.147986e+10 | 1.894926e+10 | 6.052864e+07 | 9.991174e+07 | False |
| 1991-01-03 | 000003 | 深金田A | 1991-01-03 | 188.25 | 188.25 | 189.66 | 1.139452e+10 | 1.880839e+10 | 6.052864e+07 | 9.991174e+07 | False |
| 1991-01-04 | 000003 | 深金田A | 1991-01-04 | 187.31 | 187.31 | 188.25 | 1.133762e+10 | 1.871447e+10 | 6.052864e+07 | 9.991174e+07 | False |
| 1991-01-09 | 000003 | 深金田A | 1991-01-09 | 0.00 | 0.00 | 187.31 | 1.133762e+10 | 1.871447e+10 | 6.052864e+07 | 9.991174e+07 | True |
| 1991-01-10 | 000003 | 深金田A | 1991-01-10 | 0.00 | 0.00 | 187.31 | 1.133762e+10 | 1.871447e+10 | 6.052864e+07 | 9.991174e+07 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2002-04-22 | 000003 | PT金田A | 2002-04-22 | 0.00 | 0.00 | 2.58 | 4.953879e+08 | 6.687277e+08 | 1.920108e+08 | 2.591968e+08 | True |
| 2002-04-23 | 000003 | PT金田A | 2002-04-23 | 0.00 | 0.00 | 2.58 | 4.953879e+08 | 6.687277e+08 | 1.920108e+08 | 2.591968e+08 | True |
| 2002-04-24 | 000003 | PT金田A | 2002-04-24 | 0.00 | 0.00 | 2.58 | 4.953879e+08 | 6.687277e+08 | 1.920108e+08 | 2.591968e+08 | True |
| 2002-04-25 | 000003 | PT金田A | 2002-04-25 | 0.00 | 0.00 | 2.58 | 4.953879e+08 | 6.687277e+08 | 1.920108e+08 | 2.591968e+08 | True |
| 2002-04-26 | 000003 | PT金田A | 2002-04-26 | 2.71 | 2.71 | 2.58 | 5.203493e+08 | 7.024233e+08 | 1.920108e+08 | 2.591968e+08 | False |
2770 rows × 11 columns
# 做一个空的df,存放所有退市的股票的,最后一个交易日(历史数据中)的数据
#df_all.drop(df_all.index,axis=0,inplace=True) #两句的效果应该是一样的
df_snapshot = df_all.iloc[:0]
df_snapshot
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | is_closed | |
|---|---|---|---|---|---|---|---|---|---|---|
| code |
#获得所有退市的股票的总市值,流通市值数据(历史数据的最后一行)
for i in range(0,len(df_closed)):
d = get_daily_163(df_closed.index[i])
d = d.loc[~d['is_closed']].tail(1)
#print(d)
d.set_index('code',inplace=True)
#print(d)
df_snapshot = pd.concat([df_snapshot,d])
# 这里请求了189次数据,运算需要一定的时间
df_snapshot['is_closed'] = True
df_snapshot = df_snapshot.sort_index()
df_snapshot
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | is_closed | close_before | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| code | |||||||||||
| 000003 | PT金田A | 2002-04-26 | 2.71 | 2.71 | NaN | 5.203493e+08 | 7.024233e+08 | 1.920108e+08 | 2.591968e+08 | True | 2.58 |
| 000013 | *ST石化A | 2004-04-28 | 2.45 | 2.47 | NaN | 1.267703e+08 | 6.683696e+08 | 5.132400e+07 | 2.705950e+08 | True | 2.50 |
| 000015 | PT中浩A | 2001-06-08 | 6.85 | 6.85 | NaN | 1.798988e+08 | 9.273079e+08 | 2.626260e+07 | 1.353734e+08 | True | 6.80 |
| 000018 | 神城A退 | 2020-01-06 | 0.27 | 0.27 | NaN | 3.721804e+08 | 4.585262e+08 | 1.378446e+09 | 1.698245e+09 | True | 0.27 |
| 000024 | 招商地产 | 2015-12-07 | 40.60 | 40.50 | NaN | 4.158946e+10 | 1.043260e+11 | 1.026900e+09 | 2.575951e+09 | True | 39.50 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 601268 | *ST二重 | 2014-04-28 | 2.22 | 2.35 | NaN | 3.971500e+09 | 5.389606e+09 | 1.690000e+09 | 2.293450e+09 | True | 2.24 |
| 601299 | 中国北车 | 2015-05-06 | 32.88 | 29.98 | NaN | 3.035800e+11 | 3.675482e+11 | 1.012608e+10 | 1.225978e+10 | True | 32.60 |
| 601558 | 退市锐电 | 2020-06-23 | 0.25 | 0.25 | NaN | 1.507650e+09 | 1.507650e+09 | 6.030600e+09 | 6.030600e+09 | True | 0.25 |
| 603157 | 退市拉夏 | 2022-05-17 | 0.65 | 0.59 | NaN | 1.962358e+08 | 3.231263e+08 | 3.326030e+08 | 5.476716e+08 | True | 0.64 |
| 603996 | 退市中新 | 2022-05-17 | 0.40 | 0.39 | NaN | 1.170585e+08 | 1.170585e+08 | 3.001500e+08 | 3.001500e+08 | True | 0.39 |
189 rows × 11 columns
# 用快照填充,当日K线中,这些退市的股票的,市值,股本相关的数据
df_all.loc[df_snapshot.index] = df_snapshot
df_all
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | is_closed | |
|---|---|---|---|---|---|---|---|---|---|---|
| code | ||||||||||
| 000001 | 平安银行 | 2022-10-10 | 11.87 | 11.84 | 11.86 | 2.297617e+11 | 2.297661e+11 | 1.940555e+10 | 1.940592e+10 | False |
| 000002 | 万 科A | 2022-10-10 | 17.54 | 17.83 | 17.15 | 1.732605e+11 | 2.073755e+11 | 9.717361e+09 | 1.163071e+10 | False |
| 000003 | PT金田A | 2002-04-26 | 2.71 | 2.71 | NaN | 5.203493e+08 | 7.024233e+08 | 1.920108e+08 | 2.591968e+08 | True |
| 000004 | ST国华 | 2022-10-10 | 8.36 | 8.44 | 8.36 | 9.818123e+08 | 1.121237e+09 | 1.163285e+08 | 1.328480e+08 | False |
| 000005 | ST星源 | 2022-10-10 | 1.67 | 1.68 | 1.67 | 1.777350e+09 | 1.778342e+09 | 1.057946e+09 | 1.058537e+09 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 688798 | 艾为电子 | 2022-10-10 | 91.73 | 90.40 | 91.80 | 8.604724e+09 | 1.500640e+10 | 9.518500e+07 | 1.660000e+08 | False |
| 688799 | 华纳药厂 | 2022-10-10 | 31.24 | 31.48 | 31.23 | 1.741253e+09 | 2.952824e+09 | 5.531300e+07 | 9.380000e+07 | False |
| 688800 | 瑞可达 | 2022-10-10 | 136.00 | 130.00 | 136.00 | 9.227140e+09 | 1.471042e+10 | 7.097800e+07 | 1.131571e+08 | False |
| 688819 | 天能股份 | 2022-10-10 | 34.12 | 32.90 | 34.51 | 4.279480e+09 | 3.198209e+10 | 1.300754e+08 | 9.721000e+08 | False |
| 688981 | 中芯国际 | 2022-10-10 | 37.51 | 37.78 | 37.56 | 7.395067e+10 | 2.992704e+11 | 1.957403e+09 | 7.921399e+09 | False |
4931 rows × 10 columns
# 查看一下快照中的几个有特点的值,集合是乱序的,结果是按照index(代码)排序的
df_all.loc[df_all.index.isin(['000001','000003','000508','000015','603157','300264'])]
| name | date_trade | open | close | close_yesterday | value_liquidity | value_total | equity_liquidity | equity_total | is_closed | |
|---|---|---|---|---|---|---|---|---|---|---|
| code | ||||||||||
| 000001 | 平安银行 | 2022-10-10 | 11.87 | 11.84 | 11.86 | 2.297617e+11 | 2.297661e+11 | 1.940555e+10 | 1.940592e+10 | False |
| 000003 | PT金田A | 2002-04-26 | 2.71 | 2.71 | NaN | 5.203493e+08 | 7.024233e+08 | 1.920108e+08 | 2.591968e+08 | True |
| 000015 | PT中浩A | 2001-06-08 | 6.85 | 6.85 | NaN | 1.798988e+08 | 9.273079e+08 | 2.626260e+07 | 1.353734e+08 | True |
| 000508 | 琼民源A | 1997-02-28 | 24.80 | 23.50 | NaN | 4.404452e+09 | 1.314976e+10 | 1.874235e+08 | 5.595642e+08 | True |
| 300264 | 佳创视讯 | 2022-10-10 | 4.98 | 4.90 | 4.98 | 1.746294e+09 | 2.024190e+09 | 3.563866e+08 | 4.131000e+08 | False |
| 603157 | 退市拉夏 | 2022-05-17 | 0.65 | 0.59 | NaN | 1.962358e+08 | 3.231263e+08 | 3.326030e+08 | 5.476716e+08 | True |
#函数:akshare新浪接口获得某一只股票的历史日线(行情)数据
#没有股本,没有市值
def get_daily_sina(code):
code_full = ''
if code.startswith('6'):
code_full = 'sh' + code
else:
code_full = 'sz' + code
df_day = ak.stock_zh_index_daily(symbol=code_full)
df_day['date'] = pd.to_datetime(df_day['date'])
df_day.set_index('date',inplace=True)
df_day = df_day.sort_index()
#path_file = './'+code_full+'.csv'
#df_day.to_csv(path_file)
return df_day
# 数据比对任务(有关复权的处理练习)
# 1、用当天的行情数据的,总市值,流通市值,最新价(收盘价),计算总股本,流通股本
# 2、获得有转增、送股的 分红信息
# 3、跟历史的行情数据中的对比(总市值,流通市值,收盘价,计算总股本,流通股本)
# 4、主要对比 股本变化前日,当日的历史数据,以及根据快照计算的数据,两者是否一致
# 5、股本变化后上市的价格是复权后的价格,所以这里的市值计算对比,一定要处理复权
#pandas中pd.NaT表示 not a time。
#如果要判断一个时间是不是pd.NaT可以使用pd.isna()、pd.notna()等方法。
#pandas中pd.NaN表示不是一个数字
# 去掉 有 nan 的索引项
#print(s1.dropna())
# 是否为空 pd.isnull notnull
# 是否为空字符串 == ''
# 是否为0 == 0
# 以一只股票为例
code = '000003'
df_fenhong = ak.stock_history_dividend_detail(symbol=code)
df_fenhong
| 公告日期 | 送股 | 转增 | 派息 | 进度 | 除权除息日 | 股权登记日 | 红股上市日 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1996-07-10 | 2 | 0 | 0 | 实施 | 1996-07-27 | 1996-07-26 | 1996-07-31 |
| 1 | 1995-07-28 | 2 | 0 | 1 | 实施 | 1995-08-21 | 1995-08-18 | NaT |
| 2 | 1994-05-06 | 4 | 0 | 1 | 实施 | 1994-05-20 | 1994-05-19 | NaT |
df_fenhong = df_fenhong[~((df_fenhong['送股'] == 0) & (df_fenhong['转增'] == 0))]
#df_fenhong[pd.notna(df_fenhong['除权除息日'])]
#df_fenhong[pd.isna(df_fenhong['红股上市日'])].index
# 本来,想将所有没有红股上市日的,都填写为 除权除息日,结果发现根本不行
# 以000003 的第一条为例
i = df_fenhong[pd.isna(df_fenhong['红股上市日'])].index
print(i)
df_fenhong.loc[i,'红股上市日'] = df_fenhong.loc[i,'除权除息日'] #稍作处理统一用红股上市日
df_fenhong = df_fenhong.sort_values(by=['红股上市日'],ascending=[False])
df_fenhong
Int64Index([1, 2], dtype='int64')
| 公告日期 | 送股 | 转增 | 派息 | 进度 | 除权除息日 | 股权登记日 | 红股上市日 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1996-07-10 | 2 | 0 | 0 | 实施 | 1996-07-27 | 1996-07-26 | 1996-07-31 |
| 1 | 1995-07-28 | 2 | 0 | 1 | 实施 | 1995-08-21 | 1995-08-18 | 1995-08-21 |
| 2 | 1994-05-06 | 4 | 0 | 1 | 实施 | 1994-05-20 | 1994-05-19 | 1994-05-20 |
d_compare_a = pd.DataFrame()
d_compare_a['日期'] = df_fenhong['红股上市日']
d_compare_a['股票代码'] = code
d_compare_a['股票名称'] = ''
d_compare_a['送股'] = df_fenhong['送股']
d_compare_a['转增'] = df_fenhong['转增']
d_compare_a['收盘价'] = 0
d_compare_a['前收盘价'] = 0
d_compare_a['流通市值'] = 0 #查,根据股票代码,时间,直接查
d_compare_a['总市值'] = 0
d_compare_a['流通股本'] = 0
d_compare_a['总股本'] = 0
#df = df.drop(labels='日期', axis=1)
#d_compare_a.index = pd.to_datetime(df_fenhong['红股上市日'])
d_compare_a.set_index('日期',inplace=True)
d_compare_a
| 股票代码 | 股票名称 | 送股 | 转增 | 收盘价 | 前收盘价 | 流通市值 | 总市值 | 流通股本 | 总股本 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 日期 | ||||||||||
| 1996-07-31 | 000003 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 1995-08-21 | 000003 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 1994-05-20 | 000003 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
code = '000003'
df_his = get_daily_163(code)
print(type(d_compare_a.index))
print(type(df_his.index))
print(d_compare_a.index)
print(df_his.index)
#DatetimeIndex 还是要必须的,要不后面处理数据好麻烦
<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Index([1996-07-31, 1995-08-21, 1994-05-20], dtype='object', name='日期')
DatetimeIndex(['1991-01-02', '1991-01-03', '1991-01-04', '1991-01-09',
'1991-01-10', '1991-01-11', '1991-01-14', '1991-01-15',
'1991-01-16', '1991-01-17',
...
'2002-04-15', '2002-04-16', '2002-04-17', '2002-04-18',
'2002-04-19', '2002-04-22', '2002-04-23', '2002-04-24',
'2002-04-25', '2002-04-26'],
dtype='datetime64[ns]', name='日期', length=2770, freq=None)
#查出来的一只股票在配股转增当日的情况
d_compare_a['股票名称'] = df_his.loc[d_compare_a.index,'name']
d_compare_a['收盘价'] = df_his.loc[d_compare_a.index,'close']
d_compare_a['前收盘价'] = df_his.loc[d_compare_a.index,'close_before']
d_compare_a['流通市值'] = df_his.loc[d_compare_a.index,'value_liquidity']
d_compare_a['总市值'] = df_his.loc[d_compare_a.index,'value_total']
d_compare_a['流通股本'] = df_his.loc[d_compare_a.index,'equity_liquidity']
d_compare_a['总股本'] = df_his.loc[d_compare_a.index,'equity_total']
d_compare_a
| 股票代码 | 股票名称 | 送股 | 转增 | 收盘价 | 前收盘价 | 流通市值 | 总市值 | 流通股本 | 总股本 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 日期 | ||||||||||
| 1996-07-31 | 000003 | 深金田A | 2 | 0 | 5.90 | 6.18 | 1.131883e+09 | 1.529261e+09 | 191844499.0 | 259196782.0 |
| 1995-08-21 | 000003 | 深金田A | 2 | 0 | 4.42 | 5.55 | 4.710848e+08 | 6.773085e+08 | 106580275.0 | 153237208.0 |
| 1994-05-20 | 000003 | 深金田A | 4 | 0 | 7.82 | 11.16 | 4.733339e+08 | 7.813098e+08 | 60528637.0 | 99911741.0 |
# 第0行 :根据df_all(快照)得到的股本数 跟日期最近的一条是否一致
# 总股本数一定是一致的,流通股本数有可能会不一致哦
# 其他行 : 根据上一行的送股、转增方案,计算得来
for j in range(len(d_compare_a)):
row = d_compare_a.iloc[j]
if j == 0:
print( (df_all.loc[row['股票代码'],'equity_total'] - row['总股本']) <= 0.000001)
#print( df_all.loc[row['股票代码'],'equity_liquidity'] -row['流通股本'] <= 0.000001)
else:
last = d_compare_a.iloc[j-1]
print(row['总股本'] - last['总股本']/(1+last['送股']/10+last['转增']/10) <= 0.000001)
True
True
True
#关于复权,这里看一个问题
其中的第一条,日期为 1996-07-31的数据如下:
日期: 1996-07-31
收盘价: 5.9
流通市值:1131882544.1
总市值: 1529261013.8 //经过计算得到下面两组数据
流动股本:192010804.0
总股本: 259196782.0
我们来计算一下1995-08-21 这天的数据
看看查出来的数据跟算出来的数据是否一样呢?
1996-07-31做了10 送2 在这之后,再就没查到分红配股信息了
我们认为这之后的股本总数是不变的,则分红前:
流动股本:192010804.0 / (1+送股/10+转增/10)= 192010804.0 / 1.2= 160009003.33333334
总股本:259196782.0/ (1+送股/10+转增/10)= 259196782.0/ 1.2= 215997318.33333337
计算 1995-08-21的 收盘价为 4.42 则
流动市值 = 4.42160009003.33333334 = 707239794.7333333
总市值 = 4.42215997318.33333337 = 954708147.0333335
整理如下:
日期: 1995-08-21
收盘价: 4.42
流通市值:707239794.73
总市值: 954708147.03
流动股本:160009003.33
总股本: 215997318.33
这个跟表中查询出来的结果相差也太多了(流通市值4.710848e+08,总市值:6.773085e+08)
更多推荐


所有评论(0)