Data download and processing
We'll start by downloading the ticker lists from Wikipedia. This uses the powerful pd.read_html
method we saw in Chapter 4, Long/Short Methodologies: Absolute and Relative:
web_df = pd.read_html(website)[0]
tickers_list = list(web_df['Symbol'])
tickers_list = tickers_list[:]
print('tickers_list',len(tickers_list))
web_df.head()
tickers_list
can be truncated by filling numbers in the bracket section of tickers_list[:]
.
Now, this is where the action is happening. There are a few nested loops in the engine room.
- Batch download: this is the high-level loop. OHLCV is downloaded in a multi-index dataframe in a succession of batches. The number of iterations is a function of the length of the tickers list and the batch size. 505 constituents divided by a batch size of 20 is 26 (the last batch being 6 tickers long).
- Drop level loop: this breaks the multi-index dataframe into single ticker OHLCV dataframes. The number of iterations equals the batch size. Regimes are processed at this level.
- Absolute/relative process: There are 2 passes. The first pass processes data in the absolute series. Variables are reset to the relative series at the end and then processed accordingly in the second pass. There is an option to save the ticker information as a CSV file. The last row dictionary is created at the end of the second pass.
Next, let's go through the process step-by-step:
- Benchmark download closing price and currency adjustment. This needs to be done once, so it is placed at the beginning of the sequence.
- Dataframes and lists instantiation.
- Loop size: number of iterations necessary to loop over the
tickers_list
. - Outer loop: batch download:
m,n
: index along thebatch_list
.batch_download
: download usingyfinance
.- Print batch tickers, with a Boolean if you want to see the tickers names.
- Download batch.
try
/except
: append failed list.
- Second loop: Single stock drop level loop:
- Drop level to ticker level.
- Calculate swings and regime:
abs
/rel
.
- Third loop: absolute/relative series:
- Process regimes in absolute series.
- Reset variables to relative series and process regimes a second time.
- Boolean to provide a
save_ticker_df
option. - Create a dictionary with last row values.
- Append list of dictionary rows.
- Create a dataframe
last_row_df
from dictionary. score
column: lateral sum of regime methods in absolute and relative.- Join
last_row_df
withweb_df
. - Boolean
save_regime_df
.
Let's publish the code and give further explanations afterwards:
# Appendix: The Engine Room
bm_df = pd.DataFrame()
bm_df[bm_col] = round(yf.download(tickers= bm_ticker,start= start, end = end,interval = "1d",
group_by = 'column',auto_adjust = True, prepost = True,
treads = True, proxy = None)['Close'],dgt)
bm_df[ccy_col] = 1
print('benchmark',bm_df.tail(1))
regime_df = pd.DataFrame()
last_row_df = pd.DataFrame()
last_row_list = []
failed = []
loop_size = int(len(tickers_list) // batch_size) + 2
for t in range(1,loop_size):
m = (t - 1) * batch_size
n = t * batch_size
batch_list = tickers_list[m:n]
if show_batch:
print(batch_list,m,n)
try:
batch_download = round(yf.download(tickers= batch_list,start= start, end = end,
interval = "1d",group_by = 'column',auto_adjust = True,
prepost = True, treads = True, proxy = None),dgt)
for flat, ticker in enumerate(batch_list):
df = yf_droplevel(batch_download,ticker)
df = swings(df,rel = False)
df = regime(df,lvl = 3,rel = False)
df = swings(df,rel = True)
df = regime(df,lvl = 3,rel= True)
_o,_h,_l,_c = lower_upper_OHLC(df,relative = False)
for a in range(2):
df['sma'+str(_c)[:1]+str(st)+str(lt)] = regime_sma(df,_c,st,lt)
df['bo'+str(_h)[:1]+str(_l)[:1]+ str(slow)] = regime_breakout(df,_h,_l,window)
df['tt'+str(_h)[:1]+str(fast)+str(_l)[:1]+ str(slow)] = turtle_trader(df, _h, _l, slow, fast)
_o,_h,_l,_c = lower_upper_OHLC(df,relative = True)
try:
last_row_list.append(last_row_dictionary(df))
except:
failed.append(ticker)
except:
failed.append(ticker)
last_row_df = pd.DataFrame.from_dict(last_row_list)
if save_last_row_df:
last_row_df.to_csv('last_row_df_'+ str(last_row_df['date'].max())+'.csv', date_format='%Y%m%d')
print('failed',failed)
last_row_df['score']= last_row_df[regime_cols].sum(axis=1)
regime_df = web_df[web_df_cols].set_index('Symbol').join(
last_row_df[last_row_df_cols].set_index('Symbol'), how='inner').sort_values(by='score')
if save_regime_df:
regime_df.to_csv('regime_df_'+ str(last_row_df['date'].max())+'.csv', date_format='%Y%m%d')
last_row_list.append(last_row_dictionary(df))
happens at the end of the third loop once every individual ticker has been fully processed. This list automatically updates for every ticker and every batch. Once the three loops are finished, we create the last_row_df
dataframe from this list of dictionaries using pd.DataFrame.from_dict(last_row_list)
. This process of creating a list of dictionaries and rolling it up into a dataframe is marginally faster than directly appending them to a dataframe. The score
column is a lateral sum of all the regime methodologies. The last row dataframe is then sorted by score
in ascending order. There is an option to save a datestamped version. The regime
dataframe is created by joining the Wikipedia web dataframe and the last row dataframe. Note that the Symbol
column is set as index
. Again, there is an option to save a datestamped version.
Next, let's visualize what the market is doing with a few heatmaps.