`get_bars_async` failing to connect

I am trying to download one year of history for the entire universe of stocks available at Alpaca. I filter the list of holdings to only contain stocks that are ‘active’ ignoring those that trade on the BATS and OTC. This results in 9,462 stocks.

When I attempt to use the ‘get_bars_async’ function I get 3,879 "Cannot connect to host data.alpaca.markets:443’ errors. My code is a trimmed down version of this example: alpaca-trade-api-python/historic_async.py at master · alpacahq/alpaca-trade-api-python · GitHub. When I loop on the tickers using the vanilla ‘get_bars’ I don’t encounter any issues.

We’ve had a similar problem with .NET SDK but after switching to HTTP/2 transport this problem was gone. I’m not sure about the Python SDK but maybe this information will help you.

hi @pablo.mitchell could you add here the code you are trying run?
I will try to help you debug your problem.
I wrote the historic_async module.

import asyncio
import os
import sys
import time

import pandas as pd

import alpaca_trade_api as tradeapi
from alpaca_trade_api.rest import TimeFrame
from alpaca_trade_api.rest_async import gather_with_concurrency, AsyncRest

NY = 'America/New_York'

async def get_historic_bars(
        symbols,
        start,
        end,
        timeframe: TimeFrame,
):
    major = sys.version_info.major
    minor = sys.version_info.minor

    if major < 3 or minor < 6:
        raise Exception('asyncio is not supported  by your python version')

    print(f'Getting bars:')
    print(f'\t n_symbols={len(symbols)}')
    print(f'\t timeframe={timeframe}')
    print(f'\t start={start}')
    print(f'\t end={end}')

    tasks = []

    for symbol in symbols:
        args = [symbol, start, end, timeframe]
        tasks.append(rest.get_bars_async(*args))

    if minor >= 8:
        results = await asyncio.gather(*tasks, return_exceptions=True)
    else:
        results = await gather_with_concurrency(500, *tasks)

    n_errors = 0
    n_bad_requests = 0

    for response in results:
        if isinstance(response, Exception):
            n_errors += 1
            print(f"Got an error: {response}")
        elif not len(response[1]):
            n_bad_requests += 1
            print(f'bad response: {response}')
        else:
            # print(response)
            pass

    print('Showing results:')
    print(f'\t n_bars={len(results)}')
    print(f'\t n_errors={n_errors}')
    print(f'\t n_bad_requests={n_bad_requests}')

async def main(symbols):
    start = pd.Timestamp('2020-09-29', tz=NY).date().isoformat()
    end = pd.Timestamp('2021-09-29', tz=NY).date().isoformat()
    timeframe: TimeFrame = TimeFrame.Day
    await get_historic_bars(symbols, start, end, timeframe)

if __name__ == '__main__':
    key_id = os.environ.get('APCA_API_KEY_ID')
    secret_key = os.environ.get('APCA_API_SECRET_KEY')
    base_url = os.environ.get('APCA_API_BASE_URL')

    feed = "sip"  # ???

    rest = AsyncRest(key_id=key_id, secret_key=secret_key)
    api = tradeapi.REST(key_id=key_id, secret_key=secret_key, base_url=base_url)

    symbols = [
        asset.symbol for asset in api.list_assets(status='active') if
        asset.exchange not in ('BATS', 'OTC') and
        asset.tradable
    ]
    # symbols = symbols[-10:]

    start_time = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main(symbols))
    print(f"took {time.time() - start_time:.0f} sec")

Hi @pablo.mitchell
I ran your example code and these are the results I get:

For 10 symbols, 1 year daily data

Getting bars:
	 n_symbols=10
	 timeframe=1Day
	 start=2020-09-29
	 end=2021-09-29
bad response: ('ZWS', Empty DataFrame
Columns: []
Index: [])
Showing results:
	 n_bars=10
	 n_errors=0
	 n_bad_requests=1
took 1 sec

For 100 symbols, 1 year daily data

Getting bars:
	 n_symbols=100
	 timeframe=1Day
	 start=2020-09-29
	 end=2021-09-29
bad response: ('WOLF', Empty DataFrame
Columns: []
Index: [])
...
bad response: ('ZWS', Empty DataFrame
Columns: []
Index: [])
Showing results:
	 n_bars=100
	 n_errors=0
	 n_bad_requests=4
took 1 sec

Process finished with exit code 0

For 1000 symbols, 1 year daily data

Getting bars:
	 n_symbols=1000
	 timeframe=1Day
	 start=2020-09-29
	 end=2021-09-29
bad response: ('RRX', Empty DataFrame
Columns: []
Index: [])
...
bad response: ('ZWS', Empty DataFrame
Columns: []
Index: [])
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=14
took 5 sec

Process finished with exit code 0

For all( ) symbols, 1 year daily data


Getting bars:
	 n_symbols=9516
	 timeframe=1Day
	 start=2020-09-29
	 end=2021-09-29
...
Showing results:
	 n_bars=9516
	 n_errors=530
	 n_bad_requests=127
took 303 sec

Process finished with exit code 0

I do see the errors you refer to, when trying to get data for 9000 stocks

but the errors are roughly for 600 out of 9000 stocks.

an important thing to note is
the time it took for 9000 stocks - 5 minutes!
you could never get those results for the rest module (it will take you hours to achieve that)

What ca you do
Each API has limitations and it doesn’t allow the user to get endless data at once. BUT, we could work with that.

Split your requests to segments of 1000 stocks each time.
When I did that I got this result:

Getting bars:
	 n_symbols=9516
	 timeframe=1Day
	 start=2020-09-29
	 end=2021-09-29
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=7
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=12
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=16
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=14
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=20
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=19
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=15
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=7
Showing results:
	 n_bars=1000
	 n_errors=0
	 n_bad_requests=10
Showing results:
	 n_bars=516
	 n_errors=0
	 n_bad_requests=11
took 49 sec

Process finished with exit code 0

roughly 130 stocks with bad data out of ~9000. not bad. AND it only took 50 seconds!!

I will add the modified code in the next comment

Here’s the modified code that achieves these results:

I hope that helps :slight_smile:

async def get_historic_bars(
        symbols,
        start,
        end,
        timeframe: TimeFrame,
):
    major = sys.version_info.major
    minor = sys.version_info.minor

    if major < 3 or minor < 6:
        raise Exception('asyncio is not supported  by your python version')

    print(f'Getting bars:')
    print(f'\t n_symbols={len(symbols)}')
    print(f'\t timeframe={timeframe}')
    print(f'\t start={start}')
    print(f'\t end={end}')

    step_size = 1000

    for i in range(0, len(symbols), step_size):
        n_bad_requests = 0
        n_errors = 0
        tasks = []
        for symbol in symbols[i:i+step_size]:
            args = [symbol, start, end, timeframe]
            tasks.append(rest.get_bars_async(*args))

        if minor >= 8:
            results = await asyncio.gather(*tasks, return_exceptions=True)
        else:
            results = await gather_with_concurrency(500, *tasks)



        for response in results:
            if isinstance(response, Exception):
                n_errors += 1
                # print(f"Got an error: {response}")
            elif not len(response[1]):
                n_bad_requests += 1
                # print(f'bad response: {response}')
            else:
                # print(response)
                pass

        print('Showing results:')
        print(f'\t n_bars={len(results)}')
        print(f'\t n_errors={n_errors}')
        print(f'\t n_bad_requests={n_bad_requests}')

Thank you for taking the time to confirm the bug.

Pardon me if I disagree with the approach you took to remedying it. I don’t think your hack is the solution. I think diagnosing the bug and fixing it should be the solution. Also, the existing serial version of get_bars does not exhibit this issue. Moreover, if I wrap get_bars in a ThreadPoolExecutor I achieve comparable download speeds. So for the time being I’m going to stick with get_bars.

yes of course, choose your preferred method, both are valid.
if the thread approach work for you great!
how long does it take you to get the same result? would you mind adding a code snippet?

just to point something out - this is not a hack nor a bug.
no server in the world will let you connect simultaneously with 9000 requests from the same IP address. this is what this solution does - simultaneously connect to server and X requests (in this case X=9000)
so by splitting it to segments of 1000 you get all results in less than a minute.
having said that, you are of course free to select your preferred approach

disagree again. if the function fails as implemented then it is a bug. you should build throttling into it if that’s the issue and not expect consumers of your code to do it.