Invalid bars around 8am EST shuffle for TQQQ

Currently in the process of migrating from Polygon to Alpaca’s paid data offering, and noticed some incorrect bars around the 8AM est spike that occurs most mornings.

For example, this morning at 8am EST, I get the following data from Alpaca. Note the difference in price, and especially volume/trade count. The volume/trade count is only around 10% of what’s being reported from Polygon or TradingView. Am I doing something wrong?

Code Sample

from alpaca_trade_api.rest import REST, TimeFrame
api = REST(settings.ALPACA_API_KEY, settings.ALPACA_API_SECRET)

alpaca_prices = api.get_bars('TQQQ', TimeFrame.Minute, '2022-05-04', '2022-05-04', adjustment='raw').df
alpaca_prices.index = pd.to_datetime(alpaca_prices.index).tz_convert('US/Eastern')
alpaca_prices.loc['2022-05-04 7:59:00':]

ALPACA
08:00:00 2022-05-04
Symbol TQQQ
Open 38.7012
High 38.7400
Low 38.6900
Close 38.7400
Volume 19,790.0000
Trades 110

POLYGON
08:00:00 2022-05-04
Symbol TQQQ
Open 38.70
High 38.88
Low 38.37
Close 38.74
Volume 238,236.0000
Trades 1082

TRADINGVIEW
08:00:00 2022-05-04
Symbol TQQQ
Open 38.70
High 38.88
Low 38.37
Close 38.74
Volume 225,201.0000
Trades not available

@CapitalMastery You could troubleshoot the discrepancy by checking the trades which occurred during those minutes. Here is the code to get trades from Alpaca for a given minute. You could do the same for Polygon. Not sure how TradingView would work.

# Get all the bars between a start and end time
symbol = 'AMZN'
start_time = pd.to_datetime('2022-04-14 11:19:00').tz_localize('America/New_York')
end_time = pd.to_datetime('2022-04-14 11:20:00').tz_localize('America/New_York')

trades = api_data.get_trades(symbol,
                             start=start_time.isoformat(),
                             end=end_time.isoformat(), 
                             ).df

# Convert to market time for easier reading
trades = trades.tz_convert('America/New_York')

That may give you hint as to what may be going on?

Hey @Dan_Whitnable_Alpaca thanks for the great tip. I took your advice, and dug in, and what I found is that Alapca seems to be using the exchange timestamp for aggregates, whereas Polygon and TradingView appear to be using the SIP timestamp. Said another way, due to Alpaca using the exchange timestamp instead of the SIP timestamp, the trades are being reported in a different order. However, if you sum the volume from 7AM EST to 8AM EST, the total volume is almost the same between providers.

Here’s an example where Alpaca is aggregating a trade into the prior minute bar (7:59) because it’s using the exchange timestamp, but Polygon and TradingView are aggregating into the 8AM bar because they’re using the SIP timestamp.

Unfortunately this leads to a large discrepancy around 8AM EST; likely due to the huge influx of orders that flow into the market at this time, which I believe has something to do with many brokers coming online at this time. Either way, it’s a problem. Specifically for my strategies, this difference leads to trade signals that don’t occur on Polygon/TradingView, and thus I’m hesitant to just deal with it.

Why is Alpaca not using SIP timestamps like that other vendors? It’s not just Polygon, and TradingView, but my other brokers as well i.e. InteractiveBrokers, and TradeZero.

TQQQ on 5/4/2022

Alpaca Trade ID: 271
image

Polygon Trade ID: 271

Is there any chance Alpaca will change to the SIP timestamp for reporting trades/aggregates? If not, is there anyway to get the SIP timestamp from Alapca? If so, I could resolve this issue on my end and aggregate my own bars.

Any help is much appreciated. I really feel Alpaca is so close to being the ideal one stop shop solution for quants, but with these data issues, I’m stuck having to glue together various providers. Not to mention the additional costs.

@CapitalMastery Thank you for diving into the timestamp issue. You are 100% correct that Alpaca aggregates on the ‘Participant Timestamp’. This is the time the execution venue provides as when the trade was executed. During market hours, the ‘participant’ sends the trade details to a Trade Reporting Facility as soon as practical but no later than 10 seconds after the execution. However, trades executed before or after market hours are not subject to 10-second reporting. “Specifically, trades executed between midnight and 8:00 am must be reported by 8:15 am Eastern Time on trade date. Trades executed between the close of the Facility (6:30 pm for the ADF and 8:00 pm for the TRFs and the ORF) and midnight must be reported on an “as/of” basis by 8:15 am Eastern Time the following business day” (see FINRA Trade Reporting FAQ). Those after hour trades can get ‘bunched up’ and reported around 8:00 am even though they executed some time before.

Each ‘handoff’ technically has a timestamp. The Trade Reporting Facility is effectively the Securities Information Provider (SIP) who then forward the data to their subscribers (such as Alpaca) who in turn forward the data to their clients (such as our data subscribers). All the timestamps after the initial ‘participant timestamp’ are rather arbitrary. The initial venue may delay reporting, buffer their trades or there may have been some delay or error in transmission. Those timestamps are almost random times between the initial ‘participant timestamp’ and when the final end user gets the data.

It could be debated which is the ‘appropriate’ timestamp to aggregate on. Alpaca feels the only ‘absolute’ timestamp is the participant timestamp which is when the trade was executed and therefore aggregates on those times. The SIP timestamp is simply a random amount of time (potentially up to 10 seconds during market hours or much longer after hours) after the trade was executed and when it was reported. The reason I say it could be debated falls back to the question “If a tree falls in a forest and no one is around to hear it, does it make a sound?” If a trade occurs but nobody is aware of it, does it affect anything? I’ll leave that to the philosophers.

There are a lot of subtleties like this in how data is captured and aggregated which traders should be aware. Market data, like most ‘big data’ isn’t as black and white as one would wish it to be.

1 Like

@Dan_Whitnable_Alpaca this response is incredible. Thank you so much for taking the time to break it down. The clarity of your thinking and understanding on this topic gives me comfort regarding the quality of Alpaca’s data feed. I had a feeling that something like this was happening.

Here’s my concern/issue… All my models have been developed using data from vendors like Polygon, TradingView, Interactive Brokers, etc… So you could say that they’ve been optimized to the shape of their data handling practices. Fortunately they all seem to follow a similar method, so there’s an illusion of consistency there. To that end, I can certainly see this as being a use case specific problem. i.e. the seemingly available signal that my models are picking up on from aggregating all trades since 8PM and dumping them on the market at 8AM EST.

I guess I only have two additional questions…

  1. Is there any chance that Alpaca could provide the SIP timestamp?
  2. If Alpaca would provide the SIP timestamp, would those timestamps align with the other vendors? I assume you guys are largely sourcing your data from the same vendors, but maybe not?

Either way, I think if Alpaca could provide the SIP timestamps, it would provide additional flexibility to us traders. In situations like mine, we could help ourselves. We could easily store all the data in TimescaleDB or similar and let it generate bars on the SIP timestamp vs the Participant Timestamp.

I can appreciate that this may not be as simple as adding an additional field to the API nor may it even align with Alpaca’s objectives, but any insight you can share on the potential of the SIP timestamp being provided in Alpaca’s data feed would be much appreciated.

Thanks again for your excellent help!