Clarificaton about streaming trade data

hello,

Could someone clarify what “size” represents from streaming trade data? (see the image below)

Is it the number of shares that were just traded? Or something else? When I add them up for a 1 minute window it does NOT equal the volume data shown by either google or yahoo finance for that same time period. So what does “size” actually mean?

one of these i think, depending on the stream

  • Trade Size: The number of shares traded in a single transaction.
  • Quote Size: The number of shares available for sale (ask size) or purchase (bid size) at a given price level.
  • Message Size: The amount of data in a single message or data packet transmitted via the API.

I thought it was Trade Size also, but as I mentioned above, adding all the trade sizes in a 1 minute window does not equal the volume you get by calling:

api.get_bars_iter(“NVDA”, TimeFrame.Minute,“2024-06-03”, “2024-06-03”, adjustment=‘raw’)

for that same 1 minute period. If it was the shares traded, then it should equal the volume for the same time. As you can see in the image above, the trades are all small, e.g. 15, 5 and a bunch of 1’s.

Why are they not equal? What am I missing here?

thats above me, but ive heard of phantom stock shares

@Artic The size attribute is the quantity of shares in the trade. Summing them over a minute will equal the minute bar volume. There are a couple of caveats 1) ensure you group and sum over the trade timestamps and not by when you receive them and 2) there are 3 trade conditions excluded from the bar volume calculation (M, Q, and 9) don’t include those in the bar volume. Additionally, be aware that trades are sometimes ‘corrected’. Those corrections are reflected in the bar calculations but not typically the trades.

Below is some sample code to calculate volumes from trades vs bar volumes. In this particular case the volumes match. Because of the occasional updated trade you may see small variances if resampling other times.


symbol = 'NVDA'
start = pd.to_datetime('2024-06-03 10:30:00').tz_localize('America/New_York')
end = pd.to_datetime('2024-06-03 11:00:00').tz_localize('America/New_York')

bars = (client.get_stock_bars(StockBarsRequest(
                                  symbol_or_symbols = symbol,
                                  timeframe = TimeFrame.Minute,
                                  start = start,
                                  end = end))
                                  .df
                                  .tz_convert('America/New_York', level='timestamp')
                                  .reset_index('symbol'))

trades = (client.get_stock_trades(StockTradesRequest(
                                  symbol_or_symbols = symbol,
                                  start = start,
                                  end = end))
                                  .df
                                  .tz_convert('America/New_York', level='timestamp')
                                  .reset_index('symbol'))

calculated_sizes = trades.resample('1T')['size'].sum()

bars['calculated_size'] = calculated_sizes
bars['calculated_diff'] = bars.calculated_size - bars.volume

Here’s the result. Note the calculated_diff column is all 0.

Not sure why your calculations do not match other sources, but they should be identical to Alpaca unless they are not filtering the trades properly. In that case you would see lower volume on other sources.

1 Like

thanks for your helpful response. However, I was referring to the streaming data. If you sum up the sizes from the streaming trade data for a 1 minute window, you will NOT get the volume equal to the actual volume reported by getting bar data. Even after accounting for timestamps. The volume from streaming the trades is significantly lower. I am using sip data and the summed up volume is nowhere near the real volume. It’s almost as if huge chunks of data is missing. Same thing happens if you use:

trades = (client.get_stock_latest_trade(StockLatestTradeRequest(
                                  symbol_or_symbols = symbol)))

and add up the sizes. You will get much lower volumes as compared to bar data for the same 1 minute window.

Can you test on your end? Why are the volumes different and how do I access the missing data?

@Artic The streamed trades will be all trades. Could you provide an example of trades you get from your stream which you feel are incomplete? Regarding fetching trades with get_stock_latest_trade, that can definitely miss a lot of trades. That is simply a snapshot of the current latest trade. If one calls it, even 20ms later, there is no guarantee some trades could have occurred and therefore not be reported. The best way to get trades is to use the get_stock_trades method (if using the alpaca-py SDK).

Thank you! I reran my code and indeed it does give all the trades. But it worked AFTER I removed the get_bar function call. I didn’t realize calling another function within the streaming async function would have such a detrimental hit on performance. Yikes! Now I’ll have to rework some of my logic but I can make some progress again. Thanks!

@Artic Glad to hear you got it working!