Best Practice for Subscribing to 6k+ Symbols in Trades Stream

Hi,

I’m planning to subscribe to around 6,000 symbols through WebSocket, and I’d like to clarify the best approach. Specifically, I’m comparing two strategies:

  1. Server-side filtering: Subscribing specifying symbols (e.g., "trades": ["AAPL","MSFT", …]).

  2. Client-side filtering: Subscribing with a wildcard (e.g., "trades": ["*"]) and filtering on my side.

My main question is:

  • Does providing a large symbols list (≈6k) during subscription introduce any additional processing overhead, delays, or throttling on your end compared to using the wildcard approach?

Thanks for your help!

I am no expert, but I advise you to go over this link, in which @Dan_Whitnable_Alpaca has clearly explained the limits of order/data API calls.

I am not sure if even the wildcard subscription would work. Even subscribing to 6000 symbols directly can easily put you over the data requests limit. You may want to split it into multiple requests having a smaller set of symbols to ensure that you stay within limits.

Thanks

1 Like

@guillem You asked about best practices for streaming trades. Generally only subscribe to the trades you need. The single biggest issue with streaming data is the client (ie the algo) processing the trades fast enough. Processing trades then filtering them out is a waste of CPU time.

The biggest issues we see with websocket implementations are 1) the client disconnects but then ‘silently’ reconnects (therefore missing data). 2) the client doesn’t keep up with the streamed data and simply processes messages slower than they are being received 3) the client doesn’t properly allocate time for the required underlying ping-pong protocol.

Issue 1 “silently disconnecting”. This is a direct result of the websocket implementations in some of the SDKs which automatically reconnect when a connection is lost. If one is implementing their own code this shouldn’t be an issue. Presumably the code knows when a connection is lost so it won’t be ‘silent’.

Issue 2 “slow message processing”. If messages aren’t pulled from the websocket stream fast enough they are buffered. The Alpaca servers buffer streamed messages but, if that buffer fills, the server will disconnect. The buffer is quite large. It can happen that many minutes of messages get queued. Algos can get up to 30 minutes (or more) behind processing messages without realizing it is working with ‘stale’ data.

Issue 3 “no pong message”. The client is expected to send a ‘keep alive’ pong message to the server at least every 20 seconds (this is handled transparently by most websocket packages). The algo must ensure it doesn’t get too busy processing messages and leave time to send these pongs. If the server doesn’t receive these, it thinks the client went away and disconnects.

One’s algo should always implement a process to compare the timedelta between when a message is received and the timestamp of the message. For quotes this should typically be under 25ms. Trades should also be about 25ms except trades can be reported up to 10 seconds after execution. There may be outliers with a 10 second timedelta. If the algo ever sees many trades with over 10 second timedeltas or if timedeltas are steadily increasing, it’s a sign the algo is processing messages too slowly.

Below is some sample python code using the alpaca-py SDK implementing a timedelta check.

import pandas as pd

ALPACA_API_KEY = 'xxxx'
ALPACA_API_SECRET_KEY = 'xxxx'

from alpaca.data.live import StockDataStream
from alpaca.data.enums import DataFeed

stream = StockDataStream(ALPACA_API_KEY, ALPACA_API_SECRET_KEY, feed=DataFeed.SIP)
MY_SYMBOLS = ['IBM', 'SPY', 'NVDA']
MAX_TIME_DIFFERENCE = pd.Timedelta(seconds=1)

def timestamp_difference(timestamp):
  return pd.Timestamp.utcnow() - timestamp

# async function to handle stream data
async def trade_data_handler(data):
  # trade data will arrive here

  time_delta = timestamp_difference(data.timestamp)

  if  time_delta < MAX_TIME_DIFFERENCE:
    # do whatever your logic is here
    print(f"(delta in ms: {time_delta.microseconds/1000} {time_delta}")
    pass

  else:
    # skip proccessing and log skipped data
    # this should clear the buffer in most cases
    print(f"Algo not keeping up. Skipping data. {pd.Timestamp.utcnow()} {data}")

stream.subscribe_trades(trade_data_handler, *MY_SYMBOLS)
stream.run()

If you will be streaming ~6k symbols I would recommend starting with just a few symbols. Get a baseline timedelta. Incrementally add more symbols and ensure your algo keeps up.

Another thing to remember is not all trades are equal. You will typically want to filter and exclude trades with any of these trade conditions [‘B’, ‘C’, ‘G’, ‘H’, ‘I’, ‘M’, ‘N’, ‘P’, ‘Q’, ‘R’, ‘U’, ‘V’, ‘W’, ‘Z’, ‘4’, ‘7’, ‘9’].

Hope that helps.

1 Like

@Dan_Whitnable_Alpaca Thanks a lot for your explanation! So far I’ve successfully tested subscribing to all symbols (only for trades) and then filtering on my end. I measured my client-side end-to-end latency from message reception until fully processed (parsing, filtering, and other logic), and I’m seeing an average of ~150 microseconds with the 95th percentile around 500 microseconds, with no message loss nor reconnections.

When I tested today to subscribe only to the ~6k symbols I actually need, this end-to-end latency is slightly lower, but nothing significant. That’s why I was unsure which approach would be better, since I didn’t really have a clear metric to compare. With your point about checking the timedelta to reception from the trade timestamp, I think that should give me a much better reference to benchmark.

Just to confirm: using a wildcard subscription versus explicitly listing ~6k symbols shouldn’t increase that timedelta Alpaca’s server side, correct?

@guillem Either using a wildcard, or explicitly specifying symbols, will stream at the same rate. There isn’t any significant overhead on the Alpaca servers.

If you would, post back with the timedeltas you see between the trade timestamps and your local client. I’m always interested in what type of latency end users experience. Also, I forgot to mention, ensure your local clock is synchronized to NIST time so it aligns with the trade timestamps. There’s info on how to do that here.

1 Like

@Dan_Whitnable_Alpaca implemented the timedeltas measurement you shared and ran it for about an hour (15:15 to 16:15 NY time) to get some stats, subscribing with the wildcard to trades channel only. I built it in Go using a custom client and parser. Here are the results:

---------------------------------------------------------------------
MSGS: 1424494 | TRADES: 20896473 | MIN: 4.7ms MAX: 175.7ms AVG: 9.0ms
PERCENTILES: P25: 6.5ms P50: 7.2ms P75: 8.2ms P95: 14.8ms P99: 33.2ms
---------------------------------------------------------------------

Those numbers seem surprisingly good, 95th percentile well below the 25ms you mentioned, but I double-checked everything and it looks correct to me. I’m wondering if it might be due to market hour or small sample… I will run the same tests tomorrow at different times to see if results vary, and will also benchmark in Python. The server is located ~1.5ms network latency from Alpaca and the clock is synchronized with NIST time.

@guillem Which endpoint are streaming from wss://stream.data.alpaca.markets/v2/iex or wss://stream.data.alpaca.markets/v2/sip? If you are connecting to the /iex endpoint that is only trades executed on the IEX exchange and represent only 2-3% of the full market. But, if you are connecting to the SIP endpoint, those are very good numbers.

I’m using SIP endpoint. I will monitor the numbers again tomorrow during different market conditions, but if they remain consistent, I’m quite happy with them…