Which combination of fields guarantees uniqueness in historical trade records?

Hi!

I’m building an historical trade database using Alpaca’s API and need to understand the uniqueness guarantees for creating a proper primary key.

Question: Which combination of fields guarantees uniqueness in historical trade records?

My understanding:

  • Trade ID is unique per exchange and day
  • However, I’ve observed that exchange ‘D’ (TRF/FINRA) has repeated trade_id values for the same date

What I need to know: What’s the minimal field combination that guarantees uniqueness?

  • (symbol, date, exchange, trade_id) ?
  • (symbol, date, exchange, trade_id, timestamp) ?
  • Or do I need additional fields like price/size/updates?

Thanks for any clarification on the official uniqueness constraints!

@guillem The (symbol, date, exchange, trade_id) should guarantee unique trades. If you are seeing duplicates please provide an example and how you are fetching the trades (ie the REST call used or the streaming endpoint. I can investigate.

@Dan_Whitnable_Alpaca here it is an example of ticker AMZN on 2024-04-04, I used that date because I had data from another provider to compare, but it’s consistent everyday with all tickers…

This is data from Alpaca. As you can see there are 12k duplicates only for ticker AMZN on single date, got from this endpoint https://data.alpaca.markets/v2/stocks/trade:

This is data from other provider for same day, as you can see it has a field for TRF. To me it looks like exchange D have different TRFs, so with this particular exchange the unique combination should be symbol, date, exchange, trf, trade_id. But TRF is missing in Alpaca’s data…

Thanks!

@guillem First, you are correct the “D” exchange (FINRA ADF) actually aggregates trades from several Trade Reporting Facilities (TRFs). Alpaca doesn’t break out the TRF. The trade_ids are simply offset with a number for each.

I can’t replicate your issue. I fetch all AMZN trades for 2024-04-04 and find no duplicates. How are you fetching trades. Below is my entire code.

import pandas as pd
import requests

# Alpaca API key and Secret key
API_KEY = "xxxx"
SECRET_KEY = "xxxx"


def get_trades(symbols, start, end):

  BASE_URL = 'https://data.alpaca.markets'
  ENDPOINT = '/v2/stocks/trades'
  HEADERS = {'APCA-API-KEY-ID':API_KEY,
          'APCA-API-SECRET-KEY':SECRET_KEY,
          'Content-Type': 'application/json'
          }

  params = {'symbols':symbols,
            'start':start.isoformat(),
            'end':end.isoformat(),
            'feed':'sip',
            'limit':10000,
            'page_token':None,
            }

  # initialize things
  trades = {}
  first_pass = True

  while first_pass or page_token != None:
    first_pass = False
    response = requests.get(BASE_URL+ENDPOINT, headers=HEADERS, params=params)
    
    # select trades from response and concatenate to previous trades
    json_data = response.json()['trades']
    trades = trades | json_data
    
    # check for a next_page_token
    page_token = response.json()['next_page_token']
    params['page_token'] = page_token

  return trades


def convert_json_trades_to_df(json_data, time_zone='America/New_York'):
  # map columns
  column_names = {'symbol':'symbol',
            't':'timestamp',
            'p':'price',
            's':'size',
            'x':'exchange',
            'z':'tape',
            'i':'index',
            'c':'conditions',
              }
  # add symbol to each json record and convert to dataframe
  records = [dict(records, symbol=stock) for stock, recordset in json_data.items() for records in recordset]
  df = pd.DataFrame.from_records(records, columns=column_names.keys())

  # rename columns, set timezone, set index
  df.rename(columns=column_names, inplace=True)
  df['timestamp'] = pd.to_datetime(df.timestamp).dt.tz_convert(time_zone)
  df.set_index(['symbol', 'timestamp'], inplace=True)

  return df

# get trades and convert to dataframe
trades = get_trades(symbols='AMZN', 
           start=pd.to_datetime('2024-04-04 00:00').tz_localize('America/New_York'),
           end=pd.to_datetime('2024-04-04 23:00').tz_localize('America/New_York'),
           )

trades_df = convert_json_trades_to_df(trades)

# find duplicates
duplicate_index = trades_df.reset_index().duplicated(subset=[ 'symbol','exchange','index'])
duplicates = trades_df.reset_index().loc[duplicate_index]
display(duplicates)

The result for duplicates is an empty dataframe.

What timezone are you specifying? Are you by chance fetching the evening of one day and the morning of the next (because of the timezone)?

Hi @Dan_Whitnable_Alpaca Thanks for sharing your code. I can see why you could not replicate it. On my side I’m not using the timestamp to detect duplicates, I’m using the date (corrrectly extracted from timestamp using NY time), as we talked earlier. I also ran you code and for some reason it seems when I run that it might not be fetching all trades for that day. I’m sharing this code which shouid reproduce the exact issue I refered to:

import pandas as pd
import requests

API_KEY = "***"
SECRET_KEY = "***"

def get_trades_df(symbol, start, end, api_key=API_KEY, secret_key=SECRET_KEY):

    BASE_URL = "https://data.alpaca.markets"
    ENDPOINT = "/v2/stocks/trades"
    headers = {
        "APCA-API-KEY-ID": api_key,
        "APCA-API-SECRET-KEY": secret_key
    }

    trades = []
    page_token = None

    while True:
        params = {
            "symbols": symbol,
            "start": start,
            "end": end,
            "limit": 10000,
            "page_token": page_token
        }

        resp = requests.get(BASE_URL + ENDPOINT, headers=headers, params=params)
        resp.raise_for_status()
        data = resp.json()

        trades.extend(data.get("trades", {}).get(symbol, []))

        page_token = data.get("next_page_token")
        if not page_token:
            break

    if not trades:
        return pd.DataFrame()

    df = pd.DataFrame(trades)

    column_names = {
        "t": "timestamp",
        "p": "price",
        "s": "size",
        "x": "exchange",
        "z": "tape",
        "i": "index",
        "c": "conditions",
        "u": "updates"
    }
    
    df.rename(columns=column_names, inplace=True)
    
    df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)
    df["symbol"] = symbol

    # Convert temporarily to New York timezone to correctly extract the date
    df["timestamp_ny"] = df["timestamp"].dt.tz_convert("America/New_York")
    # Extract date based on NY timezone
    df["date"] = df["timestamp_ny"].dt.date

    return df[["symbol", "date", "timestamp", "conditions", "updates", "price", "size", "exchange", "tape", "index"]]

df_trades = get_trades_df("AMZN", "2024-04-04", "2024-04-04")

# Find duplicates by 'symbol', 'date', 'exchange' and 'index'
duplicates = df_trades[df_trades.duplicated(subset=["symbol", "date", "exchange", "index"], keep=False)]

# Count how many duplicates exist
print("Total duplicates found:", len(duplicates))
display(duplicates.sort_values(by='index'))

Here it is my output:

Thanks!

Another potential issue that concerns me a bit regarding historical trades data is that IMHO Alpaca’s create hindsight bias. The historical endpoint tags trades as ‘corrected’ or ‘canceled’, which is nice, but in reality I’ve observed these corrections often arrive minutes after the original trade was made. So, during live trading using the websocket, the algorithm would process the bad trade and make decisions on it, only learning about the correction much later. Meanwhile in the backtesting it used a cleaned feed… Of course I can simply backtest ignoring trade updates, but would be ideal to know when the update was delivered.