Alpaca Community Forum

Which combination of fields guarantees uniqueness in historical trade records?

Alpaca Market Data

guillem August 28, 2025, 7:09pm 1

Hi!

I’m building an historical trade database using Alpaca’s API and need to understand the uniqueness guarantees for creating a proper primary key.

Question: Which combination of fields guarantees uniqueness in historical trade records?

My understanding:

Trade ID is unique per exchange and day
However, I’ve observed that exchange ‘D’ (TRF/FINRA) has repeated trade_id values for the same date

What I need to know: What’s the minimal field combination that guarantees uniqueness?

(symbol, date, exchange, trade_id) ?
(symbol, date, exchange, trade_id, timestamp) ?
Or do I need additional fields like price/size/updates?

Thanks for any clarification on the official uniqueness constraints!

Dan_Whitnable_Alpaca August 28, 2025, 9:29pm 2

@guillem The (symbol, date, exchange, trade_id) should guarantee unique trades. If you are seeing duplicates please provide an example and how you are fetching the trades (ie the REST call used or the streaming endpoint. I can investigate.

guillem August 28, 2025, 10:10pm 3

@Dan_Whitnable_Alpaca here it is an example of ticker AMZN on 2024-04-04, I used that date because I had data from another provider to compare, but it’s consistent everyday with all tickers…

This is data from Alpaca. As you can see there are 12k duplicates only for ticker AMZN on single date, got from this endpoint https://data.alpaca.markets/v2/stocks/trade:

This is data from other provider for same day, as you can see it has a field for TRF. To me it looks like exchange D have different TRFs, so with this particular exchange the unique combination should be symbol, date, exchange, trf, trade_id. But TRF is missing in Alpaca’s data…

Thanks!

Dan_Whitnable_Alpaca August 29, 2025, 12:05am 4

@guillem First, you are correct the “D” exchange (FINRA ADF) actually aggregates trades from several Trade Reporting Facilities (TRFs). Alpaca doesn’t break out the TRF. The trade_ids are simply offset with a number for each.

I can’t replicate your issue. I fetch all AMZN trades for 2024-04-04 and find no duplicates. How are you fetching trades. Below is my entire code.

import pandas as pd
import requests

# Alpaca API key and Secret key
API_KEY = "xxxx"
SECRET_KEY = "xxxx"


def get_trades(symbols, start, end):

  BASE_URL = 'https://data.alpaca.markets'
  ENDPOINT = '/v2/stocks/trades'
  HEADERS = {'APCA-API-KEY-ID':API_KEY,
          'APCA-API-SECRET-KEY':SECRET_KEY,
          'Content-Type': 'application/json'
          }

  params = {'symbols':symbols,
            'start':start.isoformat(),
            'end':end.isoformat(),
            'feed':'sip',
            'limit':10000,
            'page_token':None,
            }

  # initialize things
  trades = {}
  first_pass = True

  while first_pass or page_token != None:
    first_pass = False
    response = requests.get(BASE_URL+ENDPOINT, headers=HEADERS, params=params)
    
    # select trades from response and concatenate to previous trades
    json_data = response.json()['trades']
    trades = trades | json_data
    
    # check for a next_page_token
    page_token = response.json()['next_page_token']
    params['page_token'] = page_token

  return trades


def convert_json_trades_to_df(json_data, time_zone='America/New_York'):
  # map columns
  column_names = {'symbol':'symbol',
            't':'timestamp',
            'p':'price',
            's':'size',
            'x':'exchange',
            'z':'tape',
            'i':'index',
            'c':'conditions',
              }
  # add symbol to each json record and convert to dataframe
  records = [dict(records, symbol=stock) for stock, recordset in json_data.items() for records in recordset]
  df = pd.DataFrame.from_records(records, columns=column_names.keys())

  # rename columns, set timezone, set index
  df.rename(columns=column_names, inplace=True)
  df['timestamp'] = pd.to_datetime(df.timestamp).dt.tz_convert(time_zone)
  df.set_index(['symbol', 'timestamp'], inplace=True)

  return df

# get trades and convert to dataframe
trades = get_trades(symbols='AMZN', 
           start=pd.to_datetime('2024-04-04 00:00').tz_localize('America/New_York'),
           end=pd.to_datetime('2024-04-04 23:00').tz_localize('America/New_York'),
           )

trades_df = convert_json_trades_to_df(trades)

# find duplicates
duplicate_index = trades_df.reset_index().duplicated(subset=[ 'symbol','exchange','index'])
duplicates = trades_df.reset_index().loc[duplicate_index]
display(duplicates)

The result for duplicates is an empty dataframe.

What timezone are you specifying? Are you by chance fetching the evening of one day and the morning of the next (because of the timezone)?

guillem August 29, 2025, 6:25am 5

Hi @Dan_Whitnable_Alpaca Thanks for sharing your code. I can see why you could not replicate it. On my side I’m not using the timestamp to detect duplicates, I’m using the date (corrrectly extracted from timestamp using NY time), as we talked earlier. I also ran you code and for some reason it seems when I run that it might not be fetching all trades for that day. I’m sharing this code which shouid reproduce the exact issue I refered to:

import pandas as pd
import requests

API_KEY = "***"
SECRET_KEY = "***"

def get_trades_df(symbol, start, end, api_key=API_KEY, secret_key=SECRET_KEY):

    BASE_URL = "https://data.alpaca.markets"
    ENDPOINT = "/v2/stocks/trades"
    headers = {
        "APCA-API-KEY-ID": api_key,
        "APCA-API-SECRET-KEY": secret_key
    }

    trades = []
    page_token = None

    while True:
        params = {
            "symbols": symbol,
            "start": start,
            "end": end,
            "limit": 10000,
            "page_token": page_token
        }

        resp = requests.get(BASE_URL + ENDPOINT, headers=headers, params=params)
        resp.raise_for_status()
        data = resp.json()

        trades.extend(data.get("trades", {}).get(symbol, []))

        page_token = data.get("next_page_token")
        if not page_token:
            break

    if not trades:
        return pd.DataFrame()

    df = pd.DataFrame(trades)

    column_names = {
        "t": "timestamp",
        "p": "price",
        "s": "size",
        "x": "exchange",
        "z": "tape",
        "i": "index",
        "c": "conditions",
        "u": "updates"
    }
    
    df.rename(columns=column_names, inplace=True)
    
    df["timestamp"] = pd.to_datetime(df["timestamp"], utc=True)
    df["symbol"] = symbol

    # Convert temporarily to New York timezone to correctly extract the date
    df["timestamp_ny"] = df["timestamp"].dt.tz_convert("America/New_York")
    # Extract date based on NY timezone
    df["date"] = df["timestamp_ny"].dt.date

    return df[["symbol", "date", "timestamp", "conditions", "updates", "price", "size", "exchange", "tape", "index"]]

df_trades = get_trades_df("AMZN", "2024-04-04", "2024-04-04")

# Find duplicates by 'symbol', 'date', 'exchange' and 'index'
duplicates = df_trades[df_trades.duplicated(subset=["symbol", "date", "exchange", "index"], keep=False)]

# Count how many duplicates exist
print("Total duplicates found:", len(duplicates))
display(duplicates.sort_values(by='index'))

Here it is my output:

Thanks!

guillem August 29, 2025, 9:20am 6

Another potential issue that concerns me a bit regarding historical trades data is that IMHO Alpaca’s create hindsight bias. The historical endpoint tags trades as ‘corrected’ or ‘canceled’, which is nice, but in reality I’ve observed these corrections often arrive minutes after the original trade was made. So, during live trading using the websocket, the algorithm would process the bad trade and make decisions on it, only learning about the correction much later. Meanwhile in the backtesting it used a cleaned feed… Of course I can simply backtest ignoring trade updates, but would be ideal to know when the update was delivered.

Topic		Replies	Views	Activity
Duplicates in historic trades Alpaca Market Data	1	537	February 9, 2022
Alpaca get_last_trade price is delayed Alpaca Market Data	6	1122	April 23, 2021
Trades history explain Alpaca Market Data	2	1019	January 21, 2022
Incorrect historical Trades info using SDK Alpaca Market Data	5	648	September 7, 2023
Historical trade data limit Alpaca Trading	0	432	July 20, 2020