Polygon historic trades data outliers

Polygon historic_trades_v2 returns many outliers. In the chart below, they always spike down to a certain price, which doesn’t look natural. Any idea how to deal with these?

The chart is for AAPL on 2021-01-21 EST.


Excellent question. The historic_trades data contains a record for every ‘trade’ which occurred on or off an exchange. It includes corrections, negotiated price trades, late trades, and a host of other records which one may not consider a typical ‘trade’. It’s much more granular and inclusive than the historic_agg data which cleans the data up before aggregating it.

The key to dealing with historic_trades is to look at the conditions field and only include trades with specific conditions. Trades such as those executed at an ‘average price’ (2) or traded as ‘Seller’s Option’ (29) where the seller can deliver the shares anytime 2-60 days after the sale probably aren’t representative of ‘typical trades’ and should be excluded. Polygon has a list of the conditions and which ones are included in their price calculations here. There is also glossary explaining some of the terms here.

So, how to deal with outliers is to filter by the condition field. Let’s start with getting some data. For example AAPL on 1-21-2021.

trades_df = api.polygon.historic_trades_v2('AAPL', '2021-01-21').df

The plot looks like this with outliers similar to the original example.

Now, let’s create a helper function and use it to add a column called is_valid_trade to the dataframe. This simply checks the conditions field and includes only those trades with certain conditions and excludes those with other conditions. I’ve included mostly those which Polygon uses to update ‘Consolidated Prices’ but one could maybe change these. For example, this doesn’t include odd lot orders but maybe one does want to capture these? Anyway, here’s the code and the same data but filtered by is_valid_trade

# Create a function to test the conditions in the trade condition list
# Many are NaN which (I assume) are regular trades

def is_valid_trade(condition_list):
  # Lists of conditions to include and exclude
  # Must include at least one condition from the include list
  # Cannot have any conditions from the exclude list
  include_list = [0, 1, 3, 4, 8, 9, 11, 14, 15, 23, 25, 27, 28, 30, 34, 36]
  exclude_list = [2, 5, 7, 10, 12, 37]

  if type(condition_list) is not list: 
    # Assume none is a regular trade?
    include = True
    # There are one or more conditions in the list
    has_include_condition = any(condition in condition_list for condition in include_list)
    has_exclude_condition = any(condition in condition_list for condition in exclude_list)
    include = has_include_condition and not has_exclude_condition

  return include

trades_df['is_valid_trade'] = trades_df.conditions.apply(is_valid_trade)


The plot above is the same set of trades except now filtered by the conditions above. Notice how most of the outliers have disappeared.

It’s imperative to consider the conditions when looking at the raw trades data. One reason to maybe use aggregate data?

Good luck.

Very well explained. Thanks.