What is the source of the historical data

i dont seem to really understand where is the data coming from.
alpaca works with several providers for their market data. additionally it has many exchanges with different prices for ohlc. for example, when you subscribe to any symbol in the web socket your response is sometimes 3 different ohlc values.

what i would like to know which exchange is most similar to the historical data from the REST API .or do i have to somehow manipulate the data from several exchanges and get the average to most closely resemble historical data

@sirname Market data can be confusing for sure. First, there are three types of data

  • quotes (more precisely “protected quotes”)
  • trades
  • aggregated trades (commonly referred to as ‘bars’)

The collection and dissemination of quotes and trade executions is required, and enforced, by the Securities and Exchange Commission (SEC) and is spelled out in Regulation NMS. Every provider (including Alpaca) presents identical quotes and trades (which is one the primary goals of Regulation NMS). Providers however vary in how they aggregate trades. There are industry guidelines but they are not complete and not uniformly followed by all providers.

Understanding the distinction between those three types of data is necessary in understanding “where the data comes from”. Technically, Alpaca sources quote and trade data from a single ‘aggregator’ (so the statement “Alpaca works with several providers” is incorrect). Our aggregator simply streams us data which they receive from the two Securities Information Processors or SIPs (Unlisted Trading Privileges or UTP and Consolidated Tape Association or CTA. Those two SIPs are the source of all quote and trade data for every provider (such as Alpaca, Polygon, Yahoo, etc). The quote and trade data is identical between providers.

The aggregated “bar” data however is calculated by each provider based upon trades. This is the “Open High Low Close” data one sees. While every provider starts with identical trade data, the specific trades they choose to include in their calculations are not identical. It’s important to understand there are many types of trades other than the familiar “market” orders placed by most retail traders. The “Conditions” field indicates the type of trade. Not all trades are equal. For example, a trade with a condition “R” indicates a ‘Seller’s Option’ transaction which gives the seller the right to deliver the security at any time within 2 and 60 days from the transaction. This is different from a regular trade which must be settled in two trading days. For this reason these trades typically are executed at prices much different than regular trades and therefore excluded from most bar calculations. The SIPs publish guidelines for which trades to include in which bar calculation. That guidance is on page 43 of the UTP Specification and page 64 of the CTA Specification. A glaring omission in the guidelines is they do not give guidance on Open data. This leads to every provider choosing slightly different trades for their “Open”.

So, to answer the question “I would like to know which exchange is most similar to the historical data from the REST API”. One typically wouldn’t look at the ‘exchange’. The whole point of Regulation NMS is to unify the many exchanges and other trading venues into a single consolidated market. Alpaca includes all exchanges in our bar calculations (though there are a few exceptions if using the Free data). As far as “data most similar to the historical data” , the streamed data is identical to the historical quotes, trades, and bars presented by the REST APIs. Again, there are a few differences when using the Free data but one should only be using the Free data for general algo debugging and not actual bar construction and trading.

The other part of the question “…it has many exchanges with different prices for ohlc” seems to imply one is looking at the consolidated bar data (and not quotes or trades). Bar data is the only data which has OHLC. Consolidated bar data however, by definition, doesn’t have an ‘exchange’. It is trade data consolidated from all exchanges and trading venues.

I feel this may not have exactly answered the question? If you are trying to calculate your own bars based upon streamed trades please indicate that and, if so, provide your rational for not simply streaming bars. If you are wondering how data physically gets from when a trade is executed to being displayed by a data provider, that is a whole (fascinating) topic in itself but best left for a separate post.

Yes ,I think the three type of data is necessary to include in a content .
Like http://pageimage.epizy.com/valuation.html