How It Works

Data Extraction

The Process

You might be interested extract our data and search the database. While we can’t go into full detail, we can give a high level overview, which is what we will do below.

data pipeline

Scrape Cases

The process starts with us scraping definition texts from the European Commission's Competition Case Search website. We then convert the PDFs into text documents, at which point, most cases without market definition are excluded. The scraped text is known as text corpus.

web scrape

Section Extraction

After extracting the text corpus, we run it through a fine-tuned model of Google Gemini with a specially designed prompt which extracts the section where the market definitions are. At this point, if there are any remaining cases without market definitions, they are removed.

data input

Definition Extraction

Then, we run the extracted market definition sections through another specially fine-tuned Google Gemini model to extract each individual definition, which is aggregated into a single JSON file.

data filtering

Searching For Market Definitions

Embedding and Indexing

Once the database is ready, each entry is parsed to extract the relevant text fields. These text chunks are then embedded using the Pinecone API, transforming them into high-dimensional vectors that capture their semantic meaning. These vectors are stored remotely in a Pinecone index.

embedding

Queries

When a user submits a query, the OpenAI API converts the query into its own vector representation using the same embedding model. This query vector is then used to search the Pinecone index, returning the closest matches based on semantic similarity rather than exact keywords.

query

Matches

The returned matches are filtered based on the filters chosen. You may receive anywhere from 0 to 20 matches. Each match is given a score, representing how similar that match is to the meaning of your query (input). The closer the score is to 1.000, the more similar the match.

matches