How It Works
Data Extraction
The Process
You might be interested extract our data and search the database. While we can’t go into full detail, we can give a high level overview, which is what we will do below.

Scrape Cases
The process starts with us scraping definition texts from the European Commission's Competition Case Search website. We then convert the PDFs into text documents, at which point, most cases without market definition are excluded. The scraped text is known as text corpus.

Section Extraction
After extracting the text corpus, we run it through a fine-tuned model of Google Gemini with a specially designed prompt which extracts the section where the market definitions are. At this point, if there are any remaining cases without market definitions, they are removed.

Definition Extraction
Then, we run the extracted market definition sections through another specially fine-tuned Google Gemini model to extract each individual definition, which is aggregated into a single JSON file.

Searching For Market Definitions
Embedding and Indexing
Once the database is ready, each entry is parsed to extract the relevant text fields. These text chunks are then embedded using the Pinecone API, transforming them into high-dimensional vectors that capture their semantic meaning. These vectors are stored remotely in a Pinecone index.

Queries
When a user submits a query, the OpenAI API converts the query into its own vector representation using the same embedding model. This query vector is then used to search the Pinecone index, returning the closest matches based on semantic similarity rather than exact keywords.

Matches
The returned matches are filtered based on the filters chosen. You may receive anywhere from 0 to 20 matches. Each match is given a score, representing how similar that match is to the meaning of your query (input). The closer the score is to 1.000, the more similar the match.
