Entity Resolution

Identity Graph

83K person nodes. 621K introduction edges. A graph database that maps who knows who and finds the shortest path to any target contact.

83K

Person nodes

621K

CAN_INTRO edges

Linked data sources

7-sheet

Intro map deliverable

The Problem

Business relationships exist in fragments. A person appears as a LinkedIn connection, a row in a CRM, a name on a Secretary of State filing, a contributor to a GitHub organization, and a co-investor on a Crunchbase deal — five records that a human would recognize as the same person but a database treats as five strangers.

The higher-order problem is relationship mapping: given a target person you want to reach, who in your existing network knows them, and what is the strongest introduction path? This is the problem that drives warm outreach in PE deal sourcing, partnership development, and enterprise sales.

Architecture

The system is built on Neo4j (graph database) running on the DGX, with a FastAPI intelligence API (port 8010) and a Next.js visualization UI.

Data Model

The graph has two primary node types and multiple edge types:

Person nodes: Name, title, company, LinkedIn URL, email, location, and source provenance
Company nodes: Name, domain, industry, size, location, and registration data
CAN_INTRO edges: Weighted relationship links between Person nodes. 621K of these create the introduction network.
WORKS_AT edges: Person-to-Company employment relationships
CO_INVESTED edges: Shared investment activity between investors

Entity Resolution Pipeline

The resolution pipeline uses a two-pass approach:

Deterministic matching (pass 1): Exact matches on email address, LinkedIn URL, or (full name + company) tuple. These are high-confidence merges that can run without human review.
Fuzzy matching (pass 2): Probabilistic matching using name similarity (Jaro-Winkler distance), company name normalization, title semantic similarity, and location proximity. Matches above a confidence threshold merge automatically; borderline cases queue for review.

Data Sources

LinkedIn Sales Navigator

Professional profiles, connections, mutual connections

Primary source

CRM Records

Contact details, interaction history, deal associations

Imported via CSV/API

Secretary of State Filings

Business registrations, officer/director names, registered agents

State-specific scrapers

Crunchbase

Investment relationships, co-investor data, funding rounds

Via web scraping

Web Domains

Team pages, about pages, leadership bios

Via internal scraper

Source	Data Contributed	Volume
LinkedIn Sales Navigator	Professional profiles, connections, mutual connections	Primary source
CRM Records	Contact details, interaction history, deal associations	Imported via CSV/API
Secretary of State Filings	Business registrations, officer/director names, registered agents	State-specific scrapers
Crunchbase	Investment relationships, co-investor data, funding rounds	Via web scraping
Web Domains	Team pages, about pages, leadership bios	Via internal scraper

Introduction Path Discovery

The core deliverable is an introduction map: given a list of target contacts, the system finds the shortest and strongest paths through the relationship graph from the user's network to each target.

Path strength is computed from edge weights that factor in:

Recency: When was the relationship last active?
Reciprocity: Is it a one-way or mutual connection?
Proximity: How many hops separate the source from the target?
Shared context: Do the connector and target share industry, location, or co-investment history?

The output is a 7-sheet Excel workbook: target list with match status, ranked connectors per target, full path details, connector profiles, gap analysis (targets with no path found), data quality summary, and methodology notes.

Semantic Search

ChromaDB embeddings (sentence-transformers) enable natural language queries against the graph: “find healthcare investors in the Midwest who co-invested with firm X.” The embedding layer converts unstructured queries into graph traversal parameters, combining the flexibility of semantic search with the precision of structured graph queries.

Results

83,000 person nodes resolved from overlapping data sources
621,000 CAN_INTRO edges mapping the relationship network
Introduction maps generated for investor outreach campaigns
Average path length to target: 2.3 hops (within “warm intro” range)

Stack

Neo4jPythonFastAPINext.jsChromaDBsentence-transformerspandasopenpyxl

Get started

Need a system like this?

Every build starts with a Signal Audit: ten days, fixed price, a clear answer on what is worth building.

Talk to us →See what is delivered →