Entity Resolution
Identity Graph
83K person nodes. 621K introduction edges. A graph database that maps who knows who and finds the shortest path to any target contact.
83K
Person nodes
621K
CAN_INTRO edges
5+
Linked data sources
7-sheet
Intro map deliverable
The Problem
Business relationships exist in fragments. A person appears as a LinkedIn connection, a row in a CRM, a name on a Secretary of State filing, a contributor to a GitHub organization, and a co-investor on a Crunchbase deal — five records that a human would recognize as the same person but a database treats as five strangers.
The higher-order problem is relationship mapping: given a target person you want to reach, who in your existing network knows them, and what is the strongest introduction path? This is the problem that drives warm outreach in PE deal sourcing, partnership development, and enterprise sales.
Architecture
The system is built on Neo4j (graph database) running on the DGX, with a FastAPI intelligence API (port 8010) and a Next.js visualization UI.
Data Model
The graph has two primary node types and multiple edge types:
- Person nodes: Name, title, company, LinkedIn URL, email, location, and source provenance
- Company nodes: Name, domain, industry, size, location, and registration data
- CAN_INTRO edges: Weighted relationship links between Person nodes. 621K of these create the introduction network.
- WORKS_AT edges: Person-to-Company employment relationships
- CO_INVESTED edges: Shared investment activity between investors
Entity Resolution Pipeline
The resolution pipeline uses a two-pass approach:
- Deterministic matching (pass 1): Exact matches on email address, LinkedIn URL, or (full name + company) tuple. These are high-confidence merges that can run without human review.
- Fuzzy matching (pass 2): Probabilistic matching using name similarity (Jaro-Winkler distance), company name normalization, title semantic similarity, and location proximity. Matches above a confidence threshold merge automatically; borderline cases queue for review.
Data Sources
| Source | Data Contributed | Volume |
|---|---|---|
| LinkedIn Sales Navigator | Professional profiles, connections, mutual connections | Primary source |
| CRM Records | Contact details, interaction history, deal associations | Imported via CSV/API |
| Secretary of State Filings | Business registrations, officer/director names, registered agents | State-specific scrapers |
| Crunchbase | Investment relationships, co-investor data, funding rounds | Via web scraping |
| Web Domains | Team pages, about pages, leadership bios | Via internal scraper |
Introduction Path Discovery
The core deliverable is an introduction map: given a list of target contacts, the system finds the shortest and strongest paths through the relationship graph from the user's network to each target.
Path strength is computed from edge weights that factor in:
- Recency: When was the relationship last active?
- Reciprocity: Is it a one-way or mutual connection?
- Proximity: How many hops separate the source from the target?
- Shared context: Do the connector and target share industry, location, or co-investment history?
The output is a 7-sheet Excel workbook: target list with match status, ranked connectors per target, full path details, connector profiles, gap analysis (targets with no path found), data quality summary, and methodology notes.
Semantic Search
ChromaDB embeddings (sentence-transformers) enable natural language queries against the graph: “find healthcare investors in the Midwest who co-invested with firm X.” The embedding layer converts unstructured queries into graph traversal parameters, combining the flexibility of semantic search with the precision of structured graph queries.
Results
- 83,000 person nodes resolved from overlapping data sources
- 621,000 CAN_INTRO edges mapping the relationship network
- Introduction maps generated for investor outreach campaigns
- Average path length to target: 2.3 hops (within “warm intro” range)