Back to Work

Entity Resolution

Identity Graph

83K person nodes. 621K introduction edges. A graph database that maps who knows who and finds the shortest path to any target contact.

83K

Person nodes

621K

CAN_INTRO edges

5+

Linked data sources

7-sheet

Intro map deliverable

The Problem

Business relationships exist in fragments. A person appears as a LinkedIn connection, a row in a CRM, a name on a Secretary of State filing, a contributor to a GitHub organization, and a co-investor on a Crunchbase deal — five records that a human would recognize as the same person but a database treats as five strangers.

The higher-order problem is relationship mapping: given a target person you want to reach, who in your existing network knows them, and what is the strongest introduction path? This is the problem that drives warm outreach in PE deal sourcing, partnership development, and enterprise sales.

Architecture

The system is built on Neo4j (graph database) running on the DGX, with a FastAPI intelligence API (port 8010) and a Next.js visualization UI.

Data Model

The graph has two primary node types and multiple edge types:

  • Person nodes: Name, title, company, LinkedIn URL, email, location, and source provenance
  • Company nodes: Name, domain, industry, size, location, and registration data
  • CAN_INTRO edges: Weighted relationship links between Person nodes. 621K of these create the introduction network.
  • WORKS_AT edges: Person-to-Company employment relationships
  • CO_INVESTED edges: Shared investment activity between investors

Entity Resolution Pipeline

The resolution pipeline uses a two-pass approach:

  1. Deterministic matching (pass 1): Exact matches on email address, LinkedIn URL, or (full name + company) tuple. These are high-confidence merges that can run without human review.
  2. Fuzzy matching (pass 2): Probabilistic matching using name similarity (Jaro-Winkler distance), company name normalization, title semantic similarity, and location proximity. Matches above a confidence threshold merge automatically; borderline cases queue for review.

Data Sources

SourceData ContributedVolume
LinkedIn Sales NavigatorProfessional profiles, connections, mutual connectionsPrimary source
CRM RecordsContact details, interaction history, deal associationsImported via CSV/API
Secretary of State FilingsBusiness registrations, officer/director names, registered agentsState-specific scrapers
CrunchbaseInvestment relationships, co-investor data, funding roundsVia web scraping
Web DomainsTeam pages, about pages, leadership biosVia internal scraper

Introduction Path Discovery

The core deliverable is an introduction map: given a list of target contacts, the system finds the shortest and strongest paths through the relationship graph from the user's network to each target.

Path strength is computed from edge weights that factor in:

  • Recency: When was the relationship last active?
  • Reciprocity: Is it a one-way or mutual connection?
  • Proximity: How many hops separate the source from the target?
  • Shared context: Do the connector and target share industry, location, or co-investment history?

The output is a 7-sheet Excel workbook: target list with match status, ranked connectors per target, full path details, connector profiles, gap analysis (targets with no path found), data quality summary, and methodology notes.

Semantic Search

ChromaDB embeddings (sentence-transformers) enable natural language queries against the graph: “find healthcare investors in the Midwest who co-invested with firm X.” The embedding layer converts unstructured queries into graph traversal parameters, combining the flexibility of semantic search with the precision of structured graph queries.

Results

  • 83,000 person nodes resolved from overlapping data sources
  • 621,000 CAN_INTRO edges mapping the relationship network
  • Introduction maps generated for investor outreach campaigns
  • Average path length to target: 2.3 hops (within “warm intro” range)

Stack

Neo4jPythonFastAPINext.jsChromaDBsentence-transformerspandasopenpyxl