Your Knowledge Gaps Are the Most Intimate Data You Have

Ken Ruto

In January 2013, the New York Times published a piece about how Target's data science team had identified a teenage girl as pregnant before her own father knew. The model was based on purchasing behavior: unscented lotion bought in the second trimester, a larger bag of cotton balls, hand sanitizer and washcloths. The signature was indirect but specific. It predicted a condition the customer hadn't announced.

The story became canonical in discussions of data privacy, reproduced in business school cases and TED talks and startup pitch decks. The lesson usually drawn is about purchasing behavior as a proxy for life state — that what you buy reveals more than you intended to reveal.

But purchasing behavior is, in the hierarchy of intimate data, not the most sensitive category. It is what you do in the world. It is behavioral, observable, often shared with others who were present for the same transaction. There is a more intimate category that nobody has thought to name as a category: what you know and what you don't.

I want to argue that epistemic data — data about the boundaries of your knowledge — is more intimate than location data, more sensitive than purchase history, and more revealing than almost anything else you generate online. And that the tool category being built to help people read and understand complex material sits at the exact center of this.

What Epistemic Data Is

Epistemic data is the record of your knowledge state. It includes:

What questions you asked when you didn't understand something
What terms you looked up and how often
Which parts of which documents you read versus skipped
What you knew before you asked and what you knew after
The domains where you are confident and the domains where you are lost

This is not the same as search history, though it overlaps. A search history includes everything you looked for — product research, directions, news, entertainment, health symptoms. Most of what you search for is instrumental. You need a plumber. You're looking for a restaurant. You want to know who won a match.

Epistemic data is the subset of information-seeking behavior that maps the edges of your understanding. It's what you searched for because you were confused, not because you needed a fact. It is the record of where your knowledge runs out.

The boundary between epistemic and non-epistemic search is not always clean. "What time does the pharmacy close?" is instrumental. "What is the difference between a statin and an ACE inhibitor?" could be either — you might be a medical student doing revision, or a patient who just received a new prescription and doesn't understand why. But the directional claim holds: a significant subset of search and lookup behavior reveals the edges of your understanding, and that subset is meaningfully different from the rest.

The Intimacy Spectrum of Data

It is useful to think about data intimacy as a spectrum, not a binary. Different data types reveal different aspects of a person, at different depths.

Data type	What it reveals	Who observes it	Intimacy level
Location data	Where you were, when	Carriers, apps, advertisers	High — behavioral
Purchase history	What you bought, how often	Retailers, card networks, data brokers	High — behavioral + proxy for life state
Health records	Medical conditions, prescriptions	Providers, insurers, sometimes employers	Very high — physiological
Search history (general)	What you looked for	Search engines, advertisers	High — intentional signals
Epistemic data (what you don't know)	The map of your knowledge and ignorance	Currently: nobody (private lookups) or the tool builder	Extremely high — cognitive + professional

What makes epistemic data categorically different from the others is that it maps your internal cognitive state, not your external behavior. Location data records where your body was. Purchase data records what your wallet did. Epistemic data records what your mind was doing — specifically, where it ran into limits.

This is not a trivial distinction. Your knowledge gaps are deeply professional in a way that your location or purchase history usually isn't. A cardiologist who keeps looking up drug interaction tables is revealing something about the limits of her clinical memory that she would not share with a recruiter. A policy analyst who searches for the definition of "crowding-out effect" in the middle of a budget analysis is revealing something about her economics training that she would not broadcast to her colleagues. A journalist who looks up the same historical detail three times across a six-month period is revealing that he doesn't fully understand the context he's reporting on — which is fine, that's how learning works, but it is not something he'd expose voluntarily.

Your knowledge gaps are the negative space of your competence. They tell anyone who can read them exactly where your expertise ends, how quickly you learn, what domains you're moving into, and what you're working on hard enough to be confused by.

The History of Library Privacy

The library profession has understood the sensitivity of reading behavior for over a century. In the United States, library circulation records have been protected by statute in most states since the 1970s. The American Library Association's Code of Ethics explicitly treats "what library users read, reference questions, [and] circulations records" as confidential.

The reason isn't simply respect for privacy as a value. It is recognition that what you are reading is a signal of what you are thinking — and that the freedom to think requires the freedom to read without being watched. Librarians developed this ethic partly in response to McCarthyism, when the FBI pressured libraries to report on readers who checked out "subversive" material. The profession resisted, and the resistance was codified.

Confidentiality exists when a library is in possession of personally identifiable information about users and keeps that information private on their behalf.

— American Library Association, Privacy: An Interpretation of the Library Bill of Rights

The interesting thing about this history is that libraries protected reading records before the data existed in a form that was easy to analyze. Circulation records were paper. The analytical threat was hypothetical. Librarians made the ethical commitment in advance.

The digital reading environment has done the opposite: collected enormously rich behavioral data about reading, at massive scale, across millions of users, before anyone had a serious conversation about whether that was appropriate. The Kindle knows which pages you highlighted. Every e-reader platform knows your reading speed, your finish rate, which parts you re-read. Medium knows which paragraphs you spent time on. Every browser extension knows every tab you've opened. The data collection preceded the ethics by a decade, and the ethics are still catching up.

Epistemic data — data about what you understood and what you didn't — is the next frontier of this. It doesn't exist yet as a collected category, because the tools that would collect it don't exist yet at scale. But they are being built. And the question of who owns that data, who can access it, and what they can do with it is exactly as live as any other privacy question — more live, in my view, because epistemic data is more intimate than most of what people currently debate.

The BYOK Architecture as Ethical Position

Bring Your Own Key is usually described as a product feature. You connect the tool to your own API account, you pay for your own inference, the tool doesn't bear the cost of model usage. This is accurate but it undersells what BYOK actually is.

BYOK is an architectural commitment: your queries go directly from your browser to the model provider, without transiting any server that the tool builder controls. The tool builder never sees the content of your questions. They never see which terms confused you. They never see which documents you were reading when you asked. The query is yours, the response is yours, and the entire exchange is invisible to the intermediary.

This is not what most "AI reading tools" do. The typical architecture routes your query through the product's backend, where it is logged, stored, and used at minimum for debugging and billing and at maximum for model training, product analytics, and data partnership arrangements that may be buried in a terms of service document you didn't read.

The distinction between "your API key, direct to provider" and "our backend handles the model call" is invisible to most users. Both look like a box where you type a question and text appears. The difference is whether a third party sits between your question and the answer, and what they can see and retain. This is exactly the kind of architectural choice that is invisible until it matters — and for epistemic data, it matters more than for most.

For a researcher asking about the methodology section of a paper on Kenyan water governance, BYOK means that nobody outside the researcher-to-model dyad knows what the question was, why it was asked, or what documents prompted it. For a journalist asking about the legal structure of a company they're investigating, BYOK means the tool builder can't be subpoenaed for records of what the journalist was curious about. For an analyst at a financial institution asking about a term in a regulatory document, BYOK means no third party has a log of what the analyst found confusing about compliance documentation.

These are not edge cases. They are the use cases that matter most.

What the Data Would Reveal, At Scale

Let me push this further, because the individual-level sensitivity is the beginning of the argument, not the end of it.

At scale, a platform with access to epistemic data from a large population of knowledge workers would have something genuinely unprecedented: a map of what educated, informed people don't know. Not a survey. Not self-report. A behavioral log of where comprehension breaks down, across millions of document-reader pairings.

This data would be extraordinarily valuable in ways that have nothing to do with serving the individual user. It would tell an education technology company which concepts have the highest confusion rates across which populations. It would tell a media company which assumptions in their articles readers can't follow. It would tell a pharmaceutical company which clinical terms physicians look up most frequently — which is a competitive intelligence signal about prescribing patterns and knowledge gaps. It would tell a government which parts of proposed legislation are least understood by the policy analysts reviewing it, which is a signal about where regulatory capture is most likely.

None of this requires identifying individual users. Aggregate epistemic data is commercially useful without being individually attributable. Which is precisely why the argument "we anonymize everything" is insufficient as a privacy protection for this category of data.

The only architecturally sound position is to not collect the data in the first place. To not have the server in the middle. To route the query directly to the model and let the model's retention policy (most major providers do not train on API queries by default, for enterprise-tier keys) govern what is retained.

The Library Precedent, Applied Forward

The library profession made a commitment to reading privacy before the analytical threat was serious, before the data was digital, and before anyone was trying to monetize reading behavior. The commitment held because it was codified in professional ethics and, eventually, in law.

The comprehension tool category is at the same moment. The tools are just becoming technically viable. The data collection norms haven't been established. The regulatory frameworks don't exist. The first generation of products will set the defaults that subsequent products inherit — which is always how defaults work.

The default that matters most is whether epistemic data is treated as belonging to the reader or to the platform. A tool that asks for your API key and routes your queries directly to the provider is asserting that the data belongs to you. A tool that routes through its own backend is asserting the opposite, whatever its privacy policy says.

Ownership is the foundational question. Everything else — licensing, access, use — follows from it.

— Kenneth Crews, Copyright and Your Job: The Truth About Works Made for Hire

The parallel applies precisely. Ownership of epistemic data is the foundational question. If the data belongs to the reader, the architecture must make that structurally true, not just claim it in a terms of service. BYOK is how you make it structurally true.

This is not idealism. It is product design. The users who care most about comprehension tools — researchers, senior analysts, investigative journalists, policy professionals — are exactly the users who have thought about data sovereignty and who will ask the right questions before they adopt a tool that sits in the middle of their reading. BYOK is not a barrier to adoption for this population. It is a signal that the tool was built by people who understood the stakes.

The web never built the comprehension layer. The comprehension tools that are now being built have a choice to make about data ownership that the web never had to make, because the web collected behavioral data incidentally rather than as a core product function.

A tool designed for comprehension sits closer to a library than to a browser. It should be designed with library ethics, not browser economics.