Visualizing the Influence of Federal Funding on the AI Boom

8 min readApr 18, 2025

The Transformer was invented in Google. Reinforcement Learning with Human Feedback (RLHF) was not invented in industry labs, but is most commonly associated with OpenAI. There is a growing perception that university research was cut out of the AI boom and no longer has a role in the development of AI.

Agencies like the US National Science Foundation and DARPA have plenty of general stats about how their funding has spurred innovation and contributed to the community. NSF will gladly tell the story of how Google spun out of NSF-funded research. Can we draw a tighter connection between federal funding and the state of the art AI deployed by industry?

One way to track the impact of federal funding on industry AI labs is to look at industry papers. Industry labs are starting to curtail their publication, but for a time they were keen to openly publish according to scientific standards of rigor. Scientific papers reference the source of ideas and claims that have been previously published. While not perfect, it gives us a sense of what researchers were looking at when they were conducting their own research. We can look at the papers referenced by industry AI researchers and then see if any of those papers resulted from federal funding. Researchers are highly encouraged, if not outright required, to acknowledge the sources of support.

I took six industry AI papers that I feel trace the evolution of what we currently understand to be the Large Language Model and RLHF, including “Attention is All You Need”. I gathered the papers that are referenced by these six papers. I then used a very basic technique to search for acknowledgements of federal funding. Specifically I look for “NSF”, “National Science Foundation”, “DARPA”, etc. in those texts. I’ll present more of the methodology for selecting and parsing the papers below.

The following visualization shows a bubble for each paper analyzed. The six black bubbles are the six influential industry AI papers. The other bubbles represent papers referenced by these six papers. The lines trace which papers are cited by which. Yellow bubbles are referenced papers that acknowledge US federal funding.

We can quickly inspect the visualization and see a lot of yellow dots. Each of the researchers of the “key” papers were citing federally funded research.

U.S. Universities, Federal Funding, and State Funding

In the next visualization, blue dots show papers by US academic researchers that do not explicitly acknowledge federal funding.

Why is this important to note? In the US, most computer science researchers at top universities have received federal funding at some point from NSF, DARPA, or other federal funding agencies. This means that their careers were sustained at some point by federal funding, even if the paper being cited did not (as far as we can tell from parsing the text) directly result from federal funding.

More than that, most PhD students at top research universities in the US have been supported by federal funding at some point. At most top research universities, PhD students are paid employees. They receive a (low) salary (called a stipend) and tuition coverage. You get paid to get a PhD. The money to cover PhD student stipend and tuition very often comes from external grant money awarded to professors.

(Imagine this: you are hired to be a professor. You are told your university will only pay for 9 months of your salary. If you want 12 months of salary like a normal person you must raise 3 months of your own salary through grants. You are also told that you will be fired in 6 years if you are not super-productive. So you want research assistants. The university tells you that you also have to raise money to hire your research assistants. Good luck! Why would any sane person take that deal? The answer is tenure.)

These PhD students — graduate research assistants — get trained and become research scientists in industry AI labs. They might not have been hired without their advisor having some form of external funding. Their advisor might not have even had a career and thus been in a position to hire PhD students with federal funding.

Going deeper, public universities in the US derive a significant portion of funding from state taxes. Thus research conducted by faculty at public universities are partially subsidized by state governments.

It is quite reasonable to say that US academic computer science research would not exist without some form of state or federal government support.

The Rest of the World

What about the rest of the world? Here is a visualization where cyan dots are all papers that do not have industry authors (minus those already colored yellow, or blue).

Most countries in the world support university research with government funding. In Europe and Canada, there is greater involvement of government funding agencies in supporting of research (for example, in Canada faculty receive 12 months of salary from universities and the government is more involved in making the economics of that work).

Taken together (yellow, blue, and cyan), we can visually inspect that a lot of work that was directly relevant to the Transformer and RLHF was known to be federally funded or highly likely to be directly or indirectly supported by some federal or state government somewhere.

The remaining gray dots are papers with only industry researchers. As noted above, many of the authors on these papers were beneficiaries of funding from some government or another. Whereas entrepreneurs can be self-made (although the myth of the self-made man is highly overstated even then), PhDs don’t make themselves.

Some Simple Statistics

We can count the yellow, blue, and cyan dots proportional to the total number of dots.

  • ~18% of papers referenced by these 7 industry papers have acknowledged US federal funding. As mentioned earlier in the thread, this is likely an under-count.
  • ~24% of papers referenced have US university authors.
  • ~20% of papers referenced are industry lab-authored.
  • ~42% of papers referenced do not have any industry authors.

Methodology

I describe my methodology for selecting key papers, how my data processing pipelines work, and their limitations in this section.

The key papers are:

  1. Attention is All You Need (2017), by Vaswani et al. (Google). This paper introduces the Transformer architecture, upon which all LLMs are now built.
  2. Deep Reinforcement Learning from Human Preferences (2017), by Christiano et al. (OpenAI and DeepMind). OpenAI and DeepMind originally embarked on their quests for AGI by pursuing reinforcement learning in games and virtual embodied agents. This paper introduces an early version of Reinforcement Learning with Human Feedback to teach virtual embodied agents to perform skills like back flips. It lays out the general recipe for RLHF that appears later with language models.
  3. Fine-Tuning Language Models from Human Preferences (2019) by Ziegler et al. (OpenAI). This paper is an early attempt at learning a reward model of text stylistic preferences and fine-tuning a language model to generate to those style preferences. The RLHF recipe is starting to come together.
  4. Training Language Models to Follow Instructions with Human Feedback (2022), by Ouyang et al. (OpenAI). This paper shows the RLHF pipeline as used on GPT-3 about the time that ChatGPT was released.
  5. Scalable Agent Alignment via Reward Modeling: A Research Direction (2018), by Leike et al. (DeepMind). A wide-ranging paper arguing for value alignment via reward modeling of the kind that ultimately was used in papers 3 and 4 above. DeepMind and OpenAI were not the only researchers thinking and writing about value alignment, but this captures a snapshot of what was going on during a then growing community of value alignment researchers.
  6. Language Models are Unsupervised Multitask Learners (2019), by Radford et al. (OpenAI). This is the GPT-2 paper.
  7. Improving Language Understanding by Generative Pre-Training (2018), by Radford et al. (OpenAI). This is the GPT-1 paper, often overlooked.

(I don’t include the GPT-3 paper, which basically says that once a certain scale works, in-context prompting as we know it now starts to work. The Ouyang paper (#4 above) is about how RLHF made GPT-3 practical via instruction tuning and alignment and I consider that the more important paper.)

I used the Semantic Scholar API to look up the references for the seven key papers. I then used the Semantic Scholar API to get the URLs for the PDFs of each referenced paper. I downloaded the PDFs for every referenced paper that was on ArXiv or in the ACL Anthology. The ACL Anthology is the online repository for papers published in major NLP conferences organized by the Association of Computational Linguistics. If the PDF was not hosted by ArXiv or the ACL Anthology, I did not include it in the final results.

To determine funding acknowledgement, I converted the PDFs to text and searched for “NSF”, “National Science Foundation”, “DARPA”, “Defense Advanced Research Project Agency”, “IARPA”, “Intelligence Advanced Research Project Agency”, “ONR”, and “Naval Research”. The rationale being that most papers would not use these terms unless explicitly talking about government funding sponsorship. If US federal funding was acknowledge in a different way or US federal funding agencies were acknowledged that weren’t in the above list, they were missed. Furthermore, while federal agencies ask that papers acknowledge the funding source and scientific authorship norms encourage acknowledgement of funding support, not all papers acknowledge funding. Thus, the federal funding (yellow dots) are likely an under-count.

To determine nationality of authorship and corporate authorship, I parsed the text documents for email addresses of authors ending in “.edu” or “.com”. Frustratingly, industry research papers are increasingly not putting author email contact information in the papers. But, that is okay because they will be lumped in with the gray dot category where they should belong. However, this means it is possible that (a) I may be under-counting US academic authorship, and (b) I may be over-counting industry authorship.

My code is in a Jupyter notebook hosted on Github.

Conclusions

In the crucial time period between 2017 and 2023 when our current understanding of LLMs and RLHF were falling into place, industry researchers were openly publishing. Following academic publication standards, those industry papers referenced related work and the sources of concepts that were integral to their work. By looking at the funding acknowledgements of referenced papers and the nationalities of referenced authors, we get a picture of how the building-blocks of modern state-of-the-art AI techniques were supported by government funding.

The AI boom did not happen in an industry vacuum. As with all research, it was an accumulation of knowledge, much of which generated in university settings. There is a growing narrative that academia isn’t important to AI anymore and US federal funding has no role in the AI boom. It’s more correct to say that the AI boom could not have happened without US federal funding.

As is often the case: when government is functioning properly it is invisible. That makes it easy to see successful companies, whether Telsa, Google, or others, as pulling themselves up with their bootstraps. In reality, the federal government is doing a lot to support the economic well-being of its citizens and companies, as a healthy functioning government should do. It is difficult to see the direct ties between government activity and the amazing scientific and technical process underway, and I hope this provides some insight into a small slice of how the AI boom came to be.

--

--

Mark Riedl
Mark Riedl

Written by Mark Riedl

AI for storytelling, games, explainability, safety, ethics. Professor @GeorgiaTech . Associate Director @MLatGT . Time travel expert. Geek. Dad. he/him

Responses (4)