Narrative Science is now part of Salesforce!

Learn More

Blog 3 Approaches to Natural Language Generation & When To Apply Them

I’ve worked in AI for fifteen years, with most of my energy focused on getting computers to generate written language. It’s been a lot of fun the whole time, but the last few years have been especially exciting as interest and progress in the AI + human language space (officially known as Natural Language Processing, or NLP) has increased dramatically. There’s a widespread sense that the 2020s will see massive improvements in the machines’ ability to understand and interact with human language.

Unfortunately, hype and confusion with NLP have increased right alongside interest. The applications and approaches of AI and language are seemingly endless.  The whole field can often feel like an impenetrable mess of acronyms and applications.

That’s why I decided to write this blog to lay out the different approaches and applications straightforwardly so that you can understand the tradeoffs and make appropriate decisions for your projects.

Approaches for implementing NLG

Natural language generation (NLG) is a sub-area within natural language processing (NLP). While NLP is a broad category that includes areas like natural language understanding, NLG is all about computers actually generating language.

At the highest-level, there are three different approaches to actually implement NLG:

Templates – Templates are familiar to any software engineer or to anyone who has spent time writing marketing emails (“Hello, $firstName!”) or using mail merge in MS Office.

Structured – With a structured approach, the AI follows a writing process that is (very) roughly similar to what a person does. First, the AI decides what to say, then how to outline and structure the story, and finally chooses appropriate words and language. Structured approaches typically rely on knowledge of the domain. For instance, a structured approach to generating weather forecasts would have a lot of knowledge about weather.

Deep learning – This is the approach used by models like GPT-3 that have been generating a lot of the hype recently. These models are also referred to as “Transformers”, large language models (LLMs), or “foundation” models. These models are all trained on huge amounts of text, and then work by predicting the next word. If you’ve ever played the “One Word Story” game where a group of people makes up a silly story one word at a time, you’ve done something similar.

None of these approaches are inherently better than the others, but they definitely have tradeoffs that make them more or less suitable depending on the application.

Now let’s explore some common applications of NLG and outline how each approach handles the different applications.

If you’re in a hurry, here’s a cheat sheet that summarizes:

Snippet generation

While it’s not super exciting, probably the most common application of NLG today is “snippet generation”. In this application of NLG, the designer of the system knows exactly what text they want the system to produce — the computer just needs to “fill in the blanks.” There may be some basic business logic or conditions that control what language is generated, but the text that’s generated is generally tightly constrained.

Common examples include:

  • Basic UI elements (e.g. “You have 19 unread messages”)
  • Responses from Siri, Alexa, etc. (“Timer set for 15 minutes”)
  • Marketing emails (“Dear Nate, we missed you…”)

For these kinds of applications, the templated approach works best. That’s because templates are conceptually simple, easy to implement, and give consistent output. It can be a hassle to maintain and manage them, but you’re generally best served by templates if you want really consistent output.

A structured or deep learning approach would be overkill for this kind of application. There’s no reason to make things harder for yourself by causing the machine to “think” about what it will say when you know ahead of time exactly what it should say.


Another application of NLG that’s becoming more common is autocomplete/suggestions. Here, the AI is trying to make things more convenient for the user by predicting what they will type.

Often, these suggestions are very short. Google suggests a few words at a time in Gmail and Drive (where it’s branded “Smart Compose”), and iOS and Android each try and predict the next word you’ll type and make it easy to tap.

We’re also starting to see instances of “autocomplete” that are much bigger and feel more like actual generation. GitHub Copilot is one example, where the human user defines a function and the AI attempts to write the actual implementation. (Because Copilot produces code, it doesn’t technically count as “natural language generation” but the idea is the same.) Authors are also beginning to experiment with advanced “autocomplete” to help flesh out scenes and characters in their books.

In general, deep-learning approaches work best for building autocomplete functionality because they internally model all of natural language generation as an “autocomplete” problem.

Deep-learning approaches all start with a prompt—some natural language input, often written by a human—and then work by repeatedly predicting the next word. The predictions for the next word are based on all the other pieces of text the model has been trained on. For instance, you may give a deep-learning model like GPT-3 a prompt of “Once upon a time”. Based on the enormous amounts of text the model has been trained on, it begins “filling in” the rest of the story, one word at a time: “Once upon a time, there was a brave princess…”

There’s natural alignment between the approach and application when you’re using deep-learning models to implement autocomplete functionality. Autocomplete is also a great use-case for deep-learning models because it sidesteps one of the biggest problems with deep-learning models: their unpredictability.

Humans inherently have limited control over the output of these systems, and the output can vary dramatically. The system may continue the story once as “Once upon a time, there was a brave princess…” and the next time as “Once upon a time, there was a slimy toad…”. Worse, because these systems are all trained on huge amounts of text from the internet, the output can include offensive or objectionable content (“Once upon a time, there were evil doctors who put microchips in vaccines…”)

Because of this, deep-learning approaches are typically unsuitable for generating customer-facing language. But when used for autocompleting, there is often little issue with being wrong. After all, the user can simply ignore the suggestion.

In summary, building autocomplete with deep-learning models is a great fit that takes advantage of how deep-learning models work while minimizing the risks they bring.

Template and structured approaches are much less useful for building autocomplete, because they each start with some structured data or information that the system then expresses in language. For the autocomplete application, though, there is no structured data — there is just e.g. the first few words of the user’s email. While it’s possible to do some basic “autocompleting” with templates (e.g. changing “ty” to “thank you”), the overall value that they can provide to autocompletion is low.


Conversational interactions are another common application of NLG, and the best approach depends a lot on the type of conversation.

Templates work well for things like customer service bots. These bots often help with low-value interactions–checking shipping status, finding help, etc. Because the topics are constrained and they are customer-facing, the tight control provided by templated approaches works well.

Structured approaches work well for conversations with virtual domain experts (e.g. a financial planner) because the conversation tends to stay on one particular topic (e.g. retirement planning) it’s possible to build out the underlying domain knowledge necessary for structured approaches to be effective. Structured approaches also deliver on the trust and accuracy necessary to make these kinds of interactions useful.

Finally, deep-learning approaches are best for true “chat bots”. Because deep learning approaches can write about any domain, they are able to better scale across the different areas a chat may cover. And because it’s an informal conversation, it’s not a big problem if the NLG system produces text that’s not true.


The application of NLG that I’m personally most excited by is narrative generation. “Narrative” does not have a super tight definition, but it means what you probably think it means. To be a little more formal, I like to use the Outline Test: is there already—or could one create—a reasonable, tree-shaped outline for the text? If so, it’s a narrative. Any text that passes the Outline Test must have an overarching flow and structure — held together by consistent characters and ideas.

Narratives can, of course, be fiction or nonfiction, and there are different NLG approaches that are more suitable for each.

Fictional narratives

AI generating fiction has been an area of research for almost as long as NLG has been around, and it’s still far from a solved problem. While AI-generated fiction has historically been confined to university labs, we’re starting to see instances of it popping up in the real world.

One example that got a lot of buzz a few years ago is the short film Sunspring. The movie was acted, shot, scored, and edited by people, but the script was written by an AI. The film is fun to watch, but it’s clearly carried by a bunch of talented folks papering over a basically incoherent script.

Sunspring | 2016

Another application of fictional NLG that has a lot of folks worried is generating “Fake news”. The concern here is that, although any particular fictional news story could be fact-checked and dismissed, NLG systems could create a flood of fictional-but-true-seeming news stories that further muddy the waters of the reality we consume online.

Deep-learning approaches are typically used to generate these kinds of fictional content because they can typically write about arbitrary subjects and can capture particular tones or styles with some fidelity. GPT-3 can, for instance, generate text about a haunted house in the style of Edgar Allen Poe or about space aliens in the style of Isaac Asimov.

Unfortunately, the limitations of the deep learning approaches are apparent in the output — the generated fiction just doesn’t make a lot of sense. While any particular paragraph or excerpt is generally coherent, the output as a whole doesn’t pass the Outline test. Because deep-learning approaches generate the story a word at a time without any planning or thinking ahead, any overarching structure or plot is a happy accident. At a glance, the output may look terrific; but the more you read, the more you notice things that are “off”.

Ultimately, we don’t know a better approach yet for generating true fictional narratives. Templates don’t scale well for longer-form content. And structured approaches generally work by using knowledge of the underlying domain. For instance, encoded knowledge of consumer packaged goods could enable a structured approach to generating product descriptions.

The “underlying domain” for fictional narratives, however, is often human experiences and relationships. Encoded knowledge of human experience and relationships has long been a goal of AI, but we haven’t made much progress and we’re not going to crack that nut anytime soon.

So for now at least, we have no good approaches for generating fictional narratives. It remains an exciting area of research though, and Mark Riedl has a great write-up of the current state of the art.

Non-fiction narratives

We are also beginning to see applications of NLG generating non-fiction narratives. One early example is the STOP system. STOP was an anti-smoking program that used a structured approach to generate personalized narratives that encouraged individuals to stop smoking.

Another primary example of nonfiction narrative generation is our own product, Lexio. Lexio uses a structured approach to generate non-fiction narratives (“data stories”) that explain what’s happening in a user’s business data.

Augmented Data Storytelling in Lexio

There’s a completely different set of tradeoffs between the approaches for generating fiction and non-fiction narratives.

First, deep-learning approaches are deeply unsuitable for generating non-fiction, because they make no effort to generate true things. They try to mimic existing text from the internet, not accurately convey some data or information. For instance, a deep-learning NLG system is just as happy to say “The high today is 42º”, “The high today is 68º”, or “The high today is 98º”, because they all seem equally plausible. There’s no way for these systems to wire up a weather forecast API to get the actual predicted temperature. Deep-learning approaches may generate output that looks like non-fiction, but there’s no reason to believe or trust that it’s actually true.

In contrast, structured approaches work really well for generating non-fiction. Because they have a more constrained, formalized approach, you can guarantee that they won’t “go off the rails” or inadvertently say untrue things.

Non-fiction narratives also often have a more specific domain for which you can develop the necessary knowledge to power a structured approach. STOP had domain knowledge of smoking cessation, and Lexio has a lot of knowledge about BI and how people interact with it, as well as common business objects that show up again and again (profit, conversion rate, regions, product lines, etc.)

Taken together, Lexio’s knowledge of BI and its structured approach to natural language generation allows it to write deep, accurate data stories that make it easy for readers to understand and act on what’s happening in the data. We’ll continue to push the envelope of possibilities with structured approaches, and we’re eager to see how others advance in their own applications of NLG.

This is a great time to be working in AI generally, and NLG in particular. We’re in the sweet spot where big advances are being made, but we’re nowhere close to solving NLG in the general case.

In the meantime, there’s a lot of different ways of doing NLG, a lot of reasons why you might want to apply NLG, and a lot of balances that must be struck between the two. My hope is that this blog shed some light on the various tradeoffs.

Feel free to follow or contact me on LinkedIn if you want to learn more.

At Narrative Science, we’re continuing to make big bets on the structured approach because our stories have to accurately reflect what’s in the data. If pushing the NLG envelope is interesting to you, check out some of the open positions we have on our team!