How to Access and Parse Gutenberg Blocks with PHP

In this post we’ll look at how to parse a post’s Gutenberg blocks and extract specific blocks to make something else. We’ll look at WordPress PHP functions for parsing, extracting and rendering chosen blocks.

One benefit of the new Gutenberg editor in WordPress is more structured data for post content. In the older days everything was stored as HTML and there was no way to extract specific pieces of content without some very complex regular expressions. But with Gutenberg each piece of content, be it a paragraph, heading, image, video, button, or a column that has other blocks inside it, is stored with information that tells us what this piece of content is.

With built-in WordPress functions it is very easy to fetch all blocks in a post’s content in an array with all their information. This opens up for loads of useful features for theme developers. Just to mention a few ideas:

Dynamically generate a table of contents by fetching all the headings (tutorial below).
Fetch all videos, images, or quotes used in all posts to gather and list them all out on another page.
Extract a post’s first paragraph and use that as excerpt in archives (tutorial below).
Get an overview of the use of specific blocks and their position. For example say you have a custom Ad block and you need to know how often it’s used across your posts and how far down in the content they appear.
Render a post’s blocks but exclude specific block types.
Check if a post content starts with a video and use that video instead of the featured image in archives.
If you use a custom block that contains technical specs on products you can easily create a page that displays and compares the technical specs across product posts.
Dynamically gather all single images used in a post, and generate a gallery of them at the end or elsewhere.

Let’s just jump right into it!

Parse a post’s blocks

To parse blocks we use the WordPress function parse_blocks(). As parameter you need to provide a string of a post’s content. If you’re inside a loop or have access to a post object simply provide $post->post_content as parameter to the function.

$post_id = 1;
$post = get_post($post_id);
$blocks = parse_blocks($post->post_content);

The return from parse_blocks() is an array where each array element is a block. For each block element you have information such as type of block (key ‘blockName‘), block attributes (key ‘attrs‘), inner blocks for nested blocks such as Columns (key ‘innerBlocks‘), and two elements for the actual block content (keys ‘innerHTML‘ and ‘innerContent‘). The element innerHTML is a string of HTML content, while innerContent is an array of HTML strings.

And that’s all! Loop through the returned array from parse_blocks() to do your thing. We’ll look at more code examples of this further down in this post.

A note on classic (non-Gutenberg) posts

You might be working on an older WordPress site that has created posts before upgrading to Gutenberg (or used a Disable Gutenberg-plugin). In this case these posts won’t have structured post content, but rather the entire post content is inside a “Classic editor” block.

The array returned from the function parse_blocks() will on these kind of posts return one block array element with blockName set to null. The post’s full HTML content is inside this element’s innerHTML string.

We can safely assume that if a post’s return on parse_blocks() has one element and its blockName is null, we are dealing with a “pre-Gutenberg” post. Otherwise blockName will always be populated. This is a good check to keep in mind when writing code for parsing posts’s blocks and you want to handle older WordPress content.

Render a block

WordPress also offers a function to render a specific block with render_block(). As parameter you need to provide an array for a block, just like one of those returned from parse_blocks() above. The function returns a string of rendered HTML that you can simply echo straight out.

$post_id = 1;
$post = get_post($post_id);
$blocks = parse_blocks($post->post_content);
foreach ($blocks as $block) {
	echo render_block($block);
}

The above code will render all of the post’s blocks, just like it would normally when rendering the post’s content. The fun part comes when we start adding code to exclude or include specific blocks that we’re interested in.

For example, the code below will only print out the post’s paragraph blocks:

foreach ($blocks as $block) {
	if ($block['blockName'] == 'core/paragraph') {
		echo render_block($block);
	}
}

And this will render all blocks, but exclude all shortcode blocks:

foreach ($blocks as $block) {
	if ($block['blockName'] != 'core/shortcode') {
		echo render_block($block);
	}
}

Block names

When parsing a post’s blocks you will most likely need to check the block’s type. They are pretty simple to guess. For example a Paragraph block is, well you guessed it, paragraph. However keep in mind that all Gutenberg blocks in WordPress are prefixed with a namespace. For WordPress core (default) blocks, they are all prefixed with “core/“. The exception are the Embed blocks which are prefixed with “core-embed/“. So for example a paragraph block will have the block name core/paragraph.

To avoid wild guessing here’s an overview of the default blocks provided by WordPress (at the time of writing this):

Common blocks

Paragraph: core/paragraph
Image: core/image
Heading: core/heading
Gallery: core/gallery
List: core/list
Quote: core/quote
Audio: core/audio
Cover: core/cover
File: core/file
Video: core/video

Formatting

Preformatted: core/preformatted
Code: core/code
Classic: core/freeform
(but for non-Gutenberg posts it’ll be null, see note about non-Gutenberg posts)
Custom HTML: core/html
Pullquote: core/pullquote
Table: core/table
Verse: core/verse

Layout

Button: core/button
Columns: core/columns
More: core/more
Page Break: core/nextpage
Separator: core/separator
Spacer: core/spacer
Media & Text: core/media-text

Widgets

Shortcode: core/shortcode
Archives: core/archives
Categories: core/categories
Latest Comments: core/latest-omments
Latest Posts: core/latest-posts

Embeds

Embed: core/embed
Twitter: core-embed/twitter
YouTube: core-embed/youtube
Facebook: core-embed/facebook
Instagram: core-embed/instagram
WordPress: core-embed/wordpress
SoundCloud: core-embed/soundcloud
Spotify: core-embed/spotify
Flickr: core-embed/flickr
Vimeo: core-embed/vimeo
Animoto: core-embed/animoto
Cloudup: core-embed/cloudup
Crowdsignal: core-embed/crowdsignal
Dailymotion: core-embed/dailymotion
Hulu: core-embed/hulu
Imgur: core-embed/imgur

Issuu: core-embed/issuu
Kickstarter: core-embed/kickstarter
Meetup.com: core-embed/meetup-com
Mixcloud: core-embed/mixcloud
Reddit: core-embed/reddit
ReverbNation: core-embed/reverbnation
Screencast: core-embed/screencast
Scribd: core-embed/scribd
Slideshare: core-embed/slideshare
SmugMug: core-embed/smugmug
Speaker Deck: core-embed/speaker
TED: core-embed/ted
Tumblr: core-embed/tumblr
VideoPress: core-embed/videopress
WordPress.tv: core-embed/wordpress-tv
Amazon Kindle: core-embed/amazon-kindle

Code example: Fetch a post’s first paragraph as excerpt

A well-authored post should start with a paragraph that introduces what the post is about and tempt people to keep on reading. These are perfect to use as excerpts instead of relying on the automatic excerpt function in WordPress!

This is a function you can add to your theme in functions.php that will return a post’s first paragraph. If no post was provided it will refer to the global post object. It also accommodates non-Gutenberg posts by returning the actual WordPress excerpt for those.

function awp_get_excerpt($post=false) {
	if (!$post) { 
		global $post;
	}
	if (!$post) { return ''; }
	$excerpt = '';
	$blocks = parse_blocks($post->post_content);
	if (count($blocks) == 1 && $blocks[0]['blockName'] == null) {  // Non-Gutenberg posts
		$excerpt = get_the_excerpt($post->ID);
	} else {
		foreach ($blocks as $block) {
			if ($block['blockName'] == 'core/paragraph') {
				$excerpt = strip_tags($block['innerHTML']);
				break;
			}
		}
	}
	return "<div class='excerpt'>$excerpt</div>";
}

After the function’s parse_blocks() call we check whether or not the post has invalid block data (post was created before Gutenberg) and return post excerpt if that’s the case. Otherwise we loop through the post’s blocks, find the first paragraph block, and return its innerHTML. At the very end we return a string with (this is optional) a <div> around it.

To use this function all you need to do is call:

echo awp_get_excerpt();

Assuming the function call is placed somewhere there is a global $post object, for example inside a loop. If you want to specify a post, provide the post object as parameter to the function call:

$post_id = 1;
$post = get_post($post_id);
echo awp_get_excerpt($post);

Example: Create a table of contents from a post’s headings

You can automatically and dynamically generate a table of contents based on a post’s heading blocks. The process is simple enough; parse the post’s blocks and find all blocks of type core/heading. But we can take it one step further and incorporate levels; e.g. putting h3 as subheading under h2 and so on.

A heading block contains information about its level in the attribute array element (key ‘attrs‘). Inside the attrs array it would be an array element with the key ‘level‘ and a integer signifying the level. A h3 would have ‘level‘ value of 3, a h4 would have ‘level‘ of 4, and so on.

However, take note that attrs for heading blocks can be empty! This happens when the author has not changed the heading type away from the defaults in the block’s settings. To go around this we need to make some assumptions. As default headings will be h2 (unless you’ve changed this in your theme). So we can assume that if a heading block doesn’t have a level attribute, it’s a h2. Otherwise we get the level information from the attributes.

If you’re really up to the challenge I invite you to improve the code below. The issue of generating a proper structured ol list is that we can’t control how the author structure their titles. They might very well go crazy and start with a h4 heading and skip right to a h2 heading right after. And suddenly they mix a h1 in the middle. My solution therefore involves generating a flat ol list and provide the level information in the list item’s classes. Then with some clever CSS you can indent the list items accordingly with some left padding.

The code

Here’s the table of contents function:

function awp_table_of_contents($post=false) {
	if (!$post) {
		global $post;
	}
	if (!$post) { return ''; }
	
	$headings = [];
	$blocks = parse_blocks($post->post_content);
	if (count($blocks) == 1 && $blocks[0]['blockName'] == null) {  // Non-Gutenberg posts
		return '';
	} else {
		foreach ($blocks as $block) {
			if ($block['blockName'] == 'core/heading') {
				$level = (isset($block['attrs']['level'])) ? $block['attrs']['level'] : 2;  // h2 as default
				$headings[] = ['title' => wp_strip_all_tags($block['innerHTML']), 'level' => $level];
			}
		}
	}

	if (empty($headings)) {  // No headings found in post
		return '';
	}

	$toc = '<ol class="table-of-contents">';
	foreach ($headings as $heading) {
		$toc .= '<li class="heading-level-' . $heading['level'] . '">' . $heading['title'] . '</li>';
	}
	$toc .= '</ol>';
	return $toc;
}

The function starts with handling the post and parsing its blocks. At line #9 we accommodate for non-Gutenberg posts. The function continues to loop through all blocks and whenever it finds a heading block it gets added to our $headings array. We use wp_strip_all_tags() to strip the HTML tags from the titles. We also add the level information to our array, where the default is set to 2 if attributes is empty.

After the block loop the $headings array should contain all the post’s headings, in order. We can then simply generate a HTML string and output its content. As mentioned I generate a class name on each element with information about the heading’s level so we can create the illusion of a structured list with CSS.

Like with the excerpt function above, we can call this function when we’re inside the loop like so:

echo awp_table_of_contents();

Or if we are outside the loop or want to specify a post;

$post_id = 1;
$post = get_post($post_id);
echo awp_table_of_contents($post);

This will generate a list looking something like this:

Conclusion

As we have seen with structured rich post content made possible with Gutenberg, we can very easily find and extract specific parts of posts’s content. Refer back to the list of examples I mentioned in the beginning of the post. There are no limits to what you as a theme developer can do. It just depends on what your theme or WordPress site has a need for (or what would simply just be cool).