Creating Simple, Semantic HTML Markup from a Google Doc

Google Docs are great for allowing people to collaboratively build a document, with the ability for people to suggest (and discuss) changes and view revisions and a variety of other useful behaviours. At present, I'm sad to admit, CollaboraOffice isn't quite in the same level (although it's catching up quickly!).

The Problem

A big problem with Google Docs (shared, incidentally, with Microsoft Word and even LibreOffice/CollaboraOffice) is that it's almost impossible to convert a collaboratively created document into a form suitable for publishing on the web.

Yes, you can go to File->Download as...->Web Page (html, zipped)... but that results in an HTML document that is chock full of badness. Google Docs (and the other word processing packages) all assume you want your document to render on the web exactly like it does in their app. In my experience, is impossible to get clean, semantic HTML out of any word processing platform without all the nasty in-line styling and formatting kruft that these applications assume you want.

Anyone (like me) who's had to work with non-technical people wanting to manage their own websites, via a content management system (CMS) like Wordpress or Drupal or any one of hundreds of others, can attest to the fact that the average person's first tool for creating content is always their word processor. This, of course, is probably the least appropriate tool, because it

  1. creates an expectation of  control and formatting that is invalid on the web, and
  2. tries to control the content that is copied and pasted into the CMS' editing interface, which is entirely counter-productive.

All you really want is very simple, semantically marked-up content, without any styling whatsoever. The styling for CMS content is typically applied by the theme, rather than the editor. The purpose behind that is to ensure that all the site's content is consistent in its look-and-feel and the whole lot can be changed with simple tweaks to that theme. Problem is, most people neither understand that nor care - in fact, few people will even know what I mean by "semantically marked-up": simply put, it's the idea that the content is simply marked up to show what type of content it is, e.g. a title, a paragraph, a list, a table, something requiring emphasis (usually achieved by italics) or strong presentation (usually achieved by bold).

An Illustrated Example

Here's an example of a document

A screenshot of a Google Document used in this example.
Screenshot of a Google Doc containing some text with basic formatting.

Here's the simple, clean semantic mark-up we want to represent this content:

<h1>Privacy notice</h1>
<p>The Open Education Resource universitas (OERu) privacy notice provides a simple and concise summary to explain our treatment of personal information from OERu learners and users of the websites hosted by the OER Foundation (OERF).</p>
<p>This privacy notice complies with our Privacy Policy and Terms of Service Policy and applies to you when you decide to use our services.</p>
<p>We collect information when users visit websites hosted by the OERF. This includes IP addresses which are unique numbers that identify specific internet connections, and sometimes specific computers and devices that visit our sites.</p>
<h2>Definitions</h2>
<p>In the context of this policy</p>
<ul>
<li><p>Personal data refers to information related to an identified or identifiable natural person used for data processing by the OERF.</p></li>
</ul>

Here's the mark-up we get if we copy and paste from Google Docs (you'll get similar results copy-and-pasting from any word processor):

<p>&nbsp;</p>

<h1 dir="ltr" style="line-height:1.38;margin-top:20pt;margin-bottom:6pt;"><b id="docs-internal-guid-614306f4-8a93-ba9b-7e04-91586b612f23" style="font-weight:normal;"><span style="font-size:20pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Privacy notice</span></b></h1>

<p>&nbsp;</p>

<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><b id="docs-internal-guid-614306f4-8a93-ba9b-7e04-91586b612f23" style="font-weight:normal;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">The Open Education Resource universitas (OERu) privacy notice provides a simple and concise summary to explain our treatment of personal information from OERu learners and users of the websites hosted by the OER Foundation (OERF).</span></b></p>

<p>&nbsp;</p>

<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><b id="docs-internal-guid-614306f4-8a93-ba9b-7e04-91586b612f23" style="font-weight:normal;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This privacy notice complies with our Privacy Policy and Terms of Service Policy and applies to you when you decide to use our services. </span></b></p>

<p>&nbsp;</p>

<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><b id="docs-internal-guid-614306f4-8a93-ba9b-7e04-91586b612f23" style="font-weight:normal;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">We collect information when users visit websites hosted by the OERF. This includes IP addresses which are unique numbers that identify specific internet connections, and sometimes specific computers and devices that visit our sites.</span></b></p>

<h2 dir="ltr" style="line-height:1.38;margin-top:18pt;margin-bottom:6pt;"><b id="docs-internal-guid-614306f4-8a93-ba9b-7e04-91586b612f23" style="font-weight:normal;"><span style="font-size:16pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Definitions</span></b></h2>

<p dir="ltr" style="line-height:1.38;margin-top:0pt;margin-bottom:0pt;"><b id="docs-internal-guid-614306f4-8a93-ba9b-7e04-91586b612f23" style="font-weight:normal;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">In the context of this policy </span></b></p>

<p><b id="docs-internal-guid-614306f4-8a93-ba9b-7e04-91586b612f23" style="font-weight:normal;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:700;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Personal data</span><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;"> refers to information related to an identified or identifiable natural person used for data processing by the OERF. </span></b></p>

<p>&nbsp;</p>

<p>&nbsp;</p>

<p>&nbsp;</p>

As you can see, it's fairly impenetrable (it's not written for human's eyes :) ). It doesn't even use proper semantic tags to represent the content. Instead it uses "ad hoc" styling to achieve a superficial resemblance to what we want to see, and in doing so, it also generates huge amounts of totally awful mark-up.

The (Free and Open Source) Solution

Luckily, after a lot of dives down rabbit holes, I have emerged with a free and open source software (FOSS) solution. It involves an incredible software tool called "Pandoc" that can be installed on any Windows, MacOS, or Linux computer. Because I use FOSS where ever possible, I'll describe how I achieved a solution on Linux. 

In addition to providing clean semantic HTML, preserving links and structure (including tables), Pandoc can be asked to build a "Table of Contents" for the HTML, too, based on titles (H1, H2, etc.) to the level desired.

The steps:

  1. download a "Web Page (html, zipped)" version of the Google Document and unzip it in to a file, in my case, PrivacyNotice.html
  2. install Pandoc on my computer: sudo apt install pandoc
  3. in the directory in which I have PrivacyNotice.html, I ran

    pandoc -r html -w docbook --email-obfuscation=none -S -s PrivacyNotice.html | pandoc -r docbook -w html --toc --toc-depth=2 --email-obfuscation=none -S -s -o PrivacyNotice-semantic.html -


    Which produces a semantically marked up version of the file PrivacyNotice-semantic.html, and, in this case, with a linked Table of Contents automatically created, using headings up to a depth of "2" (i.e. H1 and H2) as the headings.

    If you prefer not to have the Table of Contents, run this instead:

    pandoc -r html -w docbook --email-obfuscation=none -S -s PrivacyNotice.html | pandoc -r docbook -w html --email-obfuscation=none -S -s -o PrivacyNotice-semantic.html -


    Note: the trailing "-" is important (it's a way of telling the script to take as input the output of the script described prior to the "pipe" character, "|")!

What this Pandoc process does, is to convert the nasty HTML provided by the Google Doc "Web Page" export into a different (and very semantically structured) format called DocBook, which strips out all the unnecessary styling. We then convert it back to HTML, preserving the semantic structure, and removing the stuff we don't want.

To simplify matters, you can also write a script or shell alias which achieves the same thing in a single command. For example, here's the alias I created in my ~/.bashrc (the hidden bash shell configuration file in my home directory, "~"):

alias tosemtoc='pandoc -r html -w docbook --email-obfuscation=none -S -s $1 | pandoc -r docbook -w html --toc --toc-depth=2 --email-obfuscation=none -S -s -o $1 -'


alias tosem='pandoc -r html -w docbook --email-obfuscation=none -S -s $1 | pandoc -r docbook -w html --email-obfuscation=none -S -s -o $1 -'

(if you add this to your .bashrc, you can make your bash session aware of it either by logging out or in again, or by running . ~/.bashrc at your command prompt)

With these two lines I can achieve the same thing as the above complex command like this (the semantic HTML output will be put into the same file I started with in this case, rather than a separate file - no big loss on that Google Doc-produced HTML travesty!):

With Table of Contents (down to H2): tosemtoc PrivacyNotice.html

Without the Table of Contents: tosem PrivacyNotice.html

Pretty straightforward. And it takes less than a second to run. Note that the result is a "well formed" complete HTML document, so you can view it happily in your browser, e.g. type file:///path/to/your/PrivacyNotice.html file into your browser's location bar as a URL.

If you want to see the final result of this, have a look at our new GDPR-aware Privacy Notice!

Hope this saves someone a lot of frustration and unnecessary remedial mark-up editing!

 

Add new comment

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
CAPTCHA
5 + 14 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
Are you the real deal?