I’m helping a friend convert some HTML-formatted posts from his blog into an ePub. I wanted to use the great Markdown editor Ulysses to produce the ePub, because it’s a nice program and because being able to edit in Ulysses would give me some easy control over the formatting of the ePub.
This required two steps.
The first was to get the HTML files into a format suitable for importing into Ulysses and outputting into an ePub.
This was a bit tricky, because I wouldn’t want to just do a straightforward conversion from HTML to Markdown. I needed to retain some HTML so that it could be styled by the CSS sheet I’d use in Ulysses and look correct. So I needed to be able to selectively remove formatting while getting it Markdown-ish enough for Ulysses to be able to output to ePub.
The second step would be to actually output the ePub with the desired styling. This would involve customizing a CSS Style sheet in Ulysses.
Step two was basically trivial — I modified Jennifer Mack’s excellent KBasic style sheet, essentially doing a copy-and-paste (with modification) of some of the relevant CSS stylesheet from my friend’s blog. No prob.
So this post will focus on the first step, the Ruby program and associated gems I used to get the HTML files in shape for Ulysses, since that was the interesting part.
Step by step
This bit lets us work with various Ruby Gems necessary for the project:
1
2
3
4
|
require 'nokogiri'
require 'open-uri'
require 'Sanitize'
require 'Upmark'
|
I wanted the script to iterate through a whole directory of HTML files. That’s what this next bit does:
1
2
|
Dir.glob("*.html") do |my_html_file|
puts "working on: #{my_html_file}..."
|
The next few steps use Nokogiri, an HTML/XML parser, to pull the information we want from the HTML file we’re currently working with.
This initializes a variable that will let us work with the content of the html file in Nokogiri:
1 |
doc = Nokogiri::HTML(open(my_html_file))
|
This initializes a variable for the content of the blog post we are gonna turn into an ePub:
1 |
post_body = doc.css('div.post-body')
|
In contrast with what I do below on the blog post’s title, I am not using the .text Nokogiri method to extract the text of the article. Why? Cuz i need the HTML formatting of the article for various purposes (to convert to Markdown and to have properly formatted blockquotes). Nokokgiri has a css selector that lets you extract elements from web pages. See how it works here
We want to put the body of the post into a string for further manipulation, so we do that next:
1 |
body_of_article = post_body.to_s
|
This initializes a variable for the title of the article. We just want the text here, so we’ll use the .text Nokogiri method:
1 |
post_title = doc.css('.post-title').text
|
This concatentes the title with a leading “# ” so that it will be a proper Markdown title, which will automate the process of turning blog posts into individual chapters in an ePub:
1 |
post_title = "# " + post_title.strip
|
(The strip method prevents the title from appearing on a separate line than the “#”)
Next we’re gonna combine the post title and body. first we initialize an empty string:
1 |
string_of_article = ""
|
Then we concatenate the strings we’ve got for the title and body, with the title being added first of course:
1
2
|
string_of_article << post_title
string_of_article << body_of_article
|
This next bit puts blockquotes on their own lines, which helps ulysses handle blockquotes correctly when making ePubs:
1 |
string_of_article = string_of_article.gsub(/<h3>(.*)<\/h3>/) {"\n" + "### " + $1 + "\n"}
|
This converts <h3>
tags to markdown appropriate format, which Upmark wasn’t handling well for some reason:
1 |
string_of_article = string_of_article.gsub(/<h3>(.*)<\/h3>/) {"\n" + "### " + $1 + "\n"}
|
Sanitize “cleans up” html files by removing stuff that’s not on the white list. I had to build my own custom whitelist to get curi blog posts to work correctly. I basically modified an example white list by adding a couple things:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
sanitized_article = Sanitize.clean(string_of_article,
:elements => [
'a', 'abbr', 'blockquote', 'br', 'cite', 'code', 'dd', 'dfn',
'dl', 'dt', 'kbd', 'li', 'mark', 'ol', 'p', 'pre', 'q', 's',
'samp', 'small', 'strike', 'sub', 'sup', 'time', 'ul', 'var',
],
:attributes => {
:all => ['class', 'dir', 'hidden', 'id', 'lang', 'style', 'tabindex', 'title', 'translate'],
'a' => ['href'],
'abbr' => ['title'],
'blockquote' => ['cite','color','nested'],
'dfn' => ['title'],
'q' => ['cite'],
'time' => ['datetime pubdate']
},
:add_attributes => {
'a' => {'rel' => 'nofollow'}
},
:protocols => {
'a' => {'href' => ['ftp', 'http', 'https', 'mailto', :relative]},
'blockquote' => {'cite' => ['http', 'https', :relative]},
'q' => {'cite' => ['http', 'https', :relative]}
},
:css => {
:protocols => ['http', 'https', :relative],
:properties => [
'background-color',
'color',
]
}
)
|
Upmark is a Ruby gem for converting HTML to Markdown.
1 |
markdown_article = Upmark.convert(sanitized_article)
|
(NOTE: I’ve manually MODIFIED the version of Upmark ruby gem I’m running (specifically the markdown.rb script) by commenting out the portion that handles <br>
tags. Leaving the break tags in, as opposed to replacing them with newlines, keeps the formatting correct inside of blockquotes)
This last bit saves our file to with a markdown extension and closes the loop we opened way up top:
1
2
|
File.open("#{my_html_file}.md", 'w') { |file| file.write(markdown_article)}
end
|
The Script
Here’s the whole script, for reference:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
|
require 'nokogiri'
require 'open-uri'
require 'Sanitize'
require 'Upmark'
Dir.glob("*.html") do |my_html_file|
puts "working on: #{my_html_file}..."
doc = Nokogiri::HTML(open(my_html_file))
post_body = doc.css('div.post-body')
body_of_article = post_body.to_s
post_title = doc.css('.post-title').text
post_title = "# " + post_title.strip
string_of_article = ""
string_of_article << post_title
string_of_article << body_of_article
string_of_article = string_of_article.gsub(/(<\/?blockquote.*?>)/) { "\n" + $1 + "\n" }
string_of_article = string_of_article.gsub(/<h3>(.*)<\/h3>/) {"\n" + "### " + $1 + "\n"}
sanitized_article = Sanitize.clean(string_of_article,
:elements => [
'a', 'abbr', 'blockquote', 'br', 'cite', 'code', 'dd', 'dfn',
'dl', 'dt', 'kbd', 'li', 'mark', 'ol', 'p', 'pre', 'q', 's',
'samp', 'small', 'strike', 'sub', 'sup', 'time', 'ul', 'var',
],
:attributes => {
:all => ['class', 'dir', 'hidden', 'id', 'lang', 'style', 'tabindex', 'title', 'translate'],
'a' => ['href'],
'abbr' => ['title'],
'blockquote' => ['cite','color','nested'],
'dfn' => ['title'],
'q' => ['cite'],
'time' => ['datetime pubdate']
},
:add_attributes => {
'a' => {'rel' => 'nofollow'}
},
:protocols => {
'a' => {'href' => ['ftp', 'http', 'https', 'mailto', :relative]},
'blockquote' => {'cite' => ['http', 'https', :relative]},
'q' => {'cite' => ['http', 'https', :relative]}
},
:css => {
:protocols => ['http', 'https', :relative],
:properties => [
'background-color',
'color',
]
}
)
markdown_article = Upmark.convert(sanitized_article)
File.open("#{my_html_file}.md", 'w') { |file| file.write(markdown_article)}
end
|