Building This Site

When I decided to update my web site, I wanted, for ease of setup and maintenance, a static site. But I also wanted to be able to write using markdown. My first idea was to write a script to process the markdown and build a site, but then I ran across this post. Using Flask and a few other plug-ins I now have a really nice, easy to maintain script for building my site. This post is a basic code walk-through of the script. The complete script can be downloaded here.

I am using twitter bootstrap with a few modifications for my CSS.

The basic a folder structure for the site is:

+ site
|
---- build
|
---- env 
|
---- src
      |
      ---- pages 
      |
      ---- static 
      |
      ---- templates

where env is the folder for a virtual environment for the python packages. The packages I am using are listed in the imports, as is future with statement I am using.

from __future__ import with_statement

import sys
import os.path
import shutil
import zipfile
import itertools
import yaml
import markdown
import PyRSS2Gen

from filecmp import dircmp
from datetime import datetime
from flask import Flask, render_template,abort,make_response
from flask_frozen import Freezer
from HTMLParser import HTMLParser

The most important one is Frozen Flask which will walk the flask application, using the url_for, function calls and save the site into the build folder. Unlike the post I based this project on, I didn't use FlatPages, instead I recreated the parts of that module that I needed.

The script provides a number of commands:

  • build - builds the site into the build folder
  • static - runs the freezer and then hosts the site from that folder
  • public - runs the flask app as a web server as a public server, doesn't do the freeze
  • default - runs the flask app as a web server on local host, doesn't do the freeze

Each time I build the site, I save the old copy, and create an upload folder to hold files that have changed.

DEBUG = True
BUILD_FOLDER = '../build'
OLD_BUILD_FOLDER = '../build_old'
UPLOAD_FOLDER = '../upload'+datetime.strftime(datetime.utcnow(), "_%Y_%b_%d_%H_%M_%S")
FREEZER_DESTINATION = BUILD_FOLDER

As with any Flask application we create an application and in this case hook the Freezer plug in up to it.

app = Flask(__name__)
app.config.from_object(__name__)
freezer = Freezer(app)

I keep track of all the files I have seen for the live version of this script.

file_cache = {}

One of the things you may have noticed is that the front page, and post list, shows the first part of each post. This summary is calculated by removing all of the tags from the HTML for the post using the following code:

#
# From http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python
#
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

The code for each page is contained in a markdown file. The file contains some YAML at the top with a description of the page. For example, this page has the following header:

title: Building This Site
date: 2013-2-20
tags: [python,flask,twitter bootstrap, html, javascript]

The code I wrote based on Flat Pages walks the pages folder and processes each page. The pages are stored in an object with properties for the:

  • path - the path to the page, as the site will see it
  • yaml - the YAML header, up to the first blank line
  • content - the raw content for the page
  • html - generated from the content using markdown
  • summary - calculated using the MLStripper code above, and then cut to 280 maximum characters
  • meta - a dictionary of the values in the YAML

These are calculated in the parsePage function.

def parsePage(string, path):
    lines = iter(string.split(u'\n'))
    extensions = ['codehilite']
    page = {}

    page['path'] = path
    page['meta_yaml'] = u'\n'.join(itertools.takewhile(unicode.strip, lines))
    page['content'] = u'\n'.join(lines)
    page['meta'] = yaml.safe_load(page['meta_yaml'])
    page['html'] = markdown.markdown(page['content'], extensions)
    page['summary'] = strip_tags(page['html'])[:280]+' ...' #first set of characters

    if page['meta']['tags'] is None:
        page['meta']['tags'] = []

    return page

The page object also stores the latest modification time and the file path for each markdown file.

def processFile(path, filepath):
    mtime = os.path.getmtime(filepath)
    with open(filepath) as fd:
        content = fd.read().decode('utf8')
    page = parsePage(content, path)
    page['mtime'] = mtime
    page['filepath'] = filepath
    return page

Folders are processed looking for .md files that do not start with _. Files that start with underscore are ignored, allowing me to work on files while building the site for other updates.

def processFolder(directory, path_prefix=(),pages={}):

    for name in os.listdir(directory):
        full_name = os.path.join(directory, name)
        if os.path.isdir(full_name):
            processFolder(full_name, path_prefix + (name,),pages)
        elif name.endswith('.md') and not name.startswith('_'):
            name_without_extension = name[:-len('.md')]
            new_name = name_without_extension+'.html'
            path = u'/'.join(path_prefix+(new_name,))
            pages[path] = processFile(path, full_name)
    return pages

When a page is requested, I check if the file is changed, and if so, update it, otherwise the page is returned for display.

def getPage(path,pages, default=None):
    page = None
    try:
        page = pages[path]
        filepath = page['filepath']
        mtime = os.path.getmtime(filepath)
        if(page['mtime']!=mtime):
            page = processFile(path,filepath)
    except KeyError:
        page = default

    return page

Dates in the YAML can be one of four formats:

  1. circa year
  2. archive
  3. year-month-day
  4. raw date string

These will be used during sorting.

def getPageDate(page):

    dateString = str(page['meta']['date'])

    if dateString.startswith('circa'):
        return dateString.split(' ')[1]
    elif dateString == '' or page['meta']['date'] is None:
        return -1;
    elif dateString == 'archive':
        return "2012";#rank these above circa, as a string like circa will be
    else:
        return dateString.split(' ')[0]

Using these functions I process all of the files in the pages folder and then separate them into posts for the notes section of the site and posts for the projects section of the site.

I am using a hard coded link on each page to track a parent_url which I can use for a back link. On this page you can see the back link at the top and bottom of the page.

pages = processFolder(os.path.join(app.root_path,u'pages'))

projects = []
posts = []

for path,page in pages.items():
    if path.startswith('projects'):
        page['parent_url'] = '/portfolio/'
        page['parent'] = 'Back to portfolio'
        projects.append(page)
    elif path.startswith('posts'):
        page['parent_url'] = '/notes/'
        page['parent'] = 'Back to notes'
        posts.append(page)

Once all the posts are identified, I can build an RSS XML file using the PyRSS2Gen plug-in.

rssItems = []

for page in posts:
    dateString = getPageDate(page)
    dt = None

    if isinstance(dateString, str) or isinstance(dateString, unicode):
        if len(dateString)==4:
            dt = datetime.strptime(dateString, "%Y")
        else:
            dt = datetime.strptime(dateString, "%Y-%m-%d")

    item = PyRSS2Gen.RSSItem(
         title = page['meta']['title'],
         link = 'http://www.sasbury.com'+'/'+page['path'],
         description = page['summary'],
         guid = PyRSS2Gen.Guid('/'+page['path']),
         pubDate = dt)

    rssItems.append(item)

rss = PyRSS2Gen.RSS2(
    title = "sasbury.com feed",
    link = "http://www.sasbury.com",
    description = "sasbury.com RSS feed",
    docs = '',
    lastBuildDate = datetime.utcnow(),
    items = rssItems
    )

rssFeed = rss.to_xml('utf-8')

Finally, the Flask app defines its routes. The index is a real page, defined as a template, this page displays a list of posts and a list of projects, which i get by sub-setting the full lists. The lists are sorted in reverse order by date.

@app.route('/')
def index():
    sorted_posts = sorted(posts, reverse=True, key=getPageDate)
    sorted_projects = sorted(projects, reverse=True, key=getPageDate )
    return render_template('index.html', posts=sorted_posts[:3],projects=sorted_projects[:3]
                            ,projCount=len(projects),postCount=len(posts))

The notes page is another template which displays a full list of the posts. The list is sorted in reverse order by date.

@app.route('/notes/')
def notes():
    sorted_posts = sorted(posts, reverse=True, key=getPageDate)
    return render_template('notes.html', posts=sorted_posts)

The RSS route just returns the XML we calculated above.

@app.route('/notes/sasbury_rss.xml')
def rss():
    return rssFeed, 200, {'Content-Type': 'application/xml; charset=utf-8'}

The portfolio page is another template which displays a full list of the posts. The list is sorted in reverse order by date.

@app.route('/portfolio/')
def portfolio():
    sorted_projects = sorted(projects, reverse=True, key=getPageDate )
    return render_template('portfolio.html', projects=sorted_projects)

Tags for each page are defined in the YAML. The route for each tag calculates the appropriate pages, sorts them and then displays them.

@app.route('/tag/<string:tag>/')
def tag(tag):

    tagged = []

    for path,page in pages.items():
        tags = page['meta']['tags']
        if tag in tags:
            tagged.append(page)

    sorted_pages = sorted(tagged, reverse=True, key=getPageDate)
    return render_template('tag.html', pages=sorted_pages, tag=tag)

Other pages are all grouped into a single route that renders the pages using the final template.

@app.route('/<path:path>')
def page(path):
    page = getPage(path,pages)

    if not page:
        abort(404)

    return render_template('page.html', page=page)

When I run the freezer, I want to include several static pages, so those are included manually.

@freezer.register_generator
def books_url_generator():
    # URLs as strings
    yield '/books/ejava.html'
    yield '/books/ejava2.html'
    yield '/books/jfc.html'
    yield '/books/lxatwork.html'

As I mentioned at the top of this post, I save the files that changed on each build into a new upload folder to minimize the size of each update. At some point I want to automate the upload, but unfortunately my current ISP doesn't provide an easy way for me to do this.

def processDiff(dcmp):

    #left is old build, right is new build

    #first cmp won't have any left only, but later cmps might
    if len(dcmp.left_only) > 0:

            src = dcmp.left
            print "Files or folders were deleted from %s, copying full folder" % (src)
            dest = src.replace(OLD_BUILD_FOLDER,UPLOAD_FOLDER)
            shutil.copytree(src,dest)
    else:
        to = dcmp.left.replace(OLD_BUILD_FOLDER,UPLOAD_FOLDER)

        if not os.path.exists(to) and len(dcmp.diff_files)>0:
            os.mkdir(to)

        #copy new files
        for name in dcmp.right_only:

            src = os.path.join(dcmp.right,name)
            dest = src.replace(BUILD_FOLDER,UPLOAD_FOLDER)

            if os.path.isdir(src):
                shutil.copytree(src,dest)
            else:
                parent = os.path.dirname(dest)

                if not os.path.exists(parent):
                    print "creating %s" % (parent)
                    os.makedirs(parent)

                shutil.copy(src,dest)

        #copy changed files
        for name in dcmp.diff_files:
            src = os.path.join(dcmp.right,name)
            dest = src.replace(BUILD_FOLDER,UPLOAD_FOLDER)
            shutil.copy(src,dest)

        #recurse
        for sub_dcmp in dcmp.subdirs.values():
            processDiff(sub_dcmp)

Finally, the app provides the commands. build creates the appropriate folders and freezes the site into the build folder before calculating the diff and loading the upload folder.

if __name__ == '__main__':

    if len(sys.argv) > 1 and sys.argv[1] == "build":

        if os.path.exists(OLD_BUILD_FOLDER):
            print "Removing build backup"
            shutil.rmtree(OLD_BUILD_FOLDER)

        if os.path.exists(BUILD_FOLDER):
            print "Moving previous build to build backup"
            os.rename(BUILD_FOLDER,OLD_BUILD_FOLDER)
        else:
            print "Creating placeholder build backup"
            os.mkdir(OLD_BUILD_FOLDER)

        if not os.path.exists(BUILD_FOLDER):
            print "Creating build folder"
            os.mkdir(BUILD_FOLDER)

        print "Compiling and saving site"
        freezer.freeze()

        print "Create diff of build with previous build"
        dcmp = dircmp(OLD_BUILD_FOLDER, BUILD_FOLDER)

        ##check if we deleted anything, if so, we should copy the whole folder and be done
        if len(dcmp.left_only) > 0:
            print "Files or folders were deleted, copying full build"
            shutil.copytree(BUILD_FOLDER,UPLOAD_FOLDER)
        else:
            print "Creating temporary upload folder"
            os.mkdir(UPLOAD_FOLDER)
            processDiff(dcmp)

        print "Creating zip of upload folder"
        zip = zipfile.ZipFile(UPLOAD_FOLDER+".zip", 'w', zipfile.ZIP_DEFLATED)
        for base, dirs, files in os.walk(UPLOAD_FOLDER):
            for file in files:
                fn = os.path.join(base, file)
                zip.write(fn, fn[len('../'):])#keep the upload folder name but not the ../ part

        if os.path.exists(UPLOAD_FOLDER):
            print "Removing temporary upload folder"
            shutil.rmtree(UPLOAD_FOLDER)

        if os.path.exists(OLD_BUILD_FOLDER):
            print "Removing build backup"
            shutil.rmtree(OLD_BUILD_FOLDER)

    elif len(sys.argv) > 1 and sys.argv[1] == "static":

        freezer.run(port=8080)

    elif len(sys.argv) > 1 and sys.argv[1] == "public":

        app.run(host='0.0.0.0',port=8080)

    else:

        app.run(port=8080)
python flask twitter bootstrap html javascript