How to Convert Your Blog From SubText to WordPress

No matter what anyone tells you, there’s more to changing blog engines than just clicking a few buttons and importing data. It tends to be a fair amount more complicated than that. In my last post, I highlighted some of the reasons, both rational and not so rational for my decision to change from SubText to WordPress.

It’s only been about a week or so, but I’ve learned a huge amount of “stuff”, for lack of a better term, and am thus far, I am rather pleased with the transition to WordPress. I thought that for the benefit of others, I would document my move to WordPress 2.7 from SubText 1.9.3 so that if others decide they’d like to try it, they can hopefully retrace my steps and experience a little bit less pain than I did.

When undertaking this sort of project, there’s really only one viable place to start looking for information on how to do this. Google. I found a number of links that I thought might be helpful and used them as a general starting point for deciding what to do, and how to do it.

Subtext to WordPress:
http://www.ageektrapped.com/blog/subtext-to-wordpress-converting-blog-engines/
http://www.copyandwaste.com/2008/09/15/hello-goodbye-subtext-to-wordpress/
http://blog.digitaltinder.net/2008/12/exporting-blogml-from-subtext-21-and-importing-blogml-into-wordpress-27/

Blogger to WordPress:
http://www.aaronlerch.com/blog/2007/08/23/breaking-up-moving-blog-engines/

DasBlog to WordPress:
http://www.kavinda.net/2008/10/23/migrating-from-dasblog-to-wordpress.html

WordPress to SubText:
http://betterthaneveryone.com/archive/2007/11/04/wordpress-to-subtext-done.aspx

Obviously the first set of links was much more helpful than the others, but for reference purposes, seeing how others dealt with converting to WordPress or reasons they moved away was somewhat enlightening. After running through the instructions on each of the first set of links, I came to one conclusion: that none of their sets of instructions were going to work for me directly due to the myriad of problems I was running into that they did not address. I felt there was more hand waving than hand holding. I like to hold hands, so here’s how I converted from SubText 1.9.3 to WordPress 2.7.1.

At a high level, the idea is to export from SubText to some intermediate format, and then import that into WordPress. In this case, the intermediate format which is probably the most straightforward to use is a BlogML XML file. The rationale is that you want to be able to keep all of your content and save time doing the conversion. The last blog engine transition I did was from CityDesk to SubText and it involved a lot of copy/pasting. Not fun.

BlogML is supposed to be a standard of some sort for moving your blog content from one platform to another. Unfortunately, the development is somewhat stagnant and not much of anything has gone on in quite some time. Their roadmap as of today indicates that version 3.0 is expected to be released in mid-2008. It’s early 2009 and version 2.5 is the only thing out there.

Another potential sticky point is that WordPress does not ship with a BlogML import module. Fortunately, a fellow blogger named Aaron Lerch built one and there are several variations floating around which fix a few different bugs. I’ll be offering up my own version in order to fix a couple more.

So, to reiterate, the idea is to export from SubText to BlogML, then import the XML into WordPress. Easier said than done.

Problem #1: Exporting to BlogML.
The BlogML exported in SubText doesn’t appear to work. At least it didn’t at first and in the version of SubText I was using. For the record, I was using version 1.9.3. Fortunately, I discovered almost by accident that the BlogML export feature for SubText doesn’t work if you instruct it to include embedded content, which is the default. I’m not sure specifically what that is meant to be, but from doing a bit of research, I gather that embedded content includes things like flash files, YouTube videos, or maybe even local images. In any case, including the embedded content caused it to fail. I unchecked the box to include embedded content, and viola. My BlogML XML file was ready to download.

Apologies to those of you using embedded content, but I really didn’t look too far into this. The cursory research on what embedded content was lead me to believe that I didn’t have any on my blog and could probably safely ignore it. Your mileage may vary.

Problem #2: Using Aaron Lerch’s BlogML Importer for WordPress.
This seemed flaky at first and it wasn’t clear at all why it just wasn’t working. I’d get the file upload textbox like the instructions stated, I’d attempt to upload my file, and then the fields would disappear and my browser would act as if nothing was wrong and it was done doing what it was supposed to do. I tried a few different browsers and got the same result with Firefox, IE 7, and Google Chrome.

It turns out that the BlogML import seems to use a fair amount of memory. My BlogML XML file was about 1.6MB. After digging through the apache error logs on my web server, I found that the web page was requesting about 32MB of memory to parse the XML file and the web server was denying that request, as it was limited to much less in terms of memory.

I really don’t have any idea why it requires so much memory to parse the BlogML file. Quick estimates ballpark the required memory to be about 20 times the size of your BlogML file. In my case, this was about 32MB of RAM. If my BlogML file were 5MB, I would likely need more than 100MB memory.

The quick fix to this issue was to add the following line of code to the blogml.php file:

ini_set("memory_limit", "64M");

You could always bump it up to 128M or higher, if needed. The alternative is to modify your php.ini file and alter the memory_limit for the entire apache instance, but I felt that this blog import was only going to be done once, so there was no need to allocate additional resources if it wasn’t really necessary. The machine has them to spare, but no point in wasting them.

You can download the XPath.class.php and blogml.php files that I used from here.

Problem #3: File upload problems
Once the BlogML importer seemed to be working, I immediately ran into a permissions issue. The BlogML Importer was unable to save my uploaded file to the web server due to a permissions error. I poked around a lot and the “fix” most often recommended was to change the permissions on the /wp-content/uploads directory to 777. Forgive me for working in the security field, or even being remotely security minded at all, but that’s the single most ridiculous suggestion I’ve ever heard.

If it was only made by one person, I could possibly dismiss this as just ignorance, but numerous people were suggesting that this approach was not only common, but was the recommended fix. Sorry folks. It’s not. I found that the most straightforward approach was to provide ownership of the uploads directory to apache. Immediately the problem went away, and nothing had to be made writeable by world.

Problem #4: Link redirection
This one could have been a total nightmare, but wasn’t nearly as bad as it could have been. If you have a blog that you’ve been running for any length of time, the hope is that other people have linked to your blog. Even better, there’s a steady stream of traffic headed your way. Well, to keep that traffic from drying up quickly, you’re going to need to set up URL redirection using a .htaccess file on your new web server, thus redirecting pages from
/2008/12/25/its-christmas-time.aspx to something like /2008/12/25/its-christmas-time/.

That means that you need to know exactly what every single internal link on your site is, and exactly where it goes. Once you know where all your links are, then you add a RewriteRule to your .htaccess file for each of them. This RewriteRule will perform a redirection at the web server level, simultaneously providing the browser with a 302 error code to indicate that the page has permanently moved.

This should have been easier than it was, but I wasn’t using pretty URL’s in SubText, so I had to suffer through this part of it. It didn’t take long before I came up with what I felt was an adequate solution. I poked around a lot using Google and Yahoo, looking for web tools that would crawl my site and find all of my page links for me, but I didn’t find anything that was terribly helpful. Finally, I gave up and decided to roll my own.

Using my trusty Perl skills, I wrote a website crawler which I pointed at my original blog. After reading in the main page, it parsed the page for every link on the page. If the link was local to my domain, it would retrieve the contents of that page and recursively continue to do so until it had followed every single link on my website which pointed back to itself. I ignored image references and relative URL’s. I also ignored any link that was to an external website, as I have no control over those links anyway.

Given that there was a page in SubText containing a list of archived links, this solution worked really well. I was able to capture every single link on the page and for each URL, I was able to obtain the title of the page. This made building my .htaccess file pretty hassle free. It was still a little tedious, but for a few hundred links it only took a couple hours to search the content of my new blog for the title’s that I captured and match them up to the original URL’s.

Here is a link to the Perl code that I used for this. Feel free to hack away and use it for whatever you want. I’m releasing it under the GPL 3 license. Do with it what you will. To use it, simply install the Perl libraries (assuming you don’t have them) and set the primaryURL and the TLD variables. Run the subTextCrawler.pl file, and it will spit out a bunch of half-written RewriteRule’s for your .htaccess file.

The assumption is that you have your WordPress site up and running and have imported your BlogML file. I used a temporary domain pointer for this, so I was able to take the title printed on each RewriteRule line and search for the corresponding URL on my new WordPress site.

I could have gotten much fancier and searched my WordPress site using the title from the SubText site and completely automated it, but I’ll leave that for someone else to do. I’m just trying to get you most of the way there.

subTextCrawler.pl

Problem #5: Learning how to actually build a .htaccess file.
I was pretty stupid the first time I was working with my .htaccess file. It turns out that there are two things you need to keep in mind when using WordPress. First, is that WordPress expects to be able to modify this file. So, making .htaccess owned by apache solved the first issue. The second issue I had here was that the .htaccess file automatically is filled in with a set of rules that are dictated by your Permalink preferences. Whenever you browse to the Permalink preferences page within WordPress, this file is read, parsed, and then rewritten. All without clicking a save button.

It’s pretty irritating to put all of your RewriteRule lines in there, only to find they don’t work for some reason and not realize that it’s because the file is being overwritten whenever you browse to a specific admin page in WordPress. Your .htaccess file should look something like this:

# BEGIN WordPress
‹IfModule mod_rewrite.c›
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
‹/IfModule›
# END WordPress

The trick to adding your own RewriteRule options is to add another set of instructions above the ones created by WordPress. Once you do that, your changes will not be lost whenever you browse to the Permalinks page. I’m a fan of examples, so here’s some of what I ended up with:

‹IfModule mod_rewrite.c›
RewriteEngine On
RewriteBase /
RewriteRule ^archive/2005/08\.aspx$ /2005/08/ [R=302,L,NC]
RewriteRule ^archive/2005/08/21/1\.aspx$ /2005/08/21/day-11-starting-a-new-business/ [R=302,L,NC]
RewriteRule ^archive/2005/08/22/2\.aspx$ /2005/08/22/day-12-the-website/ [R=302,L,NC]
...
one rule per line of subTextCrawler.pl output
...
‹/IfModule›

# BEGIN WordPress
‹IfModule mod_rewrite.c›
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
‹/IfModule›

# END WordPress

Conclusion:

Hopefully, someone out there finds this retelling of my experience useful and can save themselves a great deal of time and effort. Between the links above where people explained their processes, and my retelling of the problems that I ran into, you should at least have some answers as to how to tackle some of the problems you might run into.

Eventually, the Google Blog Converters project may help to allow your data to migrate between blog engines a little easier but it’s really just not there yet. Good luck!

Leave a Reply