Archiving WordPress MU blogs to browsable format

 Mikael Willberg

 4.5.2010 English, Suomi Projects · Hacking 

I faced a situation in which I had to archive few hundred blogs to browsable format from WordPress MU (multi-user) server. The server was configured to allow only secure https connections. This seemed an easy task, but reality was a bit different...

Natural solution was to use wget utility, but I just ended up with files that indicated a need to authenticate to every blog. I also rechecked that the blogs were to to be public. I think this is the same issue that Andrew Harvey experienced: Saving the WordPress.com Export File and The Linked Media Files (and wget’s strictness). I also tried some windows based tools like httrack, but they produced the same result or did not have SSL support.

I think the problem is that wget obeys RFC to the letter which complicates things a lot. In this case the reason was that the authentication cookie was "locked" to certain path even thought it worked on every path on server. I did now want to recompile wget to remove that annoyance and at last I figured a way to do this, after a lot of frustration if I might add.

Solution

The following procedure should work also with a standard WordPress site.

1. Create a dummy user to the WPMU.

2. Login with that user.

3. Extract "wordpressuser" and "wordpresspass" cookie content values. Remember to check the server and path as there might be a lot of these.

Firefox users can also use Cookie Exporter plugin to dump all cookies to a file and copy the relevant values from there.

4. Use wget to do the actual archiving. Here is the complete command.

wget -m -nv -x -k -K -e robots=off -E --ignore-length --no-check-certificate -np --cookies=off --header "Cookie: wordpressuser=USERNAME;wordpresspass=ENCODEDPASSWORD" -U "IE6 on Windows XP: Mozilla/4.0 (compatible; MSIE 6.0; Microsoft Windows NT 5.1)" https://WEBSITE/PATH/

Most likely you want to use the following table to decipher what settings I used.

-mTurn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps ftp directory listings. It is currently equivalent to ‘-r -N -l inf --no-remove-listing’.
-nvTurn off verbose without being completely quiet (use ‘-q’ for that), which means that error messages and basic information still get printed.
-xThe opposite of ‘-nd’—create a hierarchy of directories, even if one would not have been created otherwise.
-kAfter the download is complete, convert the links in the document to make them suitable for local viewing.
-KWhen converting a file, back up the original version with a ‘.orig’ suffix.
-e robots=offSpecify whether the norobots convention is respected by Wget, “on” by default. This switch controls both the /robots.txt and the ‘nofollow’ aspect of the spec.
-EIf a file of type ‘application/xhtml+xml’ or ‘text/html’ is downloaded and the URL does not end with the regexp ‘\.[Hh][Tt][Mm][Ll]?’, this option will cause the suffix ‘.html’ to be appended to the local filename.
--ignore-lengthWith this option, Wget will ignore the Content-Length header—as if it never existed.
--no-check-certificateDon't check the server certificate against the available certificate authorities. Also don't require the URL host name to match the common name presented by the certificate.
-npDo not ever ascend to the parent directory when retrieving recursively.
--cookies=offDisable the use of cookies.
--headerSend header-line along with the rest of the headers in each http request.
-U Identify as agent-string to the http server.

Problems

When the command completes and there are still files like "wp-login.php@redirect_to=%2Ffoo%2Fbar%2F.html" the defined cookie credentials are incorrect, for at least to that path. Maybe that path should be archived separately using different cookie values.