Archiving ain’t easy: bringing old one-off WP sites into WPMu

Every Summer we try and both update and archive some of the old projects we have on the various Bluehost accounts we have done over the last 3 or 4 years. It is a painstaking process, and when you have anywhere from 50-100 WordPress one-off blogs, MediaWiki installs, Drupal sites, and phpbb forums out in the wild for a number of years the possibilities for kipple haunts the archivist’s soul. So, for starters, I’ve been trying to use UMW Blogs as a space to archive the numerous WordPress blogs I setup over the last few years. It’s a logical layup, import all the data, through up a 301 redirect to the new URL, and wham, bam, one update maam. Sounds simple enough, and for WordPress blogs with one author it  actually is quite easy.

However, when you try archiving a group blogs with numerous authors into a WPMu install, the plot thickens. So, I’m going to take you through my process for archiving a group blog from back in Fall, 2007, in those fertile, heady days before UMW Blogs, hell that was even before ELS Blogs. This particular course site was centered around a directed study on Poetic Sequence led by professors Mara Scanlon and Claudia Emerson. This blog had fifteen students all independently tracing the work they were doing over the course of the semester. The group blog became a hub for sharing ideas, assignments, project progress, and their finished works. It was an interesting model for me given that they only met as a group in an actual classroom a few times over the course of the semester. In fact, it is the closest I have ever come to designing a space for a predominantly online learning environment—a fully online classroom is still something I am very interested in trying out with some of the designs we have come up with over the years.

I’m focusing on the archiving of this site in particular because I still think it is one of the best early projects I worked on, and it was also a very particular setup that clearly illustrates the challenges of importing group blogs on a single WP install into WPMu. So, here we go with the play-by-play:

Importing a One-Off WP Group Blog into WPMu

The User Conundrum

The challenge of simply exporting and importing a group blogs from a one-off WP install into WPMu has everything to do with assigning authors. Therein lies the root of the problems for about 90% of the issues when importing a one-off group blog into WPMu. See, the thing is that all the students who were part of the directed study group blog back in Fall, 2007 had graduated by the time we got UMW Blogs up and running, which means they were not users on UMW Blogs. Moreover, even if a few of them were, I would have to track down each of their email addresses and usernames and add them as a users to this one particular blog and wait for them to accept my invitation so I could them map them to the posts they wrote. (Although, I found a way to force add users as an über-admin, go to Site Admin–>Blogs an find the blog you are looking for and click the edit link, from this administrative/backend screen you can force add any user you want to a blog.) So, even if they were in the system, their email would be long defunct—a huge issue with using UMW emails in UMW Blogs which I am re-visiting thanks to D’Arcy—and adding them in this manner would prove futile.

So, to combat this issue I thought I had come up with a great idea, and when I saw Ron’s new plugin that allows you to decided what elements of a blog you want to export—I figured it was high time to try out my idea.  My plan was to use FeedWordPress to pull in all the posts from the original Poetic Sequence blog (which you can see here) to the new blog on UMW Blogs (which you can see here).  Why? Well, because FeedWordPress pulls in all the authors and immediately creates accounts for them.  It makes my job simple, all I would need to do after that is import the pages, and copy the theme into the UMW Blogs system.

A piece of cake, right? Well, kinda, it did pull in all the posts and create the authors as expected, and Ron’s Advanced Export plugin did a fine job with just importing the pages. Alas, once againthe problem was with syndicating comments, what a nightare.  It’s always the comments with FeedWordPress!!!! I had no way to import the comments cleanly, there is no special way to do it with Advanced Export, and FeedWordPress, as I have noted extensively over the last year or two, doesn’t syndicated in comments. I even tried to import the comments table into the blog on UMW Blogs archival site, but the post IDs were thrown off and the comments did not associate themselves with any posts….fail!

So, I had to delete the syndicated posts, and import them through the import file, but the cool thing I discovered is that FeedWordPress had created the users from the one-off WP blog, and they were still their and I could map the authors appropriately now from the import file.  So, my idea kind of worked. Though it adds an extra step.

The Theme and Plugins

I did eventually get all the posts and comments in and assigned to the proper author, so then I turned to preserving the original theme.  And this is one of the things that works beautifully in WPMu, and one of the things I love about it. All I had to do is copy the theme into the wp-content/themes folder and make it available to everyone for a split second.  After that, I enabled Userthemes for that blog and it is automatically copied into the the uplaod directory for that blog—something like blogs.dir/2477/themes. After that, I can delete the theme from the wp-content/themes directory and still have an archived version of an old theme which I can edit and make it match the orgianl site perfectly.  Do your own comparison between the two here and here.

With plugins, I had very few on the original blog—podPress and a quotes plugin called Yarq (which is long defunct). We have podPress on UMW Blogs—although I hate it and want to get rid of it—so I simply grabbed the mp3 URLs from the assicated fields on posts with audio (which was only one in this instance) and copied them into the post directly.  Why?  Well, one of the great benefits of Anarchy Media Player is that it will convert any url ending with mp3, mp4, etc. into a flash media player automatically—no strings attached. For the random quotes in the sidebar, I had to download the table from created from the Yarq plugin and copy and paste them into the slick Quotes Manager plugin we have in UMW Blogs.

Blogroll


Blogroll links, or just sidebar links in general, are always a special case.  TO im port a blogroll you have to go to Tools–>Import and add the following suffix to the blog url wp-links-opml.php.  So, for example, I grabbed http://poetic-sequence.elsweb.org/blog/wp-links-opml.php, and wham it’s all imported.

Links and Files

Now, to add another dimension to the archiving, there were a whole bunch of uploaded files in the wp-content/uploads directory in the one-off WordPress blog that were linked to from within a number of posts. To make the links cleaner, I downloaded the uploads directory and copied the files and contents within the uploads directory (not the uploads directory itself) into the particular uploads file for that blog on the WPMu, in this case blogs.dir/2477/files/ folder. So, once you do this, you can actually change the existing links …/wp-content/uploads/2006/09/image.jpg to the following path for WPMu  …/files/2006/09/image.jpg

Now, Shannon Hauser did all the leg work on changing the various links in the Student Projects section for the site, and this is still far too laborious.  I should have done a find and replace in the XML file exported, but I forgot this step.

MediaWiki

This class also had a MediaWiki install, though it was not very successful during the course of the semester.  It was used as a space to build a bibliography of primary and secondary sources, and it was just one page. However, one student wrote two of his papers in the MediaWiki for the class.  Rather than trying to preserve the whole MediWiki, I just copied the three pages into the UMW Blogs wiki here—which will allow me to get rid of the original and all its ugly spam. ne cool side effect of this is that using the Wiki Append (or the plugin formerly known as Wiki INC), I was able to pull the students wiki papers into blog pages in the Student Projects section. (You can see the Wikified papers here and here, and seen then here and here as blog pages being pulled in with Wiki Append).

Conclusion

This process is still far too laborious and difficult. It needs to be far, far easier than this if we are going to rpeserve some of the stuff we have done over the years..

This entry was posted in UMW Blogs, WordPress, wordpress multi-user, wpmu and tagged , , , , , , , , , , . Bookmark the permalink.

17 Responses to Archiving ain’t easy: bringing old one-off WP sites into WPMu

  1. Tom says:

    Don’t know if this helps or adds confusion but to get the uploaded files in I manually put them in a parallel folder structure on WPMU at wp-content/uploads/whatever (if the blog had >2007>11 or whatever I put it there in that way).

    Then on import I hit the download and import files option and things seem to end up working just fine even when the original blog is killed off.

    Seems like based on how easy it is to change the url of the WPMU blog you could do this and skip the manual file move step that I do and leave the old blog up on import then delete it. Maybe I’m wrong.

    I didn’t do it that way originally because I thought I had to get rid of the original to take over the original URL.

    Appreciate seeing how you think through this. My own thought process is clearly muddled.

    Fecundidly yours,

    Tom

  2. Joss Winn says:

    Jim,

    JISC have just funded a project by the British Library and University of London Computing Centre to look at archiving blogs using FeedWordPress.

    http://archivepress.ulcc.ac.uk/

    I guess you should all talk! 🙂

  3. Reverend says:

    @Tom,

    That clean solution, but I wonder if it will work on export? I think the export will look for the files in the blogs.dir, but I have to check on that. Your solution is quite fecind with possibilities 🙂

    Joss,

    Wow, I think Tony pointed me to that site, and I do like what they are doing. The comments are the real killer in that solution though, and if they figure out how to link comments to posts in FeedWordPress, I will worship them. I guess I will look them up, thanks Joss!

  4. Andrea_R says:

    I just skimmed and will be back later, but again – nice work, great content. 😀

    Glad you found Ron’s plugin useful. I know *I* did.

  5. Tom says:

    I believe I’m doing the exact thing you’re doing and it works for me.

    I’ve got a bunch of one offs at urls that I want to replace but not lose the link. So I’m exporting them to WPMU at the same URL. When I do the import as described it works just fine. Not sure why exactly, I figured it was because of the way the htaccess stuff (or whatever) redirects.

  6. Ron says:

    Glad the plugin did the trick for you 🙂

  7. Reverend says:

    @Tom,

    Yeah, i definitely see the value of creating the wp-content/uploads path, but, what happens when you do this for several blogs? Do you have conflict with the year and month directories? or do you put all those files from the disparate blogs in the same directory? Maybe I am missing something, demonstrate your kung-fu 🙂

    @Andrea and Ron:
    Plugin works great, just need to figure out exporting comments separately from posts, or at least program the fix for FeedWordPress 🙂

  8. Tom says:

    I put them all together so if I have two 6 directories I just put the stuff from both inside one 6 directory. I’d have minor trouble if I then had duplicate file names w/in there but haven’t hit that so far. Be a fairly easy change to make though and I’d get the “You really wanna replace this?” error to warn me.

    I know no fu. I practice random acts of computer violence without rhyme or reason.

  9. Hi Jim

    They say great minds think alike… and fools seldom differ!

    I’m working on the ArchivePress project, and hope we can keep in touch. Our chief aim is to find an alternative to the crawling/spidering approach of much web archiving, that treats blogs as static pages, and ignores the potential of treating them as the dynamic data sources they really are: the Post, not the Page, is the atom of blogging.

    At the moment we are trying out scenarios much like yours using a single WP installation, but once we’ve got our basic concepts proved, WPMU is the logical next step.

    As you say, importing Comments is the real killer in making this approach an acceptable solution; and ideally we want to be non-invasive of the original blog (not needing login, API key etc). My current idea is to emulate the functionality of FeedWordPress, automatically seeding the aggregator with the URL of the Comments RSS-feed for each Post imported – as long as we can reliably find it in the feed item entry. If you come across anyone who manages to make an progress with that before we do, please let me know!

    Better get back to write something for our blog now…!

  10. Reverend says:

    Hi Richard,

    I’ve seen your work and it is a brilliant idea, I’ll definitely be in touch. And the idea of allowing comments to show up on both the orginal and aggregated blog post is really the issue, a cross-aggregation mechanism should be fairly straightforward for wordpress given they now link comments directly to posts, the reall issue though comes down to the fact that all the different services like Blooger, LiveJournal, TypePad, etc. have very different ways of dealing with comments, making this a half-baked solution. I think the solution may lie in larger identity management for comments across the internet, a place where people control their comments much like they control their log space. But that is beyond me, and Stephen Downes has talk about this in some depth in regards to identity management, and it makes sense. But for now I, like you, am searching for the quick fix.

  11. Joss Winn says:

    I don’t know if this will contribute to solving your problem, but the JISCPress project will release a WordPress plugin which will produce RSS feeds for comment authors, based on their email address. Feeds for individual comment authors are not currently available on WordPress.

    http://code.google.com/p/jiscpress/issues/detail?id=8

    We hope to have this ready for the end of July.

  12. @Jim I’m hoping our use case will allow us to sidestep the cross-commenting conundrum: we really don’t want comments direct on archived blog posts (or comments)

    @Joss Will await this with interest, in case there’s some ideas we can borrow. Though as I mentioned, our centre-of-attention is the Post – but I see the Comment feed is quite central to your plans.

  13. Reverend says:

    Richard,

    Out of curiosity, how would comments not be key to an archive of a medium like blogs? I really think without archiving the comments—you lose much of the conversation. But I have to admit, I can see ton of use cases where the comments are not vital, but for blogs I would think they would be crucial.

  14. Perhaps not making myself clear – archiving comments from the live blog is essential – I just meant it doesn’t need to be a two-way thing (like I think you discussed here). In effect, on the aggregating/archiving blog, “comments are closed” – though they can continue to be imported from the live blog.

    BTW, this citation plugin is nice too – just the kind of unobtrusive scholarly/academic/library accoutrements that could add value to the collection for the researcher-of-the-future.

  15. Reverend says:

    Richard,

    Sorry my confusion, that as a kind of reflection of the comments on the original blog makes a lot of sense, and feedburner already does that for wordpress.com blogs, which is a nice feature.

    I also like that citation tool, we’re planning on playing with Zotero now that it has a shared online database and feeds, using it to pull in bibliographies based a an assortment of feeds might prove interesting.

    There is a good example of this explained here.

  16. Pingback: ArchivePress » Blog Archive » Data and views and views of data

  17. Pingback: Data and views and views of data « ArchivePress + APrints

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.