Archiving Twitter

Thanks to this post on the OL Daily, and this subsequent toot from Grant Potter reminding me to do it, I spent time converting my recently acquired Twitter archive into markdown. “Why?” you ask. Well, Mattias Ott covers that beautifully:

Once your archive is on your machine, you will have a browsable HTML archive of your tweets, direct messages, and moments including media like images, videos, and GIFs. This is nice, but it also has a few flaws. For one, you can’t easily copy your Tweets somewhere else, for example, into your website because they are stored in a complex JSON structure. But even more dangerous: your links are all still t.co links. This hides the original URL you shared and redirects all traffic over Twitter’s servers. But this is not only inconvenient, it is also dangerous. Just imagine what happens when t.co ever goes down: all URLs you ever shared are now unretrievable.

I like the idea of media links being relative to where ever I host my archive, if you’re gonna get a personal archive, might as well have one that doesn’t point media back to the original media source. The Tweet permalinks still link back to the originals, and I’m going to leave my Twitter account up and keep the archive in the unlikely event it goes away any time soon. Just feels more complete to have an accessible archive collecting dust on my hard drive.

python3 parser.py

So, if you have a more recent version of Python3 installed on your computer this process is just a small script away thanks to Tim Hutton’s Twitter Archive Parser. Just run the above command from the unzipped Twitter archive directory on your machine, and let it ride!

It takes images like the one above and rewrites the URL to point locally rather than back to Twitter’s t.co links. You can access this archive pointing to local images using the TweetArchive.html file that is now in your directory, along with a media directory (which, that’s right, holds all your images, videos, etc.) and markdown files of your tweets and DMs.

Image of Twitter Archive Parser script working through my archive to download best possible available versions of the media

Image of Twitter Archive Parser script working through my archive to download best possible available versions of the media

I had 5120 media files and this script was able to get the best available versions online for all but 12 of them. That’s right, I’m only missing 12 media files of a possible 5120 (or so I believe). It kept trying and re-trying to find best available images for a bit, and in the end there were a grand total of 12 that came back as inaccessible, I may be mis-reading something here—but if this is true it is reassuring to think I have all these images I may never reference again 🙂

I also realized that all direct message conversations were there, the person’s handle is not immediately identifiable in the original HTML version. That said, the conversation is definitely accessible and the parser breaks the DMs down by folks you chatted with. Given DMs were meant to be private, you will want to take them out of your archive if it is to be publicly accessible. Also, it points to an interesting discussion around DMs in Mastodon, which are not encrypted and the instance administrator can find and read them. There might be a push to encrypt those messages, which is definitely something server admins will have to think through given this is a space where potentially sensitive data can be laid bare to others. Also, it might be a good reminder not to put sensitive data in DMs, which is my take away seeing all this data bundled up in a neat little zip ball 🙂

This entry was posted in Archiving, twitter and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.