This post has actual graphs with numbers and data so I kind feel like Tony Hirst right now, I just won’t be as smart. Zach Davis did kernel updates on the ds106 student server as well as on the main ds106 site server, and one of the reasons we’ve been having so many issues with that site is not necessarily traffic, because ds106.us is not a highly trafficed site by any means, rather it’s a resource intensive site because of how many blogs it is syndicating in on a regular basis. Fact is, the syndication bus using FeedWordPress is a resource intensive affair.
Exhibit A, this first graph is the server that hosts all ds106 students (maybe 75-100 low traffic blogs) that have gotten hosting with Cast Iron Coding:
As Zach notes: “The CPU consumption on this server is what you’d expect. The spikes probably represent backups, but otherwise it just kinda hums along, never getting above 50% of CPU usage.”
Now, look at the DS106 server that we just moved to:
To quote Zach again:
Woah! See how this server is repeatedly using up 100% of CPU? This is why we had to move you to a dedicated server, because the site is a crazy resource hog. I don’t think it’s a resource hog because of traffic. I think that the problem is probably your feed import script….needs some kind of throttling built into it so that it works better.
And that is exactly right, what I am gonna try and do with any of the extra funds, assuming their still are some, is see if the developer of FeedWordPress—amongst other things we need from him—might consider integrating a few ways to make the syndication less resource-intensive so we can scale this kind of aggregation hub up a bit without peaking out on resources. Anyway, I found it interesting when Zach shared it with me, and given he doesn’t blog because he is in Portland and that city is basically one, big organic blog in the new flesh, I figured I’d just share it out here š
So this is probably a pointless rathole, but it came to mind…;-) Time series analysis…
I’ve not done much TSA myself (limited pretty much to first fumbling steps in http://blog.ouseful.info/2011/01/15/matplotlib-detrending-time-series-data/ ) though I do keep meaning to work through this R based tutorial… http://www.stat.pitt.edu/stoffer/tsa2/R_time_series_quick_fix.htm
Yep. Aggregation kills websites. For my site (downes.ca and MOOCs) I set up a queue system. Cron fires once a minute. Each time, the next feed in the queue is harvested and processed (processing is pretty CPU-intensive; that’s more likely what your spikes are). The queue is calculated dynamically, by ‘last harvested’ date that is updated when a feed is harvested. This means that harvetsing si a constant low-level activity taking place in the background rather than an intense burst of activity taking place periodically.
Another issue is Magpie, the PHP RSS aggregator. http://magpierss.sourceforge.net/ Like most parsers, it takes the entire file and builds a single data structure, which is then processed. So each time you harvest a feed, the entire feed is stored in memory (and if you’re harvesting 90 feeds at once, well…). I prefer to process RSS serially; processing it line by line. It’s harder to code (and gRSShopper still needs some improvements, like reading it from a file, not a string variable) but you can process anything (including whatever non-RSS data may have been stuffed into a feed) and it’s a lot more efficient.
Pingback: Notes on technology behind cMOOCs: Show me your aggregation architecture and I’ll show you mine JISC CETIS MASHe
Pingback: Nothing new under the sun… | Teaching 'E-learning and Digital Cultures'