Day 33: Choose your languages wisely.

STATUS: Welp, I wasted an entire day trying to write multi-threaded PowerShell code. In other words, I moved two steps forward and two steps back.
MOOD: I actually feel surprisingly alright with this.

——————

Let’s get technical for a second.

I am using PowerShell to create the MVP for thinglistr’s backend because it’s the language that I know most extensively right now. I’m also a C# and F# developer (I miss writing in F# A LOT), but I needed to get this thing in the air as quickly as possible, and PowerShell is much easier to write and test in a SSH session than F# (despite it having a console, `fsi`) or C# are, so I chose that. (Before you say “Y U NO PYTHON,” my Python knowledge pales hard in comparison to my PowerShell knowledge and I wanted to spend more time building than learning the semantics of a lanuage. I’m also pretty set in using C# or F# for production (F# is great for transactions and multi-threaded code; immutability by default helps a lot).

This setup has worked pretty well so far. Working with RESTful queries and caching things (primitively) is pretty straightforward, and it does the backend stuff well enough to satisfy my needs right now. That didn’t stop me from wondering whether I could make things a little faster, though. It would be much more efficient for the business polling and review parsing code to be multi-threaded, so I spent some time in attempting to make the business polling code multi-threaded.

Big mistake. Sort of.

Unlike Python, PowerShell doesn’t have any native support for .NET threading. I’m guessing that this is so because much of the underlying language primitives and/or the runspace within which PowerShell session state is saved is not thread-safe nor were they designed to be consumed as raw .NET threads to begin with. In its place, PowerShell uses “Jobs,” which are essentially scriptblocks or scripts executed within forked PowerShell sessions whose results are streamed back to their parent PowerShell session. While they’re better than having nothing at all, they are quite limiting in a few ways:

– PowerShell sessions usually consume about 20MB of RAM (or more depending on your $PROFILE and Modules directory), whereas .NET threads take up about 1MB.
– Every PowerShell session loads your $PROFILE, which can be highly, highly wasteful depending on its size and can delay initialization.
– Scripblocks sent to every child session runs within isolated runspaces, so sharing objects or state from the parent session is impossile without some modification.

(I know that PowerShell 4.0+ supports workflows, but they are also based on the Jobs engine and have the same set of problems and a different set of quirks, and I know how to work with Jobs better.

Knowing these limitations, I decided on spending some time creating a custom job queue implementation for Powershell that:

– Creates the concept of job pools by creating jobs that share common identifiers,
– Throttles the addition of jobs into job pools automatically when they go above a certain threshold,
– Reloads dependencies in the runspaces for every job,
– Serializes function calls and their arguments so that the consumer of the pool doesn’t have to, and
– Outputs errors and results from jobs within the pool as a pscustomobject for easy manipulation.

It took >10 infuriating hours for me to get this working right, but I eventually got it…and found it to be WORSE than what I had before for a few reasons:

– My AWS instance is very memory-constrained (only has 1GB of RAM), so when taking OS overhead into account, I only needed a few concurrent PowerShell sessions to send my machine into swap city (10 powershell sessions = ~200 threads = 200 MB).
– PowerShell jobs don’t have access to the parent host console because of the whole isolated runspace thing, so there’s no way for me to easily debug issues within threads when they come up until AFTER they’re done executing. This could be problematic if issues arise while retrieving business data from Google sincetheir API request quota is really low for server-side requests (2000 requests/day).
– God it was so fucking slow.

What was funny was that I wasn’t that upset about this. I was much more upset whenever I ran into a bug and thought that I’d have to spend my entire weekend on this. Now that I know that this approach sucks (for now; I think it’ll be much more useful when I move this onto a real programming language), because I still have the original code that does this serially, I can spend today focusing on the real Goliath: event discovery within Twitter.

Onwards!

Advertisements
Day 33: Choose your languages wisely.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s