2010 08 18

From TheCommandLineWiki
Jump to: navigation, search

Contents

Feature Cast for 2010-08-18

(00:17) Intro

(01:58) Inner Chapter: Scalability

  • In recent years, I spend an increasing amount of my time on scalability
    • Building features is pretty straightforward
    • Usually the hardest part is figuring out what, exactly, is needed
    • This is consistent with my previous discussion about program speed and performance
    • Even the most complicated feature only needs to be built once
    • Some times it takes multiple iterations to get it right, of course
    • I've had good success with my current employer
      • Using an agile methodology to better cope with unclear requirements
    • I worked on a quota system for our software as a service
      • When storage is costly, being able to keep any one tenant
        • From taking more than their fair share is critical
      • Unfortunately, no one had a good sense for what is fair to customers
        • And was sustainable for the business in terms of the costs of resources
      • There was not a lot of data available to inform guesses
      • The version of the product that could cost the business for disk storage & bandwidth
        • Is still very new, with only a handful of early adopters using it
      • No one was confident that the numbers we could crunch for these users
        • Was really represented of the majority of customers who will upgrade in coming months
      • Each time I delivered an iteration on the last thinking
        • We got closer, with better feedback
      • Even though it took three attempts to dial in the feature
        • Once it was finished, it specifically didn't need any more input
    • Scalability defies the idea of being able to solve it as a problem once
    • New features change how a user base stresses an application
    • Dealing with the issues that are unique to applications at scale
      • Is often the price of building a successful offering
    • It also is the kind of problem you want to try to be ahead of
    • The skill and knowledge required usually take you
      • Deeper into your application platform of choice
    • To build features, you need a sense of capabilities of your platform
      • In the sense of what is possible and what APIs to use
      • For instance, how you store and retrieve data
      • Or how you make a request from some other server
    • That functional, working knowledge is not enough for scaling an application
    • You have to start to understand what goes on under the hood
    • Doing so will reveal what resources are shared and how
      • And other complexities that may arise
      • When you have many, many users running through your code at once
  • I talked about performance, but scalability is different
    • http://thecommandline.net/2010/01/06/speed_performance/
    • The problems and techniques I discussed
      • Are applicable but assume a single thread of execution
    • Usually that means a single user
      • And the problems don't vary for a single user on a system
      • Or that user with some amount of activity by other users
    • Scalability is all about behavior with increasingly large amounts of users
      • Against constrained resources, whether those are managed as is
        • Or using some design to stretch them further like resource pools
    • It is an unavoidably aspect of most modern software
      • As more and more of it is increasingly exposed through the web
    • For traditional performance, problems are somewhat predictable
    • That is why I spent some time of algorithmic complexity
    • Cost of operations for algorithms is ideally a function of input size alone
    • That is to say, for a specific algorithm, if you know how much data flows in
      • You can figure out how long it will run, how much CPU it will utilize
        • And the amount of memory and disk storage it will need
    • The calculation may be non-trivial, but it is possible
      • And even close approximations are useful in terms
        • Of gauging the relative efficiency of comparable algorithms
    • Multi-user systems certainly are informed by orders of algorithms
    • There are many shared resources though that are much harder to characterize
    • Large scale server applications are inherently parallel
    • New problems start to arise due to the timing of more than one user
      • Trying to access some limited resource
    • In an ideal environment, you would scale up all of the components
      • So that you could treat each user as the only one on the server
    • That simply isn't practical
    • First, users may not make consistent use of say a database connection
    • They may pause to look over some data or any other reason
      • During which time a connection would be idle and wasted
    • Part of scaling, then is to figure out how to better match
      • What is available to actual demand under normal usage
    • What makes this more difficult is normal usage is anything but normal
    • More users is almost always different load demand
      • Not just more of what you've seen
    • I'll give you an example that I think illustrates this from my own experience
    • I worked on a real time auction system
      • Where performance during an event was acceptable for about ten active bidders
    • More than that and the system unaccountably started to slow down
    • I was able to run tests and gather data that ruled out the database and the front end code
    • After some thought, it turned out to be some session data
    • We had a heavy weight authentication certificate
      • Carrying the identity of the logged in user and some related data
    • It was naively configured as a remotable object
    • For ten or so users, the cost of making remote calls between the web and application servers
      • Wasn't even noticeable
    • Beyond that and the calls got slower and the rate of slowdown accelerated very rapidly
    • One solution would have been to tune the remote connection
      • Perhaps increasing the file handles on either end
      • Or beefing up memory
    • In this case, the data never changed once a user logged in
    • Sending the entire data once to the web server made much more sense
      • And broke the scalability issue so events could scale up to hundreds of users
  • BREAK
  • Load is difficult to simulate effectively
    • Doing so naively is pretty easy
    • You can easily make up some activity that sounds reasonable
    • For example
      • Log into a web application
      • View some data
      • Add or change data
      • Then log out
    • Small differences in the timing of users doing the same thing
      • Can have a profound affect on when threads working on their behalf
      • Access the same resources within the system
    • Worse, as those threads get bound up on something
      • It will start to vary the response times back to the user
      • Which then affects the lapsed time before they start the next step
    • Users are rarely so obliging as to do the same thing at the same time anyway
    • Their activity is as likely to be governed by random distractions
      • As accomplishing some task anticipated by designers and programmers
    • If your application has any kind of notifications, like an inbox
      • Or offers different ways to navigate into the same screen
      • Or to accomplish the same or similar tasks
      • Then your users are likely to show more random seeming activity
    • One solution I've seen pursued is to try to replay logs
    • For this to be effective, the test system must match whatever generated the logs
    • Data storage solutions behave differently at different scales
    • In relational databases, indexes tend to degrade as they get larger
    • Even before they degrade to the point of notice
      • The database server may decided to use different indexes
      • Depending on how effective it predicts each may be for the actual data stored
    • Setting up a realistic data store is itself a challenge, then
    • The best way would be to copy a production database
      • But that raises privacy concerns which are tricky to sort out
    • If you anonymize the data, you need to make sure doing so
      • Doesn't invalidate the logs you want to play back
    • I've never worked on a project that actually managed to replay production logs
    • The closest I have gotten is analyzing logs to come up with a simpler simulation
      • Then hand crafting playback through some software that mimicked network devices
    • I got some good results with this technique
      • And was able to dial it in further by comparing the logs from the test system
      • To the production logs captured when a scalability problem was experienced
    • The risk of working backwards like this is that you may miss the real cause
    • Reproducing the symptoms with a test environment
      • Doesn't guarantee that you reproduced the same cause
    • I've deployed fixes worked out from simulations
      • That didn't completely solve customer complaints
      • Leading me back into the simulations, usually with some further user observations for guidance
    • If you think about it, it makes sense that load will also vary
      • Based on the kind of application you are building
    • There are even some terms to roughly group activity
    • This is where you where here discussions of read vs. write
      • Or more formally of decision support vs. transaction processing respectively
    • Even understanding these basic categories and where your application falls
      • Can help in terms of getting even a first approximation simulation
      • As well as leading you to some appropriate starting points to research solutions
  • I have found very few resources to help
    • Understand the kinds of problems you will encounter
      • And some general strategies or ways of thinking about solutions
      • To gauge whether they will improve scaling or not
    • Tools existing for gauging specific resource consumption and load
    • They can also be built relatively easily if you are familiar with instrumenting your code
    • Such specific implements are best used when you have a good framework
      • To guide your reasoning about their application
    • There are increasingly people writing about page load times
      • But not so much attention paid to back end resources
    • The best I've been able to turn up are guides
      • On very specific components like server threading and databases
    • These are far from a complete application, though
    • Even for the same part of a stack, like a database
      • Two different projects can make very different use
    • Often you have to try to generalize out some technique
      • Like how to generate a query execution plan
      • And read what it means with regards to what a database query is trying to do
      • And what data is actually being stored
    • An execution plan is a common feature in relational databases
      • It just tells you how the server is going to try to find the data you want
      • Usually given some costs for each of the steps
    • You can use this as a guide for how to adjust your queries
      • In order to reduce their cost
      • Or to optimize the database itself to reduce unavoidable costs
    • Such a plan is very specific to relational databases
    • For document, key-value or object databases
      • I don't know if there is a comparable tool
      • Or you have to roll your own instrumentation
      • To figure out the cause of some slow down
    • Since the stack used by any two applications is going to differ
      • Even when they are in the same space, like enterprise Java or LAMP
      • I guess it makes sense there are few more comprehensive guides
      • As much as I'd like to be able to just pick up a book from one of my favorite tech publishers
    • Every time I have worked on scalability for a new project
      • I have to start researching almost from scratch
    • If I am very lucky, I'll have some applicable knowledge, like of relational databases
      • But it will have to be adapted to fit the current software's model
      • And the very unique sort of load and activity it experiences
  • Experimentation may be the only effective technique
    • I mean rigorous testing with good observations
    • Instrument the components of your own stack and application
    • If you can, get data from production
      • While users are experiencing scalability issues
    • That may help you find more specific resources
      • In the absence of general guides for making your application more scalable
    • That is how I've accumulated my knowledge of scaling relational databases
    • I am hopeful the technique will help with Cassandra
    • You may be able to come up with your own workarounds and solutions
    • Good observation lends itself well to theories
    • Any theory you can test might lead to some hack
      • That alleviates resource contention or minimizes delay deeper in your code
    • As with other classes of trouble shooting
      • I urge documenting your findings and solutions online
    • It doesn't matter if this is on your own blog
      • Or a forum site like Stack Overflow
    • As long as what you write up is findable by someone encountering a similar issue
  • BREAK
  • Adding more computer power seems to be the basic idea
    • You want to start by saturating the hardware you have
    • That is where load testing can help and researching tuning of your stack
    • Once the simplest deployment of your application is approaching full capacity
      • Usually measured in terms of CPU and memory utilization at the OS level
      • You'll want to either beef up the existing server hardware
        • To get a linear boost in how many more users you can handle
      • Or start exploring ways to add more hardware alongside the existing servers
    • You can start with simple load sharing
      • This is often done by having multiple full copies of your stack
      • You can put a DNS round robin in front to randomly send users to each instance
      • More often a smart load balancer is used
        • That can use data from the running instances to send users to less loaded servers
        • As well as making sure that users stick to the same servers once they are passed along
      • If each user is completely independent of every other users
        • This is probably all that your application needs
      • More servers can be added in pretty easily and cheaply
      • Unfortunately it is a rare application where each user
        • Doesn't share some data with other users on the system
      • Also, if each server has its own data store
        • Then your load balancing has to be smart enough
        • To send the same user to the same hardware every time
        • Otherwise it will look like their carefully crafted, saved data is lost
    • The next change most people make is some shared data storage
      • Although that is not always the case
      • If the layers of your application are very loosely coupled
        • Then you can scale each of those layers separately
      • It used to be very common that you'd load balance a bunch of web and application servers
        • With a shared database server
      • The database can then be beefed up, using a much more powerful server
      • Doing so ensures everyone is using the same database
      • Bouncing around to different web and app servers is much easier
      • If the biggest cost is executing some business logic in the middle
        • This strategy make sense but has a drawback
      • You've re-introduced a single point of failure
      • Also it is harder to scale up a traditional database
      • Most of the solutions I've see for this are unreliable
      • Replication, automated copying of data between servers
        • Often differs in configuration and capability
        • From database package to database package
    • Database sharding is another solution
      • Bear in mind that it is a relatively new technique in practice
      • Not all implementations, like traditional scaling techniques
        • Are necessarily equal so your mileage may vary
      • Like that first model where each server has its own database
        • You partition user data by some application aspect
      • A dead simply rule would by to modulo some user identity number in the database
      • The schemes for breaking your data into horizontal shards
        • Differ as widely as actual applications do
      • The point is you then need some logic in the middle
        • To send user data to the right shard and read it back from the same shard
      • At the expense of some complexity on the middle
        • You now have a more tunable data storage system
      • You can increase the number of shards as the data grows
        • To keep up with the demand for more horsepower
      • Increasingly, modern storage solutions offer sharding out of the box
      • Many argue that the popular NoSQL movement is defined by starting with sharding
      • It is possible to shard a relational database
        • But is difficult because these are built assuming all data is present everywhere
      • Newer systems starting with the assumption that sharding is necessary
        • Can better prepare application developers for the necessary tradeoffs
        • Or build in compensating capabilities
      • I am currently working with Cassandra which transparently shards data
      • On write, it automatically figures out where to send data in the cluster
        • Even if you are physically connected to a different node
        • Than where your data will be stored
      • It does something similar on read, forwarding a request for data
        • To wherever in the cluster your data resides without bouncing your connection around
      • Optimizing finding data works very, very differently in a sharding environment
      • I am still learning exactly how to do this
        • But roughly you don't want to be trying to read related data
        • From disparate nodes all over your cluster
      • More thought is required in designing your storage and your indexes for finding data
    • Caching is often employed to add virtual horse power
      • The idea is that going all the way to disk some where is probably expensive
      • If you have a simple scheme for looking up known data
        • Then you can just hang onto that data in memory for a while
      • Many systems already include caching already, like Enterprise Java, Django
        • Even web servers like Apache and Nginx can be configured to cache
      • Data stores, both relational and non-relational, usually rely heavily on caching
      • The way that is most evident is when tuning
        • For storage, usually all you can tweak is the size of the cache
      • Caching isn't only applied to info in RAM, either
      • Often the cost of fetching data from a data store and computing with it is expensive
      • If the result doesn't change often, then even caching to disk can be a big speed up
      • Squid and other reverse proxies are also often used towards this end on a request basis
      • Whatever kinds of caches or cache components you are dealing with
        • You have to have a way to invalidate entries in a cache
        • Otherwise your users will start to see data
        • That is increasingly inconsistent with what they are saving to the data store
      • Lower level caches often tuck these details away, manging their own staleness
        • Where stale data is old data, info that can likely to drop to make room in the cache
          • With minimal effect on the application and what the user sees
      • In a scaled up environment, caching can be a problem across servers
      • If caches are siloed, then a user bouncing around through a load balancer
        • Will see different data as he works
        • Or may see different data than another user sitting right next to him
      • There are distributed caching systems that help solve this
        • Look at memcached, one of the most popular, or the commercial outfit Terra Cotta
        • But they add a bit of complexity and overhead
        • Much like sharding a data store does
      • In my experience, caching helps more with saturating your existing hardware
      • It uses up memory that otherwise would be unused
        • To pull some load off of your data store
      • In memory hierarchy, RAM is always faster than disk
        • So it makes sense to saturate RAM as much as possible for speed
        • Including figuring out how to just make it more available
  • BREAK
  • The urge to treat data storage as zero cost is often a hurdle
    • I haven't used a storage system yet that scales without some effort
    • Programmers, not surprisingly, just want to throw data into a store
    • Where all they need to do is get data back out by some unique identity
      • That works well enough, you could just use file storage
    • The problem is that retrieval by a unique key is rarely enough
    • Unavoidably some feature will require a sophisticated condition to be met
    • The most common form this takes is a project framework
      • That includes some so called dynamic query mechanism
    • The idea is less experienced developers can just add some conditions to be met
    • The framework code mechanical turns that into the query logic that is actually run
    • Using an approach like this robs you of the ability to optimize queries
    • To be general enough for anyone to use
      • The query logic it produces is of necessity very naive
    • At my current job, I did considerable work to optimize our dynamic query system
    • I think it falls into the same class of work
      • As building optimization logic into language compilers
    • You have to have a thorough and deep understanding of both
      • Your inputs in the form of what conditions can be set
      • And of your output, the actual query logic that will be expressed against your data store
    • For SQL, I think this is a bit easier as SQL is well documented and understand
    • Of course, at scale, counter intuitions start to creep into relational data
    • Some of the optimization you have to be able to apply
      • Needs to be statically derived and applied independently
      • Such as in the forms of indexes on tables
    • For non-relational data, it is more difficult as to optimize fetches
      • You often have to be able to alter the way data is stored
    • By way of example, you need to be able to determine
      • That some dynamic query is going to fetch data from multiple places
      • Places that need to be kept together across a sharding or partitioning scheme
    • I am anticipating a great deal of headache with Cassandra
      • And our dynamic query system
    • Either the rest of the team is going to have to learn how to optimize fetches
      • Or the tools is going to get even closer to the sophistication of an optimizing compiler
    • I never took any deep computer science courses so if I had my druthers
      • The rest of the team would pick up some of the burden
    • It will be interesting--good, bad or otherwise--to see how this particular instance winds up
  • The challenge of scaling applications may explain the appeal of cloud computing
    • If the simplest solution is to throw more horsepower at an application
      • Then masses of commodity, virtualized servers would seem to fit the bill
    • In a nutshell, that is one large appeal of the cloude
    • Elastic computing platforms like Amazon's or the new open stack
      • Ease the effort of adding in more real hardware
      • Into a platform that applications can readily use right away
    • Once you cross the chasm of single server to some sort of distribution
      • Whether that is shared data and shared cache
        • Or a full one, savvy clustering system
      • Then in a cloud, you should be able to scale horizontally
        • Just by spinning up and starting the meter on more nodes
    • That is what really drives me interest in stories about cloud computing
    • It isn't the applications, as a developer, that fascinate me
      • But the possibility of evolving the systems I build
      • To take advantage of seamlessly scaling hardware
    • There is still quite a burden on developers to take full advantage
    • There are cloud platforms that combine the programming and the scaling
    • Google App Engine requires you design your data storage in specific ways
      • And that you code up your application logic as stateless
      • But promises you can just pay for more horsepower
      • If your application follows the strictures of using the platform
    • I am sure there are other examples that work the same
    • Not surprisingly, I am most interested in experiments in open and portable clouds
    • Heavy weight virtualization still makes the most sense
      • As you can have the same stack on development workstations as in the cloud
    • I wonder if it limits scalability, though, because of the overhead
      • Of setting up all of the stack per instanced, from OS on up
      • As opposed to more app development centric offerings, like App Engine
    • I am optimistic we'll see more positive experimentation
    • The idea of portable packaging for server applications isn't new
    • Say what you will about Java, it does a pretty good job here
      • With standardized means of describing resources and deployments
      • And single file archives for containing all of the pieces of an application
      • That in theory can just be dropped into a compatible server
    • In practice, vendors have diluted this to drive lock-in
      • But there is still some good thought to mine for more experimentation
      • Especially with more free and open platforms like Django and Rails

(32:25) Outro

Personal tools