Entering the Same Data Over and Over Again in Different Application Program
Affiliate one. Reliable, Scalable, and Maintainable Applications
The Internet was done so well that nigh people think of it as a natural resource like the Pacific Body of water, rather than something that was human-made. When was the concluding time a engineering science with a scale like that was and so mistake-free?
Alan Kay, in interview with Dr Dobb'southward Journal (2012)
Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU ability is rarely a limiting factor for these applications—bigger problems are unremarkably the amount of data, the complexity of information, and the speed at which it is changing.
A data-intensive awarding is typically congenital from standard edifice blocks that provide commonly needed functionality. For case, many applications need to:
-
Store information and so that they, or another awarding, can find it over again subsequently (databases)
-
Remember the issue of an expensive functioning, to speed upwardly reads (caches)
-
Permit users to search data by keyword or filter information technology in diverse ways (search indexes)
-
Ship a message to some other process, to be handled asynchronously (stream processing)
-
Periodically crunch a big amount of accumulated data (batch processing)
If that sounds painfully obvious, that's just considering these data systems are such a successful abstraction: nosotros use them all the fourth dimension without thinking too much. When building an application, most engineers wouldn't dream of writing a new information storage engine from scratch, considering databases are a perfectly skilful tool for the job.
Merely reality is not that simple. There are many database systems with different characteristics, considering different applications have different requirements. There are various approaches to caching, several ways of edifice search indexes, and so on. When building an application, we however demand to figure out which tools and which approaches are the most appropriate for the task at hand. And it tin can exist difficult to combine tools when you need to do something that a single tool cannot practice lonely.
This book is a journeying through both the principles and the practicalities of data systems, and how you tin can utilize them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they attain their characteristics.
In this affiliate, nosotros will showtime past exploring the fundamentals of what we are trying to attain: reliable, scalable, and maintainable information systems. We'll analyze what those things mean, outline some means of thinking nigh them, and go over the nuts that nosotros will need for later chapters. In the following chapters we will go on layer past layer, looking at different design decisions that need to exist considered when working on a data-intensive application.
Thinking About Information Systems
Nosotros typically think of databases, queues, caches, etc. equally being very different categories of tools. Although a database and a message queue have some superficial similarity—both shop data for some fourth dimension—they have very unlike access patterns, which ways unlike performance characteristics, and thus very dissimilar implementations.
So why should we lump them all together under an umbrella term like information systems?
Many new tools for information storage and processing have emerged in contempo years. They are optimized for a diversity of different use cases, and they no longer neatly fit into traditional categories [1]. For instance, at that place are datastores that are also used as message queues (Redis), and there are message queues with database-like immovability guarantees (Apache Kafka). The boundaries between the categories are condign blurred.
Secondly, increasingly many applications now take such demanding or broad-ranging requirements that a single tool can no longer meet all of its information processing and storage needs. Instead, the work is cleaved down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code.
For example, if you have an application-managed caching layer (using Memcached or similar), or a full-text search server (such as Elasticsearch or Solr) separate from your main database, it is normally the application lawmaking's responsibility to keep those caches and indexes in sync with the master database. Figure 1-one gives a glimpse of what this may look similar (we volition go into detail in later chapters).
When you combine several tools in order to provide a service, the service's interface or application programming interface (API) usually hides those implementation details from clients. Now you have essentially created a new, special-purpose information organisation from smaller, general-purpose components. Your composite data organisation may provide certain guarantees: e.g., that the enshroud volition be correctly invalidated or updated on writes and then that exterior clients run into consistent results. You are now not just an awarding developer, simply also a information organization designer.
If you lot are designing a data system or service, a lot of catchy questions ascend. How do you ensure that the information remains correct and complete, even when things go incorrect internally? How do you provide consistently good functioning to clients, even when parts of your system are degraded? How practice you scale to handle an increase in load? What does a good API for the service wait like?
There are many factors that may influence the design of a data arrangement, including the skills and feel of the people involved, legacy organization dependencies, the timescale for delivery, your organization's tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on the situation.
In this book, nosotros focus on three concerns that are important in most software systems:
- Reliability
-
The system should continue to work correctly (performing the correct function at the desired level of operation) even in the face up of arduousness (hardware or software faults, and even human error). See "Reliability".
- Scalability
-
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth. See "Scalability".
- Maintainability
-
Over time, many different people volition work on the system (engineering and operations, both maintaining electric current behavior and adapting the organisation to new use cases), and they should all exist able to work on it productively. See "Maintainability".
These words are often cast around without a clear agreement of what they hateful. In the involvement of thoughtful engineering science, we will spend the rest of this affiliate exploring ways of thinking about reliability, scalability, and maintainability. Then, in the following capacity, we will expect at various techniques, architectures, and algorithms that are used in order to achieve those goals.
Reliability
Everybody has an intuitive idea of what it ways for something to be reliable or unreliable. For software, typical expectations include:
-
The application performs the role that the user expected.
-
It tin tolerate the user making mistakes or using the software in unexpected ways.
-
Its performance is adept enough for the required utilise case, nether the expected load and information volume.
-
The arrangement prevents any unauthorized access and abuse.
If all those things together mean "working correctly," so we tin can sympathize reliability as significant, roughly, "continuing to work correctly, even when things get incorrect."
The things that tin can go incorrect are called faults, and systems that conceptualize faults and can cope with them are called fault-tolerant or resilient. The onetime term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is non viable. If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require spider web hosting in space—good luck getting that upkeep item approved. And then it only makes sense to talk almost tolerating certain types of faults.
Notation that a fault is non the same as a failure [2]. A fault is commonly defined as ane component of the organization diffusive from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore information technology is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book nosotros cover several techniques for building reliable systems from unreliable parts.
Counterintuitively, in such mistake-tolerant systems, it can brand sense to increase the charge per unit of faults by triggering them deliberately—for example, past randomly killing individual processes without warning. Many critical bugs are actually due to poor error handling [three]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your conviction that faults will be handled correctly when they occur naturally. The Netflix Anarchy Monkey [4] is an example of this approach.
Although we generally prefer tolerating faults over preventing faults, there are cases where prevention is ameliorate than cure (e.g., because no cure exists). This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that upshot cannot be undone. All the same, this volume mostly deals with the kinds of faults that can be cured, as described in the post-obit sections.
Hardware Faults
When we think of causes of system failure, hardware faults quickly come to listen. Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cablevision. Anyone who has worked with large datacenters can tell you that these things happen all the time when you have a lot of machines.
Hard disks are reported as having a hateful time to failure (MTTF) of about 10 to l years [5, half dozen]. Thus, on a storage cluster with x,000 disks, we should look on average one deejay to die per 24-hour interval.
Our first response is usually to add redundancy to the private hardware components in order to reduce the failure charge per unit of the system. Disks may be gear up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for fill-in power. When one component dies, the redundant component tin accept its place while the broken component is replaced. This approach cannot completely preclude hardware problems from causing failures, just it is well understood and can often go along a machine running uninterrupted for years.
Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you tin restore a backup onto a new car fairly quickly, the downtime in instance of failure is not catastrophic in about applications. Thus, multi-automobile redundancy was only required by a small number of applications for which loftier availability was absolutely essential.
However, as data volumes and applications' computing demands accept increased, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults. Moreover, in some cloud platforms such as Amazon Web Services (AWS) information technology is fairly common for virtual car instances to get unavailable without warning [vii], every bit the platforms are designed to prioritize flexibility and elasticityi over unmarried-motorcar reliability.
Hence there is a motility toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in improver to hardware redundancy. Such systems too have operational advantages: a unmarried-server system requires planned downtime if you demand to reboot the automobile (to apply operating arrangement security patches, for example), whereas a system that can tolerate automobile failure can be patched i node at a time, without downtime of the unabridged system (a rolling upgrade; meet Chapter 4).
Software Errors
We usually think of hardware faults every bit beingness random and independent from each other: i machine's disk failing does non imply that another machine'south deejay is going to fail. There may be weak correlations (for example due to a mutual cause, such equally the temperature in the server rack), but otherwise it is unlikely that a large number of hardware components will fail at the same fourth dimension.
Some other class of fault is a systematic error inside the system [8]. Such faults are harder to anticipate, and considering they are correlated beyond nodes, they tend to cause many more system failures than uncorrelated hardware faults [5]. Examples include:
-
A software bug that causes every case of an application server to crash when given a particular bad input. For example, consider the spring second on June 30, 2012, that caused many applications to hang simultaneously due to a bug in the Linux kernel [nine].
-
A runaway process that uses up some shared resource—CPU time, retentivity, disk space, or network bandwidth.
-
A service that the system depends on that slows downward, becomes unresponsive, or starts returning corrupted responses.
-
Cascading failures, where a small-scale fault in one component triggers a fault in another component, which in turn triggers further faults [10].
The bugs that crusade these kinds of software faults often lie dormant for a long time until they are triggered past an unusual ready of circumstances. In those circumstances, it is revealed that the software is making some kind of supposition about its environment—and while that assumption is usually true, it eventually stops being true for some reason [xi].
There is no quick solution to the problem of systematic faults in software. Lots of pocket-sized things can aid: carefully thinking about assumptions and interactions in the arrangement; thorough testing; procedure isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system beliefs in product. If a system is expected to provide some guarantee (for example, in a message queue, that the number of incoming messages equals the number of outgoing messages), information technology can constantly check itself while information technology is running and raise an alarm if a discrepancy is institute [12].
Man Errors
Humans design and build software systems, and the operators who keep the systems running are also man. Even when they take the best intentions, humans are known to exist unreliable. For case, 1 study of large internet services establish that configuration errors past operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only ten–25% of outages [13].
How do nosotros make our systems reliable, in spite of unreliable humans? The best systems combine several approaches:
-
Design systems in a way that minimizes opportunities for fault. For example, well-designed abstractions, APIs, and admin interfaces get in like shooting fish in a barrel to practice "the right thing" and discourage "the wrong thing." However, if the interfaces are too restrictive people will work around them, negating their do good, so this is a tricky balance to get right.
-
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured not-production sandbox environments where people tin explore and experiment safely, using real data, without affecting existent users.
-
Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests [3]. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.
-
Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, brand it fast to whorl back configuration changes, roll out new code gradually (so that whatsoever unexpected bugs affect merely a small subset of users), and provide tools to recompute data (in case information technology turns out that the old computation was incorrect).
-
Gear up up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to every bit telemetry. (Once a rocket has left the ground, telemetry is essential for tracking what is happening, and for understanding failures [14].) Monitoring can prove united states early alert signals and permit us to cheque whether whatever assumptions or constraints are being violated. When a trouble occurs, metrics can exist invaluable in diagnosing the issue.
-
Implement skilful management practices and training—a complex and important aspect, and beyond the scope of this book.
How Of import Is Reliability?
Reliability is not merely for nuclear ability stations and air traffic command software—more mundane applications are also expected to work reliably. Bugs in concern applications crusade lost productivity (and legal risks if figures are reported incorrectly), and outages of ecommerce sites can have huge costs in terms of lost acquirement and damage to reputation.
Even in "noncritical" applications we have a responsibility to our users. Consider a parent who stores all their pictures and videos of their children in your photo application [15]. How would they feel if that database was suddenly corrupted? Would they know how to restore it from a backup?
In that location are situations in which we may choose to sacrifice reliability in order to reduce development cost (e.g., when developing a prototype production for an unproven market place) or operational cost (e.g., for a service with a very narrow turn a profit margin)—but nosotros should exist very conscious of when we are cutting corners.
Scalability
Even if a organisation is working reliably today, that doesn't mean it will necessarily piece of work reliably in the futurity. One common reason for degradation is increased load: perhaps the organisation has grown from 10,000 concurrent users to 100,000 concurrent users, or from i meg to x million. Perhaps it is processing much larger volumes of data than it did before.
Scalability is the term we utilise to describe a system's ability to cope with increased load. Notation, however, that it is not a ane-dimensional characterization that we tin can attach to a system: it is meaningless to say "Ten is scalable" or "Y doesn't scale." Rather, discussing scalability ways considering questions like "If the system grows in a particular style, what are our options for coping with the growth?" and "How can nosotros add computing resources to handle the boosted load?"
Describing Load
First, we demand to succinctly describe the electric current load on the organization; only then can we talk over growth questions (what happens if our load doubles?). Load tin can be described with a few numbers which we call load parameters. The best option of parameters depends on the compages of your organization: it may be requests per 2nd to a spider web server, the ratio of reads to writes in a database, the number of simultaneously agile users in a chat room, the striking rate on a enshroud, or something else. Perhaps the average case is what matters for yous, or perchance your clogging is dominated by a pocket-size number of extreme cases.
To make this idea more concrete, let's consider Twitter as an example, using information published in November 2012 [xvi]. Ii of Twitter'southward main operations are:
- Post tweet
-
A user can publish a new bulletin to their followers (4.6k requests/sec on average, over 12k requests/sec at peak).
- Home timeline
-
A user tin can view tweets posted by the people they follow (300k requests/sec).
Simply handling 12,000 writes per second (the meridian rate for posting tweets) would be fairly piece of cake. However, Twitter's scaling challenge is non primarily due to tweet volume, but due to fan-out ii—each user follows many people, and each user is followed by many people. At that place are broadly two means of implementing these 2 operations:
-
Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by fourth dimension). In a relational database like in Figure 1-2, you could write a query such as:
SELECT
tweets
.
*
,
users
.
*
FROM
tweets
JOIN
users
ON
tweets
.
sender_id
=
users
.
id
JOIN
follows
ON
follows
.
followee_id
=
users
.
id
WHERE
follows
.
follower_id
=
current_user
-
Maintain a enshroud for each user's abode timeline—similar a mailbox of tweets for each recipient user (see Figure 1-three). When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of fourth dimension.
The showtime version of Twitter used approach 1, simply the systems struggled to proceed upwards with the load of dwelling timeline queries, so the company switched to approach 2. This works meliorate considering the average charge per unit of published tweets is almost ii orders of magnitude lower than the rate of home timeline reads, and so in this example it's preferable to do more work at write time and less at read fourth dimension.
However, the downside of approach ii is that posting a tweet now requires a lot of actress work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches. But this boilerplate hides the fact that the number of followers per user varies wildly, and some users accept over xxx million followers. This means that a single tweet may result in over xxx million writes to home timelines! Doing this in a timely way—Twitter tries to deliver tweets to followers inside five seconds—is a significant challenge.
In the example of Twitter, the distribution of followers per user (maybe weighted by how often those users tweet) is a key load parameter for discussing scalability, since it determines the fan-out load. Your application may have very different characteristics, but you can apply similar principles to reasoning nearly its load.
The final twist of the Twitter anecdote: now that arroyo 2 is robustly implemented, Twitter is moving to a hybrid of both approaches. Nigh users' tweets continue to be fanned out to home timelines at the time when they are posted, only a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from whatever celebrities that a user may follow are fetched separately and merged with that user'southward abode timeline when information technology is read, similar in approach 1. This hybrid approach is able to evangelize consistently good performance. We will revisit this example in Affiliate 12 after nosotros have covered some more than technical ground.
Describing Performance
Once you lot take described the load on your system, you can investigate what happens when the load increases. Yous can expect at it in two ways:
-
When you increase a load parameter and proceed the system resources (CPU, retentiveness, network bandwidth, etc.) unchanged, how is the functioning of your organisation affected?
-
When yous increment a load parameter, how much do yous need to increment the resources if you lot want to go along operation unchanged?
Both questions require operation numbers, so let's look briefly at describing the performance of a organization.
In a batch processing system such as Hadoop, we commonly care about throughput—the number of records we can process per second, or the total time information technology takes to run a task on a dataset of a certain size.iii In online systems, what'south usually more than important is the service's response time—that is, the time betwixt a customer sending a request and receiving a response.
Latency and response time
Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: too the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the elapsing that a asking is waiting to be handled—during which information technology is latent, awaiting service [17].
Even if you only make the same asking over and over again, you lot'll get a slightly dissimilar response time on every try. In do, in a system handling a variety of requests, the response time can vary a lot. We therefore demand to think of response time not equally a single number, but every bit a distribution of values that you can measure.
In Effigy 1-4, each grayness bar represents a asking to a service, and its acme shows how long that request took. Most requests are reasonably fast, just at that place are occasional outliers that take much longer. Perhaps the boring requests are intrinsically more expensive, e.g., because they process more data. Simply even in a scenario where you lot'd think all requests should take the same fourth dimension, you get variation: random additional latency could be introduced by a context switch to a groundwork process, the loss of a network packet and TCP retransmission, a garbage collection break, a page fault forcing a read from disk, mechanical vibrations in the server rack [18], or many other causes.
It'due south common to see the average response time of a service reported. (Strictly speaking, the term "boilerplate" doesn't refer to any particular formula, but in do it is usually understood equally the arithmetic mean: given n values, add upwardly all the values, and divide past n.) Withal, the mean is not a very adept metric if y'all desire to know your "typical" response fourth dimension, because it doesn't tell you how many users really experienced that delay.
Ordinarily information technology is better to utilize percentiles. If y'all take your list of response times and sort it from fastest to slowest, then the median is the halfway betoken: for example, if your median response fourth dimension is 200 ms, that means one-half your requests render in less than 200 ms, and half your requests take longer than that.
This makes the median a good metric if you desire to know how long users typically have to wait: half of user requests are served in less than the median response time, and the other half take longer than the median. The median is also known equally the 50th percentile, and sometimes abbreviated as p50. Note that the median refers to a single request; if the user makes several requests (over the course of a session, or considering several resources are included in a single page), the probability that at least one of them is slower than the median is much greater than 50%.
In order to figure out how bad your outliers are, you lot tin can look at higher percentiles: the 95th, 99th, and 99.ninth percentiles are common (abbreviated p95, p99, and p999). They are the response time thresholds at which 95%, 99%, or 99.ix% of requests are faster than that particular threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than ane.v seconds, and 5 out of 100 requests have ane.v seconds or more. This is illustrated in Effigy ane-four.
Loftier percentiles of response times, likewise known as tail latencies, are of import because they directly affect users' experience of the service. For instance, Amazon describes response time requirements for internal services in terms of the 99.9th percentile, even though information technology only affects 1 in one,000 requests. This is because the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases—that is, they're the about valuable customers [19]. It's important to go on those customers happy past ensuring the website is fast for them: Amazon has besides observed that a 100 ms increase in response time reduces sales by 1% [xx], and others report that a 1-second slowdown reduces a customer satisfaction metric by 16% [21, 22].
On the other mitt, optimizing the 99.99th percentile (the slowest ane in ten,000 requests) was deemed too expensive and to not yield enough do good for Amazon's purposes. Reducing response times at very high percentiles is difficult because they are easily affected by random events outside of your control, and the benefits are diminishing.
For example, percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs), contracts that ascertain the expected performance and availability of a service. An SLA may state that the service is considered to be up if it has a median response time of less than 200 ms and a 99th percentile under one s (if the response fourth dimension is longer, it might as well exist down), and the service may be required to be up at least 99.9% of the fourth dimension. These metrics set expectations for clients of the service and let customers to demand a refund if the SLA is not met.
Queueing delays ofttimes business relationship for a large part of the response time at high percentiles. As a server can only process a modest number of things in parallel (limited, for example, past its number of CPU cores), it merely takes a small number of ho-hum requests to hold up the processing of subsequent requests—an effect sometimes known every bit caput-of-line blocking. Fifty-fifty if those subsequent requests are fast to procedure on the server, the client will encounter a slow overall response time due to the time waiting for the prior asking to consummate. Due to this effect, it is important to measure response times on the client side.
When generating load artificially in order to examination the scalability of a arrangement, the load-generating customer needs to go along sending requests independently of the response fourth dimension. If the customer waits for the previous request to complete before sending the next one, that behavior has the issue of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements [23].
Approaches for Coping with Load
Now that nosotros have discussed the parameters for describing load and metrics for measuring functioning, we can start discussing scalability in earnest: how do we maintain good performance even when our load parameters increment by some corporeality?
An architecture that is appropriate for one level of load is unlikely to cope with ten times that load. If you lot are working on a fast-growing service, it is therefore likely that you volition need to rethink your architecture on every club of magnitude load increase—or perhaps even more often than that.
People often talk of a dichotomy between scaling upwardly (vertical scaling, moving to a more powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines). Distributing load beyond multiple machines is also known every bit a shared-nil compages. A system that can run on a unmarried machine is oftentimes simpler, but high-end machines can become very expensive, so very intensive workloads often can't avoid scaling out. In reality, good architectures usually involve a pragmatic mixture of approaches: for case, using several adequately powerful machines tin still exist simpler and cheaper than a large number of small-scale virtual machines.
Some systems are rubberband, meaning that they can automatically add together calculating resources when they detect a load increase, whereas other systems are scaled manually (a human being analyzes the chapters and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, only manually scaled systems are simpler and may accept fewer operational surprises (see "Rebalancing Partitions").
While distributing stateless services beyond multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity. For this reason, common wisdom until recently was to keep your database on a single node (scale up) until scaling cost or high-availability requirements forced you to make information technology distributed.
As the tools and abstractions for distributed systems get better, this common wisdom may change, at least for some kinds of applications. It is believable that distributed data systems will become the default in the future, even for use cases that don't handle large volumes of data or traffic. Over the form of the rest of this book we will encompass many kinds of distributed data systems, and hash out how they fare non just in terms of scalability, but also ease of use and maintainability.
The architecture of systems that operate at large scale is usually highly specific to the application—in that location is no such thing every bit a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce). The problem may be the book of reads, the volume of writes, the volume of data to store, the complexity of the data, the response fourth dimension requirements, the access patterns, or (usually) some mixture of all of these plus many more issues.
For case, a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a organization that is designed for three requests per minute, each 2 GB in size—even though the 2 systems accept the aforementioned data throughput.
An architecture that scales well for a item application is built around assumptions of which operations volition be common and which will exist rare—the load parameters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive. In an early-stage startup or an unproven product it's unremarkably more of import to be able to iterate quickly on product features than it is to scale to some hypothetical future load.
Even though they are specific to a detail application, scalable architectures are nevertheless normally built from general-purpose building blocks, arranged in familiar patterns. In this book we discuss those edifice blocks and patterns.
Maintainability
It is well known that the majority of the cost of software is non in its initial development, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.
Even so, unfortunately, many people working on software systems dislike maintenance of and so-called legacy systems—perhaps information technology involves fixing other people's mistakes, or working with platforms that are now outdated, or systems that were forced to exercise things they were never intended for. Every legacy organisation is unpleasant in its own way, and so information technology is hard to give general recommendations for dealing with them.
However, we tin can and should pattern software in such a fashion that it volition hopefully minimize pain during maintenance, and thus avoid creating legacy software ourselves. To this end, we will pay particular attention to three design principles for software systems:
- Operability
-
Make it easy for operations teams to go along the organization running smoothly.
- Simplicity
-
Make information technology easy for new engineers to understand the system, by removing as much complexity every bit possible from the organization. (Note this is not the same every bit simplicity of the user interface.)
- Evolvability
-
Arrive easy for engineers to make changes to the organization in the future, adapting it for unanticipated use cases as requirements change. As well known equally extensibility, modifiability, or plasticity.
As previously with reliability and scalability, in that location are no easy solutions for achieving these goals. Rather, we will try to recollect about systems with operability, simplicity, and evolvability in mind.
Operability: Making Life Easy for Operations
It has been suggested that "good operations can frequently work around the limitations of bad (or incomplete) software, merely expert software cannot run reliably with bad operations" [12]. While some aspects of operations can and should be automated, information technology is still upwards to humans to fix up that automation in the first place and to make sure it'south working correctly.
Operations teams are vital to keeping a software arrangement running smoothly. A good operations team typically is responsible for the following, and more [29]:
-
Monitoring the health of the organization and speedily restoring service if information technology goes into a bad country
-
Tracking down the crusade of problems, such every bit organization failures or degraded performance
-
Keeping software and platforms upwardly to date, including security patches
-
Keeping tabs on how unlike systems affect each other, so that a problematic change tin can exist avoided before it causes harm
-
Anticipating future problems and solving them before they occur (east.g., capacity planning)
-
Establishing good practices and tools for deployment, configuration direction, and more
-
Performing complex maintenance tasks, such equally moving an application from one platform to another
-
Maintaining the security of the system every bit configuration changes are fabricated
-
Defining processes that make operations predictable and assistance keep the production surroundings stable
-
Preserving the organization's knowledge about the system, even as individual people come and go
Skilful operability means making routine tasks piece of cake, assuasive the operations squad to focus their efforts on high-value activities. Data systems tin do various things to make routine tasks like shooting fish in a barrel, including:
-
Providing visibility into the runtime behavior and internals of the arrangement, with practiced monitoring
-
Providing practiced support for automation and integration with standard tools
-
Fugitive dependency on private machines (allowing machines to be taken downwardly for maintenance while the organisation as a whole continues running uninterrupted)
-
Providing practiced documentation and an like shooting fish in a barrel-to-understand operational model ("If I do 10, Y will happen")
-
Providing skillful default behavior, but besides giving administrators the freedom to override defaults when needed
-
Self-healing where appropriate, simply also giving administrators manual control over the system land when needed
-
Exhibiting anticipated behavior, minimizing surprises
Simplicity: Managing Complexity
Small software projects tin can have delightfully simple and expressive code, but as projects get larger, they often get very complex and difficult to sympathise. This complexity slows downward everyone who needs to work on the organisation, farther increasing the cost of maintenance. A software project mired in complexity is sometimes described every bit a big brawl of mud [30].
There are diverse possible symptoms of complexity: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance bug, special-casing to work around issues elsewhere, and many more. Much has been said on this topic already [31, 32, 33].
When complexity makes maintenance difficult, budgets and schedules are ofttimes overrun. In complex software, there is besides a greater run a risk of introducing bugs when making a alter: when the system is harder for developers to sympathise and reason most, hidden assumptions, unintended consequences, and unexpected interactions are more than hands overlooked. Conversely, reducing complexity profoundly improves the maintainability of software, and thus simplicity should be a key goal for the systems nosotros build.
Making a organisation simpler does not necessarily mean reducing its functionality; it can too mean removing accidental complexity. Moseley and Marks [32] define complexity every bit accidental if it is not inherent in the problem that the software solves (every bit seen past the users) simply arises simply from the implementation.
One of the best tools we have for removing accidental complexity is abstraction. A skillful abstraction can hide a peachy deal of implementation detail behind a clean, uncomplicated-to-empathise façade. A skillful abstraction can also be used for a broad range of dissimilar applications. Not only is this reuse more efficient than reimplementing a similar affair multiple times, but it likewise leads to college-quality software, equally quality improvements in the abstracted component do good all applications that utilise it.
For example, loftier-level programming languages are abstractions that hibernate machine code, CPU registers, and syscalls. SQL is an abstraction that hides complex on-deejay and in-retentiveness data structures, concurrent requests from other clients, and inconsistencies after crashes. Of course, when programming in a high-level linguistic communication, we are even so using machine code; we are but not using information technology straight, considering the programming language abstraction saves us from having to remember about it.
However, finding good abstractions is very hard. In the field of distributed systems, although there are many good algorithms, it is much less clear how nosotros should be packaging them into abstractions that assist us keep the complexity of the organisation at a manageable level.
Throughout this book, we volition go along our eyes open up for good abstractions that allow us to excerpt parts of a big organization into well-defined, reusable components.
Evolvability: Making Change Like shooting fish in a barrel
It's extremely unlikely that your system's requirements will remain unchanged forever. They are much more than likely to be in constant flux: you larn new facts, previously unanticipated use cases sally, business priorities alter, users request new features, new platforms supercede old platforms, legal or regulatory requirements change, growth of the arrangement forces architectural changes, etc.
In terms of organizational processes, Agile working patterns provide a framework for adapting to change. The Agile customs has as well developed technical tools and patterns that are helpful when developing software in a frequently changing environment, such as test-driven development (TDD) and refactoring.
Most discussions of these Active techniques focus on a adequately small, local scale (a couple of source lawmaking files within the same awarding). In this book, we search for ways of increasing agility on the level of a larger data system, perhaps consisting of several different applications or services with different characteristics. For example, how would you "refactor" Twitter's architecture for assembling habitation timelines ("Describing Load") from approach 1 to approach ii?
The ease with which you tin modify a data organization, and adapt information technology to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-sympathise systems are ordinarily easier to modify than complex ones. But since this is such an important thought, we will use a different discussion to refer to agility on a information system level: evolvability [34].
Summary
In this chapter, nosotros have explored some primal ways of thinking about information-intensive applications. These principles will guide us through the rest of the book, where nosotros swoop into deep technical detail.
An application has to see various requirements in social club to be useful. There are functional requirements (what it should practice, such as allowing data to be stored, retrieved, searched, and processed in diverse means), and some nonfunctional requirements (general properties similar security, reliability, compliance, scalability, compatibility, and maintainability). In this chapter we discussed reliability, scalability, and maintainability in detail.
Reliability ways making systems work correctly, even when faults occur. Faults can exist in hardware (typically random and uncorrelated), software (bugs are typically systematic and hard to deal with), and humans (who inevitably brand mistakes from fourth dimension to time). Mistake-tolerance techniques tin can hide certain types of faults from the end user.
Scalability means having strategies for keeping performance adept, fifty-fifty when load increases. In order to discuss scalability, we beginning need ways of describing load and operation quantitatively. We briefly looked at Twitter'south dwelling timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable organisation, yous tin can add processing capacity in order to remain reliable under high load.
Maintainability has many facets, but in essence it'due south about making life ameliorate for the engineering and operations teams who need to piece of work with the arrangement. Skillful abstractions tin help reduce complexity and make the system easier to modify and adapt for new use cases. Practiced operability ways having expert visibility into the system's health, and having effective means of managing it.
There is unfortunately no easy fix for making applications reliable, scalable, or maintainable. Yet, there are sure patterns and techniques that keep reappearing in different kinds of applications. In the adjacent few chapters we will take a expect at some examples of information systems and analyze how they work toward those goals.
Later in the book, in Part III, nosotros volition look at patterns for systems that consist of several components working together, such as the one in Figure ane-1.
Footnotes
i Defined in "Approaches for Coping with Load".
ii A term borrowed from electronic engineering, where it describes the number of logic gate inputs that are attached to another gate's output. The output needs to supply enough current to drive all the fastened inputs. In transaction processing systems, we use information technology to describe the number of requests to other services that we need to make in lodge to serve one incoming request.
3 In an ideal world, the running time of a batch task is the size of the dataset divided by the throughput. In exercise, the running time is oftentimes longer, due to skew (data not beingness spread evenly across worker processes) and needing to wait for the slowest task to complete.
References
[one] Michael Stonebraker and Uğur Çetintemel: "'I Size Fits All': An Idea Whose Time Has Come and Gone," at 21st International Conference on Data Engineering (ICDE), April 2005.
[two] Walter L. Heimerdinger and Charles B. Weinstock: "A Conceptual Framework for System Fault Tolerance," Technical Report CMU/SEI-92-TR-033, Software Engineering Plant, Carnegie Mellon University, Oct 1992.
[3] Ding Yuan, Yu Luo, Xin Zhuang, et al.: "Elementary Testing Tin Forbid Nigh Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems," at 11th USENIX Symposium on Operating Systems Pattern and Implementation (OSDI), October 2014.
[4] Yury Izrailevsky and Ariel Tseitlin: "The Netflix Simian Regular army," netflixtechblog.com, July 19, 2011.
[five] Daniel Ford, François Labelle, Florentina I. Popovici, et al.: "Availability in Globally Distributed Storage Systems," at 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2010.
[half dozen] Brian Embankment: "Hard Bulldoze Reliability Update – Sep 2014," backblaze.com, September 23, 2014.
[7] Laurie Voss: "AWS: The Good, the Bad and the Ugly," blog.awe.sm, December xviii, 2012.
[eight] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: "What Bugs Live in the Cloud?," at 5th ACM Symposium on Cloud Calculating (SoCC), November 2014. doi:10.1145/2670979.2670986
[ix] Nelson Minar: "Bound Second Crashes Half the Internet," somebits.com, July 3, 2012.
[ten] Amazon Web Services: "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US E Region," aws.amazon.com, Apr 29, 2011.
[11] Richard I. Cook: "How Complex Systems Fail," Cognitive Technologies Laboratory, April 2000.
[12] Jay Kreps: "Getting Real About Distributed System Reliability," web log.empathybox.com, March nineteen, 2012.
[13] David Oppenheimer, Archana Ganapathi, and David A. Patterson: "Why Practise Net Services Fail, and What Can Be Done Nearly It?," at 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.
[14] Nathan Marz: "Principles of Software Applied science, Part 1," nathanmarz.com, Apr ii, 2013.
[15] Michael Jurewitz: "The Human Impact of Bugs," jury.me, March fifteen, 2013.
[16] Raffi Krikorian: "Timelines at Scale," at QCon San Francisco, November 2012.
[17] Martin Fowler: Patterns of Enterprise Application Architecture. Addison Wesley, 2002. ISBN: 978-0-321-12742-half-dozen
[18] Kelly Sommers: "After all that run around, what caused 500ms deejay latency even when nosotros replaced physical server?" twitter.com, November xiii, 2014.
[19] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: "Dynamo: Amazon'south Highly Available Central-Value Store," at 21st ACM Symposium on Operating Systems Principles (SOSP), Oct 2007.
[20] Greg Linden: "Make Data Useful," slides from presentation at Stanford University Data Mining class (CS345), December 2006.
[21] Tammy Everts: "The Real Cost of Slow Time vs Downtime," slideshare.net, November v, 2014.
[22] Jake Brutlag: "Speed Matters," ai.googleblog.com, June 23, 2009.
[23] Tyler Treat: "Everything You lot Know About Latency Is Wrong," bravenewgeek.com, December 12, 2015.
[24] Jeffrey Dean and Luiz André Barroso: "The Tail at Scale," Communications of the ACM, book 56, number two, pages 74–fourscore, Feb 2013. doi:10.1145/2408776.2408794
[25] Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: "Forward Decay: A Practical Time Disuse Model for Streaming Systems," at 25th IEEE International Briefing on Data Engineering (ICDE), March 2009.
[26] Ted Dunning and Otmar Ertl: "Computing Extremely Authentic Quantiles Using t-Digests," github.com, March 2014.
[27] Gil Tene: "HdrHistogram," hdrhistogram.org.
[28] Baron Schwartz: "Why Percentiles Don't Work the Way Yous Think," solarwinds.com, November 18, 2016.
[29] James Hamilton: "On Designing and Deploying Net-Scale Services," at 21st Large Installation Organization Administration Conference (LISA), November 2007.
[30] Brian Foote and Joseph Yoder: "Large Ball of Mud," at fourth Briefing on Pattern Languages of Programs (PLoP), September 1997.
[31] Frederick P Brooks: "No Argent Bullet – Essence and Accident in Software Applied science," in The Mythical Man-Month, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3
[32] Ben Moseley and Peter Marks: "Out of the Tar Pit," at BCS Software Practice Advancement (SPA), 2006.
[33] Rich Hickey: "Simple Made Easy," at Strange Loop, September 2011.
[34] Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: "Analyzing Software Evolvability," at 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC), July 2008. doi:10.1109/COMPSAC.2008.l
Source: https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/ch01.html
0 Response to "Entering the Same Data Over and Over Again in Different Application Program"
Post a Comment