Designing data-intensive applications pdf download






















But tutorials aren't enough. They don't teach Go's idioms, so developers end up recreating patterns that don't make sense in a Go context. This practical guide provides the essential background you need to write clear and idiomatic Go. No matter your level of experience, you'll learn how to think like a Go developer. Author Jon Bodner reveals design patterns that experienced Go developers have adopted and the rationale for them. You'll learn how to structure your project and choose the proper tools and libraries to create successful software.

Learn how to write idiomatic code in Go and design a Go project Understand the reasons for the design decisions in Go Set up a Go development environment for a solo developer or team Learn how and when to use reflection, unsafe, and CGo Learn how Go's features allow the language to run efficiently Know which Go features you should use sparingly, or not at all Learn the future of Go, including Generics.

This book showcases cutting-edge research papers from the 6th International Conference on Research into Design ICoRD — the largest in India in this area — written by eminent researchers from across the world on design process, technologies, methods and tools, and their impact on innovation, for supporting design for communities.

While design traditionally focused on the development of products for the individual, the emerging consensus on working towards a more sustainable world demands greater attention to designing for and with communities, so as to promote their sustenance and harmony - within each community and across communities.

The special features of the book are the insights into the product and system innovation process, and the host of methods and tools from all major areas of design research for the enhancement of the innovation process.

The main benefit of the book for researchers in various areas of design and innovation are access to the latest quality research in this area, with the largest collection of research from India. For practitioners and educators, it is exposure to an empirically validated suite of theories, models, methods and tools that can be taught and practiced for design-led innovation.

The contents of this volume will be of use to researchers and professionals working in the areas on industrial design, manufacturing, consumer goods, and industrial management. What do Docker, Kubernetes, and Prometheus have in common? All of these cloud native technologies are written in the Go programming language. This practical book shows you how to use Go's strengths to develop cloud native services that are scalable and resilient, even in an unpredictable environment.

You'll explore the composition and construction of these applications, from lower-level features of Go to mid-level design patterns to high-level architectural considerations. Each chapter builds on the lessons of the last, walking intermediate to advanced developers through Go to construct a simple but fully featured distributed key-value store.

You'll learn best practices for adopting Go as your development language for solving cloud native management and deployment issues. Learn how cloud native applications differ from other software architectures Understand how Go can solve the challenges of designing scalable, distributed services Leverage Go's lower-level features, such as channels and goroutines, to implement a reliable cloud native service Explore what "service reliability" is and what it has to do with "cloud native" Apply a variety of patterns, abstractions, and tooling to build and manage complex distributed systems.

World renowned leaders in the field provide an accessible introduction to the use of Generalized Stochastic Petri Nets GSPNs for the performance analysis of diverse distributed systems. Divided into two parts, it begins with a summary of the major results in GSPN theory. The second section is devoted entirely to application examples which demonstrate how GSPN methodology can be used in different arenas.

A simple version of the software tool used to analyse GSPN models is included with the book and a concise manual for its use is presented in the later chapters. Skip to content. Designing Data Intensive Applications. Designing Data Intensive Web Applications. Serverless Handbook. Serverless Handbook Book Review:. Move Fast. Move Fast Book Review:.

The Neuman Systems Model. Author : Betty M. Building Data Driven Applications with Danfo js. Data intensive Systems. Data intensive Systems Book Review:. Applied Behavior Analysis Third Edition.

Author : John O. Amazon SageMaker Best Practices. Designing Database Applications with Objects and Rules. Architecting Data Intensive Applications. Enough Rope to Shoot Yourself in the Foot. Author : Allen I. Spectroscopy Based Biosensors. Spectroscopy Based Biosensors Book Review:. Refactoring to Patterns. Refactoring to Patterns Book Review:. Learning Go. Learning Go Book Review:. The panel recommended a new approach for federal statistical programs that would combine diverse data sources from government and private sector sources and the creation of a new entity that would provide the foundational elements needed for this new approach, including legal authority to access data and protect privacy.

This second of the panel's two reports builds on the analysis, conclusions, and recommendations in the first one. This report assesses alternative methods for implementing a new approach that would combine diverse data sources from government and private sector sources, including describing statistical models for combining data from multiple sources; examining statistical and computer science approaches that foster privacy protections; evaluating frameworks for assessing the quality and utility of alternative data sources; and various models for implementing the recommended new entity.

Together, the two reports offer ideas and recommendations to help federal statistical agencies examine and evaluate data from alternative sources and then combine them as appropriate to provide the country with more timely, actionable, and useful information for policy makers, businesses, and individuals.

The total of papers presented at the HCII conferences were carefully reviewed and selected from submissions. These papers address the latest research and development efforts and highlight the human aspects of design and use of computing systems. The papers accepted for presentation thoroughly cover the entire field of Human-Computer Interaction, addressing major advances in knowledge and effective use of computers in a variety of application areas. The total of contributions included in the DUXU proceedings were carefully reviewed and selected for inclusion in this three-volume set.

Please click button to get designing data-intensive applications: the big ideas behind reliable, scalable, and maintainable systems pdf new book Download designing data-intensive applications: the big ideas behind reliable, scalable, and maintainable systems or read online here in PDF or EPUB Click Download or Read Online button designing data-intensive applications: the big ideas behind reliable, scalable, and maintainable systems free download pdf This site currently has over a thousand free books available for download in various formats of designing data-intensive applications: the big ideas behind reliable, scalable, and maintainable systems best book Many of the technologies described in this book fall within the realm of the Big Data buzzword.

However, the term Big Data is so over-used and under-defined that it is not useful in a serious engineering discussion. This book uses less ambiguous terms, such as single-node vs. This book has a bias towards free and open source software FOSS , because reading, modifying and executing source code is a great way to understand how something works in detail.

Open platforms also reduce the risk of vendor lock-in. The text, figures and examples are a work in progress, and several chapters are yet to be written. We are releasing the book before it is finished because we hope that it is already useful in its current form, and because we would love your feedback in order to create the best possible finished product. The Big Picture Chapter 1.

Reliable, Scalable and Maintainable Applications Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing. A data-intensive application is typically built from standard building blocks which provide commonly needed functionality.

But reality is not that simple. There are many database systems with different characteristics, because different applications have different requirements. There are various approaches to caching, several ways of building search indexes, and so on.

When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. Sometimes it can be hard to combine several tools when you need to do something that a single tool cannot do alone.

This book is a journey through both the principles and the practicalities of data systems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics. Although a database and a message queue have some superficial similarity—both store data for some time—they have very different access patterns, which means different performance characteristics, and thus very different implementations.

So why should we lump them all together under an umbrella term like data systems? Many new tools for data storage and processing have emerged in recent years. They are optimized for a variety of different use cases, and they no longer neatly fit into these categories. Secondly, increasingly many applications now have such demanding or wide-ranging requirements that a single tool can no longer meet all of its data processing and storage needs.

Instead, the work is broken down into tasks that can be performed efficiently on a single tool, and those different tools are stitched together using application code. Figure gives a glimpse of what this may look like we will go into detail in later chapters. Now you have essentially created a new, special-purpose data system from smaller, general-purpose components.

Your composite data system may provide certain guarantees, e. You are now not only an application developer, but also a data system designer. If you are designing a data system or service, a lot of tricky questions arise.

How do you ensure that the data remains correct and complete, even when things go wrong internally? How do you provide consistently good performance to clients, even when parts of your system are degraded?

How do you scale to handle an increase in load? What does a good API for the service look like? One possible architecture for a data system that combines several components. Those factors depend very much on the situation. In this book, we focus on three concerns that are important in most software systems: Reliability The system should continue to work correctly performing the correct function at the desired performance even in the face of adversity hardware or software faults, and even human error.

See Reliability. Scalability As the system grows in data volume, traffic volume or complexity , there should be reasonable ways of dealing with that growth. See Scalability. See Maintainability. These words are often cast around without a clear understanding of what they mean.

In the interest of thoughtful engineering, we will spend the rest of this chapter exploring ways of thinking about reliability, scalability and maintainability. Then, in the following chapters, we will look at various techniques, architectures and algorithms that are used in order to achieve those goals.

Reliability Everybody has an intuitive idea of what it means for software to be reliable or unreliable. The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant. The term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible if the entire planet Earth, and all servers on Earth, is sucked into a black hole, tolerance of that fault would require web hosting in space—good luck getting that budget item approved.

So it only makes sense to talk about tolerance of certain types of fault. Note that a fault is not the same as a failure—see [[2]] for an overview of the terminology. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.

It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault tolerance mechanisms that prevent faults from causing failures.

In this book we cover several techniques for building reliable systems from unreliable parts. Counter-intuitively, in such fault-tolerant systems, it can make sense to deliberately increase the rate of faults by triggering them deliberately.

Software that deliberately causes faults—for example, randomly killing individual processes without warning—is known as a chaos monkey [[3]]. It ensures that the fault-tolerance machinery is continually exercised and tested, so that we can be confident that faults will be handled correctly when they occur naturally. This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone.

However, this book mostly deals with the kinds of fault that can be cured, as described in the following sections. Hardware faults When we think of causes of system failure, hardware faults quickly come to mind.

Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable. Anyone who has worked with large data centers can tell you that these things happen all the time when you have a lot of machines. Hard disks are reported as having a mean time to failure MTTF of about 10 to 50 years [[4]]. Thus, on a storage cluster with 10, disks, we should expect on average one disk to die per day. Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system.

Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and data centers may have batteries and diesel generators for backup power. When one component dies, the redundant component can take its place while the broken component is replaced. This approach cannot completely prevent hardware problems from causing failures, but it is well understood, and can often keep a machine running uninterrupted for years.

Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you can restore a backup onto a new machine fairly quickly, the downtime in case of failure is not catastrophic in most applications. Thus, multi-machine redundancy was only required by a small number of applications for which high availability was absolutely essential.

Hence there is a move towards systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference to hardware redundancy. There may be weak correlations for example due to a common cause, such as the temperature in the server rack , but otherwise it is unlikely that a large number of hardware components will fail at the same time.

Another class of fault is a systematic error within the system. Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults [[4]].

For example, consider the leap second on June 30, that caused many applications to hang simultaneously, due to a bug in the Linux kernel. The bugs that cause these kinds of software fault often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and whilst that assumption is usually true, it eventually stops being true for some reason.

There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system, thorough testing, measuring, monitoring and analyzing system behavior in production. If a system is expected to provide some guarantee for example, in a message queue, that the number of incoming messages equals the number of outgoing messages , it can constantly check itself while it is running, and raise an alert if a discrepancy is found [[8]].

Even when they have the best intentions, humans are known to be unreliable. How do we make our system reliable, in spite of unreliable humans? However, if the interfaces are too restrictive, people will work around them, negating their benefit, so this is a tricky balance to get right. In particular, provide fully-featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.

Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation. For example, make it fast to roll back configuration changes, roll out new code gradually so that any unexpected bugs affect only a small subset of users , and provide tools to recompute data in case it turns out that the old computation was incorrect.

The term telemetry is not often applied to software, but it is very apt. We can learn about good telemetry from other engineering disciplines, such as aerospace [[9]]. Monitoring can show us early warning signals, and allow us to check whether any assumptions or constraints are being violated.

When a problem occurs, metrics can be invaluable in diagnosing the issue. How important is reliability? Reliability is not just for nuclear power stations and air traffic control software—more mundane applications are also expected to work reliably. Bugs in business applications cause lost productivity and legal risks if figures are reported incorrectly , and outages of e-commerce sites can have huge costs in terms of lost revenue and reputation.

Consider a parent who stores all pictures and videos of their children in your photo application [[10]]. How would they feel if that database was suddenly corrupted? Would they know how to restore it from a backup? One common reason for degradation is increased load: perhaps it has grown from 10, concurrent users to , concurrent users, or from 1 million to 10 million.

Perhaps it is processing much larger volumes of data than it did before. Rather, discussing scalability means to discuss the question: if the system grows in a particular way, what are our options for coping with the growth?

Describing load First, we need to succinctly describe the current load on the system; only then can we discuss growth questions what happens if our load doubles? Load can be described with a few numbers which we call load parameters. Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

Simply handling 12, writes per second the peak rate for posting tweets would be fairly easy. Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests home timeline, look up all the people they follow, find all recent tweets for each of those users, and merge them sorted by time.

In a relational database like the one in Figure , this would be a query along the lines of: 2. When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read the home timeline is then cheap, because its result has been computed ahead of time.

Figure Simple relational schema for implementing a Twitter home timeline Figure However, the downside of approach 2 is that posting a tweet now requires a lot of extra work.

On average, a tweet is delivered to about 75 followers, so 4. But this average hides the fact that the number of followers per user varies wildly, and some users have over 30 million followers.

This means that a single tweet may result in over 30 million writes to home timelines! Doing this in a timely manner—Twitter tries to deliver tweets to followers within 5 seconds—is a significant challenge. In the example of Twitter, the distribution of followers per user maybe weighted by how often those users tweet , is a key load parameter for discussing scalability, since it determines the fan-out load.

Your application may have very different characteristics, but you can apply similar principles to reasoning about its load. The final twist of the Twitter anecdote: now that approach 2 is robustly implemented, Twitter is moving to a hybrid of both approaches. Instead, when the home timeline is read, the tweets from celebrities followed by the user are fetched separately and merged with the home timeline when the timeline is read, like in approach 1.

This hybrid approach is able to deliver consistently good performance. Describing performance Once you have described the load on our system, you can investigate what happens when the load increases.

In a batch-processing system such as Hadoop, we usually care about throughput—the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.

In practice, in a system handling a variety of requests, the latency per request can vary a lot. We therefore need to think of latency not as a single number, but as a probability distribution. In Figure , each gray bar represents a request to a service, and its height shows how long that request took.

Most requests are reasonably fast, but there are occasional outliers that take much longer. Perhaps the slow requests are intrinsically more expensive, e. Usually it is better to use percentiles. If you take your list of response times and sort it, from fastest to slowest, then the median is the half-way point: for example, if your median response time is ms, that means half your requests return in less than ms, and half your requests take longer than that. This makes the median a good metric if you want to know how long users typically have to wait.

The median is also known as 50th percentile, and sometimes abbreviated as p In order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th and For example, if the 95th percentile response time is 1.

This is illustrated in Figure Now, when testing a system at various levels of load, you can track the median and higher percentiles of response times in order to get a quick measure of the performance. Percentiles in Practice You may wonder whether the high percentiles are worth worrying about—if just 1 in 1, requests is unacceptably slow for the end user, and the other are fast enough, you may still consider the overall level of service to be acceptable.

High percentiles become especially important in backend services that are called multiple times as part of serving a single end-user request. Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete. As it takes just one slow call to make the entire end-user request slow, rare slow calls to the backend become much more frequent at the end-user request level Figure See [[14]] for a discussion of approaches to solving this problem.

If you want to add response time percentiles to the monitoring dashboards for your services, you need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling window of response times of requests in the last ten minutes. Every minute, you calculate the median and various percentiles over the values in that window, and plot those metrics on a graph. If that is too inefficient for you, there are algorithms which give a good approximation of percentiles at minimal CPU and memory cost, such as forward decay [[15]], which has been implemented in Java [[16]] and Ruby [[17]].

When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request. Approaches for coping with load Now that we have discussed the parameters for describing load, and metrics for measuring performance, we can start discussing scalability in earnest: how do we maintain good performance, even when our load parameters increase by some amount?

An architecture that is appropriate for one level of load is unlikely to cope with ten times that load. If you are working on a fast-growing service, it is therefore likely that you will need to re-think your architecture on every order of magnitude load increase—perhaps even more often than that.

People often talk of a dichotomy between scaling up vertical scaling, using a single, powerful machine and scaling out horizontal scaling, distributing the load across multiple smaller machines.

In reality, good architectures usually involve a pragmatic mixture of approaches. While distributing stateless services across multiple machines is fairly straightforward, taking stateful data systems from a single node to a distributed setup can introduce a lot of additional complexity. For this reason, common wisdom until recently was to keep your database on a single node scale up until scaling cost or high-availability requirements forced you to make it distributed.

As the tools and abstractions for distributed systems get better, this common wisdom may change, at least for some kinds of application. The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture informally known as magic scaling sauce. The problem may be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the latency requirements, the access patterns, or usually some mixture of all of these plus many more issues.

For example, a system that is designed to handle , requests per second, each 1 kB in size, looks very different from a system that is designed for three requests per minute, each 2 GB in size—even though the two systems have the same data throughput.

An architecture that scales well for a particular application is built around assumptions of which operations will be common, and which will be rare—the load parameters.

If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counter-productive. However, whilst being specific to a particular application, scalable architectures are usually built from general-purpose building blocks, arranged in familiar patterns. In this book we discuss those building blocks and patterns. Maintainability It is well-known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.

Every legacy system is unpleasant in its own way, and so it is difficult to give general recommendations for dealing with them. To this end, we will pay particular attention to three design principles for software systems: Operability Make it easy for operations teams to keep the system running smoothly. Simplicity Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.

Note this is not the same as simplicity of the user interface. Plasticity Make it easy for engineers in future to make changes to the system, adapting it for unanticipated use cases as requirements change.

Also known as extensibility, modifiability or malleability. As previously with reliability and scalability, there are no quick answers to achieving these goals. Rather, we will try to think about systems with operability, simplicity and plasticity in mind. Operations teams are vital to keeping a software system running smoothly. Simplicity: managing complexity Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand.

This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. There are many possible symptoms of complexity: explosion of the state space, tight coupling of modules, tangled dependencies, inconsistent naming and terminology, hacks aimed at solving performance problems, special-casing to work around issues elsewhere, and many more. Much has been written on this topic already—to mention just two articles, No Silver Bullet [[18]] is a classic, and its ideas are further developed in Out of the Tar Pit [[19]].

When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences and unexpected interactions are more easily overlooked.

Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build. Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity. Moseley and Marks [[19]] define complexity as accidental if it is not inherent in the problem that the software solves as seen by the users , but arises only from the implementation. One of the best tools we have for removing accidental complexity is abstraction.

A good abstraction can also be used for a wide range of different applications. For example, high-level programming languages are abstractions that hide machine code, CPU registers and syscalls. SQL is an abstraction that hides complex on-disk and in-memory data structures, concurrent requests from other clients, and inconsistencies after crashes.

Of course, when programming in a high-level language, we are still using machine code; we are just not using it directly, because the programming language abstraction saves us from having to think about it.

However, finding good abstractions is very hard. In the field of distributed systems, although there are many good algorithms, it is much less clear how we should be packaging them into abstractions that help us keep the complexity of the system at a manageable level. Throughout this book, we will keep our eyes open for good abstractions that allow us to extract parts of a large system into well-defined, reusable components.

Much more likely, it is in constant flux: you learn new facts, previously unanticipated use cases emerge, business priorities change, users request new features, new platforms replace old platforms, legal or regulatory requirements change, growth of the system forces architectural changes, etc. In terms of organizational processes, agile working patterns provide a framework for adapting to change.

The agile community has also developed technical tools and patterns that are helpful when developing software in a frequently-changing environment, such as test-driven development TDD and refactoring. Most discussions of these agile techniques focus on a fairly small, local scale a couple of source code files within the same application. In this book, we search for ways of increasing agility on the level of a larger data system, perhaps consisting of several different applications or services with different characteristics.

The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to-understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data system level: plasticity. These principles will guide us through the rest of the book, when we dive into deep technical detail. An application has to meet various requirements in order to be useful.

There are functional requirements what it should do, e. In this chapter we discussed reliability, scalability and maintainability in detail.

Reliability means making systems work correctly, even when faults occur. Faults can be in hardware typically random and uncorrelated , software bugs are typically systematic and hard to deal with , and humans who inevitably make mistakes from time to time. Fault tolerance techniques can hide certain types of fault from the end user. Scalability means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively.

Good abstractions can help reduce complexity and make the system easier to modify and adapt for new use cases. There is unfortunately no quick answer to making applications reliable, scalable or maintainable. However, there are certain patterns and techniques which keep re-appearing in various different kinds of application.

In the next few chapters we will take a look at some examples of data systems, and analyze how they work towards those goals. Later in the book, in to come , we will look at patterns for systems that consist of several components working together, such as the one in Figure The output needs to supply enough current to drive all the attached inputs. In transaction processing systems, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request.

In practice, the running time is often longer, due to skew data not being spread evenly across worker processes or waiting for the slowest task to complete. The Battle of the Data Models The limits of my language mean the limits of my world. Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it represented in terms of the next-lower layer?

For example: 1. As an application developer, you look at the real world in which there are people, organizations, goods, actions, money flows, sensors, etc. Those structures are often specific to your application. When you want to store those data structures, you express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a relational database, or a graph model. The representation may allow the data to be queried, searched, manipulated and processed in various ways.

On yet lower levels, hardware engineers have figured out how to represent bytes in terms of electrical currents, pulses of light, magnetic fields, and more. In a complex application there may be more intermediary levels, such as APIs built upon APIs, but the basic idea is still the same: each layer hides the complexity of the layers below it by providing a clean data model. These abstractions allow different groups of people—for example, the engineers at the database vendor and the application developers using their database—to work together effectively.

There are many different kinds of data model, and every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward. Building software is hard enough, even when working with just one data model, and without worrying about its inner workings.

In this chapter, we will look at a range of general-purpose data models for data storage and querying point 2 in the list of layers above. In Chapter 3 we will discuss how they are implemented point 3. Rivals of the Relational Model The best-known data model today is probably that of SQL, based on the relational model proposed by Edgar Codd in [[21]]: data is organized into relations in SQL: tables , where each relation is an unordered collection of tuples rows.

The relational model was a theoretical proposal, and many people at the time doubted whether it could be implemented efficiently. However, by the mids, relational database management systems RDBMS and SQL had become the tool of choice for most people who needed to store and query data with some kind of regular structure. The dominance of relational databases has lasted around 25 30 years—an eternity in computing history. The roots of relational databases lie in business data processing, which was performed on mainframe computers in the s and 70s.

Other databases at that time forced application developers to think a lot about the internal representation of the data in the database. The goal of the relational model was to hide that implementation detail behind a cleaner interface. Over the years, there have been many competing approaches to data storage and querying.

Object databases came and went again in the late s and early s. XML databases appeared in the early s, but have only seen niche adoption. Each competitor to the relational model generated a lot of hype in its time, but it never lasted.

And remarkably, relational databases turned out to generalize very well, beyond their original scope of business data processing, to a broad variety of use cases. Much of what you see on the web today is still powered by relational databases—be it online publishing, discussion, social networking, e-commerce, games, software-as-a-service productivity applications, or much more.

A number of interesting database systems are now associated with the NoSQL hashtag. It therefore seems likely that in the foreseeable future, relational databases will continue to be used alongside a broad variety of non-relational data stores—an idea that is sometimes called polyglot persistence [[23]]. The object-relational mismatch Most application development today is done in object-oriented programming languages, which leads to a common criticism of the SQL data model: if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows and columns.

The disconnect between the models is sometimes called an impedance mismatch[25]. However, most people have had more than one job in their career positions , varying numbers of periods of education, and any number of pieces of contact information. In this setup, you typically cannot use the database to query for values inside that serialized column.

Document-oriented databases like MongoDB [[29]], RethinkDB [[30]], CouchDB [[31]] and Espresso [[32]] support this data model, and many developers feel that the JSON model reduces the impedance mismatch between the application code and the storage layer.

In the JSON representation, all the relevant information is in once place, and one simple query is sufficient. The one-to-many relationships from the user profile to its positions, educational history and contact information imply a tree structure in the data, and the JSON representation makes this tree structure explicit see Figure Representing a LinkedIn profile using a relational schema.

Example Active blogger. One-to-many relationships forming a tree structure. If the user interface has free-text fields for entering the region and the industry, it makes sense to store them as plain strings.

A database in which entities like region and industry are referred to by ID is called normalized,[33] whereas a database that duplicates the names and properties of entities on each document is denormalized. Normalization is a popular topic of debate among database administrators. Note Duplication of data is appropriate in some situations and inappropriate in others, and it generally needs to be handled carefully. We discuss caching, denormalization and derived data in to come of this book.

In document databases, joins are not needed for one-to-many tree structures, and support for joins is often weak. In this case, the lists of regions and industries are probably small and slow-changing enough that the application can simply keep them in memory. But nevertheless, the work of making the join is shifted from the database to the application code. Moreover, even if the initial version of an application fits well in a join-free document model, data has a tendency of becoming more interconnected as features are added to applications.

Perhaps they should be references to entities instead? Recommendations Say you want to add a new feature: one user can write a recommendation for another user. If the recommender updates their photo, any recommendations they have written need to reflect the new photo.

Figure illustrates how these new features require many-to-many relationships. The data within each dotted rectangle can be grouped into one document, but the references to organizations, schools and other users need to be represented as references, and require joins when queried. The company name is not just a string, but a link to a company entity. Historical interlude While many-to-many relationships and joins are routinely used in relational databases without thinking twice, document databases and NoSQL reopened the debate on how best to represent such relationships in a database.

This debate is much older than NoSQL—in fact, it goes back to the very earliest computerized database systems. These problems of the s were very much like the problems that developers are running into with document databases today. The two most prominent were the relational model which became SQL, and took over the world , and the network model which initially had a large following but eventually faded into obscurity.

In the tree structure of the hierarchical model, every record has exactly one parent; in the network model, a record can have multiple parents. For example, there could be one record for the "Greater Seattle Area" region, and every user who lives in that region could be its parent.

This allows many-to-one and many-to-many relationships to be modeled. The links between records in the network model are not foreign keys, but more like pointers in a programming language while still being stored on disk. The only way of accessing a record was to follow a path from a root record along these chains of links.

This was called an access path. In the simplest case, an access path could be like the traversal of a linked list: start at the head of the list, and look at one record at a time, until you find the one you want. But in a world of many-to-many relationships, several different paths can lead to the same record, and a programmer working with the network model had to keep track of these different access paths in their head.



0コメント

  • 1000 / 1000