Data Structures

Table of Contents

Build the main data structures from the ground up. Learn when to use an array vs. a linked list vs….

Computer science in plain English

To really understand how data structures work, we're going to derive each of them from scratch. Starting with bits.

为了真正理解数据结构的工作原理,我们将从头开始推导它们。 从 开始。

Don't worry – we'll skip the convoluted academic jargon and proofs.

别担心,我们会跳过复杂的学术术语和证明。

We'll cover:

  • Random Access Memory
  • Binary Numbers
  • Fixed-Width Integers
  • Arrays
  • Strings
  • Pointers
  • Dynamic Arrays
  • Linked Lists
  • Hash Tables

Random Access Memory (RAM)

i.e.随机存取存储器

When a computer is running code, it needs to keep track of variables (numbers, strings, arrays, etc.).

当计算机运行代码时,它需要跟踪变量(数字,字符串,数组等)。

Variables are stored in randon access memory (RAM) . We sometimes call RAM "working memory" or just "memory."

变量存储在随机存取存储器(RAM)中,有时我们将 RAM 称为“工作内存”或“内存”。

RAM is not where mp3s and apps get stored. In addition to "memory", your computer has storage (sometimes called "persistent storage" or "disc"). While memory is where we keep the variables our functions allocate as they crunch data for us, storage is where we keep files like mp3s, videos, Word documents, and even executable programs or apps.

RAM不能存储 mp3 和应用。 除了“内存”之外,您的计算机还具有 存储 (有时称为“永久存储”或“光盘”)。 内存 是我们保留函数在处理数据时分配的变量的地方,而 存储 则是我们保留 mp3 ,视频,Word 文档甚至可执行程序或应用程序之类的文件的位置。

Memory (or RAM) is faster but has less space, while storage (or "disc") is slower but has more space. A modern laptop might have 500GB of storage but only 16GB of RAM.

内存(或RAM)速度更快,但空间较小,而存储(或“光盘”)速度较慢,但​​空间较大。 现代笔记本电脑可能具有500GB的存储空间,但只有16GB的RAM。

Think of RAM like a really tall bookcase with a lot of shelves. Like, billions of shelves.

可以将RAM看作是一个非常高的书柜,上面有数以亿计的书架。

It just keeps going down. Again, picture billons of these shelves.

The shelves are numbered.

这些书架都被编了号。

We call a shelf's number its address .

我们称架子的编号为地址。

Each shelf holds 8 bits . A bit is a tiny electrical switch that can be turned "on" or "off." But instead of calling it "on" or "off" we call it 1 or 0 .

每个架子可容纳 8 位(bits)。 一位(bit)是一个可以打开或关闭的微型电气开关。但是,我们将其称为 10 ,而不是将其称为“开”或“关”。

8 bits is called a byte . So each shelf has one byte (8 bits) of storage.

8 位称为 字节 ,因此,每个架子都有一个字节(8位)的存储空间。

Of course, we also have a processor that does all the real work inside our computer:

当然,我们还有一个 处理器 ,它可以完成计算机内部的所有实际工作:

It's connected to a memory controller . The memory controller does the actual reading and writing to and from RAM. It has a direct connection to each shelf of RAM.

它连接到一个 内存控制器 ,内存控制器对 RAM 执行实际的读写操作,直接连接到 RAM 的每个书架。

That direct connection is important. It means we can access address 0 and then immediately access 918,873 without having to "climb down" our massive bookshelf of RAM.

这种直接联系很重要,它意味着我们可以访问地址 0 ,然后立即访问地址 918,873 ,而不必“爬下”我们庞大的 RAM 书柜。

That's why we call it Random Access Memory (RAM) – we can Access the bits at any Random address in Memory right away.

这就是为什么我们称它为随机存取存储器(RAM) – 我们可以立即访问存储器中任意随机地址的位。

Spinning hard drives don't have this "random access" superpower, because there's no direct connection to each byte on the disc. Instead, there's a reader – called a head – that moves along the surface of a spinning storage disc (like the needle on a record player). Reading bytes that are far apart takes longer because you have to wait for the head to physically move along the disc.

旋转硬盘驱动器没有这种“随机访问”超能力,因为没有直接连接到磁盘上的每个字节。 取而代之的是,有一个读取器(称为磁头)沿着旋转的存储光盘的表面移动(就像电唱机上的针一样)。 读取相距较远的字节需要更长的时间,因为您必须等待磁头沿着磁盘移动。

Even though the memory controller can jump between far-apart memory addresses quickly, programs tend to access memory that's nearby. So computers are turned to get an extra speed boost when reading memory addresses that're close to each other. Here's how it works:

尽管内存控制器可以在相距较远的内存地址之间快速跳转,程序还是倾向于访问附近的内存。因为当读取彼此接近的内存地址时,计算机会获得额外的速度提升,其运作方式如下。

The processor has a cache where it stores a copy of stuff it's recently read from RAM.

处理器具有一个 高速缓存 ,用于存储最近从 RAM 读取的内容的副本。

Actually, it has a series of caches. But we can picture them all lumped together as one cache like this.

This cache is much faster to read from than RAM, so the processor saves time whenever it can read something from cache instead of going out to RAM.

这个 高速缓存 的读取速度比 RAM 快得多,因此,处理器就可以从高速缓存中读取数据,而不是从 RAM 读取,就可以节省时间。

When the processor asks for the contents of a given memory address, the memory controller also sends the contents of a handful of nearby memory addresses. And the processor puts all of it in the cache.

当处理器请求给定内存地址的内容时,内存控制器也发送少量附近内存地址的内容,处理器把所有数据都放到高速缓存中。

So if the processor asks for the contents of address 951, then 952, then 953, then 954…it'll go out to RAM once for that first read, and the subsequent reads will come straight from the super-fast cache.

因此,如果处理器请求地址为 951、952、953、954 的内容,那么它将在第一次读取时从 RAM 读取,随后将直接从处理器的高速缓存中读取。

But if the processor asks to read address 951, then address 362, then address 419…then the cache won't help, and it'll have to go all the way out to RAM for each read.

但是,如果处理器要求读取地址 951,然后是地址 362,然后是地址 419,那么处理器的高速缓存就不起作用了,每次读取时,都不得不从 RAM 中读取。

So reading from sequential memory addresses is faster than jumping around.

因此,从 连续的 内存地址读取数据要比从 不连续的 内存地址 快得多

Binary numbers

i.e. 二进制数

Let's put those bits to use. Let's store some stuff. Starting with numbers.

让我们使用这些 ,储存一些东西,从 开始。

The number system we usually use (the one you probably learned in elementary school) is called base 10 , because each digit has ten posiible values (1, 2, 3, 4, 5, 6, 7, 8, 9, and 0).

我们通常使用的数字系统叫做 十进制 ,因为每个数位都有 10 种可能的值( 1、2、3、4、5、6、7、8、9 和 0)。

But computers don't have digits with ten possible values. They have bits with two possible values. So they use base 2 numbers.

但是计算机没有 10 个可能值的数位,它有 2 个可能值的位。所以,用 2 作为底数。

Base 10 is also called decimal . Base 2 is also called binary .

以 10 为底数的叫 十进制 ,以 2 为底数叫 二进制

To understand binary, let's take a closer look at how decimal numbers work. Take the number "101" in decimal:

要了解二进制,让我们仔细看一下十进制数字是如何工作的,以十进制数字“ 101”为例:

Notice we have two "1"s here, but they don't mean the same thing. The leftmost "1" means 100, and the rightmost "1" means 1. That's because the leftmost "1" is in the hundreds place, while the rightmost "1" is in the ones place. And the "0" between them is in the tens place.

注意这里有两个 1 ,但它们的意思不一样。最左边的“1”表示 100,最右边的“1”表示 1。因为最左边的“1”在百位上,而最右边的“1”在个位上。它们之间的“0”在十位上。

So this "101" in base 10 is telling us we have "1 hundred, 0 tens, and 1 one."

所以这个以 10 为底的“101”表示我们有“1 个 100,0 个 10,和 1 个 1”。

Notice how the places in base 10 (ones place, tens place, hundreds place, etc.) are sequential powers of 10:

注意,以 10 为基数的数位(个位、十位、百位等)是 10 的连续幂:

  • 100 = 1
  • 101 = 10
  • 102 = 100
  • 103 = 1000
  • etc.

The places in binary (base 2) are sequential powers of 2:

二进制(以 2 为基数)中的位是 2 的连续幂:

  • 20 = 1
  • 21 = 2
  • 22 = 4
  • 23 = 8
  • etc.

So let's take that same "101" but this time let's read it as a binary number:

我们仍使用相同的“101”,但是这次我们用二进制数来表示:

Reading this from right to left: we have a 1 in the ones place, a 0 in the twos place, and a 1 in the fours place. So out total is 4 + 0 + 1 which is 5.

从右向左读:个位是 1,两位是 0,四位是 1,所以总共是 4 + 0 + 1 = 5。

Here's how we'd count up to 12 in binary:

下面是二进制数到 12 的方法:

Decimal Binary
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001
10 1010
11 1011
12 1100

So far we've been talking about unsigned integers ("unsigned" means non-negative, and "integer" means a whole number, not a fraction or decimal). Storing other numbers isn't hard though. Here's how some other numbers could be stored:

  • Fractions – Store two numbers: the numberator and the denominator.
  • Decimals – Also two numbers:
    • the number with the decimal point taken out, and
    • the position where the decimal point goes (how many digits over from the leftmost digit).
  • Negative Numbers – Reserve the leftmost bit for expressing the sign of the number. 0 for positive and 1 for negative.

到目前为止,我们一直在讨论 无符号整数 (“无符号”表示非负的,而“整数”表示整数,而不是分数或小数)。存储其他数字并不困难,下面是一些其他数字的存储方式:

  • 分数 – 存储两个数字:分子和分母。
  • 小数也是两个数:
    • 去掉小数点的数,和
    • 小数点的位置(从最左边的数字开始移几位)。
  • 负数 – 保留最左边的位以表示数字的符号。 0 表示正,1 表示负。

In reality we usually do something slightly fancier for each of these. But these approaches work, and they show how we can express some complex stuff with just 1s and 0s.

在现实中,我们通常对每一个都做一些稍微花哨的事情,但是这些方法是有效的,它们展示了如何用 1 和 0 来表达一些复杂的东西。

We've talked about base 10 and base 2…you may have also seen base 16 , also called hexadecimal or hex .

我们已经讨论了以 10 为底和以 2 为底的情况……您可能还看到过以 16 为底,也称为 十六进制

In hex, our possible values for each digit are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e and f. Hex numbers are often prefixed with "0x" or "#".

以十六进制表示,每个数字的可能值为 0、1、2、3、4、5、6、7、8、9,a,b,c,d,ef 。 十六进制数字通常以“0x”或“#”为前缀。

In CSS, colors are sometimes expressed in hex. Interview Cake's signature blue color is #5bc0de.

在CSS中,颜色有时用十六进制表示,如 Cake 站的标志性蓝色是 #5bc0de

Fixed-width integers

i.e. 定宽整数

How many different numbers can we express with 1 byte (8 bits)?

1个字节(8位)可以表示多少个不同的数字?

28 = 256 different numbers. How did we know to take 28?

28 = 256 个不同的数字。 我们是怎么知道有 28 个呢?

What happens if we have the number 255 in an 8-bit unsigned integer (1111 1111 in binary) and we add 1? The answer (256) needs a 9th bit (1 0000 0000). But we only have 8 bits!

如果我们有一个 8 位无符号整数中的 255 (二进制中的 1111 1111 ),然后加上 1 会发生什么? 答案 (256) 需要第 9 位 ( 1 0000 0000 ),但是我们只有 8 位!

This is called an integer overflow . At best, we might just get an error. At worst, our computer might compute the correct answer but then just throw out the 9th bit, giving up zero (0000 0000) instead of 256 (1 0000 0000)! (Python actually notices that the result won't fit and automatically allocates more bits to store the larger number.)

这称为 整数溢出 。 充其量,我们可能会得到一个错误。 在最坏的情况下,我们的计算机可能会计算出正确的答案,但会抛出第 9 位,导致最终结果是 0( 0000 0000 )而不是 256( 1 0000 0000 )!(Python 实际上会注意到结果不合适,并自动分配更多位来存储更大的数字。)

The 256 possibilities we get with 1 byte are pretty limiting. So we usually use 4 or 8 bytes (32 or 64 bits) for storing integers.

  • 32-bit integers have 232 possible values – more than 4 billion;
  • 64-bit integers have 264 positive values – more than 10 billion billion (1019).

我们用 1 个字节获得的 256 个数是非常有限的。因此,我们通常使用 4 或 8 个字节( 32 或 64 位)来存储整数。

  • 32 位整数具有 232 个可能的值 – 超过 40 亿;
  • 64 位整数具有 264 个可能的值 – 超过 100 亿个( 1019 )。

"How come I've never had to think about how many bits my integer are?" Maybe you have but just didn't know it.

“为什么我从来不用考虑我的整数有多少位?”也许你有,只是没有意识到而已。

Have you ever noticed how in some languages (like Java and C) sometimes numbers are Integers and sometimes they're Longs? The difference is the number of bits (in Java, Integers are 32 bits and Longs are 64).

您是否曾经注意到在某些语言(例如 Java 和 C)中,数字有时是整数,有时又是长整数吗? 区别在于位数(在 Java 中,整数是 32 位,而长整数是 64 位)。

Ever created a table in SQL? When you specify that a column will hold integers, you have to specify how many bytes: 1 byte (tinyint), 2 bytes (smallint), 4 bytes (int), or 8 bytes (bigint).

是否曾经在 SQL 中创建过表? 当您指定一列将容纳整数时,您必须指定多少字节:1 个字节( tinyint ),2 个字节( smallint ),4 个字节( int )或 8 个字节( bigint )。

When is 32 bits not enough? When you're counting views on a viral video. YouTube famously ran into trouble when the Gangnam Style video hit over 231 views, forcing them to upgrade their view counts from 32-bit to 64-bit signed integers.

什么时候 32 位不够?当您计算火爆视频的观看次数时。当江南 Style 的视频播放超过 231 次观看时,YouTube 遇到了麻烦,迫使他们将观看次数从 32 位带符号的整数升级为 64 位。

Most integers are fixed-width or fixed-length , which means the number of bits they take up doesn't change.

大多数整数是定宽或定长的,这意味着它们占用的位数不会改变。

It's usually safe to assume an integer is fixed-width unless you're told otherwise. Variable-size numbers exists, but they're only used in special cases.

通常,除非另有说明,我们假定整数为固定宽度通常是安全的。可变大小的数字是存在的,但仅在特殊情况下使用。

If we have a 64-bit fixed-length integer, it doesn't matter if that integer is 0 or 193,457 – it still takes up the same amount of space in RAM: 64 bits.

如果我们有一个 64 位固定长度的整数,则该整数是 0 还是 193,457 都没关系 – 它仍然在 RAM 中占用相同的空间量:64 位。

Are you familiar with big O notation? It's a tool we use for talking about how much time an algorithm takes to run or how much space a data structure takes up in RAM. It's pretty simple:

您熟悉 big O 符号吗?它是我们用来讨论算法运行需要多少时间或数据结构在 RAM 中占用多少空间的工具。 很简单:

O(1) or constant means the time or space stays about the same even as the dataset gets bigger and bigger.

O(1) 或常数 表示即使数据集变得越来越大,时间或空间也保持不变。

O(n) or linear means the time or space grows proprotionally as the dataset grows.

O(n) 或线性 表示时间或空间随数据集的增长按比例增长。

So O(1) space is much smaller than O(n) space. And O(1) time is much faster than O(n) time.

因此,O(1) 空间比 O(n) 空间小得多。O(1) 时间比 O(n) 时间快得多。

That's all you need for this piece. But if you're curious, you can read our whole big O explainer here .

其实已经足够了,如果你想要了解更多,请参考 big O

In big O natation, we say fixed-width integers take up constant space or O(1) space.

在 big O 表示法中,我们说固定宽度整数占据恒定空间或 O(1) 空间。

And because they have a constant number of bits, most simple operations on fixed-width integers (addition, substraction, multiplication, division) take constant time (O(1) time).

并且由于它们具有恒定数量的位,因此对固定宽度整数(加法,减法,乘法,除法)进行的大多数简单操作都需要恒定时间(O(1) 时间)。

So fixed-width integers are very space efficient and time efficient.

因此,固定宽度整数非常节省空间和时间。

But that efficiency comes at a cost – their values are limited. Specifically, they're limited to 2n possibilities, where n is the number of bits.

但是,效率是有代价的 – 它们的值是有限的。具体来说,它们仅限于 2n 种可能性,其中 n 是位数。

So there's a tradeoff. As we'll see, that's a trend in data structures – to get a nice property, we'll often have to lose something.

这是一种权衡。正如我们将看到的,在数据结构中,这是一个趋势,为了得到一个好的属性,我们经常会不得不失去一些东西。

Arrays

Ok, so we know how to store individual numbers. Let's talk about storing several numbers.

现在,我们已经知道如何存储单个数字,接下来让我们讨论一下如何存储多个数字。

That's right, things are starting to heat up.

没错,事情开始升温了。

Suppose we wanted to keep a count of how many bottles of kombucha we drink every day.

假设我们想统计每天喝多少瓶康普茶。

Let's store each day's kombucha count in an 8-bit, fixed-width, unsigned integer. That should be plenty – we're not likely to get through more than 256 (28) bottles in a signle day, right?

让我们将每天的康普茶计数存储在一个 8 位固定宽度的无符号整数中。那应该够了吧,我们不太可能在一天内喝掉 256 瓶( 28 瓶),对吗?

And let's store the kombucha counts right next to each other in RAM, starting at memory address 0:

让我们把康普茶的数量存储在内存中,从内存地址 0 开始。

Bam. That's an array . RAM is basically an array already.

这是一个数组,RAM 基本上已经是一个数组了。

Just like with RAM, the elements of an array are numbered. We call that number the index of the array element (plural: indices). In this example, each array element's index is the same as its address in RAM.

就像 RAM 一样,数组的元素也是有编号的。我们称这个数字为数组元素的 索引 (复数:indices)。在本例中,每个数组元素的索引与其在 RAM 中的地址相同。

But that's not usually true. Suppose another program like Spotify has already stored some information at memory address 2:

但通常不是这样的。假设另一个类似 Spotify 的程序已经在内存地址 2 中存储了一些信息:

We'd have to start our array below it, for example at memory address 3. So index 0 in our array would be at memory address 3, and index 1 would be at memory address 4, etc.:

我们必须从它下面的数组开始,例如在内存地址 3 处。因此,数组中的索引 0 将在内存地址 3,而索引 1 将在内存地址 4,依此类推:

Suppose we wanted to get the kombucha count at index 4 in our array. How do we figure out what address in memory to go to? Simple math:

假设我们想要获得数组中索引 4 处的康普茶计数。我们如何找出内存中的地址?简单的数学:

Take the array's starting address (3), add the index we're looking for (4), and that's the address of the item we're looking for. 3 + 4 = 7 . In general, for getting the nth item in our array:

取数组的起始地址(3),添加我们要查找的索引(4),这就是我们要查找的项目的地址。 3 + 4 = 7, 通常,要获取数组中的第 n 个项目:

address of nth item in array = address of array start + n

This works out nicely because the size of the addressed memory slots and the size of each count are both 1 byte. So a slot in our array corresponds to a slot in RAM.

这很好地解决了问题,因为地址内存槽的大小和每个计数的大小都是 1 字节,所以数组中的一个槽对应于 RAM 中的一个槽。

But that's not always the case. In fact, it's usually not the case. We usually use 64-bit integers.

但并非总是如此,实际上,通常不是这种情况,我们通常使用 64 位整数。

So how do we build an array of 64-bit (8 byte) integers on top of our 8-bit (1 byte) memory slots?

那么,如何在 8 位(1字节)内存插槽的基础上构建 64 位(8字节)整数数组?

We simply give each array index 8 address slots instead of 1:

我们只需给每个数组索引分配 8 个地址槽,而不是 1 个:

So we can still use simple math to grab the start of the nth item in our array – just gotta throw in some multiplication:

因此,我们仍然可以使用简单的数学方法来获取数组中第 n 个项目的开始 – 只需进行一些乘法运算即可:

address of nth item in array = address of array start + (n * size of each item in bytes)

Don't worry – adding this multiplication doesn't really slow us down. Remember: addition, subtraction, multiplication, and division of fixed-width integers takes O(1) time. So all the math we're using here to get the address of the nth item in the array takes O(1) time.

不用担心 – 添加此乘法并不会真正减慢我们的速度。 请记住:固定宽度整数的加,减,乘和除运算需要 O(1) 时间。因此,我们在此处用于获取数组中第 n 个项目的地址的所有数学运算都只需要 O(1) 时间。

And remember how we said the memory controller has a direct connection to each slot in RAM? That means we can read the stuff at any given memory address in O(1) time.

还记得我们怎么说内存控制器直接连接到 RAM 中的每个插槽吗? 这意味着我们可以在 O(1) 时间内读取任何给定的内存地址中的内容。

Together, this means looking up the contents of a given array index is O(1) time. This fast lookup capability is the most important property of arrays.

这意味着查找给定数组索引的内容的时间是 O(1),这种快速查找功能是数组最重要的属性。

But the formula we used to get the address of the nth item in our array only works if:

  1. Each item in the array is the same size (takes up the same number of bytes).
  2. The array is uninterrupted (contiguous) in memory. There can't be any gaps in the array…like to "skip over" a memory slot Spotify was already using.

但是,我们用来获取数组中第 n 项地址的公式只有在以下情况下才有效:

  1. 数组中的每一项的大小都相同(占用相同的字节数):
  2. 数组在内存中是不间断的(连续的),数组中不能有任何类似于“跳过” Spotify 已经在使用的内存槽。

These things make our formula for finding the nth item work because they make our array predictable. We can predict exactly where in memory the nth element of our array will be.

上述约束确保我们找到第 n 项的公式有效,因为它们使我们的数组是可预测的,我们可以准确地预测数组的第 n 个元素在内存中的位置。

But they also constrain what kinds of things we can put in an array. Every item has to be the same size. And if our array is going to store a lot of stuff, we'll need a bunch of uninterrupted free space in RAM. Which gets hard when most of our RAM is already occupied by other programs (like Spotify).

但它们也限制了我们可以放入数组的东西。每个项目必须是相同的大小。如果我们的数组要存储很多东西,我们需要在 RAM 中有一些不间断的空闲空间。当我们的大部分内存已经被其他程序(如 Spotify )占用时,这就变得困难了。

That's the tradeoff. Arrays have fast lookups (O(1) time), but each item in the array needs to be the same size, and you need a big block of uninterrupted free memory to store the array.

这是一种折衷。数组具有快速查找( O(1) 时间),但是数组中的每个项都需要相同的大小,并且您需要一个大的不间断的空闲内存块来存储数组。

Strings

i.e. 字符串

Okay, let's store some words.

现在让我们存储一些单词。

A series of characters (letters, punctuation, etc.) is called a string .

一系列字符(字母,标点符号等)称为字符串。

We already know one way to store a series of things – arrays. But how can an array store characters instead of numbers?

我们已经知道一种存储一系列事物的方法 – 数组。 但是数组如何存储字符而不是数字?

Easy. Let's define a mapping between numbers and characters. Let's say "A" is 1 (or 0000 0001 in binary), "B" is 2 (or 0000 0010 in binary), etc. Bam. Now we have characters.

简单。让我们定义数字和字符之间的映射。 假设“A”为 1(或二进制的 0000 0001 ),“B”为 2(或二进制的 0000 0010 ),等等。现在我们有了字符串了。

This mapping of numbers to characters is called a character encoding . One common character encoding is "ASCII". Here's how the alphabet is encoded in ASCII:

这种数字到字符的映射称为 字符编码 。一种常见的字符编码是“ASCII”,字母表是如何用 ASCII 码编码的,如下:

Table 1: ASCII
A: 01000001 S: 01010011 k: 01101011
B: 01000010 T: 01010100 l: 01101100
C: 01000011 U: 01010101 m: 01101101
D: 01000100 V: 01010110 n: 01111110
E: 01000101 W: 01010111 o: 01101111
F: 01000110 X: 01011000 p: 01110000
G: 01000111 Y: 01011001 q: 01110001
H: 01001000 Z: 01011010 r: 01110010
I: 01001001 a: 01100001 s: 01110011
J: 01001010 b: 01100010 t: 01110100
K: 01001011 c: 01100011 u: 01110101
L: 01001100 d: 01100100 v: 01110110
M: 01001101 e: 01100101 w: 01110111
N: 01001110 f: 01100110 x: 01111000
O: 01001111 g: 01100111 y: 01111001
P: 01010000 h: 01101000 z: 01111010
Q: 01010001 i: 01101001  
R: 01010010 j: 01101010  

You get the idea. So since we can express characters as 8-bit integers, we can express strings as arrays of 8-bit numbers characters.

你懂的。既然我们可以将字符表示为 8 位整数,我们就可以将字符串表示为 8 位数字字符的数组。

Pointers

i.e. 指针

Remember how we said every item in an array had to be the same size? Let's dig into that a little more.

还记得我们说过数组中的每个元素的大小必须相同吗?让我们再深入一点。

Suppose we want to store a bunch of ideas for baby names. Because we've got some really cute ones.

假设我们为孩子准备了许多名字,我们像从去选取一个非常可爱的。

Each name is a string. Which is really an array. And now we want to store those arrays in an array. Whoa.

每个名称都是一个字符串。这实际上是一个数组,现在我们想将这些数组存储在一个数组中。

Now, what if our baby names have different lengths? That'd violate our rule that all the items in an array need to be the same size!

现在,如果我们的宝宝名字有不同的长度怎么办?这违反了我们的规则,即数组中所有元素的大小必须相同!

We could put our baby names in arbitrary large arrays (say, 13 characters each), and just use a special character to mark the end of the string within each array…

我们可以将孩子的名字放在任意大的数组中(比如,每个 13 个字符),然后在每个数组中使用一个特殊的字符来标记字符串的结尾。

"Wigglesworth" is a cute baby name, right?

“ Wigglesworth”是个可爱的婴儿名字,对吧?

But look at all that wasted space after "Bill". And what if we wanted to store a string that was more than 13 characters? We'd be out of luck.

但是看看 “Bill” 之后浪费的空间,另外如果我们想存储一个超过 13 个字符的字符串呢?

There's a better way. Instead of storing the strings right inside our array, let's just put the strings wherever we can fit them in memory. Then we will have each element in our array hold the address in memory of its corresponding string. Each address is an integer, so really our outer array is just an array of integers. We can call each of these integers a pointer , since it points to another spot in memory.

有一个更好的方法。我们不把字符串存储在数组中,而是把字符串放到内存中任何合适(空间足够)的地方。我们使用一个整数数组,其中的每个元素为相应字符串的内存地址,每个地址都是整数,称之为 指针 (指向内存中的另一个点)。

The pointers are marked with a * at the beginning.

指针以 * 开头。

Pretty clever, right? This fixes both the disadvantages of arrays:

  1. The items don't have to be the same length – each string can be as long or as short as we want.
  2. We don't need enough uninterrupted free memory to store all our strings next to each other – we can place each of them separately, wherever there's space in RAM.

非常聪明,不是吗?它修复了数组的两个缺点:

  1. 这些元素的长度不必一样长-每个字符串都可以根据需要长或短。
  2. 我们不需要足够的不间断的空闲内存来存储所有的字符串,我们可以将它们分别放在RAM中任何有空间的地方。

We fixed it! No more tradeoffs. Right?

如此,我们修复了缺点,但没有任何副作用吗?

Nope. Now we have a new tradeoff:

不是这样的,我们其实做了一些妥协和折衷。

Remember how the memory controller sends the contents of nearby memory addresses to the processor with each read? And the processor caches them. So reading sequential addresses in RAM is RAM is faster because we can get most of those reads right from the caches?

还记得每次读取时,内存控制器如何将附近内存地址的内容发送给处理器?处理器缓存它们,所以读取 RAM 中的 连续地址 比间断地址更快,因为我们可以直接从处理器高速缓存中获取大部分读取信息。

Our original array was very cache-friendly , because everything was sequential. So reading from the 0th index, then 1st index, then the 2nd, etc. got an extra speedup from the processor cache.

我们的原始数组非常 易于缓存 ,因为所有内容都是顺序的。 因此,从第 0 个索引开始读取,然后从第 1 个索引读取,然后从第 2 个索引读取,依此类推,从处理器缓存中获得了额外的加速。

But the pointers in this array make it not cache-friendly, because the baby names are scattered randomly around RAM. So reading from the 0th index, then the 1st index, etc. doesn't get that extra speedup from the cache.

但是这个数组中的指针使它对 缓存不友好 ,因为孩子的名字随机分布在 RAM 中。所以从第 0 个索引,然后是第 1 个索引读取数据,等等,并不能从缓存中获得额外的加速。

That's the tradeoff. This pointer-based array requires less uninterrupted memory and can accommodate elements that aren't all the same size, but it's slower because it's not cache-friendly.

这是一种折衷。这个基于指针的数组需要更少的不间断内存,并且可以容纳大小不同的元素,但是它的速度较慢,因为它对缓存不友好。

This slowdown isn't reflected in the big O time cost. Lookups in this pointer-based array are still O(1) time.

这种放缓并没有反映在 big O 的时间成本上,在这个基于指针的数组中查找仍然是 O(1) 时间。

Dynamic arrays

i.e. 动态数组

Let's build a very simple word processor. What data structure should we use to store the text as our user writes it?

让我们构建一个非常简单的文字处理器。当用户编写文本时,我们应使用哪种数据结构存储文本?

Strings are stored as arrays, right? So we should use an array?

字符串存储为数组,那么我们应该使用数组吗?

Here's where that gets tricky: when we allocate an array in a low-level language like C or Java, we have to specify upfront how many indices we want our array to have.

这就是问题所在:当我们用 C 或 Java 这样的低级语言分配一个数组时,我们必须预先指定我们希望数组有多少个索引。

There's a reason for this – the computer has to reserve space in memory for the array and commit to not letting anything else use that space. We can't have some other program overwriting the elements in our array!

这样做是有原因的,计算机必须在内存中为数组保留空间,并承诺不让其他任何东西使用该空间。我们不能让其他程序覆盖数组中的元素!

The computer can't reserve all its memory for a single array. So we have to tell it how much to reserve.

计算机不能把它所有的内存都留给一个数组,所以我们要告诉它保留多少。

But for our word processor, we don't know ahead of time how long the user's document is going to be! So what can we do?

但是对于我们的文字处理器,我们并不能提前知道用户的文档会有多长!那么,我们能做什么呢?

Just make an array and program it to resize itself when it runs out of space! This is called a dynamic array , and it's built on top of a normal array.

只需制作一个数组并对它进行编程,使其在空间不足时自动调整大小!这称为 动态数组 ,它建立在普通数组之上。

Python, Ruby, and JavaScript use dynamic arrays for their default array-like data structures. In Python, they're called "lists." Other languages have both. For example, in Java, array is a static array (whose size we have to define ahead of time) and ArrayList is a dynamic array.

Python、Ruby 和 JavaScript 使用动态数组作为其默认的类数组数据结构。在 Python 中,它们被称为“列表”。其他语言两者都有。例如,在 Java 中,array 是一个静态数组(我们必须预先定义其大小),而 ArrayList 是一个动态数组。

Here's how it works:

这里是它的运作方式:

When you allocate a dynamic array, your dynamic array implementation makes an underlying static array. The starting size depends on the implementation – let's say our implementation uses 10 indices:

当您分配一个动态数组时,动态数组的实现将创建一个 底层静态数组 。初始大小取决于实现 – 假设我们的实现使用了 10 个索引:

Say you append 4 items to your dynamic array:

假设您将 4 个项目附加到动态数组中:

At this point, our dynamic array contains 4 items. It has a length of 4. But the underlying array has a length of 10.

此时,我们的动态数组包含 4 个项,它的长度是 4,但是底层静态数组的长度是 10。

We'd say this dynamic array's size is 4 and its capability is 10.

我们说这个动态数组的大小为 4,其 容量 为 10。

The dynamic array stores an end_index to keep track of where the dynamic array ends and the extra capability begins.

动态数组存储一个 end_index 以跟踪动态数组的结束位置和额外功能的开始。

If you keep appending, at some point you'll use up the full capability of the underlying array:

如果继续追加,在某个时候就会耗尽底层数组的全部容量:

Next time you append, the dynamic array implementation will do a few things under the hood to make it work:

下次追加时,动态数组实现将在底层做一些事情以使其工作:

  1. Make a new, bigger array. Usually twice as big.

1. 创建一个新的、更大的数组,通常是两倍大。

Why not just extent the existing array? Because that memory might already be taken. Say we have Spotify open and it's using a handful of memory addresses right after the end of our old array. We'll have to skip that memory and reserve the next 20 uninterrupted memory slots for our new array:

为什么不扩展现有数组呢? 因为其后的内存可能已被占用。假设我们打开了 Spotify,并且在旧数组结束后立即使用了少量内存地址。 我们将不得不跳过该内存,并为我们的新数组保留接下来的 20 个不间断的内存插槽:

2. Copy each element from the old array into the new array.

2. 将每个元素从旧数组复制到新数组。

3. Free up the old array. This tells the operating system, "you can use this memory for something else now."

3.释放旧数组。 这告诉操作系统,“您现在可以将这些内存用于其他用途。

4. Append your new item.

4. 添加新项目。

We could call these special appends "doubling" appends since they require us to make a new array that's (usually) double the size of the old one.

我们可以将这些特殊的追加称为“加倍”追加,因为它们要求我们创建一个新数组,该数组的大小通常是旧数组的两倍。

Appending an item to an array is usually an O(1) time operation, but a single doubling append is an O(n) time operation since we have to copy all n items from our array.

将项目追加到数组通常是 O(1)时间操作,但是一次加倍追加是 O(n)时间操作,因为我们必须从数组中复制所有 n 个项目。

Does that mean an append operation on a dynamic array is always worst-case O(n) time? Yes. So if we make an empty dynamic array and append n items, that has some crazy time cost like O(n2) or O(n!) ?!?! Not quite.

这是否意味着动态数组的追加操作总是最坏情况 O(n) 时间?是的。因此,如果我们创建一个空的动态数组并添加 n 个项目,就会有一些疯狂的时间成本,比如 O(n2) 或 O(n!) ?!不尽然。

While the time cost of each special O(n) doubling append doubles each time, the number of O(1) appends you get until the next doubling append also doubles. This kind of "cancels out," and we can say each append has an average cost or amortized cost of O(1).

虽然每个特殊的 O(n) 加倍追加的时间成本每次都会加倍,但其后时间都是 O(1) ,直到下一次加倍追加。

Given this, in industry we usually wave our hands and say dynamic arrays have a time cost of O(1) for appends, even though strictly speaking that's only true for the average case or the amortized cost.

考虑到这一点,在工业中我们通常会说动态数组的附加时间成本是 (1) ,尽管严格来说这只适用于平均情况或平摊代价。

In an interview, if we were worried about that O(n) – time worst – case cost of appends, we might try to use a normal, non-dynamic array.

在一次面试中,如果我们担心附加的 O(n) 时间最坏情况成本,我们可能会尝试使用一个普通的、非动态的数组。

The advantage of dynamic arrays over arrays is that you don't have to specify the size ahead of time, but the disadvantage is that some appends can be expensive. That's the tradeoff.

与数组相比,动态数组的优点是您不必预先指定大小,但缺点是有些追加可能很昂贵,这是一种折衷。

But what if we wanted the best of both worlds…

但是,如果我们想要两全其美呢……

Linked lists

i.e. 链表

Our word processor is definitely going to need fast appends – appending to the document is like the main thing you do with a word processor.

我们的文字处理器肯定需要快速追加到文档中,这就是你使用文字处理器做的主要事情。

Can we build a data structure that can store a string, has fast appends, and doesn't require you to say how long the string will be ahead of time?

我们能不能构建一个数据结构来存储一个字符串,它有快速的附加,并且不需要你提前指定这个字符串有多长。

Let's focus first on not having to know the length of our string ahead of time. Remember how we used pointers to get around length issues with our array of baby names?

让我们首先关注不需要提前知道字符串的长度,还记得我们如何使用指针来解决婴儿名数组的长度问题吗?

What if we pushed that idea even further?

如果我们把这个想法进一步推进呢?

What if each characters in our string were a two-index array with:

  1. the character itself
  2. a pointer to the next character

如果字符串中的每个字符都是一个包含 两个索引 的数组呢:

  1. 字符本身,和
  2. 指向下一个字符的指针。

We would call each of these two-item arrays a node and we'd call this series of nodes a linked list .

我们将b包含两个索引数组中的每一个称为一个 节点 ,并将这一系列节点称为一个 链表

Here's how we'd actually implement it in memory:

下面是我们如何在内存中实现它:

Notice how we're free to store our nodes wherever we can find two open slots in memory. They don't have to be next to each other. They don't even have to be in order:

请注意,我们可以自由地将节点存储在内存中任何能找到两个空位的地方。它们不必相邻,甚至不必按顺序排列。

"But that's not cache-friendly," you may be thinking. Good point! We'll get to that.

你可能会想:“但这是缓存不友好的。”不错的想法,下面我们会讲到这一点。

The first node of a linked list is called the head , and the last node is usually called the tail .

链表的第一个节点称为 ,最后一个节点通常称为

Confusingly, some people prefer to use "tail" to refer to everything after the head of a linked list. In an interview it's fine to use either definition. Briefly say which definition you're using, just to be clear.

It's important to have a pointer variable referencing the head of the list – otherwise we'd be unable to find our way back to the start of the list!

有一个指针变量引用列表的开头是很重要的,否则我们将无法找到返回列表开头的方法。

We'll also sometimes keep a pointer to the tail. That comes in handy when we want to add something new to the end of the linked list. In fact, let's try that out:

有时我们还会保存指向尾巴的指针。 当我们要在链接列表的末尾添加新内容时,这很方便。如下:

Suppose we had the string "LOG" stored in a linked list:

假设我们将字符串“LOG”存储在一个链表中:

Suppose we wanted to add an "S" to the end, to make it "LOGS". How would we do that?

假设我们想在末尾添加一个“S”,使其成为“LOGS”。我们将如何做?

Easy. We just put it in a new node:

很简单。我们只需将其放在一个新节点中:

And tweak some pointers:

并调整一些指针:

1 . Grab the last letter, which is "G". Our tail pointer let up do this in O(1) time.

1 . 抓住最后一个字母,即“ G”,我们的尾部指针会在 O(1)时间内完成操作。

2 . Point the last letter's next to the letter we're appending ("S").

2 . 将最后一个字母的 *next 指针指向我们要附加的字母(“S”)。

3 . Update the tail pointer to point to our new last letter, "S".

3 . 更新尾指针以指向我们新的最后一个字母“S”。

That's O(1) time.

Why is it O(1) time? Because the runtime doesn't get bigger if the string gets bigger. No matter how many characters are in our string, we still just have to tweak a couple pointers for any append.

为什么是O(1)时间? 因为运行时不会随着字符串变大而变大。 无论我们的字符串中有多少个字符,我们都只需要为任何附加项调整两个指针。

Now, what if instead of a linked list, our string had been a dynamic array? We might not have any room at the end, forcing us to do one of those doubling operations to make space:

现在,如果我们的字符串不是一个链表,而是一个动态数组呢?我们可能在最后没有任何空间,迫使我们采取加倍操作来留出空间。

So with a dynamic array, our append would have a worst-case time cost of O(n).

因此,对于动态数组,我们的追加操作在最坏情况下的时间成本为 O(n)。

Linked lists have worst-case O(1)-time appends, which is better than the worst-case O(n) time of dynamic arrays.

链表有最坏情况 O(1) 时间追加,这比动态数组的最坏情况 O(n) 时间要好。

That worst-case part is important. The average case runtime for appends to linked lists and dynamic arrays is the same: O(1).

最坏的情况很重要,附加到链表和动态数组的平均情况运行时是相同的:O(1)。

Now, what if we wanted to prepend something to our string? Let's say we wanted to put a "B" at the beginning.

现在,如果我们想在字符串前加一些东西呢?假设我们想在开头添加字母“B”。

For our linked list, it's just as easy as appending. Create the node:

对于我们的链表,它就像附加一样简单。创建一个节点:

And tweak some pointers:

  1. Point "B"'s next to "L".
  2. Point the head to "B".

调整一些指针:

  1. 把 “B” 的 *next 指针指向 “L”;
  2. *head 指针指向 “B”。

Bam. O(1) time again.

时间成本仍然为 O(1) 。

But if our string were a dynamic array…

但是如果我们的字符串是一个动态数组……

And we wanted to add in that "B":

我们想要添加 “B”:

Eep. We have to make room for the "B"!

我们必须为 “B” 留出空间!

We have to move each character one space down:

我们必须将每个字符向下移动一个空格:

Now we can drop the "B" in there:

现在我们可以把 “B” 放进去了:

What's our time cost here?

我们在这里花费的时间是多少?

It's all in the step where we made room for the first letter. We had to move all n characters in our string. One at a time. That's O(n) time.

这是我们为第一个字母腾出空间的所有步骤,我们必须移动字符串中的所有 n 个字符,一次一个,是 O(n)时间。

So linked lists have faster prepends (O(1) time) than dynamic arrays (O(n) time).

因此,链表的前置时间(O(1))比动态数组(O(n))快。

No "worst case" caveat this time – prepends for dynamic arrays are always O(n) time. And prepends for linked lists are always O(1) time.

这次没有“最坏情况” – 动态数组的前缀总是 O(n)时间,并且链接列表的前缀始终为 O(1)时间。

These quick appends and prepends for linked lists come from the fact that linked list nodes can go anywhere in memory. They don't have to sit right next to each other the way items in an array do.

链表的这些快速追加和前置添加是由于链表节点可以在内存中的任何位置,它们不必像数组中的项那样彼此相邻。

So if linked lists are so great, why do we usually store strings in an array? Because arrays have O(1)-time lookups. And those constant-time lookups come from the fact that all the array elements are lined up next to each other in memory.

如果链表这么好,为什么我们要把字符串存储在数组中呢?因为数组有 O(1) 时间查找。这些恒定时间的查找来自于这样一个事实,即所有的数组元素在内存中都是相邻排列的。

Lookups with a linked list are more of a process, because we have no way of knowing where the i th node is in memory. So we have to walk through the linked list node by node, counting as we go, until we hit the i th item.

使用链表进行查找是一个过程,因为我们无法知道第 i 个节点在内存中的位置,所以我们必须一个节点一个节点地遍历链表,直到找到第 i 项。

def get_ith_item_in_linked_list(head, i):
    if i < 0:
        raise ValueError("i can't be negative: %d" % i)

    current_node = head
    current_position = 0
    while current_node:
        if current_position == i:
            # Found it!
            return current_node

        # Move on to the next node
        current_node = current_node.next
        current_position += 1

    raise ValueError('List has fewer than i + 1 (%d) nodes' % (i + 1))        

That's i + 1 steps down our linked list to get to the ith node (we made our function zero-based to match indices in arrays). So linked lists have O(i)-time lookups. Much slower than the O(1)-time lookups for arrays and dynamic arrays.

那就是 i + 1 在链接列表中下移到第 i 个节点(我们使函数从零开始,以匹配数组中的索引)。 因此,链接列表具有 O(i)时间查找,比数组和动态数组的 O(1)时间查找慢得多。

Not only that – walking down a linked list is not cache-friendly. Because the next node could be anywhere in memory, we don't get any benefit from the processor cache. This means lookups in a linked list are even slower.

不仅如此 – 遍历一个链表不是缓存友好的。因为下一个节点可能在内存中的任何位置,所以我们无法从处理器缓存中获得任何好处,这意味着在链表中查找更慢。

So the tradeoff with linked lists is they have faster prepends and faster appends than dynamic arrays, but they have slower lookups.

因此,链表的折衷之处在于,它比动态数组具有更快的前置和追加速度,但查找速度较慢。

Hash tables

i.e. 哈希表

Quick lookups are often really important. For that reason, we tend to use arrays (O(1)-time lookups) much more often than linked lists (O(i)-time lookups).

快速查找通常非常重要。因此,我们倾向于使用数组(O(1)时间查找),而不是链表(O(i)时间查找)。

For example, suppose we wanted to count how many times each ASCII character appears in Remeo and Juliet. How would we store those counts?

例如,假设我们要计算每个 ASCII 字符在《罗密欧与朱丽叶》中出现的次数,我们将如何存储这些计数?

We can use arrays in a clever way here. Remember – characters are just numbers. In ASCII (a common character encoding) 'A' is 65, 'B' is 66, etc.

我们可以在这里巧妙地使用数组。记住 – 字符只是数字,在ASCII(通用字符编码)中,“A”为65,“B”为66,依此类推。

So we can use the character('s number value) as the index in our array, and store the count for that character at that index in the array:

因此,我们可以使用字符(的 数字值 )作为数组中的索引,并将该字符的计数存储在数组中的该索引处:

With this array, we can look up (and edit) the count for any character in constant time. Because we can access any index in our array in constant time.

使用此数组,我们可以在恒定时间内查找(和编辑)任何字符的计数, 因为我们可以在恒定时间内访问数组中的任何索引。

Sometimes interesting is happening here – this array isn't just a list of values. This array is storing two things: characters and counts. The characters are implied by the indices.

这里发生了有趣的事情 – 这个数组不仅仅是值列表,它存储两类:字符和计数,字符由索引隐含。

So we can think of an array as a table with two columns…except you don't really get to pick the values in one column (the indices) – they're always 0, 1, 2, 3, etc.

因此,我们可以将数组视为具有两列的表……除了你不需要选择一列(索引)中的值-它们始终为 0、1、2、3 等。

But what if we wanted to put any value in that column and still get quick lookups?

但是,如果我们想在那个列中放入任何值,并且仍然能够快速查找,那该怎么办呢?

Suppose we wanted to count the number of times each word appears in Romeo and Juliet. Can we adapt our array?

假设我们想数一数《罗密欧与朱丽叶》中每个单词出现的次数,我们能调整数组吗?

Translating a character into an array index was easy. But we'll have to do something more clever to translate a word (a string) into an array index…

将字符转换为数组索引很容易,但是我们如何才能将一个单词(一个字符串)翻译成数组索引……

Here's one way we could do it:

这是我们可以做到的一种方法:

Grab the number value for each character and add those up.

获取每个字符的数值并将其相加。

The result is 429. But what if we only have 30 slots in our array? We'll use a common trick for forcing a number into a specific range: the modulus operator (%). Modding our sum by 30 ensures we get a whole number that's less than 30 (and at least 0):

结果是429。但是,如果数组中只有 30 个插槽,该怎么办? 我们将使用一个通用技巧将结果限制到特定范围内:模运算符(%)。 将我们的总和求余于 30 可确保我们得到的整数小于 30(最小为0):

429 % 30 = 9

Bam. That'll get us from a word (or any string) to an array index.

这将使我们从一个单词(或任何字符串)到一个数组索引。

This data structure is called a hash table or hash map . In our hash table, the counts are the values and the words ("lies," etc.) are the keys (analogous to the indices in an array). The process we used to translate a key into an array index is called a hashing function .

这种数据结构称为 哈希表 或哈希映射。在我们的哈希表中,计数是值,单词(“lies”等)是键(类似于数组中的索引)。我们将键转换为数组索引的过程称为 哈希函数

The hashing functions used in modern systems get pretty complicated – the one we used here is a simplified example.

现代系统中使用的哈希函数非常复杂 – 这里我们使用的是一个简化的例子。

Note that our quick lookups are only in one direction – we can quickly get the value for a value for a given key, but the only way to get the key for a given value is to walk through all the values and keys.

请注意,我们的快速查找仅在一个方向上 – 我们可以快速获取给定键的值,但是获取给定值的键的惟一方法是遍历所有的值和键。

Same thing with arrays – we can quickly look up the value at a given index, but the only way to figure out the index for a given value is to walk through the whole array.

数组也是如此 – 我们可以快速查找给定索引处的值,但是找出给定值的索引的唯一方法是遍历整个数组。

One problem – what if two keys hash to the same index in our array? Look at "lies" and "foes":

一个问题 – 如果两个键哈希映射到数组中的同一索引怎么办? 看一下“lies”和“foes”:

They both sum up to 429! So of course they'll have the same answer when we mod by 30:

它们加起来是 429!所以当对 30 取余时,答案是一样的。

429 % 30 = 9

So our hashing function gives us the same answer for "lies" and "foes." This is called a hash collision . There are a few different strategies for dealing with them.

因此,我们的哈希函数为“lies”和“foes”提供了相同的答案,这称为 哈希冲突 ,有几种不同的应对策略。

Here's a common one: instead of storing the actual values in our array, let's have each array slot hold a pointer to a linked list holding the counts for all the words that hash to that index:

一个常见的方法:不是将实际值存储在数组中,而是让每个数组插槽都包含一个指向链表的指针,该链表包含哈希到该索引的所有单词的计数:

One problem – how do we know which count is for "lies" and which is for "foes"? To fix this, we'll store the word as well as the count in each linked list node:

一个问题 – 我们如何知道哪个计数代表“lies”,哪个计数代表“foes”? 为了解决这个问题,我们将单词和计数存储在每个链接列表节点中:

"But wait!" you may be thinking, "Now lookups in our hash table take O(n) time in the worst case, since we have to walk down a linked list." That's true! You could even say that in the worst case every key creates a hash collision, so our whole hash table degrades to a linked list.

“但是等等!”您可能会想,“在最坏的情况下,查找哈希表需要 O(n) 个时间,因为我们必须遍历一个链表。”这是真的!你甚至可以说,在最坏的情况下,每个键都会产生一个哈希冲突,所以我们的整个哈希表会退化为一个链表。

In industry though, we usually wave our hands and say collisions are rare enough that on average lookups in a hash table are O(1) time. And there are fancy algorithms that keep the number of collisions low and keep the lengths of our linked lists nice and short.

在实际过程中,我们通常会说哈希冲突是非常罕见的,以至于哈希表中的平均查找时间是 O(1),还有一些很神奇的算法可以降低冲突的次数并使链表的长度很短。

But that's sort of the tradeoff with hash tables. You get fast lookups by key…except some lookups could be slow. And of course, you only get those fast lookups in one direction – looking up the key for a given value still takes O(n) time.

但这是哈希表的折衷方案。您可以通过按键快速进行查找……只是某些查找可能很慢。 当然,您只能在一个方向上进行快速查找 – 查找给定值的键仍需要 O(n)时间。

Summary

Arrays have O(1) time lookups. But you need enough uninterrupted space in RAM to store the whole array. And the array items need to be the same size.

数组有O(1)时间查找,但是 RAM 中需要足够的不间断空间来存储整个数组,且数组项的大小必须相同。

But if your array stores pointers to the actual items (like we did with our list of baby names), you can get around both those weaknesses. You can store each array item wherever there's space in RAM, and the array items can be different sizes. The tradeoff is that now your array is slower because it's not cache-friendly.

但是,如果您的数组存储了指向实际项目的指针(就像我们对婴儿名列表所做的那样),那么您可以避开这两个弱点。您可以将每个数组项存储在 RAM 中任何有空间的地方,并且数组项可以是不同大小的。这样做的代价是现在您的数组变慢了,因为它对缓存不友好。

Another problem with arrays is you have to specify their sizes ahead of time. There are two ways to get around this: dynamic arrays and linked lists . Linked lists have faster appends and prepends than dynamic arrays, but dynamic arrays have faster lookups.

数组的另一个问题是必须提前指定它们的大小。有两种方法可以解决这个问题:动态数组和链表。链表具有比动态数组更快的追加和前置,而动态数组具有更快的查找。

Fast lookups are really useful, especially if you can look things up not just by indices (0, 1, 2, 3, etc.) but by arbitrary keys ("lies", "foes"…any string). That's what hash tables are for. The only problem with hash tables is they have to deal with hash collisions, which means some lookups could be a bit slow.

快速查找是非常有用的,特别是如果您不仅可以通过索引( 0、1、2、3 等),还可以通过任意键(“lies”、“foes”任何字符串)查找。这就是哈希表的作用,哈希表唯一的问题是它们必须处理哈希冲突,这意味着一些查找可能会有点慢。

Each data structure has tradeoffs. You can't have it all.

每种数据结构都有所妥协和折衷,你不可能拥有一切。

So you have to know what's important in the problem you've working on. What does your data structure need to do quickly? Is it lookups by index? Is it appends or prepends?

因此,您必须知道在解决的问题中什么是重要的。您的数据结构需要快速做什么? 它是按索引查找的吗? 它是追加还是前置?

Once you know what's important, you can pick the data structure that does it best.

一旦您知道什么是重要的,您就可以选择最合适的数据结构。

Date: 2020-01-29 Wed 12:45

Author: Jack Liu

Created: 2020-02-08 Sat 21:26

Validate