当前位置：首页 > news >正文

浙江台州做网站的公司有哪些it外包项目做完了就解散了吗

news 2026/4/8 16:32:45

浙江台州做网站的公司有哪些,it外包项目做完了就解散了吗,毕业网站建设开题报告,子网站数量参考自 K8s 调度框架设计与 scheduler plugins 开发部署示例#xff08;2024#xff09; 调度插件扩展点等待调度阶段PreEnqueuePod 处于 ready for scheduling 的阶段。内部工作原理#xff1a;sig-scheduling/scheduler_queues.md。在 Pod 被放入调度队列之前执行的插…参考自 K8s 调度框架设计与 scheduler plugins 开发部署示例2024 调度插件扩展点等待调度阶段PreEnqueuePod 处于 ready for scheduling 的阶段。内部工作原理sig-scheduling/scheduler_queues.md。在 Pod 被放入调度队列之前执行的插件。它允许用户在 Pod 被正式加入调度队列之前对 Pod 进行一些预处理或决策这一步没过就不会进入调度队列更不会进入调度流程。提前过滤不合格的 Pod如果 Pod 资源请求明显超出了集群的资源限制可以在 PreEnqueue 阶段拒绝它而不是让它进入调度队列浪费调度器的时间执行安全检查或策略验证延迟或取消调度QueueSort调度器会从调度队列中选择下一个要调度的 Pod而 QueueSort 扩展点负责定义这个选择的规则和顺序。默认情况下Kubernetes 调度器会根据 Pod 的优先级Priority和调度时间FIFO即先入先出来进行排序但通过自定义 QueueSort 插件你可以实现更加复杂的排序逻辑。调度阶段PreFilterpod 预处理和检查不符合预期就提前结束调度主体是 Pod对 Pod 进行预处理或检查Filter过滤掉那些不满足要求的 node 针对每个 node调度器会按配置顺序依次执行 filter plugins 任何一个插件返回失败这个 node 就被排除了主体是 node每个 node 按顺序执行插件的检查结果进行 merge任一失败就不通过PostFilter如果 Filter 阶段之后所有 nodes 都被筛掉了一个都没剩才会执行这个阶段否则不会执行这个阶段的 plugins。按 plugin 顺序依次执行任何一个插件将 node 标记为 Schedulable 就算成功不再执行剩下的 PostFilter plugins。典型例子preemptiontoleration Filter() 之后已经没有可用 node 了在这个阶段就挑一个 pod/node抢占它的资源。可以理解为该抢占 post 插件为 pod 抢到了资源别的不用执行了PreScore用于提前准备、计算不依赖 Pod 的信息评估硬件特性如 GPU不考虑 Pod 的具体要求假设我们有一个调度策略是根据节点的硬件类型进行评分。在 PreScore 阶段调度器检查所有节点是否具有 GPU 资源并为这些节点提前打分比如 GPU 节点得分为 10普通节点得分为 0Score主要评估节点的适配程度通常依赖于 Pod 的具体要求现在进入 Score 阶段调度器根据 Pod 的资源需求和标签要求Pod 需要一定的 CPU 和内存资源并且要求节点上的特定标签例如 appweb对每个节点进行评分。节点 Aappweb有足够的 CPU 和内存得分 80。节点 Bappfrontend有足够的 CPU但没有满足标签要求得分 20。节点 Cappweb资源不足得分 10。Normalize Score将得分转换为标准化值以便更公平地比较不同节点的适合性将所有节点的分数进行归一化处理确保分数在同一范围内ReserveInformational维护 plugin 状态信息不影响调度决策这里有两个方法都是 informational也就是不影响调度决策维护了 runtime state (aka “stateful plugins”) 的插件可以通过这两个方法接收 scheduler 传来的信息Reserve方法用来避免 scheduler 等待 bind 操作结束期间因 race condition 导致的错误。只有当所有 Reserve plugins 都成功后才会进入下一阶段否则 scheduling cycle 就中止了。 UnReserve 方法调度失败这个阶段回滚时执行。Unreserve() 必须幂等且不能 fail幂等就是多次执行和一次执行结果保持一致保证多次执行不会产生其他意外bug情况Permit这是 scheduling cycle 的最后一个扩展点了可以阻止或延迟将一个 pod binding 到 candidate node。三种结果 approve所有 Permit plugins 都 appove 之后这个 pod 就进入下面的 binding 阶段 deny任何一个 Permit plugin deny 之后就无法进入 binding 阶段。这会触发 Reserve plugins 的 Unreserve() 方法 wait (with a timeout)如果有 Permit plugin 返回 “wait”这个 pod 就会进入一个 internal “waiting” Pods list绑定阶段WaitOnPermitWaitOnPermit 参数主要控制调度器在 Permit 阶段等待的行为具体来说它定义了调度器等待 Pod 获得“Permit”许可的最大时长。也就是说如果一个 Pod 在 Permit 阶段被插件要求等待调度器会根据 WaitOnPermit 设定的时间限制等待这个 Pod 获得许可。Pod 协同调度有时需要让多个 Pods 在相同的条件下同时被调度或根据某些协调机制进行调度比如分布式应用程序中的主从架构或依赖其他 Pods 的启动状态。在这种情况下可以通过 Permit 插件让某个 Pod 等待其他 Pods 满足某些条件然后再一起放行。资源锁定某些情况下你可能希望确保一些资源在其他 Pods 准备好之前不会被使用Permit 阶段可以用来实现这种资源锁定机制WaitOnPermit 则会控制 Pod 在资源锁定期间的等待时间。任务队列如果某些 Pods 需要排队进行处理Permit 插件可以将它们暂时挂起并通过设置 WaitOnPermit 来定义它们可以等待的最长时间。PreBindBind 之前的预处理例如到 node 上去挂载 volume任何一个 PreBind plugin 失败都会导致 pod 被 reject进入到 reserve plugins 的 Unreserve() 方法Bind所有 PreBind 完成之后才会进入 Bind - 所有 plugin 按配置顺序依次执行- 每个 plugin 可以选择是否要处理一个给定的 pod - 如果选择处理后面剩下的 plugins 会跳过。也就是最多只有一个 bind plugin 会执行。PostBindInformational维护 plugin 状态信息不影响调度决策这是一个 informational extension point也就是无法影响调度决策没有返回值。- bind 成功的 pod 才会进入这个阶段- 作为 binding cycle 的最后一个阶段一般是用来清理一些相关资源。执行清理操作或其他后置操作比如将 pod 绑定后的 node 信息保存到 CR 中 1 引言 K8s 调度框架提供了一种扩展调度功能的插件机制对于想实现自定义调度逻辑的场景非常有用。如果 pod spec 里没指定 schedulerName 字段则使用默认调度器如果指定了就会走到相应的调度器/调度插件。本文整理一些相关内容并展示如何用 300 来行代码实现一个简单的固定宿主机调度插件。代码基于 k8s v1.28。 1.1 调度框架sceduling framework扩展点如下图所示K8s 调度框架定义了一些扩展点extension points Fig. Scheduling framework extension points. 用户可以编写自己的调度插件scheduler plugins注册到这些扩展点来实现想要的调度逻辑。每个扩展点上一般会有多个 plugins按注册顺序依次执行。扩展点根据是否影响调度决策可以分为两类。 1.1.1 影响调度决策的扩展点大部分扩展点是影响调度决策的后面会看到这些函数的返回值中包括一个成功/失败字段决定了是允许还是拒绝这个 pod 进入下一处理阶段任何一个扩展点失败了这个 pod 的调度就失败了 1.1.2 不影响调度决策的扩展点informational 少数几个扩展点是 informational 的这些函数没有返回值因此不能影响调度决策但是在这里面可以修改 pod/node 等信息或者执行清理操作。 1.2 调度插件分类根据是否维护在 k8s 代码仓库本身分为两类。 1.2.1 in-tree plugins 维护在 k8s 代码目录 pkg/scheduler/framework/plugins 中跟内置调度器一起编译。里面有十几个调度插件大部分都是常用和在用的 $ ll pkg/scheduler/framework/plugins defaultbinder/ defaultpreemption/ dynamicresources/ feature/ imagelocality/ interpodaffinity/ names/ nodeaffinity/ nodename/ nodeports/ noderesources/ nodeunschedulable/ nodevolumelimits/ podtopologyspread/ queuesort/ schedulinggates/ selectorspread/ tainttoleration/ volumebinding/ volumerestrictions/ volumezone/in-tree 方式每次要添加新插件或者修改原有插件都需要修改 kube-scheduler 代码然后编译和重新部署 kube-scheduler比较重量级。 1.2.2 out-of-tree plugins out-of-tree plugins 由用户自己编写和维护独立部署不需要对 k8s 做任何代码或配置改动。本质上 out-of-tree plugins 也是跟 kube-scheduler 代码一起编译的不过 kube-scheduler 相关代码已经抽出来作为一个独立项目 github.com/kubernetes-sigs/scheduler-plugins。用户只需要引用这个包编写自己的调度器插件然后以普通 pod 方式部署就行其他部署方式也行比如 binary 方式部署。编译之后是个包含默认调度器和所有 out-of-tree 插件的总调度器程序它有内置调度器的功能也包括了 out-of-tree 调度器的功能用法有两种跟现有调度器并行部署只管理特定的某些 pods取代现有调度器因为它功能也是全的。 1.3 每个扩展点上分别有哪些内置插件内置的调度插件以及分别工作在哪些 extention points 官方文档。比如 node selectors 和 node affinity 用到了 NodeAffinity plugintaint/toleration 用到了 TaintToleration plugin。 2 Pod 调度过程一个 pod 的完整调度过程可以分为两个阶段 scheduling cycle为 pod 选择一个 node类似于数据库查询和筛选binding cycle落实以上选择类似于处理各种关联的东西并将结果写到数据库。例如虽然 scheduling cycle 为 pod 选择了一个 node但是在接下来的 binding cycle 中在这个 node 上给这个 pod 创建 persistent volume 失败了那整个调度过程也是算失败的需要回到最开始的步骤重新调度。以上两个过程加起来称为一个 scheduling context。另外在进入一个 scheduling context 之前还有一个调度队列用户可以编写自己的算法对队列内的 pods 进行排序决定哪些 pods 先进入调度流程。总流程如下图所示 Fig. queuing/sorting and scheduling context 下面分别来看。 2.1 等待调度阶段 2.1.1 PreEnqueue Pod 处于 ready for scheduling 的阶段。内部工作原理sig-scheduling/scheduler_queues.md。这一步没过就不会进入调度队列更不会进入调度流程。作用和场景 PreEnqueue 扩展点为调度器提供了一个机会可以在 Pod 进入调度循环前进行检查、过滤或修改。它能够帮助我们进行如下操作提前过滤不合格的 Pod 在 Pod 进入调度队列之前如果有明确的原因导致这个 Pod 不应该调度PreEnqueue 可以快速决定不让这个 Pod 进入调度队列从而减少不必要的调度开销。示例如果 Pod 资源请求明显超出了集群的资源限制可以在 PreEnqueue 阶段拒绝它而不是让它进入调度队列浪费调度器的时间。对 Pod 进行优先级排序这个扩展点可以提前调整 Pod 的优先级确保更重要的 Pod 先进入队列从而优先被调度。示例可以在 PreEnqueue 阶段识别一些关键应用的 Pod并调整它们的优先级使它们能更快地调度。执行安全检查或策略验证在 Pod 被加入队列前可以执行一些安全检查或策略验证确保 Pod 满足集群的安全或策略要求。示例在 PreEnqueue 阶段可以检查 Pod 的安全策略确保它符合集群的网络隔离或资源使用策略。延迟或取消调度 PreEnqueue 可以决定某些 Pod 不该立刻调度或根据策略直接取消它们的调度。示例假设某个 Pod 依赖外部服务而这些服务当前不可用可以在 PreEnqueue 阶段决定暂时不让该 Pod 进入队列等待服务恢复。 2.1.2 QueueSort 对调度队列scheduling queue内的 pod 进行排序决定先调度哪些 pods。调度器会从调度队列中选择下一个要调度的 Pod而 QueueSort 扩展点负责定义这个选择的规则和顺序。默认情况下Kubernetes 调度器会根据 Pod 的优先级Priority和调度时间FIFO即先入先出来进行排序但通过自定义 QueueSort 插件你可以实现更加复杂的排序逻辑。作用场景 QueueSort 可以用于以下场景按优先级排序默认情况下Pods 是按照优先级PriorityClass进行排序优先级高的 Pods 会先被调度。示例一个关键服务的 Pod 可以配置一个较高的优先级通过 QueueSort 扩展点确保它在调度队列中比其他低优先级的 Pods 更快得到调度。自定义排序规则如果有特殊需求比如希望基于 Pod 的某些标签、资源请求量、甚至是某种自定义的策略进行排序可以通过实现 QueueSort 插件来实现。示例你可以自定义排序规则让需要 GPU 的 Pods 优先被调度或者按节点的负载平衡策略选择 Pods。公平调度对不同用户或不同队列中的 Pods 实现公平调度防止某些队列的 Pods 占用过多调度资源。示例可以根据每个 namespace 的资源配额或用户的权限来调度 Pods确保某些租户的 Pods 不会霸占调度器资源。按 Pod 的等待时间排序除了按优先级排序还可以按 Pods 等待调度的时间长短进行排序确保一些长时间等待的 Pods 能够得到调度机会。示例如果某些 Pods 因为资源短缺而一直在等待你可以通过 QueueSort 逻辑优先调度这些等待时间长的 Pods防止它们被饥饿。 QueueSort 插件的实现实现 QueueSort 插件需要遵循 Kubernetes 调度框架中的插件接口规范。一个 QueueSort 插件主要需要实现两个核心函数 Less function 决定两个 Pods 的优先级比较如果返回 true表示第一个 Pod 的优先级高于第二个 Pod会优先调度。 func (p *MyQueueSortPlugin) Less(pod1, pod2 *v1.Pod) bool {// 自定义排序逻辑 }Sort function 决定整个调度队列的排序方式通常会调用 Less 函数。小结 QueueSort 是调度框架中的一个重要扩展点负责定义 Pod 在调度队列中的排序规则。它可以通过自定义逻辑来优化调度顺序例如按优先级、等待时间、资源需求或其他策略进行排序。通过实现 QueueSort 插件你可以控制调度器的 Pod 排序行为满足特定的调度需求。 2.2 调度阶段scheduling cycle 2.2.1 PreFilterpod 预处理和检查不符合预期就提前结束调度这里的插件可以对 Pod 进行预处理或者条件检查函数签名如下 // https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349-L367// PreFilterPlugin is an interface that must be implemented by PreFilter plugins. // These plugins are called at the beginning of the scheduling cycle. type PreFilterPlugin interface {// PreFilter is called at the beginning of the scheduling cycle. All PreFilter// plugins must return success or the pod will be rejected. PreFilter could optionally// return a PreFilterResult to influence which nodes to evaluate downstream. This is useful// for cases where it is possible to determine the subset of nodes to process in O(1) time.// When it returns Skip status, returned PreFilterResult and other fields in status are just ignored,// and coupled Filter plugin/PreFilterExtensions() will be skipped in this scheduling cycle.PreFilter(ctx , state *CycleState, p *v1.Pod) (*PreFilterResult, *Status)// PreFilterExtensions returns a PreFilterExtensions interface if the plugin implements one,// or nil if it does not. A Pre-filter plugin can provide extensions to incrementally// modify its pre-processed info. The framework guarantees that the extensions// AddPod/RemovePod will only be called after PreFilter, possibly on a cloned// CycleState, and may call those functions more than once before calling// Filter again on a specific node.PreFilterExtensions() PreFilterExtensions }输入 p *v1.Pod 是待调度的 pod第二个参数 state 可用于保存一些状态信息然后在后面的扩展点例如 Filter() 阶段拿出来用输出只要有任何一个 plugin 返回失败这个 pod 的调度就失败了换句话说所有已经注册的 PreFilter plugins 都成功之后pod 才会进入到下一个环节 2.2.2 Filter排除所有不符合要求的 node 这里的插件可以过滤掉那些不满足要求的 nodeequivalent of Predicates in a scheduling Policy 针对每个 node调度器会按配置顺序依次执行 filter plugins任何一个插件返回失败这个 node 就被排除了 // https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349C1-L367C2// FilterPlugin is an interface for Filter plugins. These plugins are called at the // filter extension point for filtering out hosts that cannot run a pod. // This concept used to be called predicate in the original scheduler. // These plugins should return Success, Unschedulable or Error in Status.code. // However, the scheduler accepts other valid codes as well. // Anything other than Success will lead to exclusion of the given host from running the pod. type FilterPlugin interface {Plugin// Filter is called by the scheduling framework.// All FilterPlugins should return Success to declare that// the given node fits the pod. If Filter doesnt return Success,// it will return Unschedulable, UnschedulableAndUnresolvable or Error.// For the node being evaluated, Filter plugins should look at the passed// nodeInfo reference for this particular nodes information (e.g., pods// considered to be running on the node) instead of looking it up in the// NodeInfoSnapshot because we dont guarantee that they will be the same.// For example, during preemption, we may pass a copy of the original// nodeInfo object that has some pods removed from it to evaluate the// possibility of preempting them to schedule the target pod.Filter(ctx , state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status }输入 nodeInfo 是当前给定的 node 的信息Filter() 程序判断这个 node 是否符合要求输出放行或拒绝。对于给定 node如果所有 Filter plugins 都返回成功这个 node 才算通过筛选成为备选 node 之一。 2.2.3 PostFilterFilter 之后没有 node 剩下补救阶段如果 Filter 阶段之后所有 nodes 都被筛掉了一个都没剩才会执行这个阶段否则不会执行这个阶段的 plugins。 // https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L392C1-L407C2// PostFilterPlugin is an interface for PostFilter plugins. These plugins are called after a pod cannot be scheduled. type PostFilterPlugin interface {// A PostFilter plugin should return one of the following statuses:// - Unschedulable: the plugin gets executed successfully but the pod cannot be made schedulable.// - Success: the plugin gets executed successfully and the pod can be made schedulable.// - Error: the plugin aborts due to some internal error.//// Informational plugins should be configured ahead of other ones, and always return Unschedulable status.// Optionally, a non-nil PostFilterResult may be returned along with a Success status. For example,// a preemption plugin may choose to return nominatedNodeName, so that framework can reuse that to update the// preemptor pods .spec.status.nominatedNodeName field.PostFilter(ctx , state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) (*PostFilterResult, *Status) }按 plugin 顺序依次执行任何一个插件将 node 标记为 Schedulable 就算成功不再执行剩下的 PostFilter plugins。典型例子preemptiontoleration Filter() 之后已经没有可用 node 了在这个阶段就挑一个 pod/node抢占它的资源。 2.2.4 PreScore PreScore/Score/NormalizeScore 都是给 node 打分的以最终选出一个最合适的 node。这里就不展开了函数签名也在上面给到的源文件路径中这里就不贴了。 2.2.5 Score 针对每个 node 依次调用 scoring plugin得到一个分数。 2.2.6 NormalizeScore 2.2.7 ReserveInformational维护 plugin 状态信息 // https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L444C1-L462C2// ReservePlugin is an interface for plugins with Reserve and Unreserve // methods. These are meant to update the state of the plugin. This concept // used to be called assume in the original scheduler. These plugins should // return only Success or Error in Status.code. However, the scheduler accepts // other valid codes as well. Anything other than Success will lead to // rejection of the pod. type ReservePlugin interface {// Reserve is called by the scheduling framework when the scheduler cache is// updated. If this method returns a failed Status, the scheduler will call// the Unreserve method for all enabled ReservePlugins.Reserve(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status// Unreserve is called by the scheduling framework when a reserved pod was// rejected, an error occurred during reservation of subsequent plugins, or// in a later phase. The Unreserve method implementation must be idempotent// and may be called by the scheduler even if the corresponding Reserve// method for the same plugin was not called.Unreserve(ctx , state *CycleState, p *v1.Pod, nodeName string) }这里有两个方法都是 informational也就是不影响调度决策维护了 runtime state (aka “stateful plugins”) 的插件可以通过这两个方法接收 scheduler 传来的信息 Reserve 用来避免 scheduler 等待 bind 操作结束期间因 race condition 导致的错误。只有当所有 Reserve plugins 都成功后才会进入下一阶段否则 scheduling cycle 就中止了。 Unreserve 调度失败这个阶段回滚时执行。Unreserve() 必须幂等且不能 fail。 2.2.8 Permit允许/拒绝/等待进入 binding cycle 这是 scheduling cycle 的最后一个扩展点了可以阻止或延迟将一个 pod binding 到 candidate node。 // PermitPlugin is an interface that must be implemented by Permit plugins. // These plugins are called before a pod is bound to a node. type PermitPlugin interface {// Permit is called before binding a pod (and before prebind plugins). Permit// plugins are used to prevent or delay the binding of a Pod. A permit plugin// must return success or wait with timeout duration, or the pod will be rejected.// The pod will also be rejected if the wait timeout or the pod is rejected while// waiting. Note that if the plugin returns wait, the framework will wait only// after running the remaining plugins given that no other plugin rejects the pod.Permit(ctx , state *CycleState, p *v1.Pod, nodeName string) (*Status, time.Duration) }三种结果 approve所有 Permit plugins 都 appove 之后这个 pod 就进入下面的 binding 阶段deny任何一个 Permit plugin deny 之后就无法进入 binding 阶段。这会触发 Reserve plugins 的 Unreserve() 方法wait (with a timeout)如果有 Permit plugin 返回 “wait”这个 pod 就会进入一个 internal “waiting” Pods list 2.3 绑定阶段binding cycle Fig. Scheduling framework extension points. 2.3.1 WaitOnPermit主要控制调度器在 Permit 阶段等待的行为具体来说它定义了调度器等待 Pod 获得“Permit”许可的最大时长 WaitOnPermit 参数主要控制调度器在 Permit 阶段等待的行为具体来说它定义了调度器等待 Pod 获得“Permit”许可的最大时长。也就是说如果一个 Pod 在 Permit 阶段被插件要求等待调度器会根据 WaitOnPermit 设定的时间限制等待这个 Pod 获得许可。作用场景 Pod 协同调度有时需要让多个 Pods 在相同的条件下同时被调度或根据某些协调机制进行调度比如分布式应用程序中的主从架构或依赖其他 Pods 的启动状态。在这种情况下可以通过 Permit 插件让某个 Pod 等待其他 Pods 满足某些条件然后再一起放行。资源锁定某些情况下你可能希望确保一些资源在其他 Pods 准备好之前不会被使用Permit 阶段可以用来实现这种资源锁定机制WaitOnPermit 则会控制 Pod 在资源锁定期间的等待时间。任务队列如果某些 Pods 需要排队进行处理Permit 插件可以将它们暂时挂起并通过设置 WaitOnPermit 来定义它们可以等待的最长时间。例子假设你有一个调度插件它使用 Permit 扩展点来控制 Pod 的调度时机并要求某些 Pods 在调度前等待其他 Pods 的状态满足某个条件 func (p *MyPermitPlugin) Permit(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) *framework.Status {if shouldWait(pod) {// 如果需要等待则让 Pod 进入等待状态return framework.NewStatus(framework.Wait, Waiting for other pods to be ready)}return framework.NewStatus(framework.Success, Pod can be scheduled) }在这种情况下如果 WaitOnPermit 设置为 30 秒那么调度器会在 Permit 阶段最多等待 30 秒。如果在这段时间内其他条件满足Pod 会被允许调度如果超过 30 秒仍未获得许可Pod 的调度会失败调度器会对该 Pod 进行重试或报告错误。总结 WaitOnPermit 的作用是在 Permit 阶段控制调度器等待 Pod 被允许调度的时间适用于需要等待特定条件的场景如 Pod 协同调度、资源协调、或任务队列管理。如果在指定时间内没有获得许可调度将超时并失败。 2.3.2 PreBindBind 之前的预处理例如到 node 上去挂载 volume 例如在将 pod 调度到一个 node 之前先给这个 pod 在那台 node 上挂载一个 network volume。 // PreBindPlugin is an interface that must be implemented by PreBind plugins. // These plugins are called before a pod being scheduled. type PreBindPlugin interface {// PreBind is called before binding a pod. All prebind plugins must return// success or the pod will be rejected and wont be sent for binding.PreBind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status }任何一个 PreBind plugin 失败都会导致 pod 被 reject进入到 reserve plugins 的 Unreserve() 方法 2.3.3 Bind将 pod 关联到 node 所有 PreBind 完成之后才会进入 Bind。 // https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L497// Bind plugins are used to bind a pod to a Node. type BindPlugin interface {// Bind plugins will not be called until all pre-bind plugins have completed. Each// bind plugin is called in the configured order. A bind plugin may choose whether// or not to handle the given Pod. If a bind plugin chooses to handle a Pod, the// remaining bind plugins are skipped. When a bind plugin does not handle a pod,// it must return Skip in its Status code. If a bind plugin returns an Error, the// pod is rejected and will not be bound.Bind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status }所有 plugin 按配置顺序依次执行每个 plugin 可以选择是否要处理一个给定的 pod如果选择处理后面剩下的 plugins 会跳过。也就是最多只有一个 bind plugin 会执行。 2.3.4 PostBindinformational可选执行清理操作这是一个 informational extension point也就是无法影响调度决策没有返回值。 bind 成功的 pod 才会进入这个阶段作为 binding cycle 的最后一个阶段一般是用来清理一些相关资源。 // https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L473// PostBindPlugin is an interface that must be implemented by PostBind plugins. // These plugins are called after a pod is successfully bound to a node. type PostBindPlugin interface {// PostBind is called after a pod is successfully bound. These plugins are informational.// A common application of this extension point is for cleaning// up. If a plugin needs to clean-up its state after a pod is scheduled and// bound, PostBind is the extension point that it should register.PostBind(ctx , state *CycleState, p *v1.Pod, nodeName string) }3 开发一个极简 sticky node 调度器插件out-of-tree 这里以 kubevirt 固定宿主机调度 VM 为例展示如何用几百行代码实现一个 out-of-tree 调度器插件。 3.1 设计 3.1.1 背景知识一点背景知识 [2,3] VirtualMachine 是一个虚拟机 CRD一个 VirtualMachine 会对应一个 VirtualMachineInstance这是一个运行中的 VirtualMachine一个 VirtualMachineInstance 对应一个 Pod 如果发生故障VirtualMachineInstance 和 Pod 可能会重建和重新调度但 VirtualMachine 是不变的 VirtualMachine -- VirtualMachineInstance/Pod 的关系类似于 StatefulSet -- Pod 的关系。 3.1.2 业务需求 VM 创建之后只要被调度到某台 node以后不管发生什么故障它永远都被调度到这个 node 上除非人工干预。可能场景VM 挂载了宿主机本地磁盘因此换了宿主机之后数据就没了。故障场景下机器或容器不可用没关系微服务系统自己会处理实例的健康检测和流量拉出底层基础设施保证不换宿主机就行了这样故障恢复之后数据还在。技术描述用户创建一个 VirtualMachine 后能正常调度到一台 node 创建出来后续不管发生什么问题pod crash/eviction/recreate、node restart …这个 VirtualMachine 都要被调度到这台机器。 3.1.3 技术方案用户创建一个 VirtualMachine 后由默认调度器给它分配一个 node然后将 node 信息保存到 VirtualMachine CR 上如果 VirtualMachineInstance 或 Pod 被删除或发生重建调度器先找到对应的 VirtualMachine CR 如果 CR 中有保存的 node 信息就用这个 node否则必定是第一次调度转 1。 3.2 实现实现以上功能需要在三个位置注册调度扩展函数 PreFilterFilterPostBind 代码基于 k8s v1.28。 3.2.1 Prefilter() 主要做一些检查和准备工作如果不是我们的 Pod直接返回成功留给其他 plugin 去处理如果是我们的 Pod查询关联的 VMI/VM CR这里分两种情况找到了说明之前已经调度过可能是 pod 被删除了导致重新调度我们应该解析出原来的 node供后面 Filter() 阶段使用没找到说明是第一次调度什么都不做让默认调度器为我们选择初始 node。将 pod 及为它选择的 node没有就是空保存到一个 state 上下文中这个 state 会传给后面的 Filter() 阶段使用。 // PreFilter invoked at the preFilter extension point. func (pl *StickyVM) PreFilter(ctx , state *framework.CycleState, pod *v1.Pod) (*framework.PreFilterResult, *framework.Status) {s : stickyState{false, }// Get pod owner referencepodOwnerRef : getPodOwnerRef(pod)if podOwnerRef nil {return nil, framework.NewStatus(framework.Success, Pod owner ref not found, return)}// Get VMIvmiName : podOwnerRef.Namens : pod.Namespacevmi : pl.kubevirtClient.VirtualMachineInstances(ns).Get(context.TODO(), vmiName, metav1.GetOptions{ResourceVersion: 0})if err ! nil {return nil, framework.NewStatus(framework.Error, get vmi failed)}vmiOwnerRef : getVMIOwnerRef(vmi)if vmiOwnerRef nil {return nil, framework.NewStatus(framework.Success, VMI owner ref not found, return)}// Get VMvmName : vmiOwnerRef.Namevm : pl.kubevirtClient.VirtualMachines(ns).Get(context.TODO(), vmName, metav1.GetOptions{ResourceVersion: 0})if err ! nil {return nil, framework.NewStatus(framework.Error, get vmi failed)}// Annotate sticky node to VMs.node, s.nodeExists vm.Annotations[stickyAnnotationKey]return nil, framework.NewStatus(framework.Success, Check pod/vmi/vm finish, return) }3.2.2 Filter() 调度器会根据 pod 的 nodeSelector 等为我们初步选择出一些备选 nodes。然后会遍历这些 node依次调用各 plugin 的 Filter() 方法看这个 node 是否合适。伪代码 // For a given pod for node in selectedNodes:for pl in plugins:pl.Filter(ctx, customState, pod, node)我们的 plugin 逻辑首先解析传过来的 state/pod/node 信息如果 state 中保存了一个 node 如果保存的这个 node 就是当前 Filter() 传给我们的 node返回成功对于其他所有 node都返回失败。以上的效果就是只要这个 pod 上一次调度到某个 node我们就继续让它调度到这个 node 也就是**“固定宿主机调度”**。如果 state 中没有保存的 node说明是第一次调度也返回成功默认调度器会给我们分一个 node。我们在后面的 PostBind 阶段把这个 node 保存到 state 中。 func (pl *StickyVM) Filter(ctx , state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {s : state.Read(stateKey)if err ! nil {return framework.NewStatus(framework.Error, fmt.Sprintf(read preFilter state fail: %v, err))}r, ok : s.(*stickyState)if !ok {return framework.NewStatus(framework.Error, fmt.Sprintf(convert %v to stickyState fail, s))}if !r.nodeExists {return nil}if r.node ! nodeInfo.Node().Name {// returning framework.Error will prevent process on other nodesreturn framework.NewStatus(framework.Unschedulable, already stick to another node)}return nil }3.2.3 PostBind() 能到这个阶段说明已经为 pod 选择好了一个 node。我们只需要检查下这个 node 是否已经保存到 VM CR 中如果没有就保存之。 func (pl *StickyVM) PostBind(ctx , state *framework.CycleState, pod *v1.Pod, nodeName string) {s : state.Read(stateKey)if err ! nil {return}r, ok : s.(*stickyState)if !ok {klog.Errorf(PostBind: pod %s/%s: convert failed, pod.Namespace, pod.Name)return}if r.nodeExists {klog.Errorf(PostBind: VM already has sticky annotation, return)return}// Get pod owner referencepodOwnerRef : getPodOwnerRef(pod)if podOwnerRef nil {return}// Get VMI owner referencevmiName : podOwnerRef.Namens : pod.Namespacevmi : pl.kubevirtClient.VirtualMachineInstances(ns).Get(context.TODO(), vmiName, metav1.GetOptions{ResourceVersion: 0})if err ! nil {return}vmiOwnerRef : getVMIOwnerRef(vmi)if vmiOwnerRef nil {return}// Add sticky node to VM annotationsretry.RetryOnConflict(retry.DefaultRetry, func() error {vmName : vmiOwnerRef.Namevm : pl.kubevirtClient.VirtualMachines(ns).Get(context.TODO(), vmName, metav1.GetOptions{ResourceVersion: 0})if err ! nil {return err}if vm.Annotations nil {vm.Annotations make(map[string]string)}vm.Annotations[stickyAnnotationKey] nodeNameif _ pl.kubevirtClient.VirtualMachines(pod.Namespace).Update(ctx, vm, metav1.UpdateOptions{}); err ! nil {return err}return nil}) }前面提到过这个阶段是 informational 的它不能影响调度决策所以它没有返回值。 3.2.4 其他说明以上就是核心代码再加点初始化代码和脚手架必需的东西就能编译运行了。完整代码见这里不包括依赖包。实际开发中golang 依赖问题可能比较麻烦需要根据 k8s 版本、scheduler-plugins 版本、golang 版本、kubevirt 版本等等自己解决。 3.3 部署 Scheduling plugins 跟网络 CNI plugins 不同后者是可执行文件binary放到一个指定目录就行了。 Scheduling plugins 是 long running 服务。 3.3.1 配置为我们的 StickyVM scheduler 创建一个配置 $ cat ksc.yamlapiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration clientConnection:kubeconfig: /etc/kubernetes/scheduler.kubeconfig profiles: - schedulerName: stickyvmplugins:preFilter:enabled:- name: StickyVMdisabled:- name: NodeResourceFitfilter:enabled:- name: StickyVMdisabled:- name: NodePorts# - name: *reserve:disabled:- name: *preBind:disabled:- name: *postBind:enabled:- name: StickyVMdisabled:- name: *一个 ksc 里面可以描述多个 profile 会启动多个独立 scheduler。由于这个配置是给 kube-scheduler 的而不是 kube-apiserver # content of the file passed to --config apiVersion: kubescheduler.config.k8s.io/v1alpha1 kind: KubeSchedulerConfiguration所以 k api-resources 或 k get KubeSchedulerConfiguration 都是找不到这个资源的。 pod 想用哪个 profile就填对应的 schdulerName。如果没指定就是 default-scheduler。 3.3.2 运行不需要对 k8s 做任何配置改动作为普通 pod 部署运行就行需要创建合适的 CluterRole 等等。这里为了方面用 k8s cluster admin 证书直接从开发机启动适合开发阶段快速迭代 $ ./bin/stickyvm-scheduler --leader-electfalse --config ksc.yaml Creating StickyVM scheduling plugin Creating kubevirt clientset Create kubevirt clientset successful Create StickyVM scheduling plugin successful Starting Kubernetes Scheduler versionv0.0.20231122 Golang settings GOGC GOMAXPROCS GOTRACEBACK Serving securely on [::]:10259 Starting DynamicServingCertificateController3.4 测试只需要在 VM CR spec 里面指定调度器名字。 3.4.1 首次创建 VM 新创建一个 VM 时的 workflow yaml 里指定用 schedulerName: stickyvmk8s 默认调度器自动选一个 nodeStickyVM 根据 ownerref 依次拿到 vmi/vm然后在 postbind hook 里将这个 node 添加到 VM annotation 里日志 Prefilter: start Prefilter: processing pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp PreFilter: parent is VirtualMachineInstance kubevirt-smoke-fedora PreFilter: found corresponding VMI PreFilter: found corresponding VM PreFilter: VM has no sticky node, skip to write to scheduling context Prefilter: finish Filter: start Filter: pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp, sticky node not exist, got node-1, return success PostBind: start: pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp PostBind: annotating selected node node-1 to VM PostBind: parent is VirtualMachineInstance kubevirt-smoke-fedora PostBind: found corresponding VMI PostBind: found corresponding VM PostBind: annotating node node-1 to VM: kubevirt-smoke-fedora3.4.2 删掉 VMI/Pod重新调度时删除 vmi 或者 podStickyVM plugin 会在 prefilter 阶段从 annotation 拿出这个 node 信息然后在 filter 阶段做判断只有过滤到这个 node 时才返回成功从而实现固定 node 调度的效果 Prefilter: start Prefilter: processing pod default/virt-launcher-kubevirt-smoke-fedora-m8f7v PreFilter: parent is VirtualMachineInstance kubevirt-smoke-fedora PreFilter: found corresponding VMI PreFilter: found corresponding VM PreFilter: VM already sticky to node node-1, write to scheduling context Prefilter: finish Filter: start Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, already stick to node-1, skip node-2 Filter: start Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, given node is sticky node node-1, return success Filter: finish Filter: start Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, already stick to node-1, skip node-3 PostBind: start: pod default/virt-launcher-kubevirt-smoke-fedora-m8f7v PostBind: VM already has sticky annotation, return这时候 VM 上已经有 annotation因此 postbind 阶段不需要做任何事情。 4 总结本文整理了一些 k8s 调度框架和扩展插件相关的内容并通过一个例子展示了开发和部署过程。参考资料 github.com/kubernetes-sigs/scheduler-pluginsVirtual Machines on Kubernetes: Requirements and Solutions (2023)Spawn a Virtual Machine in Kubernetes with kubevirt: A Deep Dive (2023)Scheduling Framework, kubernetes.iogithub.com/kubernetes-sigs/scheduler-plugins

查看全文

http://www.w-s-a.com/news/287706/