Attacker Value
Unknown
(0 users assessed)
Exploitability
Unknown
(0 users assessed)
User Interaction
Unknown
Privileges Required
Unknown
Attack Vector
Unknown
0

CVE-2024-53054

Disclosure Date: November 19, 2024
Add MITRE ATT&CK tactics and techniques that apply to this CVE.

Description

In the Linux kernel, the following vulnerability has been resolved:

cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

A hung_task problem shown below was found:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
<TASK>
__schedule+0x5a2/0x2050
? find_held_lock+0x33/0x100
? wq_worker_sleeping+0x9e/0xe0
schedule+0x9f/0x180
schedule_preempt_disabled+0x25/0x50
__mutex_lock+0x512/0x740
? cgroup_bpf_release+0x1e/0x4d0
? cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? cgroup_bpf_release+0x1e/0x4d0
? mutex_lock_nested+0x2b/0x40
? __pfx_delay_tsc+0x10/0x10
mutex_lock_nested+0x2b/0x40
cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
? process_scheduled_works+0x161/0x8a0
process_scheduled_works+0x23a/0x8a0
worker_thread+0x231/0x5b0
? __pfx_worker_thread+0x10/0x10
kthread+0x14d/0x1c0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x59/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>

This issue can be reproduced by the following pressuse test:

  1. A large number of cpuset cgroups are deleted.
  2. Set cpu on and off repeatly.
  3. Set watchdog_thresh repeatly.
    The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:

  1. A large number of cpusets are deleted asynchronously, which puts a
    large number of cgroup_bpf_release works into system_wq. The max_active
    of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
    cgroup_bpf_release works, and many cgroup_bpf_release works will be put
    into inactive queue. As illustrated in the diagram, there are 256 (in
    the acvtive queue) + n (in the inactive queue) works.
  2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
    smp_call_on_cpu work into system_wq. However step 1 has already filled
    system_wq, ‘sscs.work’ is put into inactive queue. ‘sscs.work’ has
    to wait until the works that were put into the inacvtive queue earlier
    have executed (n cgroup_bpf_release), so it will be blocked for a while.
  3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
  4. Cpusets that were deleted at step 1 put cgroup_release works into
    cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
    When cgroup_metux is acqured by work at css_killed_work_fn, it will
    call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
    However, cpuset_css_offline will be blocked for step 3.
  5. At this moment, there are 256 works in active queue that are
    cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
    a result, all of them are blocked. Consequently, sscs.work can not be
    executed. Ultimately, this situation leads to four processes being
    blocked, forming a deadlock.

system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4)

2000+ cgroups deleted asyn
256 actives + n inactives

			__lockup_detector_reconfigure
			P(cpu_hotplug_lock.read)
			put sscs.work into system_wq

256 + n + 1(sscs.work)
sscs.work wait to be executed

			warting sscs.work finish
							percpu_down_write
							P(cpu_hotplug_lock.write)
							...blocking...
										css_killed_work_fn
										P(cgroup_mutex)
										cpuset_css_offline
										P(cpu_hotplug_lock.read)
										...blocking...

256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking…

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn’t create a large number of concurrent work
items. If something is going to generate >
—-truncated—-

Add Assessment

No one has assessed this topic. Be the first to add your voice to the community.

CVSS V3 Severity and Metrics
Base Score:
None
Impact Score:
Unknown
Exploitability Score:
Unknown
Vector:
Unknown
Attack Vector (AV):
Unknown
Attack Complexity (AC):
Unknown
Privileges Required (PR):
Unknown
User Interaction (UI):
Unknown
Scope (S):
Unknown
Confidentiality (C):
Unknown
Integrity (I):
Unknown
Availability (A):
Unknown

General Information

Additional Info

Technical Analysis