Discussion:
[PATCH 2/2] x86/modules: Make x86 allocs to flush when free
Rick Edgecombe
2018-11-28 00:07:54 UTC
Permalink
Change the module allocations to flush before freeing the pages.

Signed-off-by: Rick Edgecombe <***@intel.com>
---
arch/x86/kernel/module.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b052e883dd8c..1694daf256b3 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -87,8 +87,8 @@ void *module_alloc(unsigned long size)
p = __vmalloc_node_range(size, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
- PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
--
2.17.1
Rick Edgecombe
2018-11-28 00:07:53 UTC
Permalink
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.

Having callers flush the TLB after calling vfree still leaves a window where
the pages are freed, but the TLB entry remains. Also the entire operation can be
deferred if the vfree is called from an interrupt and so a TLB flush after
calling vfree would miss the entire operation. So in order to support this use
case, a new flag VM_IMMEDIATE_UNMAP is added, that will cause the free operation
to take place like this:
1. Unmap
2. Flush TLB/Unmap aliases
3. Free pages
In the deferred case these steps are all done by the work queue.

This implementation derives from two sketches from Dave Hansen and
Andy Lutomirski.

Suggested-by: Dave Hansen <***@intel.com>
Suggested-by: Andy Lutomirski <***@kernel.org>
Suggested-by: Will Deacon <***@arm.com>
Signed-off-by: Rick Edgecombe <***@intel.com>
---
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 13 +++++++++++--
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..cca6b6b83cf0 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -21,6 +21,7 @@ struct notifier_block; /* in notifier.h */
#define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */
#define VM_NO_GUARD 0x00000040 /* don't add guard page */
#define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
+#define VM_IMMEDIATE_UNMAP 0x00000200 /* flush before releasing pages */
/* bits [20..32] reserved for arch specific ioremap internals */

/*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 97d4b25d0373..68766651b5a7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1516,6 +1516,14 @@ static void __vunmap(const void *addr, int deallocate_pages)
debug_check_no_obj_freed(area->addr, get_vm_area_size(area));

remove_vm_area(addr);
+
+ /*
+ * Need to flush the TLB before freeing pages in the case of this flag.
+ * As long as that's happening, unmap aliases.
+ */
+ if (area->flags & VM_IMMEDIATE_UNMAP)
+ vm_unmap_aliases();
+
if (deallocate_pages) {
int i;

@@ -1925,8 +1933,9 @@ EXPORT_SYMBOL(vzalloc_node);

void *vmalloc_exec(unsigned long size)
{
- return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
- NUMA_NO_NODE, __builtin_return_address(0));
+ return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+ GFP_KERNEL, PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
}

#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
--
2.17.1
Edgecombe, Rick P
2018-12-04 00:04:21 UTC
Permalink
It looks like this new flag is in linux-next now. As I am reading it, these
architectures have a module_alloc that uses some sort of executable flag and
are not using the default module_alloc which is already covered, and so may need
it plugged in:
arm
arm64
parisc
s390
unicore32

Thanks,

Rick
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
Having callers flush the TLB after calling vfree still leaves a window where
the pages are freed, but the TLB entry remains. Also the entire operation can be
deferred if the vfree is called from an interrupt and so a TLB flush after
calling vfree would miss the entire operation. So in order to support this use
case, a new flag VM_IMMEDIATE_UNMAP is added, that will cause the free operation
1. Unmap
2. Flush TLB/Unmap aliases
3. Free pages
In the deferred case these steps are all done by the work queue.
This implementation derives from two sketches from Dave Hansen and
Andy Lutomirski.
---
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 13 +++++++++++--
2 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd61..cca6b6b83cf0 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -21,6 +21,7 @@ struct notifier_block; /* in notifier.h */
#define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully
initialized */
#define VM_NO_GUARD 0x00000040 /* don't add guard page */
#define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */
+#define VM_IMMEDIATE_UNMAP 0x00000200 /* flush before releasing pages */
/* bits [20..32] reserved for arch specific ioremap internals */
/*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 97d4b25d0373..68766651b5a7 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1516,6 +1516,14 @@ static void __vunmap(const void *addr, int deallocate_pages)
debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
remove_vm_area(addr);
+
+ /*
+ * Need to flush the TLB before freeing pages in the case of this flag.
+ * As long as that's happening, unmap aliases.
+ */
+ if (area->flags & VM_IMMEDIATE_UNMAP)
+ vm_unmap_aliases();
+
if (deallocate_pages) {
int i;
@@ -1925,8 +1933,9 @@ EXPORT_SYMBOL(vzalloc_node);
void *vmalloc_exec(unsigned long size)
{
- return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
- NUMA_NO_NODE, __builtin_return_address(0));
+ return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+ GFP_KERNEL, PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
}
#if defi
Nadav Amit
2018-12-04 01:43:11 UTC
Permalink
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).

But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???

In other words: disable_ro_nx() is called by free_module() before freeing
the memory. Wouldn’t inverting the logic makes much more sense? I am
confused.

-- >8 --

From: Nadav Amit <***@vmware.com>
Subject: [PATCH] modules: disable_ro_nx() should enable nx

---
kernel/module.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 7cb207249437..e12d760ea3b0 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2029,14 +2029,13 @@ void set_all_modules_text_ro(void)

static void disable_ro_nx(const struct module_layout *layout)
{
+ frob_text(layout, set_memory_nx);
+
if (rodata_enabled) {
frob_text(layout, set_memory_rw);
frob_rodata(layout, set_memory_rw);
frob_ro_after_init(layout, set_memory_rw);
}
- frob_rodata(layout, set_memory_x);
- frob_ro_after_init(layout, set_memory_x);
- frob_writable_data(layout, set_memory_x);
}

#else
--
2.17.1
Will Deacon
2018-12-04 16:03:04 UTC
Permalink
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.

If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?

Is it just nios2 that does something different?

Will
Edgecombe, Rick P
2018-12-04 20:02:03 UTC
Permalink
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.

It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.

On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.

The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.

A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either modules or
vmalloc. As Jessica points out in the other thread though, modules does a lot
more stuff there than the other module_alloc callers. I think it may take some
thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
user and arch.

But for now with the change in vmalloc, we can block the executable mapping
freed page re-use
Andy Lutomirski
2018-12-04 20:09:49 UTC
Permalink
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.

After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
Edgecombe, Rick P
2018-12-04 23:52:58 UTC
Permalink
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
With arch hooks, I guess we could remove disable_ro_nx then. I think you would
still have to flush twice on x86 to really have no W^X violating window from the
direct map (I think x86 is the only one that sets permissions there?). But this
could be down from sometimes 3. You could also directly vfree non exec RO memory
without set_memory_, like in BPF.

The vfree deferred list would need to be moved since it then couldn't reuse the
allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
the vm_struct. So would probably be a little slower in the interrupt case. Is
this ok?

Tha
Andy Lutomirski
2018-12-05 01:57:26 UTC
Permalink
Post by Edgecombe, Rick P
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
With arch hooks, I guess we could remove disable_ro_nx then. I think you would
still have to flush twice on x86 to really have no W^X violating window from the
direct map (I think x86 is the only one that sets permissions there?). But this
could be down from sometimes 3. You could also directly vfree non exec RO memory
without set_memory_, like in BPF.
Just one flush if you’re careful. Set the memory not-present in the direct map and zap it from the vmap area, then flush, then set it RW in the
Post by Edgecombe, Rick P
The vfree deferred list would need to be moved since it then couldn't reuse the
allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
the vm_struct. So would probably be a little slower in the interrupt case. Is
this ok?
I’m fine with that. For eBPF, we should really have a lookaside list for small allocations.
Will Deacon
2018-12-05 11:41:49 UTC
Permalink
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
Post by Andy Lutomirski
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.

Will

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=for-next/core&id=c55191e96caa9d787e8f682c5e525b7f8172a3b4
Andy Lutomirski
2018-12-05 23:16:33 UTC
Permalink
Post by Will Deacon
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
Post by Andy Lutomirski
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
Ard Biesheuvel
2018-12-06 07:29:03 UTC
Permalink
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
itable before freeing the memory, so why can’t we make it
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
Post by Andy Lutomirski
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
Will Deacon
2018-12-06 11:10:58 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
itable before freeing the memory, so why can’t we make it
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
Post by Andy Lutomirski
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
Right, that should be pretty straightforward. We're basically saying that
RO in the vmalloc area implies PROT_NONE in the linear map, so we could
just do this in our set_memory_ro() function.

Will
Andy Lutomirski
2018-12-06 18:53:50 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
itable before freeing the memory, so why can’t we make it
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
Post by Andy Lutomirski
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.

RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.

(It seems like some people call it the linear map and some people call
it the direct map. Is there any preference?)
Tycho Andersen
2018-12-06 19:01:15 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
itable before freeing the memory, so why can’t we make it
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
Post by Andy Lutomirski
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.

Seems like maybe we should do this for rodata today?
Post by Andy Lutomirski
(It seems like some people call it the linear map and some people call
it the direct map. Is there any preference?)
...and some people call it the physmap :)

Tycho
Andy Lutomirski
2018-12-06 19:19:36 UTC
Permalink
Post by Tycho Andersen
Post by Andy Lutomirski
Post by Ard Biesheuvel
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.

So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.

What do you all think? This should solve most of the problems we have.

If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.

I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
Nadav Amit
2018-12-06 19:39:32 UTC
Permalink
Post by Andy Lutomirski
Post by Tycho Andersen
Post by Andy Lutomirski
Post by Ard Biesheuvel
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
Setting it as read-only may work (and already happens for the read-only
module data). I am not sure about setting it as non-present.

At some point, a discussion about a threat-model, as Rick indicated, would
be required. I presume ROP attacks can easily call set_all_modules_text_rw()
and override all the protections.
Andy Lutomirski
2018-12-06 20:17:13 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
Post by Tycho Andersen
Post by Andy Lutomirski
Post by Ard Biesheuvel
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
Setting it as read-only may work (and already happens for the read-only
module data). I am not sure about setting it as non-present.
At some point, a discussion about a threat-model, as Rick indicated, would
be required. I presume ROP attacks can easily call set_all_modules_text_rw()
and override all the protections.
I am far from an expert on exploit techniques, but here's a
potentially useful model: let's assume there's an attacker who can
write controlled data to a controlled kernel address but cannot
directly modify control flow. It would be nice for such an attacker
to have a very difficult time of modifying kernel text or of
compromising control flow. So we're assuming a feature like kernel
CET or that the attacker finds it very difficult to do something like
modifying some thread's IRET frame.

Admittedly, for the kernel, this is an odd threat model, since an
attacker can presumably quite easily learn the kernel stack address of
one of their tasks, do some syscall, and then modify their kernel
thread's stack such that it will IRET right back to a fully controlled
register state with RSP pointing at an attacker-supplied kernel stack.
So this threat model gives very strong ROP powers. unless we have
either CET or some software technique to harden all the RET
instructions in the kernel.

I wonder if there's a better model to use. Maybe with stack-protector
we get some degree of protection? Or is all of this is rather weak
until we have CET or a RAP-like feature.
Nadav Amit
2018-12-06 23:08:02 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Tycho Andersen
Post by Andy Lutomirski
Post by Ard Biesheuvel
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
Setting it as read-only may work (and already happens for the read-only
module data). I am not sure about setting it as non-present.
At some point, a discussion about a threat-model, as Rick indicated, would
be required. I presume ROP attacks can easily call set_all_modules_text_rw()
and override all the protections.
I am far from an expert on exploit techniques, but here's a
potentially useful model: let's assume there's an attacker who can
write controlled data to a controlled kernel address but cannot
directly modify control flow. It would be nice for such an attacker
to have a very difficult time of modifying kernel text or of
compromising control flow. So we're assuming a feature like kernel
CET or that the attacker finds it very difficult to do something like
modifying some thread's IRET frame.
Admittedly, for the kernel, this is an odd threat model, since an
attacker can presumably quite easily learn the kernel stack address of
one of their tasks, do some syscall, and then modify their kernel
thread's stack such that it will IRET right back to a fully controlled
register state with RSP pointing at an attacker-supplied kernel stack.
So this threat model gives very strong ROP powers. unless we have
either CET or some software technique to harden all the RET
instructions in the kernel.
I wonder if there's a better model to use. Maybe with stack-protector
we get some degree of protection? Or is all of this is rather weak
until we have CET or a RAP-like feature.
I believe that seeing the end-goal would make reasoning about patches
easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
up.

I’m not sure CET or other CFI would be enough even with this threat-model.
The page-tables (the very least) need to be write-protected, as otherwise
controlled data writes may just modify them. There are various possible
solutions I presume: write_rare for page-tables, hypervisor-assisted
security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
hardware enclave.

What do you think?
Edgecombe, Rick P
2018-12-07 03:06:22 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Tycho Andersen
Post by Andy Lutomirski
Post by Ard Biesheuvel
If we are going to unmap the linear alias, why not do it at
vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
Setting it as read-only may work (and already happens for the read-only
module data). I am not sure about setting it as non-present.
At some point, a discussion about a threat-model, as Rick indicated, would
be required. I presume ROP attacks can easily call
set_all_modules_text_rw()
and override all the protections.
I am far from an expert on exploit techniques, but here's a
potentially useful model: let's assume there's an attacker who can
write controlled data to a controlled kernel address but cannot
directly modify control flow. It would be nice for such an attacker
to have a very difficult time of modifying kernel text or of
compromising control flow. So we're assuming a feature like kernel
CET or that the attacker finds it very difficult to do something like
modifying some thread's IRET frame.
Admittedly, for the kernel, this is an odd threat model, since an
attacker can presumably quite easily learn the kernel stack address of
one of their tasks, do some syscall, and then modify their kernel
thread's stack such that it will IRET right back to a fully controlled
register state with RSP pointing at an attacker-supplied kernel stack.
So this threat model gives very strong ROP powers. unless we have
either CET or some software technique to harden all the RET
instructions in the kernel.
I wonder if there's a better model to use. Maybe with stack-protector
we get some degree of protection? Or is all of this is rather weak
until we have CET or a RAP-like feature.
I believe that seeing the end-goal would make reasoning about patches
easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
up.
I’m not sure CET or other CFI would be enough even with this threat-model.
The page-tables (the very least) need to be write-protected, as otherwise
controlled data writes may just modify them. There are various possible
solutions I presume: write_rare for page-tables, hypervisor-assisted
security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
hardware enclave.
What do you think?
I am not sure which issue you are talking about. I think there are actually two
separate issues that are merged discussions from overlap of fix for the teardown
W^X window.

For the W^X stuff I had originally imagined the protection was for when an
attacker has a limited bug that could write to a location in the module space,
but not other locations due to only having the ability to overwrite part of a
pointer or some something like that. Then the module could execute the new code
as it ran normally after finishing loading. So that is why I was wondering about
the RW window during load. Still seems generally sensible to enforce W^X though.

I like your idea about something like text_poke to load modules. I think maybe
my modules KASLR patchset could help the above somewhat too since it loads at a
freshly randomized address.

Since the issue with the freed pages before flush (the original source of this
thread) doesn't require a write bug to insert the code, but does require a way
to jump to it, its kind of the opposite model of the above. So that's why I
think they are different.

I am still learning lots on kernel exploits though, maybe Kees can provide some
better insight here?

Edgecombe, Rick P
2018-12-06 20:19:35 UTC
Permalink
Post by Andy Lutomirski
Post by Tycho Andersen
Post by Andy Lutomirski
Post by Ard Biesheuvel
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
Hi,

Sorry, I've been having email troubles today.

I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
NP/RO in the directmap, so it would be sort of inconsistent whether the
directmap of vmalloc range allocations were readable or not. I couldn't see any
places where it would cause problems today though.

I was ready to assume that all TLBs don't cache NP, because I don't know how
usages where a page fault is used to load something could work without lots of
flushes. If that's the case, then all archs with directmap permissions could
share a single vmalloc special permission flush implementation that works like
Andy described originally. It could be controlled with an
ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
set_pages_rw on any archs with directmap permissions. So seems simpler to me
(and what I have been doing) unless I'm missing the problem.

If you all think so I can indeed take a shot at it, I just don't see what the
problem was with the original solution, that seems less likely to bre
Andy Lutomirski
2018-12-06 20:26:18 UTC
Permalink
On Thu, Dec 6, 2018 at 12:20 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Andy Lutomirski
Post by Tycho Andersen
Post by Andy Lutomirski
Post by Ard Biesheuvel
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
Hi,
Sorry, I've been having email troubles today.
I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
NP/RO in the directmap, so it would be sort of inconsistent whether the
directmap of vmalloc range allocations were readable or not. I couldn't see any
places where it would cause problems today though.
I was ready to assume that all TLBs don't cache NP, because I don't know how
usages where a page fault is used to load something could work without lots of
flushes.
Or the architecture just fixes up the spurious faults, I suppose. I'm
only well-educated on the x86 mmu.
Post by Edgecombe, Rick P
If that's the case, then all archs with directmap permissions could
share a single vmalloc special permission flush implementation that works like
Andy described originally. It could be controlled with an
ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
set_pages_rw on any archs with directmap permissions. So seems simpler to me
(and what I have been doing) unless I'm missing the problem.
Hmm. The only reason I've proposed anything fancier was because I was
thinking of minimizing flushes, but I think I'm being silly. This
sequence ought to work optimally:

- vmalloc(..., VM_HAS_DIRECT_MAP_PERMS); /* no flushes */

- Write some data, via vmalloc's return address.

- Use some set_memory_whatever() functions to update permissions,
which will flush, hopefully just once.

- Run the module code!

- vunmap -- this will do a single flush that will fix everything.

This does require that set_pages_np() or set_memory_np() or whatever
exists and that it's safe to do that, then flush, and then
set_pages_rw(). So maybe you want set_pages_np_noflush() and
set_pages_rw_noflush() to make it totally clear what's supposed to
happen.

--Andy
Ard Biesheuvel
2018-12-06 19:04:09 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-used.
This is
undesirable for cases where the memory being freed has special permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
itable before freeing the memory, so why can’t we make it
Post by Andy Lutomirski
Post by Will Deacon
Post by Andy Lutomirski
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
Post by Andy Lutomirski
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
The crypto code shouldn't care, but I think it will probably break hibernate :-(
Post by Andy Lutomirski
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
(It seems like some people call it the linear map and some people call
it the direct map. Is there any preference?)
Either is fine with me.
Andy Lutomirski
2018-12-06 19:20:51 UTC
Permalink
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
Post by Ard Biesheuvel
Post by Andy Lutomirski
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
The crypto code shouldn't care, but I think it will probably break hibernate :-(
How so? Hibernate works (or at least should work) on x86 PAE, where
__va doesn't work on module data, and, on x86, the direct map has some
RO parts with where the module is, so hibernate can't be writing to
the memory through the direct map with its final permissions.
Ard Biesheuvel
2018-12-06 19:23:20 UTC
Permalink
Post by Andy Lutomirski
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
Post by Ard Biesheuvel
Post by Andy Lutomirski
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
The crypto code shouldn't care, but I think it will probably break hibernate :-(
How so? Hibernate works (or at least should work) on x86 PAE, where
__va doesn't work on module data, and, on x86, the direct map has some
RO parts with where the module is, so hibernate can't be writing to
the memory through the direct map with its final permissions.
On arm64 at least, hibernate reads the contents of memory via the
linear mapping. Not sure about other arches.
Will Deacon
2018-12-06 19:31:08 UTC
Permalink
Post by Ard Biesheuvel
Post by Andy Lutomirski
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
Post by Ard Biesheuvel
Post by Andy Lutomirski
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
The crypto code shouldn't care, but I think it will probably break hibernate :-(
How so? Hibernate works (or at least should work) on x86 PAE, where
__va doesn't work on module data, and, on x86, the direct map has some
RO parts with where the module is, so hibernate can't be writing to
the memory through the direct map with its final permissions.
On arm64 at least, hibernate reads the contents of memory via the
linear mapping. Not sure about other arches.
Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
the pte when we see that it's PROT_NONE?

Will
Ard Biesheuvel
2018-12-06 19:36:28 UTC
Permalink
Post by Will Deacon
Post by Ard Biesheuvel
Post by Andy Lutomirski
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
Post by Ard Biesheuvel
Post by Andy Lutomirski
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
The crypto code shouldn't care, but I think it will probably break hibernate :-(
How so? Hibernate works (or at least should work) on x86 PAE, where
__va doesn't work on module data, and, on x86, the direct map has some
RO parts with where the module is, so hibernate can't be writing to
the memory through the direct map with its final permissions.
On arm64 at least, hibernate reads the contents of memory via the
linear mapping. Not sure about other arches.
Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
the pte when we see that it's PROT_NONE?
As long as we can easily figure out whether a certain linear address
is mapped or not, having a special case like that for these mappings
doesn't sound unreasonable.
Nadav Amit
2018-12-04 20:36:44 UTC
Permalink
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either modules or
vmalloc. As Jessica points out in the other thread though, modules does a lot
more stuff there than the other module_alloc callers. I think it may take some
thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.

All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.

I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
Edgecombe, Rick P
2018-12-04 23:51:38 UTC
Permalink
Post by Nadav Amit
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either modules or
vmalloc. As Jessica points out in the other thread though, modules does a lot
more stuff there than the other module_alloc callers. I think it may take some
thought to centralize AND make it optimal for every
module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
Post by Nadav Amit
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
Post by Nadav Amit
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
Totally agree there is overlap in the fixes and we should sync.

What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
it and on
Nadav Amit
2018-12-05 00:01:38 UTC
Permalink
Post by Edgecombe, Rick P
Post by Nadav Amit
Post by Edgecombe, Rick P
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either modules or
vmalloc. As Jessica points out in the other thread though, modules does a lot
more stuff there than the other module_alloc callers. I think it may take some
thought to centralize AND make it optimal for every
module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
Post by Nadav Amit
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
Post by Nadav Amit
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
Totally agree there is overlap in the fixes and we should sync.
What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)

You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
Edgecombe, Rick P
2018-12-05 00:29:59 UTC
Permalink
Post by Nadav Amit
Post by Edgecombe, Rick P
Post by Nadav Amit
On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X
mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem
that
this
(your) patch-set deals with at all. We already change the mappings
to
make
the memory writable before freeing the memory, so why can’t we make
it
non-executable at the same time? Actually, why do we make the module
memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution
should
be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I
have
since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so
all
of
the frob_* functions are actually just noops. In which case allocating
RWX
is
needed to make it work at all, because that is what the allocation is
going
to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there
is
the
changing of the permissions on the directmap as well. You don't want
some
other
caller getting a page that was left RO when freed and then trying to
write
to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either
modules
or
vmalloc. As Jessica points out in the other thread though, modules does
a
lot
more stuff there than the other module_alloc callers. I think it may
take
some
thought to centralize AND make it optimal for every
module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
Post by Nadav Amit
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
Post by Nadav Amit
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
Totally agree there is overlap in the fixes and we should sync.
What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)
You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
I thought you were trying to prevent having any memory that at any time was W+X,
how does vfree help with the module load time issues, where it
Nadav Amit
2018-12-05 00:53:03 UTC
Permalink
Post by Edgecombe, Rick P
Post by Nadav Amit
Post by Edgecombe, Rick P
Post by Nadav Amit
On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying
pages,
it often leaves stale TLB entries to freed pages that could get re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem
that
this
(your) patch-set deals with at all. We already change the mappings
to
make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution
should
be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I
have
since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so
all
of
the frob_* functions are actually just noops. In which case allocating
RWX
is
needed to make it work at all, because that is what the allocation is
going
to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there
is
the
changing of the permissions on the directmap as well. You don't want
some
other
caller getting a page that was left RO when freed and then trying to
write
to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either
modules
or
vmalloc. As Jessica points out in the other thread though, modules does
a
lot
more stuff there than the other module_alloc callers. I think it may
take
some
thought to centralize AND make it optimal for every
module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
Post by Nadav Amit
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
Post by Nadav Amit
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
Totally agree there is overlap in the fixes and we should sync.
What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)
You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
I thought you were trying to prevent having any memory that at any time was W+X,
how does vfree help with the module load time issues, where it starts WRX on
x86?
I didn’t say it does. The patch I submitted before [1] should deal with the
issue of module loading, and I still think it is required. I also addressed
the kprobe and ftrace issues that you raised.

Perhaps it makes more sense that I will include the patch I proposed for
module cleanup to make the patch-set “complete”. If you finish the changes
you propose before the patch is applied, it could be dropped. I just want to
get rid of this series, as it keeps collecting more and more patches.

I suspect it will not be the last version anyhow.

[1] https://lkml.org/lkml/2018/11/21/305
Edgecombe, Rick P
2018-12-05 01:45:10 UTC
Permalink
Post by Nadav Amit
Post by Edgecombe, Rick P
Post by Nadav Amit
On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
Post by Nadav Amit
On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying
pages,
it often leaves stale TLB entries to freed pages that could
get
re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient
W+X
mappings
from taking space, by handling kprobes & ftrace that I missed
(thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the
problem
that
this
(your) patch-set deals with at all. We already change the
mappings
to
make
the memory writable before freeing the memory, so why can’t we
make
it
non-executable at the same time? Actually, why do we make the
module
memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a
combination
of the various different configurations and hysterical raisins. We
can't
rely on module_alloc() allocating from the vmalloc area (see
nios2)
nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(),
then
we could pass in Rick's new flag and drop disable_ro_nx()
altogether
afaict -- who cares about the memory attributes of a mapping
that's
about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx
everywhere
would
solve it as well, in fact that was what I first thought the solution
should
be
until this was suggested. It's interesting that from the other
thread
Masami
Hiramatsu referenced, set_memory_nx was suggested last year and
would
have
inadvertently blocked this on x86. But, on the other architectures I
have
since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so
all
of
the frob_* functions are actually just noops. In which case allocating
RWX
is
needed to make it work at all, because that is what the allocation is
going
to
stay at. So in these archs, set_memory_nx won't solve it because it
will
do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there
is
the
changing of the permissions on the directmap as well. You don't want
some
other
caller getting a page that was left RO when freed and then trying to
write
to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what
we
are
actually trying to do which is prevent the pages from getting
released
too
early.
A more clear solution for all of this might involve refactoring some
of
the
set_memory_ de-allocation logic out into __weak functions in either
modules
or
vmalloc. As Jessica points out in the other thread though, modules does
a
lot
more stuff there than the other module_alloc callers. I think it may
take
some
thought to centralize AND make it optimal for every
module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable
mapping
freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
Post by Nadav Amit
All I did is asking - how come the PTEs are executable when they are
cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
Post by Nadav Amit
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-
window in
which W^X is broken due to disable_ro_nx().
Totally agree there is overlap in the fixes and we should sync.
What do you think about Andy's suggestion for doing the vfree cleanup in
vmalloc
with arch hooks? So the allocation goes into vfree fully setup and
vmalloc
frees
it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)
You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
I thought you were trying to prevent having any memory that at any time was W+X,
how does vfree help with the module load time issues, where it starts WRX on
x86?
I didn’t say it does. The patch I submitted before [1] should deal with the
issue of module loading, and I still think it is required. I also addressed
the kprobe and ftrace issues that you raised.
Perhaps it makes more sense that I will include the patch I proposed for
module cleanup to make the patch-set “complete”. If you finish the changes
you propose before the patch is applied, it could be dropped. I just want to
get rid of this series, as it keeps collecting more and more patches.
I suspect it will not be the last version anyhow.
[1] https://lkml.org/lkml/2018/11/21/305
That seems fine.

And not to make it any more complicated, but how much different is a W+X mapping
from a RW mapping that is about to turn X? Can't an attacker with the ability to
write to the module space just write and wait a short time? If that is the
threat model, I think there may still be additional work to do here even after
you found all the W+X cases.

I'll take a shot at what Andy suggested in the next few days.

Thanks,
Nadav Amit
2018-12-05 02:09:04 UTC
Permalink
Post by Edgecombe, Rick P
Post by Nadav Amit
Post by Edgecombe, Rick P
Post by Nadav Amit
On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
Post by Nadav Amit
On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
Post by Will Deacon
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying
pages,
it often leaves stale TLB entries to freed pages that could
get
re-
used.
This is
undesirable for cases where the memory being freed has special
permissions
such
as executable.
So I am trying to finish my patch-set for preventing transient W+X
mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again
for
pointing it out).
But all of the sudden, I don’t understand why we have the problem
that
this
(your) patch-set deals with at all. We already change the mappings
to
make
the memory writable before freeing the memory, so why can’t we make
it
non-executable at the same time? Actually, why do we make the module
memory,
including its data executable before freeing it???
Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see
nios2)
nor
can we rely on disable_ro_nx() being available at build time.
If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping
that's
about
to disappear anyway?
Is it just nios2 that does something different?
Will
Yea it is really intertwined. I think for x86, set_memory_nx
everywhere
would
solve it as well, in fact that was what I first thought the solution
should
be
until this was suggested. It's interesting that from the other
thread
Masami
Hiramatsu referenced, set_memory_nx was suggested last year and
would
have
inadvertently blocked this on x86. But, on the other architectures I
have
since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so
all
of
the frob_* functions are actually just noops. In which case allocating
RWX
is
needed to make it work at all, because that is what the allocation is
going
to
stay at. So in these archs, set_memory_nx won't solve it because it
will
do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there
is
the
changing of the permissions on the directmap as well. You don't want
some
other
caller getting a page that was left RO when freed and then trying to
write
to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what
we
are
actually trying to do which is prevent the pages from getting
released
too
early.
A more clear solution for all of this might involve refactoring some
of
the
set_memory_ de-allocation logic out into __weak functions in either
modules
or
vmalloc. As Jessica points out in the other thread though, modules does
a
lot
more stuff there than the other module_alloc callers. I think it may
take
some
thought to centralize AND make it optimal for every
module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
Post by Nadav Amit
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
Post by Nadav Amit
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-
window in
which W^X is broken due to disable_ro_nx().
Totally agree there is overlap in the fixes and we should sync.
What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
with arch hooks? So the allocation goes into vfree fully setup and
vmalloc
frees
it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)
You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
I thought you were trying to prevent having any memory that at any time was W+X,
how does vfree help with the module load time issues, where it starts WRX on
x86?
I didn’t say it does. The patch I submitted before [1] should deal with the
issue of module loading, and I still think it is required. I also addressed
the kprobe and ftrace issues that you raised.
Perhaps it makes more sense that I will include the patch I proposed for
module cleanup to make the patch-set “complete”. If you finish the changes
you propose before the patch is applied, it could be dropped. I just want to
get rid of this series, as it keeps collecting more and more patches.
I suspect it will not be the last version anyhow.
[1] https://lkml.org/lkml/2018/11/21/305
That seems fine.
And not to make it any more complicated, but how much different is a W+X mapping
from a RW mapping that is about to turn X? Can't an attacker with the ability to
write to the module space just write and wait a short time? If that is the
threat model, I think there may still be additional work to do here even after
you found all the W+X cases.
I agree that a complete solution may require to block any direct write onto
a code-page. When I say “complete”, I mean for a threat model in which
dangling pointers are used to inject code, but not to run existing ROP/JOP
gadgets. (I didn’t think too deeply on the threat-model, so perhaps it needs
to be further refined).

I think the first stage is to make everybody go through a unified interface
(text_poke() and text_poke_early()). ftrace, for example, uses an
independent mechanism to change the code.

Eventually, after boot text_poke_early() should not be used, and text_poke()
(or something similar) should be used instead. Alternatively, when module
text is loaded, a hash value can be computed and calculated over it.

Since Igor Stoppa wants to use the infrastructure that is included in the
first patches, and since I didn’t intend this patch-set to be a full
solution for W^X (I was pushed there by tglx+Andy [1]), it may be enough
as a first step.

[1] https://lore.kernel.org/patchwork/patch/1006293/#1191341
Andy Lutomirski
2018-12-04 18:56:03 UTC
Permalink
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.

Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?

Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Nadav Amit
2018-12-04 19:44:58 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.
Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?
Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Thanks for the explanation. I read it just after I realized that indeed the
whole purpose of this code is to get cpa_process_alias()
update the corresponding direct mapping.

This thing (pageattr.c) indeed seems over-engineered and very unintuitive.
Right now I have a list of patch-sets that I owe, so I don’t have the time
to deal with it.

But, I still think that disable_ro_nx() should not call set_memory_x().
IIUC, this breaks W+X of the direct-mapping which correspond with the module
memory. Does it ever stop being W+X?? I’ll have another look.
Andy Lutomirski
2018-12-04 19:48:42 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.
Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?
Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Thanks for the explanation. I read it just after I realized that indeed the
whole purpose of this code is to get cpa_process_alias()
update the corresponding direct mapping.
This thing (pageattr.c) indeed seems over-engineered and very unintuitive.
Right now I have a list of patch-sets that I owe, so I don’t have the time
to deal with it.
But, I still think that disable_ro_nx() should not call set_memory_x().
IIUC, this breaks W+X of the direct-mapping which correspond with the module
memory. Does it ever stop being W+X?? I’ll have another look.
Dunno. I did once chase down a bug where some memory got freed while
it was still read-only, and the results were hilarious and hard to
debug, since the explosion happened long after the buggy code
finished.

--Andy
Nadav Amit
2018-12-04 22:48:40 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.
Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?
Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Thanks for the explanation. I read it just after I realized that indeed the
whole purpose of this code is to get cpa_process_alias()
update the corresponding direct mapping.
This thing (pageattr.c) indeed seems over-engineered and very unintuitive.
Right now I have a list of patch-sets that I owe, so I don’t have the time
to deal with it.
But, I still think that disable_ro_nx() should not call set_memory_x().
IIUC, this breaks W+X of the direct-mapping which correspond with the module
memory. Does it ever stop being W+X?? I’ll have another look.
Dunno. I did once chase down a bug where some memory got freed while
it was still read-only, and the results were hilarious and hard to
debug, since the explosion happened long after the buggy code
finished.
This piece of code causes me pain and misery.

So, it turns out that the direct map is *not* changed if you just change
the NX-bit. See change_page_attr_set_clr():

/* No alias checking for _NX bit modifications */
checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;

How many levels of abstraction are broken in the way? What would happen
if somebody tries to change the NX-bit and some other bit in the PTE?
Luckily, I don’t think someone does… at least for now.

So, again, I think the change I proposed makes sense. nios2 does not have
set_memory_x() and it will not be affected.

[ I can add a comment, although I don’t have know if nios2 has an NX bit,
and I don’t find any code that defines PTEs. Actually where is pte_present()
of nios2 being defined? Whatever. ]
Andy Lutomirski
2018-12-04 23:27:35 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.
Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?
Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Thanks for the explanation. I read it just after I realized that indeed the
whole purpose of this code is to get cpa_process_alias()
update the corresponding direct mapping.
This thing (pageattr.c) indeed seems over-engineered and very unintuitive.
Right now I have a list of patch-sets that I owe, so I don’t have the time
to deal with it.
But, I still think that disable_ro_nx() should not call set_memory_x().
IIUC, this breaks W+X of the direct-mapping which correspond with the module
memory. Does it ever stop being W+X?? I’ll have another look.
Dunno. I did once chase down a bug where some memory got freed while
it was still read-only, and the results were hilarious and hard to
debug, since the explosion happened long after the buggy code
finished.
This piece of code causes me pain and misery.
So, it turns out that the direct map is *not* changed if you just change
/* No alias checking for _NX bit modifications */
checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
How many levels of abstraction are broken in the way? What would happen
if somebody tries to change the NX-bit and some other bit in the PTE?
Luckily, I don’t think someone does… at least for now.
So, again, I think the change I proposed makes sense. nios2 does not have
set_memory_x() and it will not be affected.
[ I can add a comment, although I don’t have know if nios2 has an NX bit,
and I don’t find any code that defines PTEs. Actually where is pte_present()
of nios2 being defined? Whatever. ]
At least rename the function, then. The last thing we need is for disable_ro_nx to *enable* NX.
Nadav Amit
2018-12-04 23:34:09 UTC
Permalink
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Rick Edgecombe
Since vfree will lazily flush the TLB, but not lazily free the underlying pages,
it often leaves stale TLB entries to freed pages that could get re-used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.
Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?
Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Thanks for the explanation. I read it just after I realized that indeed the
whole purpose of this code is to get cpa_process_alias()
update the corresponding direct mapping.
This thing (pageattr.c) indeed seems over-engineered and very unintuitive.
Right now I have a list of patch-sets that I owe, so I don’t have the time
to deal with it.
But, I still think that disable_ro_nx() should not call set_memory_x().
IIUC, this breaks W+X of the direct-mapping which correspond with the module
memory. Does it ever stop being W+X?? I’ll have another look.
Dunno. I did once chase down a bug where some memory got freed while
it was still read-only, and the results were hilarious and hard to
debug, since the explosion happened long after the buggy code
finished.
This piece of code causes me pain and misery.
So, it turns out that the direct map is *not* changed if you just change
/* No alias checking for _NX bit modifications */
checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
How many levels of abstraction are broken in the way? What would happen
if somebody tries to change the NX-bit and some other bit in the PTE?
Luckily, I don’t think someone does… at least for now.
So, again, I think the change I proposed makes sense. nios2 does not have
set_memory_x() and it will not be affected.
[ I can add a comment, although I don’t have know if nios2 has an NX bit,
and I don’t find any code that defines PTEs. Actually where is pte_present()
of nios2 being defined? Whatever. ]
At least rename the function, then. The last thing we need is for
disable_ro_nx to *enable* NX.
The code is so horrible right now (IMHO), that it will not make it much
worse. But, yes, I will of course change the name. I just want to finish
this text_poke() patch-set and W+X mappings keep popping up.

Thanks (as usual) for your help.
Edgecombe, Rick P
2018-12-05 01:09:18 UTC
Permalink
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying pages,
it often leaves stale TLB entries to freed pages that could get re-
used. This is
undesirable for cases where the memory being freed has special
permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks
again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that
this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module
memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.
Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?
Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Thanks for the explanation. I read it just after I realized that indeed the
whole purpose of this code is to get cpa_process_alias()
update the corresponding direct mapping.
This thing (pageattr.c) indeed seems over-engineered and very unintuitive.
Right now I have a list of patch-sets that I owe, so I don’t have the time
to deal with it.
But, I still think that disable_ro_nx() should not call set_memory_x().
IIUC, this breaks W+X of the direct-mapping which correspond with the module
memory. Does it ever stop being W+X?? I’ll have another look.
Dunno. I did once chase down a bug where some memory got freed while
it was still read-only, and the results were hilarious and hard to
debug, since the explosion happened long after the buggy code
finished.
This piece of code causes me pain and misery.
So, it turns out that the direct map is *not* changed if you just change
/* No alias checking for _NX bit modifications */
checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
How many levels of abstraction are broken in the way? What would happen
if somebody tries to change the NX-bit and some other bit in the PTE?
Luckily, I don’t think someone does… at least for now.
So, again, I think the change I proposed makes sense. nios2 does not have
set_memory_x() and it will not be affected.
Hold on...so on architectures that don't have set_memory_ but do have something
like NX, wont the executable stale TLB continue to live to re-used pages, and so
it doesn't fix the problem this patch is trying to address generally? I see at
least a couple archs use vmalloc and have an exec bit, but don't define
s
Nadav Amit
2018-12-05 01:45:04 UTC
Permalink
Post by Edgecombe, Rick P
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
Post by Andy Lutomirski
Post by Nadav Amit
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Since vfree will lazily flush the TLB, but not lazily free the
underlying pages,
it often leaves stale TLB entries to freed pages that could get re-
used. This is
undesirable for cases where the memory being freed has special permissions such
as executable.
So I am trying to finish my patch-set for preventing transient W+X mappings
from taking space, by handling kprobes & ftrace that I missed (thanks again for
pointing it out).
But all of the sudden, I don’t understand why we have the problem that this
(your) patch-set deals with at all. We already change the mappings to make
the memory writable before freeing the memory, so why can’t we make it
non-executable at the same time? Actually, why do we make the module memory,
including its data executable before freeing it???
All the code you're looking at is IMO a very awkward and possibly
incorrect of doing what's actually necessary: putting the direct map
the way it wants to be.
Can't we shove this entirely mess into vunmap? Have a flag (as part
of vmalloc like in Rick's patch or as a flag passed to a vfree variant
directly) that makes the vunmap code that frees the underlying pages
also reset their permissions?
Right now, we muck with set_memory_rw() and set_memory_nx(), which
both have very awkward (and inconsistent with each other!) semantics
when called on vmalloc memory. And they have their own flushes, which
is inefficient. Maybe the right solution is for vunmap to remove the
vmap area PTEs, call into a function like set_memory_rw() that resets
the direct maps to their default permissions *without* flushing, and
then to do a single flush for everything. Or, even better, to cause
the change_page_attr code to do the flush and also to flush the vmap
area all at once so that very small free operations can flush single
pages instead of flushing globally.
Thanks for the explanation. I read it just after I realized that indeed the
whole purpose of this code is to get cpa_process_alias()
update the corresponding direct mapping.
This thing (pageattr.c) indeed seems over-engineered and very unintuitive.
Right now I have a list of patch-sets that I owe, so I don’t have the time
to deal with it.
But, I still think that disable_ro_nx() should not call set_memory_x().
IIUC, this breaks W+X of the direct-mapping which correspond with the module
memory. Does it ever stop being W+X?? I’ll have another look.
Dunno. I did once chase down a bug where some memory got freed while
it was still read-only, and the results were hilarious and hard to
debug, since the explosion happened long after the buggy code
finished.
This piece of code causes me pain and misery.
So, it turns out that the direct map is *not* changed if you just change
/* No alias checking for _NX bit modifications */
checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
How many levels of abstraction are broken in the way? What would happen
if somebody tries to change the NX-bit and some other bit in the PTE?
Luckily, I don’t think someone does… at least for now.
So, again, I think the change I proposed makes sense. nios2 does not have
set_memory_x() and it will not be affected.
Hold on...so on architectures that don't have set_memory_ but do have something
like NX, wont the executable stale TLB continue to live to re-used pages, and so
it doesn't fix the problem this patch is trying to address generally? I see at
least a couple archs use vmalloc and have an exec bit, but don't define
set_memory_*.
Again, this does not come instead of your patch (the one in this thread).
And if you follow Andy’s suggestion, the patch I propose will not be needed.
However, in the meantime - I see no reason to mark data as executable, even
for a brief period of time.
Nadav Amit
2018-11-28 01:06:05 UTC
Permalink
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.
There are sort of three categories of this which show up across modules, bpf,
1. When executable memory is touched and then immediatly freed
This shows up in a couple error conditions in the module loader and BPF JIT
compiler.
Interesting!

Note that this may cause conflict with "x86: avoid W^X being broken during
modules loading”, which I recently submitted.
Nadav Amit
2018-11-28 01:21:08 UTC
Permalink
Post by Nadav Amit
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.
There are sort of three categories of this which show up across modules, bpf,
1. When executable memory is touched and then immediatly freed
This shows up in a couple error conditions in the module loader and BPF JIT
compiler.
Interesting!
Note that this may cause conflict with "x86: avoid W^X being broken during
modules loading”, which I recently submitted.
I actually have not looked on the vmalloc() code too much recent, but it
seems … strange:

void vm_unmap_aliases(void)
{

...
mutex_lock(&vmap_purge_lock);
purge_fragmented_blocks_allcpus();
if (!__purge_vmap_area_lazy(start, end) && flush)
flush_tlb_kernel_range(start, end);
mutex_unlock(&vmap_purge_lock);
}

Since __purge_vmap_area_lazy() releases the memory, it seems there is a time
window between the release of the region and the TLB flush, in which the
area can be allocated for another purpose. This can result in a
(theoretical) correctness issue. No?
Will Deacon
2018-11-28 09:57:35 UTC
Permalink
Post by Nadav Amit
Post by Nadav Amit
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.
There are sort of three categories of this which show up across modules, bpf,
1. When executable memory is touched and then immediatly freed
This shows up in a couple error conditions in the module loader and BPF JIT
compiler.
Interesting!
Note that this may cause conflict with "x86: avoid W^X being broken during
modules loading”, which I recently submitted.
I actually have not looked on the vmalloc() code too much recent, but it
void vm_unmap_aliases(void)
{
...
mutex_lock(&vmap_purge_lock);
purge_fragmented_blocks_allcpus();
if (!__purge_vmap_area_lazy(start, end) && flush)
flush_tlb_kernel_range(start, end);
mutex_unlock(&vmap_purge_lock);
}
Since __purge_vmap_area_lazy() releases the memory, it seems there is a time
window between the release of the region and the TLB flush, in which the
area can be allocated for another purpose. This can result in a
(theoretical) correctness issue. No?
If __purge_vmap_area_lazy() returns false, then it hasn't freed the memory,
so we only invalidate the TLB if 'flush' is true in that case. If
__purge_vmap_area_lazy() returns true instead, then it takes care of the TLB
invalidation before the freeing.

Will
Nadav Amit
2018-11-28 18:29:52 UTC
Permalink
Post by Will Deacon
Post by Nadav Amit
Post by Nadav Amit
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.
There are sort of three categories of this which show up across modules, bpf,
1. When executable memory is touched and then immediatly freed
This shows up in a couple error conditions in the module loader and BPF JIT
compiler.
Interesting!
Note that this may cause conflict with "x86: avoid W^X being broken during
modules loading”, which I recently submitted.
I actually have not looked on the vmalloc() code too much recent, but it
void vm_unmap_aliases(void)
{
...
mutex_lock(&vmap_purge_lock);
purge_fragmented_blocks_allcpus();
if (!__purge_vmap_area_lazy(start, end) && flush)
flush_tlb_kernel_range(start, end);
mutex_unlock(&vmap_purge_lock);
}
Since __purge_vmap_area_lazy() releases the memory, it seems there is a time
window between the release of the region and the TLB flush, in which the
area can be allocated for another purpose. This can result in a
(theoretical) correctness issue. No?
If __purge_vmap_area_lazy() returns false, then it hasn't freed the memory,
so we only invalidate the TLB if 'flush' is true in that case. If
__purge_vmap_area_lazy() returns true instead, then it takes care of the TLB
invalidation before the freeing.
Right. Sorry for my misunderstanding.

Thanks,
Nadav
Andrew Morton
2018-11-28 23:11:45 UTC
Permalink
Post by Rick Edgecombe
Change the module allocations to flush before freeing the pages.
...
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -87,8 +87,8 @@ void *module_alloc(unsigned long size)
p = __vmalloc_node_range(size, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
- PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
Should any other architectures do this?
Edgecombe, Rick P
2018-11-29 00:02:15 UTC
Permalink
Post by Andrew Morton
Post by Rick Edgecombe
Change the module allocations to flush before freeing the pages.
...
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -87,8 +87,8 @@ void *module_alloc(unsigned long size)
p = __vmalloc_node_range(size, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
- PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
Should any other architectures do this?
I would think everything that has something like an NX bit and doesn't use the
default module_alloc implementation.

I could add the flag for every arch that defines PAGE_KERNEL_EXEC, but I don't
have a good way to test on all of those architectures.

Thanks,

Ri
Andy Lutomirski
2018-11-29 01:40:53 UTC
Permalink
Post by Rick Edgecombe
Change the module allocations to flush before freeing the pages.
---
arch/x86/kernel/module.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b052e883dd8c..1694daf256b3 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -87,8 +87,8 @@ void *module_alloc(unsigned long size)
p = __vmalloc_node_range(size, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
- PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
Hmm. How awful is the resulting performance for heavy eBPF users? I’m
wondering if the JIT will need some kind of cache to reuse
allocations.
Post by Rick Edgecombe
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
--
2.17.1
Edgecombe, Rick P
2018-11-29 06:14:20 UTC
Permalink
Post by Andy Lutomirski
On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
Change the module allocations to flush before freeing the pages.
---
arch/x86/kernel/module.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index b052e883dd8c..1694daf256b3 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -87,8 +87,8 @@ void *module_alloc(unsigned long size)
p = __vmalloc_node_range(size, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
- PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
- __builtin_return_address(0));
+ PAGE_KERNEL_EXEC, VM_IMMEDIATE_UNMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
Hmm. How awful is the resulting performance for heavy eBPF
users? I’m
wondering if the JIT will need some kind of cache to reuse
allocations.
I think it should have no effect for x86 at least. On allocation there
is only the setting of the flag. For free-ing there is of course a new
TLB flush, but it happens in way that should remove one elsewhere for
BPF.

On x86 today there are actually already 3 flushes for the operation
around a module_alloc JIT free. What's happening is there are two
allocations that are RO: the JIT and some data. When freeing, first the
JIT is set RW, then vfreed. So this causes 1 TLB flush from the
set_memory_rw, and there is now a lazy vmap area from the vfree. When
the data allocation is set to RW, vm_unmap_aliases() is called in
pageattr.c:change_page_attr_set_clr, so it will cause a flush from
clearing the lazy area, then there is the third flush as part of the
permissions change like usual.

Since now the JIT vfree will call vm_unmap_aliases(), it should not
trigger a TLB flush in the second permission change, so remain at 3.
Post by Andy Lutomirski
if (p && (kasan_module_alloc(p, size) < 0)) {
vfree(p);
return NULL;
Masami Hiramatsu
2018-11-29 14:06:16 UTC
Permalink
On Tue, 27 Nov 2018 16:07:52 -0800
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.
Good catch!
There are sort of three categories of this which show up across modules, bpf,
For x86-64 kprobe, it sets the page NX and after that RW, and then release
via module_memfree. So I'm not sure it really happens on kprobes. (Of course
the default memory allocator is simpler so it may happen on other archs) But
interesting fixes.

Thank you,
1. When executable memory is touched and then immediatly freed
This shows up in a couple error conditions in the module loader and BPF JIT
compiler.
2. When executable memory is set to RW right before being freed
In this case (on x86 and probably others) there will be a TLB flush when its
set to RW and so since the pages are not touched between setting the
flush and the free, it should not be in the TLB in most cases. So this
category is not as big of a concern. However, techinically there is still a
race where an attacker could try to keep it alive for a short window with a
well timed out-of-bound read or speculative read, so ideally this could be
blocked as well.
3. When executable memory is freed in an interrupt
At least one example of this is the freeing of init sections in the module
loader. Since vmalloc reuses the allocation for the work queue linked list
node for the deferred frees, the memory actually gets touched as part of the
vfree operation and so returns to the TLB even after the flush from resetting
the permissions.
I have only actually tested category 1, and identified 2 and 3 just from reading
the code.
To catch all of these, module_alloc for x86 is changed to use a new flag that
instructs the unmap operation to flush the TLB before freeing the pages.
If this solution seems good I can plug the flag in for other architectures that
define PAGE_KERNEL_EXEC.
vmalloc: New flag for flush before releasing pages
x86/modules: Make x86 allocs to flush when free
arch/x86/kernel/module.c | 4 ++--
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 13 +++++++++++--
3 files changed, 14 insertions(+), 4 deletions(-)
--
2.17.1
--
Masami Hiramatsu <***@kernel.org>
Edgecombe, Rick P
2018-11-29 18:49:26 UTC
Permalink
Post by Masami Hiramatsu
On Tue, 27 Nov 2018 16:07:52 -0800
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.
Good catch!
There are sort of three categories of this which show up across modules, bpf,
For x86-64 kprobe, it sets the page NX and after that RW, and then release
via module_memfree. So I'm not sure it really happens on kprobes. (Of course
the default memory allocator is simpler so it may happen on other archs) But
interesting fixes.
Yes, I think you are right, it should not leave an executable TLB entry in this
case. Ftrace actually does this on x86 as well.

Is there some other reason for calling set_memory_nx that should apply elsewhere
for module users? Or could it be removed in the case of this patch to centralize
the behavior?

Thanks,

Rick
Post by Masami Hiramatsu
Thank you,
1. When executable memory is touched and then immediatly freed
This shows up in a couple error conditions in the module loader and BPF JIT
compiler.
2. When executable memory is set to RW right before being freed
In this case (on x86 and probably others) there will be a TLB flush when its
set to RW and so since the pages are not touched between setting the
flush and the free, it should not be in the TLB in most cases. So this
category is not as big of a concern. However, techinically there is still a
race where an attacker could try to keep it alive for a short window with a
well timed out-of-bound read or speculative read, so ideally this could be
blocked as well.
3. When executable memory is freed in an interrupt
At least one example of this is the freeing of init sections in the module
loader. Since vmalloc reuses the allocation for the work queue linked list
node for the deferred frees, the memory actually gets touched as part of the
vfree operation and so returns to the TLB even after the flush from resetting
the permissions.
I have only actually tested category 1, and identified 2 and 3 just from reading
the code.
To catch all of these, module_alloc for x86 is changed to use a new flag that
instructs the unmap operation to flush the TLB before freeing the pages.
If this solution seems good I can plug the flag in for other architectures that
define PAGE_KERNEL_EXEC.
vmalloc: New flag for flush before releasing pages
x86/modules: Make x86 allocs to flush when free
arch/x86/kernel/module.c | 4 ++--
include/linux/vmalloc.h | 1 +
mm/vmalloc.c | 13 +++++++++++--
3 files changed, 14 insertions(+), 4 deletions(-)
--
2.17.1
Masami Hiramatsu
2018-11-29 23:19:13 UTC
Permalink
On Thu, 29 Nov 2018 18:49:26 +0000
Post by Edgecombe, Rick P
Post by Masami Hiramatsu
On Tue, 27 Nov 2018 16:07:52 -0800
Sometimes when memory is freed via the module subsystem, an executable
permissioned TLB entry can remain to a freed page. If the page is re-used to
back an address that will receive data from userspace, it can result in user
data being mapped as executable in the kernel. The root of this behavior is
vfree lazily flushing the TLB, but not lazily freeing the underlying pages.
Good catch!
There are sort of three categories of this which show up across modules, bpf,
For x86-64 kprobe, it sets the page NX and after that RW, and then release
via module_memfree. So I'm not sure it really happens on kprobes. (Of course
the default memory allocator is simpler so it may happen on other archs) But
interesting fixes.
Yes, I think you are right, it should not leave an executable TLB entry in this
case. Ftrace actually does this on x86 as well.
Is there some other reason for calling set_memory_nx that should apply elsewhere
for module users? Or could it be removed in the case of this patch to centralize
the behavior?
According to the commit c93f5cf571e7 ("kprobes/x86: Fix to set RWX bits correctly
before releasing trampoline"), if we release readonly page by module_memfree(),
it causes kernel crash. And at this moment, on x86-64 set the trampoline page
readonly becuase it is an exacutable page. Setting NX bit is for security reason
that should be set before making it writable.
So I think if you centralize setting NX bit, it should be done before setting
writable bit.

Thank you,
--
Masami Hiramatsu <***@kernel.org>
Loading...