• Steven Rostedt (Red Hat)'s avatar
    ring-buffer: Always reset iterator to reader page · 651e22f2
    Steven Rostedt (Red Hat) authored
    When performing a consuming read, the ring buffer swaps out a
    page from the ring buffer with a empty page and this page that
    was swapped out becomes the new reader page. The reader page
    is owned by the reader and since it was swapped out of the ring
    buffer, writers do not have access to it (there's an exception
    to that rule, but it's out of scope for this commit).
    
    When reading the "trace" file, it is a non consuming read, which
    means that the data in the ring buffer will not be modified.
    When the trace file is opened, a ring buffer iterator is allocated
    and writes to the ring buffer are disabled, such that the iterator
    will not have issues iterating over the data.
    
    Although the ring buffer disabled writes, it does not disable other
    reads, or even consuming reads. If a consuming read happens, then
    the iterator is reset and starts reading from the beginning again.
    
    My tests would sometimes trigger this bug on my i386 box:
    
    WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
    Modules linked in:
    CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
    Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
     00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
     f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
     ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
    Call Trace:
     [<c18796b3>] dump_stack+0x4b/0x75
     [<c103a0e3>] warn_slowpath_common+0x7e/0x95
     [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
     [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
     [<c103a185>] warn_slowpath_fmt+0x33/0x35
     [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
     [<c10bed04>] trace_find_cmdline+0x40/0x64
     [<c10c3c16>] trace_print_context+0x27/0xec
     [<c10c4360>] ? trace_seq_printf+0x37/0x5b
     [<c10c0b15>] print_trace_line+0x319/0x39b
     [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
     [<c10c13b1>] s_show+0x192/0x1ab
     [<c10bfd9a>] ? s_next+0x5a/0x7c
     [<c112e76e>] seq_read+0x267/0x34c
     [<c1115a25>] vfs_read+0x8c/0xef
     [<c112e507>] ? seq_lseek+0x154/0x154
     [<c1115ba2>] SyS_read+0x54/0x7f
     [<c188488e>] syscall_call+0x7/0xb
    ---[ end trace 3f507febd6b4cc83 ]---
    >>>> ##### CPU 1 buffer started ####
    
    Which was the __trace_find_cmdline() function complaining about the pid
    in the event record being negative.
    
    After adding more test cases, this would trigger more often. Strangely
    enough, it would never trigger on a single test, but instead would trigger
    only when running all the tests. I believe that was the case because it
    required one of the tests to be shutting down via delayed instances while
    a new test started up.
    
    After spending several days debugging this, I found that it was caused by
    the iterator becoming corrupted. Debugging further, I found out why
    the iterator became corrupted. It happened with the rb_iter_reset().
    
    As consuming reads may not read the full reader page, and only part
    of it, there's a "read" field to know where the last read took place.
    The iterator, must also start at the read position. In the rb_iter_reset()
    code, if the reader page was disconnected from the ring buffer, the iterator
    would start at the head page within the ring buffer (where writes still
    happen). But the mistake there was that it still used the "read" field
    to start the iterator on the head page, where it should always start
    at zero because readers never read from within the ring buffer where
    writes occur.
    
    I originally wrote a patch to have it set the iter->head to 0 instead
    of iter->head_page->read, but then I questioned why it wasn't always
    setting the iter to point to the reader page, as the reader page is
    still valid.  The list_empty(reader_page->list) just means that it was
    successful in swapping out. But the reader_page may still have data.
    
    There was a bug report a long time ago that was not reproducible that
    had something about trace_pipe (consuming read) not matching trace
    (iterator read). This may explain why that happened.
    
    Anyway, the correct answer to this bug is to always use the reader page
    an not reset the iterator to inside the writable ring buffer.
    
    Cc: stable@vger.kernel.org # 2.6.28+
    Fixes: d769041f "ring_buffer: implement new locking"
    Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
    651e22f2
ring_buffer.c 129 KB