- 16 Jun, 2014 12 commits
-
-
Xufeng Zhang authored
[ Upstream commit 85350871 ] commit 813b3b5d (ipv4: Use caller's on-stack flowi as-is in output route lookups.) introduces another regression which is very similar to the problem of commit e6b45241 (ipv4: reset flowi parameters on route connect) wants to fix: Before we call ip_route_output_key() in sctp_v4_get_dst() to get a dst that matches a bind address as the source address, we have already called this function previously and the flowi parameters have been initialized including flowi4_oif, so when we call this function again, the process in __ip_route_output_key() will be different because of the setting of flowi4_oif, and we'll get a networking device which corresponds to the inputted flowi4_oif as the output device, this is wrong because we'll never hit this place if the previously returned source address of dst match one of the bound addresses. To reproduce this problem, a vlan setting is enough: # ifconfig eth0 up # route del default # vconfig add eth0 2 # vconfig add eth0 3 # ifconfig eth0.2 10.0.1.14 netmask 255.255.255.0 # route add default gw 10.0.1.254 dev eth0.2 # ifconfig eth0.3 10.0.0.14 netmask 255.255.255.0 # ip rule add from 10.0.0.14 table 4 # ip route add table 4 default via 10.0.0.254 src 10.0.0.14 dev eth0.3 # sctp_darn -H 10.0.0.14 -P 36422 -h 10.1.4.134 -p 36422 -s -I You'll detect that all the flow are routed to eth0.2(10.0.1.254). Signed-off-by:
Xufeng Zhang <xufeng.zhang@windriver.com> Signed-off-by:
Julian Anastasov <ja@ssi.bg> Acked-by:
Vlad Yasevich <vyasevich@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Toshiaki Makita authored
[ Upstream commit 30313a3d ] When bridge device is created with IFLA_ADDRESS, we are not calling br_stp_change_bridge_id(), which leads to incorrect local fdb management and bridge id calculation, and prevents us from receiving frames on the bridge device. Reported-by:
Tom Gundersen <teg@jklm.no> Signed-off-by:
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Kumar Sundararajan authored
[ Upstream commit 1c265854 ] When the ipv6 fib changes during a table dump, the walk is restarted and the number of nodes dumped are skipped. But the existing code doesn't advance to the next node after a node is skipped. This can cause the dump to loop or produce lots of duplicates when the fib is modified during the dump. This change advances the walk to the next node if the current node is skipped after a restart. Signed-off-by:
Kumar Sundararajan <kumar@fb.com> Signed-off-by:
Chris Mason <clm@fb.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
David Gibson authored
[ Upstream commit c53864fd ] Since 115c9b81 (rtnetlink: Fix problem with buffer allocation), RTM_NEWLINK messages only contain the IFLA_VFINFO_LIST attribute if they were solicited by a GETLINK message containing an IFLA_EXT_MASK attribute with the RTEXT_FILTER_VF flag. That was done because some user programs broke when they received more data than expected - because IFLA_VFINFO_LIST contains information for each VF it can become large if there are many VFs. However, the IFLA_VF_PORTS attribute, supplied for devices which implement ndo_get_vf_port (currently the 'enic' driver only), has the same problem. It supplies per-VF information and can therefore become large, but it is not currently conditional on the IFLA_EXT_MASK value. Worse, it interacts badly with the existing EXT_MASK handling. When IFLA_EXT_MASK is not supplied, the buffer for netlink replies is fixed at NLMSG_GOODSIZE. If the information for IFLA_VF_PORTS exceeds this, then rtnl_fill_ifinfo() returns -EMSGSIZE on the first message in a packet. netlink_dump() will misinterpret this as having finished the listing and omit data for this interface and all subsequent ones. That can cause getifaddrs(3) to enter an infinite loop. This patch addresses the problem by only supplying IFLA_VF_PORTS when IFLA_EXT_MASK is supplied with the RTEXT_FILTER_VF flag set. Signed-off-by:
David Gibson <david@gibson.dropbear.id.au> Reviewed-by:
Jiri Pirko <jiri@resnulli.us> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
David Gibson authored
[ Upstream commit 973462bb ] Without IFLA_EXT_MASK specified, the information reported for a single interface in response to RTM_GETLINK is expected to fit within a netlink packet of NLMSG_GOODSIZE. If it doesn't, however, things will go badly wrong, When listing all interfaces, netlink_dump() will incorrectly treat -EMSGSIZE on the first message in a packet as the end of the listing and omit information for that interface and all subsequent ones. This can cause getifaddrs(3) to enter an infinite loop. This patch won't fix the problem, but it will WARN_ON() making it easier to track down what's going wrong. Signed-off-by:
David Gibson <david@gibson.dropbear.id.au> Reviewed-by:
Jiri Pirko <jpirko@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Vlad Yasevich authored
[ Upstream commit b14878cc ] Currently, it is possible to create an SCTP socket, then switch auth_enable via sysctl setting to 1 and crash the system on connect: Oops[#1]: CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.1-mipsgit-20140415 #1 task: ffffffff8056ce80 ti: ffffffff8055c000 task.ti: ffffffff8055c000 [...] Call Trace: [<ffffffff8043c4e8>] sctp_auth_asoc_set_default_hmac+0x68/0x80 [<ffffffff8042b300>] sctp_process_init+0x5e0/0x8a4 [<ffffffff8042188c>] sctp_sf_do_5_1B_init+0x234/0x34c [<ffffffff804228c8>] sctp_do_sm+0xb4/0x1e8 [<ffffffff80425a08>] sctp_endpoint_bh_rcv+0x1c4/0x214 [<ffffffff8043af68>] sctp_rcv+0x588/0x630 [<ffffffff8043e8e8>] sctp6_rcv+0x10/0x24 [<ffffffff803acb50>] ip6_input+0x2c0/0x440 [<ffffffff8030fc00>] __netif_receive_skb_core+0x4a8/0x564 [<ffffffff80310650>] process_backlog+0xb4/0x18c [<ffffffff80313cbc>] net_rx_action+0x12c/0x210 [<ffffffff80034254>] __do_softirq+0x17c/0x2ac [<ffffffff800345e0>] irq_exit+0x54/0xb0 [<ffffffff800075a4>] ret_from_irq+0x0/0x4 [<ffffffff800090ec>] rm7k_wait_irqoff+0x24/0x48 [<ffffffff8005e388>] cpu_startup_entry+0xc0/0x148 [<ffffffff805a88b0>] start_kernel+0x37c/0x398 Code: dd0900b8 000330f8 0126302d <dcc60000> 50c0fff1 0047182a a48306a0 03e00008 00000000 ---[ end trace b530b0551467f2fd ]--- Kernel panic - not syncing: Fatal exception in interrupt What happens while auth_enable=0 in that case is, that ep->auth_hmacs is initialized to NULL in sctp_auth_init_hmacs() when endpoint is being created. After that point, if an admin switches over to auth_enable=1, the machine can crash due to NULL pointer dereference during reception of an INIT chunk. When we enter sctp_process_init() via sctp_sf_do_5_1B_init() in order to respond to an INIT chunk, the INIT verification succeeds and while we walk and process all INIT params via sctp_process_param() we find that net->sctp.auth_enable is set, therefore do not fall through, but invoke sctp_auth_asoc_set_default_hmac() instead, and thus, dereference what we have set to NULL during endpoint initialization phase. The fix is to make auth_enable immutable by caching its value during endpoint initialization, so that its original value is being carried along until destruction. The bug seems to originate from the very first days. Fix in joint work with Daniel Borkmann. Reported-by:
Joshua Kinard <kumba@gentoo.org> Signed-off-by:
Vlad Yasevich <vyasevic@redhat.com> Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Tested-by:
Joshua Kinard <kumba@gentoo.org> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Ivan Vecera authored
commit ba67b510 upstream. The patch fixes a problem with dropped jumbo frames after usage of 'ethtool -G ... rx'. Scenario: 1. ip link set eth0 up 2. ethtool -G eth0 rx N # <- This zeroes rx-jumbo 3. ip link set mtu 9000 dev eth0 The ethtool command set rx_jumbo_pending to zero so any received jumbo packets are dropped and you need to use 'ethtool -G eth0 rx-jumbo N' to workaround the issue. The patch changes the logic so rx_jumbo_pending value is changed only if jumbo frames are enabled (MTU > 1500). Signed-off-by:
Ivan Vecera <ivecera@redhat.com> Acked-by:
Michael Chan <mchan@broadcom.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
dingtianhong authored
[ Upstream commit dc8eaaa0 ] When I open the LOCKDEP config and run these steps: modprobe 8021q vconfig add eth2 20 vconfig add eth2.20 30 ifconfig eth2 xx.xx.xx.xx then the Call Trace happened: [32524.386288] ============================================= [32524.386293] [ INFO: possible recursive locking detected ] [32524.386298] 3.14.0-rc2-0.7-default+ #35 Tainted: G O [32524.386302] --------------------------------------------- [32524.386306] ifconfig/3103 is trying to acquire lock: [32524.386310] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0 [32524.386326] [32524.386326] but task is already holding lock: [32524.386330] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40 [32524.386341] [32524.386341] other info that might help us debug this: [32524.386345] Possible unsafe locking scenario: [32524.386345] [32524.386350] CPU0 [32524.386352] ---- [32524.386354] lock(&vlan_netdev_addr_lock_key/1); [32524.386359] lock(&vlan_netdev_addr_lock_key/1); [32524.386364] [32524.386364] *** DEADLOCK *** [32524.386364] [32524.386368] May be due to missing lock nesting notation [32524.386368] [32524.386373] 2 locks held by ifconfig/3103: [32524.386376] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81431d42>] rtnl_lock+0x12/0x20 [32524.386387] #1: (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40 [32524.386398] [32524.386398] stack backtrace: [32524.386403] CPU: 1 PID: 3103 Comm: ifconfig Tainted: G O 3.14.0-rc2-0.7-default+ #35 [32524.386409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [32524.386414] ffffffff81ffae40 ffff8800d9625ae8 ffffffff814f68a2 ffff8800d9625bc8 [32524.386421] ffffffff810a35fb ffff8800d8a8d9d0 00000000d9625b28 ffff8800d8a8e5d0 [32524.386428] 000003cc00000000 0000000000000002 ffff8800d8a8e5f8 0000000000000000 [32524.386435] Call Trace: [32524.386441] [<ffffffff814f68a2>] dump_stack+0x6a/0x78 [32524.386448] [<ffffffff810a35fb>] __lock_acquire+0x7ab/0x1940 [32524.386454] [<ffffffff810a323a>] ? __lock_acquire+0x3ea/0x1940 [32524.386459] [<ffffffff810a4874>] lock_acquire+0xe4/0x110 [32524.386464] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0 [32524.386471] [<ffffffff814fc07a>] _raw_spin_lock_nested+0x2a/0x40 [32524.386476] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0 [32524.386481] [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0 [32524.386489] [<ffffffffa0500cab>] vlan_dev_set_rx_mode+0x2b/0x50 [8021q] [32524.386495] [<ffffffff8141addf>] __dev_set_rx_mode+0x5f/0xb0 [32524.386500] [<ffffffff8141af8b>] dev_set_rx_mode+0x2b/0x40 [32524.386506] [<ffffffff8141b3cf>] __dev_open+0xef/0x150 [32524.386511] [<ffffffff8141b177>] __dev_change_flags+0xa7/0x190 [32524.386516] [<ffffffff8141b292>] dev_change_flags+0x32/0x80 [32524.386524] [<ffffffff8149ca56>] devinet_ioctl+0x7d6/0x830 [32524.386532] [<ffffffff81437b0b>] ? dev_ioctl+0x34b/0x660 [32524.386540] [<ffffffff814a05b0>] inet_ioctl+0x80/0xa0 [32524.386550] [<ffffffff8140199d>] sock_do_ioctl+0x2d/0x60 [32524.386558] [<ffffffff81401a52>] sock_ioctl+0x82/0x2a0 [32524.386568] [<ffffffff811a7123>] do_vfs_ioctl+0x93/0x590 [32524.386578] [<ffffffff811b2705>] ? rcu_read_lock_held+0x45/0x50 [32524.386586] [<ffffffff811b39e5>] ? __fget_light+0x105/0x110 [32524.386594] [<ffffffff811a76b1>] SyS_ioctl+0x91/0xb0 [32524.386604] [<ffffffff815057e2>] system_call_fastpath+0x16/0x1b ======================================================================== The reason is that all of the addr_lock_key for vlan dev have the same class, so if we change the status for vlan dev, the vlan dev and its real dev will hold the same class of addr_lock_key together, so the warning happened. we should distinguish the lock depth for vlan dev and its real dev. v1->v2: Convert the vlan_netdev_addr_lock_key to an array of eight elements, which could support to add 8 vlan id on a same vlan dev, I think it is enough for current scene, because a netdev's name is limited to IFNAMSIZ which could not hold 8 vlan id, and the vlan dev would not meet the same class key with its real dev. The new function vlan_dev_get_lockdep_subkey() will return the subkey and make the vlan dev could get a suitable class key. v2->v3: According David's suggestion, I use the subclass to distinguish the lock key for vlan dev and its real dev, but it make no sense, because the difference for subclass in the lock_class_key doesn't mean that the difference class for lock_key, so I use lock_depth to distinguish the different depth for every vlan dev, the same depth of the vlan dev could have the same lock_class_key, I import the MAX_LOCK_DEPTH from the include/linux/sched.h, I think it is enough here, the lockdep should never exceed that value. v3->v4: Add a huge array of locking keys will waste static kernel memory and is not a appropriate method, we could use _nested() variants to fix the problem, calculate the depth for every vlan dev, and use the depth as the subclass for addr_lock_key. Signed-off-by:
Ding Tianhong <dingtianhong@huawei.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Nicolas Dichtel authored
[ Upstream commit 54d63f78 ] It's possible to remove the FB tunnel with the command 'ip link del ip6gre0' but this is unsafe, the module always supposes that this device exists. For example, ip6gre_tunnel_lookup() may use it unconditionally. Let's add a rtnl handler for dellink, which will never remove the FB tunnel (we let ip6gre_destroy_tunnels() do the job). Introduced by commit c12b395a ("gre: Support GRE over IPv6"). CC: Dmitry Kozlov <xeb@mail.ru> Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Mathias Krause authored
[ Upstream commit 05ab8f26 ] The BPF_S_ANC_NLATTR and BPF_S_ANC_NLATTR_NEST extensions fail to check for a minimal message length before testing the supplied offset to be within the bounds of the message. This allows the subtraction of the nla header to underflow and therefore -- as the data type is unsigned -- allowing far to big offset and length values for the search of the netlink attribute. The remainder calculation for the BPF_S_ANC_NLATTR_NEST extension is also wrong. It has the minuend and subtrahend mixed up, therefore calculates a huge length value, allowing to overrun the end of the message while looking for the netlink attribute. The following three BPF snippets will trigger the bugs when attached to a UNIX datagram socket and parsing a message with length 1, 2 or 3. ,-[ PoC for missing size check in BPF_S_ANC_NLATTR ]-- | ld #0x87654321 | ldx #42 | ld #nla | ret a `--- ,-[ PoC for the same bug in BPF_S_ANC_NLATTR_NEST ]-- | ld #0x87654321 | ldx #42 | ld #nlan | ret a `--- ,-[ PoC for wrong remainder calculation in BPF_S_ANC_NLATTR_NEST ]-- | ; (needs a fake netlink header at offset 0) | ld #0 | ldx #42 | ld #nlan | ret a `--- Fix the first issue by ensuring the message length fulfills the minimal size constrains of a nla header. Fix the second bug by getting the math for the remainder calculation right. Fixes: 4738c1db ("[SKFILTER]: Add SKF_ADF_NLATTR instruction") Fixes: d214c753 ("filter: add SKF_AD_NLATTR_NEST to look for nested..") Cc: Patrick McHardy <kaber@trash.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Mathias Krause <minipli@googlemail.com> Acked-by:
Daniel Borkmann <dborkman@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Julian Anastasov authored
[ Upstream commit 91146153 ] Extend commit 13378cad ("ipv4: Change rt->rt_iif encoding.") from 3.6 to return valid RTA_IIF on 'ip route get ... iif DEVICE' instead of rt_iif 0 which is displayed as 'iif *'. inet_iif is not appropriate to use because skb_iif is not set. Use the skb->dev->ifindex instead. Signed-off-by:
Julian Anastasov <ja@ssi.bg> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Wang, Xiaoming authored
[ Upstream commit b04c4619 ] Plug a group_info refcount leak in ping_init. group_info is only needed during initialization and the code failed to release the reference on exit. While here move grabbing the reference to a place where it is actually needed. Signed-off-by:
Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by:
Zhang Dongxing <dongxing.zhang@intel.com> Signed-off-by:
xiaoming wang <xiaoming.wang@intel.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
- 13 Jun, 2014 15 commits
-
-
Eric Dumazet authored
[ Upstream commit 30f78d8e ] Francois reported that setting big mtu on loopback device could prevent tcp sessions making progress. We do not support (yet ?) IPv6 Jumbograms and cook corrupted packets. We must limit the IPv6 MTU to (65535 + 40) bytes in theory. Tested: ifconfig lo mtu 70000 netperf -H ::1 Before patch : Throughput : 0.05 Mbits After patch : Throughput : 35484 Mbits Reported-by:
Francois WELLENREITER <f.wellenreiter@gmail.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Acked-by:
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Acked-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Thomas Richter authored
[ Upstream commit db298686 ] Remove the bonding debug_fs entries when the module initialization fails. The debug_fs entries should be removed together with all other already allocated resources. Signed-off-by:
Thomas Richter <tmricht@linux.vnet.ibm.com> Signed-off-by:
Jay Vosburgh <j.vosburgh@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Florian Westphal authored
[ Upstream commit 6d39d589 ] In case of tcp, gso_size contains the tcpmss. For UFO (udp fragmentation offloading) skbs, gso_size is the fragment payload size, i.e. we must not account for udp header size. Otherwise, when using virtio drivers, a to-be-forwarded UFO GSO packet will be needlessly fragmented in the forward path, because we think its individual segments are too large for the outgoing link. Fixes: fe6cc55f ("net: ip, ipv6: handle gso skbs in forwarding path") Cc: Eric Dumazet <eric.dumazet@gmail.com> Reported-by:
Tobias Brunner <tobias@strongswan.org> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Dmitry Petukhov authored
[ Upstream commit f34c4a35 ] When l2tp driver tries to get PMTU for the tunnel destination, it uses the pointer to struct sock that represents PPPoX socket, while it should use the pointer that represents UDP socket of the tunnel. Signed-off-by:
Dmitry Petukhov <dmgenp@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Daniel Borkmann authored
[ Upstream commit 1e1cdf8a ] In function sctp_wake_up_waiters(), we need to involve a test if the association is declared dead. If so, we don't have any reference to a possible sibling association anymore and need to invoke sctp_write_space() instead, and normally walk the socket's associations and notify them of new wmem space. The reason for special casing is that otherwise, we could run into the following issue when a sctp_primitive_SEND() call from sctp_sendmsg() fails, and tries to flush an association's outq, i.e. in the following way: sctp_association_free() `-> list_del(&asoc->asocs) <-- poisons list pointer asoc->base.dead = true sctp_outq_free(&asoc->outqueue) `-> __sctp_outq_teardown() `-> sctp_chunk_free() `-> consume_skb() `-> sctp_wfree() `-> sctp_wake_up_waiters() <-- dereferences poisoned pointers if asoc->ep->sndbuf_policy=0 Therefore, only walk the list in an 'optimized' way if we find that the current association is still active. We could also use list_del_init() in addition when we call sctp_association_free(), but as Vlad suggests, we want to trap such bugs and thus leave it poisoned as is. Why is it safe to resolve the issue by testing for asoc->base.dead? Parallel calls to sctp_sendmsg() are protected under socket lock, that is lock_sock()/release_sock(). Only within that path under lock held, we're setting skb/chunk owner via sctp_set_owner_w(). Eventually, chunks are freed directly by an association still under that lock. So when traversing association list on destruction time from sctp_wake_up_waiters() via sctp_wfree(), a different CPU can't be running sctp_wfree() while another one calls sctp_association_free() as both happens under the same lock. Therefore, this can also not race with setting/testing against asoc->base.dead as we are guaranteed for this to happen in order, under lock. Further, Vlad says: the times we check asoc->base.dead is when we've cached an association pointer for later processing. In between cache and processing, the association may have been freed and is simply still around due to reference counts. We check asoc->base.dead under a lock, so it should always be safe to check and not race against sctp_association_free(). Stress-testing seems fine now, too. Fixes: cd253f9f357d ("net: sctp: wake up all assocs if sndbuf policy is per socket") Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Acked-by:
Vlad Yasevich <vyasevic@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Daniel Borkmann authored
[ Upstream commit 52c35bef ] SCTP charges chunks for wmem accounting via skb->truesize in sctp_set_owner_w(), and sctp_wfree() respectively as the reverse operation. If a sender runs out of wmem, it needs to wait via sctp_wait_for_sndbuf(), and gets woken up by a call to __sctp_write_space() mostly via sctp_wfree(). __sctp_write_space() is being called per association. Although we assign sk->sk_write_space() to sctp_write_space(), which is then being done per socket, it is only used if send space is increased per socket option (SO_SNDBUF), as SOCK_USE_WRITE_QUEUE is set and therefore not invoked in sock_wfree(). Commit 4c3a5bda ("sctp: Don't charge for data in sndbuf again when transmitting packet") fixed an issue where in case sctp_packet_transmit() manages to queue up more than sndbuf bytes, sctp_wait_for_sndbuf() will never be woken up again unless it is interrupted by a signal. However, a still remaining issue is that if net.sctp.sndbuf_policy=0, that is accounting per socket, and one-to-many sockets are in use, the reclaimed write space from sctp_wfree() is 'unfairly' handed back on the server to the association that is the lucky one to be woken up again via __sctp_write_space(), while the remaining associations are never be woken up again (unless by a signal). The effect disappears with net.sctp.sndbuf_policy=1, that is wmem accounting per association, as it guarantees a fair share of wmem among associations. Therefore, if we have reclaimed memory in case of per socket accounting, wake all related associations to a socket in a fair manner, that is, traverse the socket association list starting from the current neighbour of the association and issue a __sctp_write_space() to everyone until we end up waking ourselves. This guarantees that no association is preferred over another and even if more associations are taken into the one-to-many session, all receivers will get messages from the server and are not stalled forever on high load. This setting still leaves the advantage of per socket accounting in touch as an association can still use up global limits if unused by others. Fixes: 4eb701df ("[SCTP] Fix SCTP sendbuffer accouting.") Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Oleg Nesterov authored
[ Upstream commit 008208c6 ] Add two trivial helpers list_next_entry() and list_prev_entry(), they can have a lot of users including list.h itself. In fact the 1st one is already defined in events/core.c and bnx2x_sp.c, so the patch simply moves the definition to list.h. Signed-off-by:
Oleg Nesterov <oleg@redhat.com> Cc: Eilon Greenstein <eilong@broadcom.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Bjørn Mork authored
commit 34f972d6 upstream. A number of older CMOTech modems are based on Qualcomm chips. The blacklisted interfaces are QMI/wwan. Reported-by:
Lars Melin <larsm17@gmail.com> Signed-off-by:
Bjørn Mork <bjorn@mork.no> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Bjørn Mork authored
commit dd6b48ec upstream. Device interface layout: 0: ff/ff/ff - serial 1: ff/00/00 - serial AT+PPP 2: ff/ff/ff - QMI/wwan 3: 08/06/50 - storage Signed-off-by:
Bjørn Mork <bjorn@mork.no> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Bjørn Mork authored
commit 533b3994 upstream. Device interface layout: 0: ff/ff/ff - serial 1: ff/ff/ff - serial AT+PPP 2: 08/06/50 - storage 3: ff/ff/ff - serial 4: ff/ff/ff - QMI/wwan Reported-by:
Julio Araujo <julio.araujo@wllctel.com.br> Signed-off-by:
Bjørn Mork <bjorn@mork.no> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Bjørn Mork authored
commit bce4f588 upstream. Signed-off-by:
Bjørn Mork <bjorn@mork.no> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Bjørn Mork authored
commit 70a3615f upstream. Signed-off-by:
Bjørn Mork <bjorn@mork.no> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Bjørn Mork authored
commit a00986f8 upstream. Signed-off-by:
Bjørn Mork <bjorn@mork.no> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Johan Hovold authored
commit 5509076d upstream. During firmware download the device expects memory addresses in big-endian byte order. As the wIndex parameter which hold the address is sent in little-endian byte order regardless of host byte order, we need to use swab16 rather than cpu_to_be16. Also make sure to handle the struct ti_i2c_desc size parameter which is returned in little-endian byte order. Reported-by:
Ludovic Drolez <ldrolez@debian.org> Tested-by:
Ludovic Drolez <ldrolez@debian.org> Signed-off-by:
Johan Hovold <jhovold@gmail.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Denis Turischev authored
commit c09ec25d upstream. The same issue like with Panther Point chipsets. If the USB ports are switched to xHCI on shutdown, the xHCI host will send a spurious interrupt, which will wake the system. Some BIOS have work around for this, but not all. One example is Compulab's mini-desktop, the Intense-PC2. The bug can be avoided if the USB ports are switched back to EHCI on shutdown. This patch should be backported to stable kernels as old as 3.12, that contain the commit 638298dc "xhci: Fix spurious wakeups after S5 on Haswell" Signed-off-by:
Denis Turischev <denis@compulab.co.il> Signed-off-by:
Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
- 12 Jun, 2014 13 commits
-
-
Miao Xie authored
commit 1c70d8fb upstream. Currently, with inode cache enabled, we will reuse its inode id immediately after unlinking file, we may hit something like following: |->iput inode |->return inode id into inode cache |->create dir,fsync |->power off An easy way to reproduce this problem is: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt -o inode_cache,commit=100 dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync inode_id=`ls -i /mnt/data | awk '{print $1}'` rm -f /mnt/data i=1 while [ 1 ] do mkdir /mnt/dir_$i test1=`stat /mnt/dir_$i | grep Inode: | awk '{print $4}'` if [ $test1 -eq $inode_id ] then dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync echo b > /proc/sysrq-trigger fi sleep 1 i=$(($i+1)) done mount /dev/sdb /mnt umount /dev/sdb btrfs check /dev/sdb We fix this problem by adding unlinked inode's id into pinned tree, and we can not reuse them until committing transaction. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by:
Chris Mason <clm@fb.com> [ kamal: backport to 3.8-stable: context ] Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Johan Hovold authored
commit 10164c2a upstream. Fix driver new_id sysfs-attribute removal deadlock by making sure to not hold any locks that the attribute operations grab when removing the attribute. Specifically, usb_serial_deregister holds the table mutex when deregistering the driver, which includes removing the new_id attribute. This can lead to a deadlock as writing to new_id increments the attribute's active count before trying to grab the same mutex in usb_serial_probe. The deadlock can easily be triggered by inserting a sleep in usb_serial_deregister and writing the id of an unbound device to new_id during module unload. As the table mutex (in this case) is used to prevent subdriver unload during probe, it should be sufficient to only hold the lock while manipulating the usb-serial driver list during deregister. A racing probe will then either fail to find a matching subdriver or fail to get the corresponding module reference. Since v3.15-rc1 this also triggers the following lockdep warning: ====================================================== [ INFO: possible circular locking dependency detected ] 3.15.0-rc2 #123 Tainted: G W ------------------------------------------------------- modprobe/190 is trying to acquire lock: (s_active#4){++++.+}, at: [<c0167aa0>] kernfs_remove_by_name_ns+0x4c/0x94 but task is already holding lock: (table_lock){+.+.+.}, at: [<bf004d84>] usb_serial_deregister+0x3c/0x78 [usbserial] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (table_lock){+.+.+.}: [<c0075f84>] __lock_acquire+0x1694/0x1ce4 [<c0076de8>] lock_acquire+0xb4/0x154 [<c03af3cc>] _raw_spin_lock+0x4c/0x5c [<c02bbc24>] usb_store_new_id+0x14c/0x1ac [<bf007eb4>] new_id_store+0x68/0x70 [usbserial] [<c025f568>] drv_attr_store+0x30/0x3c [<c01690e0>] sysfs_kf_write+0x5c/0x60 [<c01682c0>] kernfs_fop_write+0xd4/0x194 [<c010881c>] vfs_write+0xbc/0x198 [<c0108e4c>] SyS_write+0x4c/0xa0 [<c000f880>] ret_fast_syscall+0x0/0x48 -> #0 (s_active#4){++++.+}: [<c03a7a28>] print_circular_bug+0x68/0x2f8 [<c0076218>] __lock_acquire+0x1928/0x1ce4 [<c0076de8>] lock_acquire+0xb4/0x154 [<c0166b70>] __kernfs_remove+0x254/0x310 [<c0167aa0>] kernfs_remove_by_name_ns+0x4c/0x94 [<c0169fb8>] remove_files.isra.1+0x48/0x84 [<c016a2fc>] sysfs_remove_group+0x58/0xac [<c016a414>] sysfs_remove_groups+0x34/0x44 [<c02623b8>] driver_remove_groups+0x1c/0x20 [<c0260e9c>] bus_remove_driver+0x3c/0xe4 [<c026235c>] driver_unregister+0x38/0x58 [<bf007fb4>] usb_serial_bus_deregister+0x84/0x88 [usbserial] [<bf004db4>] usb_serial_deregister+0x6c/0x78 [usbserial] [<bf005330>] usb_serial_deregister_drivers+0x2c/0x4c [usbserial] [<bf016618>] usb_serial_module_exit+0x14/0x1c [sierra] [<c009d6cc>] SyS_delete_module+0x184/0x210 [<c000f880>] ret_fast_syscall+0x0/0x48 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(table_lock); lock(s_active#4); lock(table_lock); lock(s_active#4); *** DEADLOCK *** 1 lock held by modprobe/190: #0: (table_lock){+.+.+.}, at: [<bf004d84>] usb_serial_deregister+0x3c/0x78 [usbserial] stack backtrace: CPU: 0 PID: 190 Comm: modprobe Tainted: G W 3.15.0-rc2 #123 [<c0015e10>] (unwind_backtrace) from [<c0013728>] (show_stack+0x20/0x24) [<c0013728>] (show_stack) from [<c03a9a54>] (dump_stack+0x24/0x28) [<c03a9a54>] (dump_stack) from [<c03a7cac>] (print_circular_bug+0x2ec/0x2f8) [<c03a7cac>] (print_circular_bug) from [<c0076218>] (__lock_acquire+0x1928/0x1ce4) [<c0076218>] (__lock_acquire) from [<c0076de8>] (lock_acquire+0xb4/0x154) [<c0076de8>] (lock_acquire) from [<c0166b70>] (__kernfs_remove+0x254/0x310) [<c0166b70>] (__kernfs_remove) from [<c0167aa0>] (kernfs_remove_by_name_ns+0x4c/0x94) [<c0167aa0>] (kernfs_remove_by_name_ns) from [<c0169fb8>] (remove_files.isra.1+0x48/0x84) [<c0169fb8>] (remove_files.isra.1) from [<c016a2fc>] (sysfs_remove_group+0x58/0xac) [<c016a2fc>] (sysfs_remove_group) from [<c016a414>] (sysfs_remove_groups+0x34/0x44) [<c016a414>] (sysfs_remove_groups) from [<c02623b8>] (driver_remove_groups+0x1c/0x20) [<c02623b8>] (driver_remove_groups) from [<c0260e9c>] (bus_remove_driver+0x3c/0xe4) [<c0260e9c>] (bus_remove_driver) from [<c026235c>] (driver_unregister+0x38/0x58) [<c026235c>] (driver_unregister) from [<bf007fb4>] (usb_serial_bus_deregister+0x84/0x88 [usbserial]) [<bf007fb4>] (usb_serial_bus_deregister [usbserial]) from [<bf004db4>] (usb_serial_deregister+0x6c/0x78 [usbserial]) [<bf004db4>] (usb_serial_deregister [usbserial]) from [<bf005330>] (usb_serial_deregister_drivers+0x2c/0x4c [usbserial]) [<bf005330>] (usb_serial_deregister_drivers [usbserial]) from [<bf016618>] (usb_serial_module_exit+0x14/0x1c [sierra]) [<bf016618>] (usb_serial_module_exit [sierra]) from [<c009d6cc>] (SyS_delete_module+0x184/0x210) [<c009d6cc>] (SyS_delete_module) from [<c000f880>] (ret_fast_syscall+0x0/0x48) Signed-off-by:
Johan Hovold <jhovold@gmail.com> Signed-off-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Liu Hua authored
commit 56b700fd upstream. For vmcore generated by LPAE enabled kernel, user space utility such as crash needs additional infomation to parse. So this patch add arch_crash_save_vmcoreinfo as what PAE enabled i386 linux does. Reviewed-by:
Will Deacon <will.deacon@arm.com> Signed-off-by:
Liu Hua <sdu.liu@huawei.com> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Xiangyu Lu authored
commit 80bb3ef1 upstream. In big-endian systems, "%1" get the most significant part of the value, cause the instruction to get the wrong result. When viewing ftrace record in big-endian ARM systems, we found that the timestamp errors: swapper-0 [001] 1325.970000: 0:120:R ==> [001] 16:120:R events/1 events/1-16 [001] 1325.970000: 16:120:S ==> [001] 0:120:R swapper swapper-0 [000] 1325.1000000: 0:120:R + [000] 15:120:R events/0 swapper-0 [000] 1325.1000000: 0:120:R ==> [000] 15:120:R events/0 swapper-0 [000] 1326.030000: 0:120:R + [000] 1150:120:R sshd swapper-0 [000] 1326.030000: 0:120:R ==> [000] 1150:120:R sshd When viewed ftrace records, it will call the do_div(n, base) function, which achieved arch/arm/include/asm/div64.h in. When n = 10000000, base = 1000000, in do_div(n, base) will execute "umull %Q0, %R0, %1, %Q2". Reviewed-by:
Dave Martin <Dave.Martin@arm.com> Reviewed-by:
Nicolas Pitre <nico@linaro.org> Signed-off-by:
Alex Wu <wuquanming@huawei.com> Signed-off-by:
Xiangyu Lu <luxiangyu@huawei.com> Signed-off-by:
Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Linus Torvalds authored
commit 1b17844b upstream. fixup_user_fault() is used by the futex code when the direct user access fails, and the futex code wants it to either map in the page in a usable form or return an error. It relied on handle_mm_fault() to map the page, and correctly checked the error return from that, but while that does map the page, it doesn't actually guarantee that the page will be mapped with sufficient permissions to be then accessed. So do the appropriate tests of the vma access rights by hand. [ Side note: arguably handle_mm_fault() could just do that itself, but we have traditionally done it in the caller, because some callers - notably get_user_pages() - have been able to access pages even when they are mapped with PROT_NONE. Maybe we should re-visit that design decision, but in the meantime this is the minimal patch. ] Found by Dave Jones running his trinity tool. Reported-by:
Dave Jones <davej@redhat.com> Acked-by:
Hugh Dickins <hughd@google.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Alex Deucher authored
commit e9a4099a upstream. Some newer PX laptops have the pci device class set to DISPLAY_OTHER rather than DISPLAY_VGA. This properly detects ATPX on those laptops. Based on a patch from: Pali Rohár <pali.rohar@gmail.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com> Cc: airlied@gmail.com Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Hans de Goede authored
commit 46a2986e upstream. We expect that all the Haswell series will need such quirks, sigh. The T431s seems to be T430 hardware in a T440s case, using the T440s touchpad, with the same min/max issue. The X1 Carbon 3rd generation name says 2nd while it is a 3rd generation. The X1 and T431s share a PnPID with the T540p, but the reported ranges are closer to those of the T440s. HdG: Squashed 5 quirk patches into one. T431s + L440 + L540 are written by me, S1 Yoga and X1 are written by Benjamin Tissoires. Hdg: Standardized S1 Yoga and X1 values, Yoga uses the same touchpad as the X240, X1 uses the same touchpad as the T440. Signed-off-by:
Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by:
Hans de Goede <hdegoede@redhat.com> Signed-off-by:
Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Dan Williams authored
commit 8a4aeec8 upstream. The AHCI spec allows implementations to issue commands in tag order rather than FIFO order: 5.3.2.12 P:SelectCmd HBA sets pSlotLoc = (pSlotLoc + 1) mod (CAP.NCS + 1) or HBA selects the command to issue that has had the PxCI bit set to '1' longer than any other command pending to be issued. The result is that commands posted sequentially (time-wise) may play out of sequence when issued by hardware. This behavior has likely been hidden by drives that arrange for commands to complete in issue order. However, it appears recent drives (two from different vendors that we have found so far) inflict out-of-order completions as a matter of course. So, we need to take care to maintain ordered submission, otherwise we risk triggering a drive to fall out of sequential-io automation and back to random-io processing, which incurs large latency and degrades throughput. This issue was found in simple benchmarks where QD=2 seq-write performance was 30-50% *greater* than QD=32 seq-write performance. Tagging for -stable and making the change globally since it has a low risk-to-reward ratio. Also, word is that recent versions of an unnamed OS also does it this way now. So, drives in the field are already experienced with this tag ordering scheme. Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ed Ciechanowski <ed.ciechanowski@intel.com> Reviewed-by:
Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Tejun Heo <tj@kernel.org> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Jeff Layton authored
commit 3758cf7e upstream. ...otherwise the logic in the timeout handling doesn't work correctly. Spotted-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Thomas Gleixner authored
commit 01f8fa4f upstream. The current implementation of irq_set_affinity() refuses rightfully to route an interrupt to an offline cpu. But there is a special case, where this is actually desired. Some of the ARM SoCs have per cpu timers which require setting the affinity during cpu startup where the cpu is not yet in the online mask. If we can't do that, then the local timer interrupt for the about to become online cpu is routed to some random online cpu. The developers of the affected machines tried to work around that issue, but that results in a massive mess in that timer code. We have a yet unused argument in the set_affinity callbacks of the irq chips, which I added back then for a similar reason. It was never required so it got not used. But I'm happy that I never removed it. That allows us to implement a sane handling of the above scenario. So the affected SoC drivers can add the required force handling to their interrupt chip, switch the timer code to irq_force_affinity() and things just work. This does not affect any existing user of irq_set_affinity(). Tagged for stable to allow a simple fix of the affected SoC clock event drivers. Reported-and-tested-by:
Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Tomasz Figa <t.figa@samsung.com>, Cc: Daniel Lezcano <daniel.lezcano@linaro.org>, Cc: Kukjin Kim <kgene.kim@samsung.com> Cc: linux-arm-kernel@lists.infradead.org, Link: http://lkml.kernel.org/r/20140416143315.717251504@linutronix.deSigned-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Jeff Layton authored
commit f1c6bb2c upstream. A fl->fl_break_time of 0 has a special meaning to the lease break code that basically means "never break the lease". knfsd uses this to ensure that leases don't disappear out from under it. Unfortunately, the code in __break_lease can end up passing this value to wait_event_interruptible as a timeout, which prevents it from going to sleep at all. This makes __break_lease to spin in a tight loop and causes soft lockups. Fix this by ensuring that we pass a minimum value of 1 as a timeout instead. Cc: J. Bruce Fields <bfields@fieldses.org> Reported-by:
Terry Barnaby <terry1@beam.ltd.uk> Signed-off-by:
Jeff Layton <jlayton@redhat.com> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Theodore Ts'o authored
commit 6e6358fc upstream. We haven't taken i_mutex yet, so we need to use i_size_read(). Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-
Jan Kara authored
commit ec4cb1aa upstream. When heavily exercising xattr code the assertion that jbd2_journal_dirty_metadata() shouldn't return error was triggered: WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237 jbd2_journal_dirty_metadata+0x1ba/0x260() CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G W 3.10.0-ceph-00049-g68d04c9 #1 Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978 Call Trace: [<ffffffff816311b0>] dump_stack+0x19/0x1b [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0 [<ffffffff8119421e>] setxattr+0x13e/0x1e0 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b The reason for the warning is that buffer_head passed into jbd2_journal_dirty_metadata() didn't have journal_head attached. This is caused by the following race of two ext4_xattr_release_block() calls: CPU1 CPU2 ext4_xattr_release_block() ext4_xattr_release_block() lock_buffer(bh); /* False */ if (BHDR(bh)->h_refcount == cpu_to_le32(1)) } else { le32_add_cpu(&BHDR(bh)->h_refcount, -1); unlock_buffer(bh); lock_buffer(bh); /* True */ if (BHDR(bh)->h_refcount == cpu_to_le32(1)) get_bh(bh); ext4_free_blocks() ... jbd2_journal_forget() jbd2_journal_unfile_buffer() -> JH is gone error = ext4_handle_dirty_xattr_block(handle, inode, bh); -> triggers the warning We fix the problem by moving ext4_handle_dirty_xattr_block() under the buffer lock. Sadly this cannot be done in nojournal mode as that function can call sync_dirty_buffer() which would deadlock. Luckily in nojournal mode the race is harmless (we only dirty already freed buffer) and thus for nojournal mode we leave the dirtying outside of the buffer lock. Reported-by:
Sage Weil <sage@inktank.com> Signed-off-by:
Jan Kara <jack@suse.cz> Signed-off-by:
"Theodore Ts'o" <tytso@mit.edu> Signed-off-by:
Kamal Mostafa <kamal@canonical.com>
-