Skip to content

Intel e1000, e1000e eth0 resetting/hang workarounds

by admin on January 20th, 2015

I got several interface resets when traffic is high (>500Mbps) or large number of packets sent through it.

You can see errors like bellow in logs and interface will be down for a couple of seconds

#######################################################

Jan 20 13:02:56 R2 kernel: [65563.040441] e1000e 0000:0c:00.0: eth0: Detected Hardware Unit Hang:

Jan 20 13:02:56 R2 kernel: [65563.040442] TDH <16>
Jan 20 13:02:56 R2 kernel: [65563.040443] TDT <7a>
Jan 20 13:02:56 R2 kernel: [65563.040444] next_to_use <7a>
Jan 20 13:02:56 R2 kernel: [65563.040444] next_to_clean <15>
Jan 20 13:02:56 R2 kernel: [65563.040445] buffer_info[next_to_clean]:
Jan 20 13:02:56 R2 kernel: [65563.040445] time_stamp <100639422>
Jan 20 13:02:56 R2 kernel: [65563.040446] next_to_watch <19>
Jan 20 13:02:56 R2 kernel: [65563.040447] jiffies <100639560>
Jan 20 13:02:56 R2 kernel: [65563.040447] next_to_watch.status <0>
Jan 20 13:02:56 R2 kernel: [65563.040448] MAC Status <80383>
Jan 20 13:02:56 R2 kernel: [65563.040449] PHY Status <792d>
Jan 20 13:02:56 R2 kernel: [65563.040449] PHY 1000BASE-T Status <3c00>
Jan 20 13:02:56 R2 kernel: [65563.040450] PHY Extended Status <3000>
Jan 20 13:02:56 R2 kernel: [65563.040450] PCI Status <10>
Jan 20 13:03:00 R2 kernel: [65567.040385] e1000e 0000:0c:00.0: eth0: Detected Hardware Unit Hang:
Jan 20 13:03:00 R2 kernel: [65567.040386] TDH <16>
Jan 20 13:03:00 R2 kernel: [65567.040387] TDT <7a>
Jan 20 13:03:00 R2 kernel: [65567.040387] next_to_use <7a>
Jan 20 13:03:00 R2 kernel: [65567.040388] next_to_clean <15>
Jan 20 13:03:00 R2 kernel: [65567.040388] buffer_info[next_to_clean]:
Jan 20 13:03:00 R2 kernel: [65567.040389] time_stamp <100639422>
Jan 20 13:03:00 R2 kernel: [65567.040390] next_to_watch <19>
Jan 20 13:03:00 R2 kernel: [65567.040390] jiffies <1006396f0>
Jan 20 13:03:00 R2 kernel: [65567.040391] next_to_watch.status <0>
Jan 20 13:03:00 R2 kernel: [65567.040391] MAC Status <80383>
Jan 20 13:03:00 R2 kernel: [65567.040392] PHY Status <792d>
Jan 20 13:03:00 R2 kernel: [65567.040392] PHY 1000BASE-T Status <3c00>
Jan 20 13:03:00 R2 kernel: [65567.040393] PHY Extended Status <3000>
Jan 20 13:03:00 R2 kernel: [65567.040394] PCI Status <10>
Jan 20 13:03:00 R2 kernel: [65567.779227] ————[ cut here ]————
Jan 20 13:03:00 R2 kernel: [65567.779235] WARNING: at net/sched/sch_generic.c:257 dev_watchdog+0xf3/0x191()
Jan 20 13:03:00 R2 kernel: [65567.779248] Modules linked in: 8021q garp stp llc iptable_nat ip6table_filter ip6table_raw ip6_tables iptable_filter xt_NOTRACK xt_CT nfnetlink_cthelper nfnetlink iptable_raw nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_h323 nf_conntrack_h323 nf_nat_sip nf_conntrack_sip nf_nat_proto_gre nf_nat_tftp nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_ftp nf_conntrack mperf cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative ipv6 evdev dcdbas psmouse serio_raw i5k_amb pcspkr i5000_edac rng_core edac_core processor shpchp button pci_hotplug thermal_sys battery usb_storage ohci_hcd squashfs loop ext4 jbd2 crc16 raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod usbhid hid ses enclosure pata_acpi ata_generic ata_piix megaraid_sas e1000e tg3 bnx2 [last unloaded: scsi_wait_scan]
Jan 20 13:03:00 R2 kernel: [65567.779459] Pid: 11, comm: kworker/0:1 Not tainted 3.3.8-1-amd64-vyatta #1
Jan 20 13:03:00 R2 kernel: [65567.779463] Call Trace:
Jan 20 13:03:00 R2 kernel: [65567.779467] <IRQ> [<ffffffff8103b3c2>] ? warn_slowpath_common+0x78/0x8c
Jan 20 13:03:00 R2 kernel: [65567.779481] [<ffffffff8103b474>] ? warn_slowpath_fmt+0x45/0x4a
Jan 20 13:03:00 R2 kernel: [65567.779490] [<ffffffff8137dffd>] ? _raw_spin_lock+0x5/0x8
Jan 20 13:03:00 R2 kernel: [65567.779496] [<ffffffff812fe963>] ? netif_tx_lock+0x67/0x7c
Jan 20 13:03:00 R2 kernel: [65567.779503] [<ffffffff812fea6b>] ? dev_watchdog+0xf3/0x191
Jan 20 13:03:00 R2 kernel: [65567.779510] [<ffffffff81047628>] ? run_timer_softirq+0x1a4/0x278
Jan 20 13:03:00 R2 kernel: [65567.779517] [<ffffffff812fe978>] ? netif_tx_lock+0x7c/0x7c
Jan 20 13:03:00 R2 kernel: [65567.779523] [<ffffffff81040a24>] ? __do_softirq+0xc4/0x1a0
Jan 20 13:03:00 R2 kernel: [65567.779530] [<ffffffff8107414b>] ? clockevents_program_event+0x99/0xb8
Jan 20 13:03:00 R2 kernel: [65567.779537] [<ffffffff8105869e>] ? hrtimer_interrupt+0x10e/0x19f
Jan 20 13:03:00 R2 kernel: [65567.779544] [<ffffffff8138001c>] ? call_softirq+0x1c/0x30
Jan 20 13:03:00 R2 kernel: [65567.779551] [<ffffffff8101102b>] ? do_softirq+0x3f/0x79
Jan 20 13:03:00 R2 kernel: [65567.779557] [<ffffffff810407f9>] ? irq_exit+0x44/0xb1
Jan 20 13:03:00 R2 kernel: [65567.779564] [<ffffffff81027b3f>] ? smp_apic_timer_interrupt+0x85/0x93
Jan 20 13:03:00 R2 kernel: [65567.779570] [<ffffffff8137f61e>] ? apic_timer_interrupt+0x6e/0x80
Jan 20 13:03:00 R2 kernel: [65567.779574] <EOI> [<ffffffff8137e01e>] ? _raw_spin_lock_irqsave+0x1e/0x26
Jan 20 13:03:00 R2 kernel: [65567.779584] [<ffffffff8103b51d>] ? __call_console_drivers+0x10/0x86
Jan 20 13:03:00 R2 kernel: [65567.779590] [<ffffffff8103c27a>] ? vprintk+0x355/0x37e
Jan 20 13:03:00 R2 kernel: [65567.779596] [<ffffffff8137c588>] ? printk+0x40/0x48
Jan 20 13:03:00 R2 kernel: [65567.779603] [<ffffffff81254147>] ? dev_printk+0x42/0x47
Jan 20 13:03:00 R2 kernel: [65567.779607] [<ffffffff812e5577>] ? netdev_err+0x51/0x56
Jan 20 13:03:00 R2 kernel: [65567.779618] [<ffffffffa0062f25>] ? e1000_print_hw_hang+0x1a5/0x1b4 [e1000e]
Jan 20 13:03:00 R2 kernel: [65567.779626] [<ffffffff81061fec>] ? finish_task_switch+0x4f/0xc8
Jan 20 13:03:00 R2 kernel: [65567.779636] [<ffffffffa0062d80>] ? e1000_maybe_stop_tx+0x8b/0x8b [e1000e]
Jan 20 13:03:00 R2 kernel: [65567.779643] [<ffffffff810514d9>] ? process_one_work+0x1c2/0x2d8
Jan 20 13:03:00 R2 kernel: [65567.779649] [<ffffffff81051721>] ? worker_thread+0x132/0x250
Jan 20 13:03:00 R2 kernel: [65567.779655] [<ffffffff810515ef>] ? process_one_work+0x2d8/0x2d8
Jan 20 13:03:00 R2 kernel: [65567.779661] [<ffffffff810515ef>] ? process_one_work+0x2d8/0x2d8
Jan 20 13:03:00 R2 kernel: [65567.779667] [<ffffffff81054db1>] ? kthread+0x81/0x89
Jan 20 13:03:00 R2 kernel: [65567.779673] [<ffffffff8137ff24>] ? kernel_thread_helper+0x4/0x10
Jan 20 13:03:00 R2 kernel: [65567.779680] [<ffffffff81054d30>] ? kthread_freezable_should_stop+0x4e/0x4e
Jan 20 13:03:00 R2 kernel: [65567.779686] [<ffffffff8137ff20>] ? gs_change+0x13/0x13
Jan 20 13:03:00 R2 kernel: [65567.779690] —[ end trace 74029b34e2d5927c ]—
Jan 20 13:03:00 R2 kernel: [65567.780833] e1000e 0000:0c:00.0: eth0: Reset adapter

##########################################################

One of the workarounds would be to disable TSO (GSO and GRO eventually) on interface, even if we will have some resource degradation.

If you want to see curent eth0 offload settings(TSO=tcp-segmentation-offload):

#ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
ntuple-filters: off
receive-hashing: off

To set TSO off :

#ethtool -K tso off

 

 

No comments yet

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS