Quantcast

Restore Database Resulting in Invalid Opcode: 0000

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Restore Database Resulting in Invalid Opcode: 0000

Jeremy Lemaire
This appears to be a hardware bug specific to the Linux kernel running on an AMD 64 processor.  But nevertheless it is thus far only reproducible when doing a LucidDb restore so I thought I throw it out there.  Any input would be greatly appreciated.  Here is the procedure to reproduce:

  • Do a  full CALL SYS_ROOT.BACKUP_DATABASE on Server 1 (Compressed backup is about 135GB).
  • Drop all schemas and tables from Server 2.
  • Do a RESTORE_DATABASE_WITHOUT_CATALOG or a RESTORE_DATABASE on Server 2.  
  • Within 5 mins the terminal locks up on Server2 and the following syslog message is seen:

 Message from syslogd@adsdw02 at Jun  8 15:33:47 ... 
 kernel:[18468439.611224] 
 
 ------------[ cut here ]------------

Message from syslogd@adsdw02 at Jun  8 15:33:47 ...
 kernel:[18468439.611316] invalid opcode: 0000 [1] SMP
  
Read from remote host adsdw02: Connection timed out
Connection to adsdw02 closed.

Unfortunately the only way I have found to recover is to reboot the server.  Here is more detailed information about my setup:

LucidDb Version: 
luciddb-bin-linux64-0.9.2 

Server 1:

OS: Debian GNU/Linux 5.0.4
Kernel: Linux adsdw01 2.6.26-2-amd64 #1 SMP Tue Jan 12 22:12:20 UTC 2010 x86_64 GNU/Linux


adsdw01
    description: Rack Mount Chassis
    product: PowerEdge R905
    vendor: Dell Inc.
    serial: 7LL7WL1
    width: 64 bits
    capabilities: smbios-2.5 dmi-2.5 vsyscall64 vsyscall32
    configuration: boot=normal chassis=rackmount uuid=44454C4C-4C00-104C-8037-B7C04F574C31
  *-core
       description: Motherboard
       product: 0K552T
       vendor: Dell Inc.
       physical id: 0
       version: A01
       serial: ..CN708219C8006V.
     *-firmware
          description: BIOS
          vendor: Dell Inc.
          physical id: 0
          version: 4.0.3 (05/29/2009)
          size: 64KiB
          capacity: 960KiB
          capabilities: isa pci pnp upgrade shadowing cdboot bootselect edd int13floppytoshiba int13floppy360            int13floppy1200 int13floppy720 int9keyboard int14serial int10video acpi usb biosbootspecification netboot
     *-cpu:0
          description: CPU
          product: Quad-Core AMD Opteron(tm) Processor 8374 HE
          vendor: Advanced Micro Devices [AMD]
          physical id: 400
          bus info: cpu@0
          version: Quad-Core AMD Opteron(tm) Processor 8374 HE
          slot: CPU1
          size: 2200MHz
          capacity: 2800MHz
          width: 64 bits
          clock: 1GHz
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 3dnowext 3dnow constant_tsc rep_good pni monitor   cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt

Server 2:

OS: Debian GNU/Linux 5.0.4
Kernel: Linux adsdw02 2.6.26-2-amd64 #1 SMP Tue Jan 12 22:12:20 UTC 2010 x86_64 GNU/Linux

adsdw02
    description: Rack Mount Chassis
    product: PowerEdge R905
    vendor: Dell Inc.
    serial: 6LL7WL1
    width: 64 bits
    capabilities: smbios-2.5 dmi-2.5 vsyscall64 vsyscall32
    configuration: boot=normal chassis=rackmount uuid=44454C4C-4C00-104C-8037-B6C04F574C31
  *-core
       description: Motherboard
       product: 0K552T
       vendor: Dell Inc.
       physical id: 0
       version: A01
       serial: ..CN708219C8001X.
     *-firmware
          description: BIOS
          vendor: Dell Inc.
          physical id: 0
          version: 4.0.3 (05/29/2009)
          size: 64KiB
          capacity: 960KiB
          capabilities: isa pci pnp upgrade shadowing cdboot bootselect edd int13floppytoshiba int13floppy360            int13floppy1200 int13floppy720 int9keyboard int14serial int10video acpi usb biosbootspecification netboot
     *-cpu:0
          description: CPU
          product: Quad-Core AMD Opteron(tm) Processor 8374 HE
          vendor: Advanced Micro Devices [AMD]
          physical id: 400
          bus info: cpu@0
          version: Quad-Core AMD Opteron(tm) Processor 8374 HE
          slot: CPU1
          size: 2200MHz
          capacity: 2800MHz
          width: 64 bits
          clock: 1GHz
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 3dnowext 3dnow constant_tsc rep_good pni monitor   cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
 

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Restore Database Resulting in Invalid Opcode: 0000

John Sichi
Administrator
Hmmm, googling for {kernel invalid opcode: 0000 [1] SMP} turns up a lot
of hits, so it's hard to say which of the various
hardware/driver/filesystem/access-patterns is implicated.  I guess
RESTORE (particularly from a compressed backup) puts a lot of stress on
the system for both CPU and I/O (with a combination of both buffered and
direct I/O, since we use buffered for reading the archive but direct for
writing the DB pages).

Is it possible for you to test the restore part on a machine with a
later kernel version to see if that fixes it?

If it's easy to repro on even recent kernels, maybe someone on lkml
would be interested?

They would probably ask you to poke around more in dmesg and /var/log to
see if you could turn up any more diagnostics; you might find something
interesting from that.

JVS

Jeremy Lemaire wrote:

> This appears to be a hardware bug specific to the Linux kernel running
> on an AMD 64 processor.  But nevertheless it is thus far
> only reproducible when doing a LucidDb restore so I thought I throw it
> out there.  Any input would be greatly appreciated.  Here is the
> procedure to reproduce:
>
>     * Do a  full CALL SYS_ROOT.BACKUP_DATABASE on Server 1 (Compressed
>       backup is about 135GB).
>     * Drop all schemas and tables from Server 2.
>     * Do a RESTORE_DATABASE_WITHOUT_CATALOG or a RESTORE_DATABASE on
>       Server 2.  
>     * Within 5 mins the terminal locks up on Server2 and the following
>       syslog message is seen:
>
>
>          Message from syslogd@adsdw02 at Jun  8 15:33:47 ...
>
>          kernel:[18468439.611224]
>
>          
>
>          ------------[ cut here ]------------
>
>
>         Message from syslogd@adsdw02 at Jun  8 15:33:47 ...
>
>          kernel:[18468439.611316] invalid opcode: 0000 [1] SMP
>
>          
>
>         Read from remote host adsdw02: Connection timed out
>
>         Connection to adsdw02 closed.
>
>
> Unfortunately the only way I have found to recover is to reboot the
> server.  Here is more detailed information about my setup:
>
> *LucidDb Version:*
>
>     luciddb-bin-linux64-0.9.2
>
>
> *Server 1:*
> *
> *
>
>     OS: Debian GNU/Linux 5.0.4
>     Kernel: Linux adsdw01 2.6.26-2-amd64 #1 SMP Tue Jan 12 22:12:20 UTC
>     2010 x86_64 GNU/Linux
>
> *
> *
>
>     adsdw01
>         description: Rack Mount Chassis
>         product: PowerEdge R905
>         vendor: Dell Inc.
>         serial: 7LL7WL1
>         width: 64 bits
>         capabilities: smbios-2.5 dmi-2.5 vsyscall64 vsyscall32
>         configuration: boot=normal chassis=rackmount
>     uuid=44454C4C-4C00-104C-8037-B7C04F574C31
>       *-core
>            description: Motherboard
>            product: 0K552T
>            vendor: Dell Inc.
>            physical id: 0
>            version: A01
>            serial: ..CN708219C8006V.
>          *-firmware
>               description: BIOS
>               vendor: Dell Inc.
>               physical id: 0
>               version: 4.0.3 (05/29/2009)
>               size: 64KiB
>               capacity: 960KiB
>               capabilities: isa pci pnp upgrade shadowing cdboot
>     bootselect edd int13floppytoshiba int13floppy360          
>      int13floppy1200 int13floppy720 int9keyboard int14serial int10video
>     acpi usb biosbootspecification netboot
>          *-cpu:0
>               description: CPU
>               product: Quad-Core AMD Opteron(tm) Processor 8374 HE
>               vendor: Advanced Micro Devices [AMD]
>               physical id: 400
>               bus info: cpu@0
>               version: Quad-Core AMD Opteron(tm) Processor 8374 HE
>               slot: CPU1
>               size: 2200MHz
>               capacity: 2800MHz
>               width: 64 bits
>               clock: 1GHz
>               capabilities: fpu fpu_exception wp vme de pse tsc msr pae
>     mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse
>     sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 3dnowext
>     3dnow constant_tsc rep_good pni monitor   cx16 popcnt lahf_lm
>     cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
>     3dnowprefetch osvw ibs skinit wdt
>
>
> *Server 2:*
> *
> *
>
>     *OS: Debian GNU/Linux 5.0.4*
>
>     *Kernel: Linux adsdw02 2.6.26-2-amd64 #1 SMP Tue Jan 12 22:12:20 UTC
>     2010 x86_64 GNU/Linux*
>     *
>     *
>     adsdw02
>         description: Rack Mount Chassis
>         product: PowerEdge R905
>         vendor: Dell Inc.
>         serial: 6LL7WL1
>         width: 64 bits
>         capabilities: smbios-2.5 dmi-2.5 vsyscall64 vsyscall32
>         configuration: boot=normal chassis=rackmount
>     uuid=44454C4C-4C00-104C-8037-B6C04F574C31
>       *-core
>            description: Motherboard
>            product: 0K552T
>            vendor: Dell Inc.
>            physical id: 0
>            version: A01
>            serial: ..CN708219C8001X.
>          *-firmware
>               description: BIOS
>               vendor: Dell Inc.
>               physical id: 0
>               version: 4.0.3 (05/29/2009)
>               size: 64KiB
>               capacity: 960KiB
>               capabilities: isa pci pnp upgrade shadowing cdboot
>     bootselect edd int13floppytoshiba int13floppy360          
>      int13floppy1200 int13floppy720 int9keyboard int14serial int10video
>     acpi usb biosbootspecification netboot
>          *-cpu:0
>               description: CPU
>               product: Quad-Core AMD Opteron(tm) Processor 8374 HE
>               vendor: Advanced Micro Devices [AMD]
>               physical id: 400
>               bus info: cpu@0
>               version: Quad-Core AMD Opteron(tm) Processor 8374 HE
>               slot: CPU1
>               size: 2200MHz
>               capacity: 2800MHz
>               width: 64 bits
>               clock: 1GHz
>               capabilities: fpu fpu_exception wp vme de pse tsc msr pae
>     mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse
>     sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 3dnowext
>     3dnow constant_tsc rep_good pni monitor   cx16 popcnt lahf_lm
>     cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
>     3dnowprefetch osvw ibs skinit wdt
>
>  
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> luciddb-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/luciddb-users


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Restore Database Resulting in Invalid Opcode: 0000

Jeremy Lemaire
Although googling for {kernel invalid opcode: 0000 [1] SMP} turns up a lot of hits there do not appear to be any patches specifically related.  Although I agree that it is worth a try so we are planning to upgrade to a newer kernel.

In the meantime I was looking for a quick work-around.  When the lockup occurred I was trying to backup on a machine where we had been experimenting with various schemas and query optimizations so it was not entirely "clean".  I did drop all schemas, tables, and did a deallocate in an attempt to create a "clean" system but it had had some use.  Therefore thought I would try a clean install.

After a clean install the restore succeeded.  The 135GB compressed full backup took 3.8 hours to restore w\out catalog.  A subsequent 2.7GB compressed differential restore took about 5 minutes to restore.  The fact that this works brings up some interesting questions now.

What will happen if I make changes to the schema on server 2 and then try to restore from a backup of server 1?  Is the restore expected to fail, should I expect some weird combination of data from server1 and server 2, or should I expect a nice clean copy of server 1 on server 2 regardless of any changes made to server 2 in between restores?  
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Restore Database Resulting in Invalid Opcode: 0000

John Sichi
Administrator
haawker wrote:
> After a clean install the restore succeeded.  The 135GB compressed full
> backup took 3.8 hours to restore w\out catalog.  A subsequent 2.7GB
> compressed differential restore took about 5 minutes to restore.  The fact
> that this works brings up some interesting questions now.

That is pretty slow.  When we did the design, our goal was to have
backup close to the speed of tar czf db.dat, and restore close to tar
xzf.  The numbers I have from the tests we did then was 25s to do full
restore for 1.5GB of gzip-compressed no-catalog data (with tar xzf
taking the same amount of time).  Your performance is something like 6x
worse.  Maybe we had really fast drives :)

Is it CPU-bound or I/O-bound?

> What will happen if I make changes to the schema on server 2 and then try to
> restore from a backup of server 1?  Is the restore expected to fail, should
> I expect some weird combination of data from server1 and server 2, or should
> I expect a nice clean copy of server 1 on server 2 regardless of any changes
> made to server 2 in between restores?  

You should expect the last one (a nice clean copy of server 1).  We
overwrite anything that was already there.

JVS

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
luciddb-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/luciddb-users
Loading...