May 2009 11

In May last year I updated our CPAN installation as I normally do every 3 months or so. The routine will seem familiar to all Perl developers out there. On the CPAN shell just run install Bundle::* for most core modules, test, be happy. This one particular time it got a little more complicated than that. It was my own Perl Chernobyl – a simple procedure that should have taken 30 minutes became a system meltdown.

Flashback
Since 1999 my trust in CPAN has been unshakable, the Perl community itself has been one of the biggest quality assurance teams ever grouped together(perhaps rivaled only by Linux, but that was a few years after Perl was out there). What’s most impressive is a complete absence of micro-management by Larry Wall. Perl is an entirely autonomous slightly organized anarchic community of hackers that do what they do because they love Perl, not because a czarish group leader is spiking them with a fork.

Larry Wall’s object encapsulation principle, that you shouldn’t go into other people’s houses because you weren’t invited, not because the door is locked and the owner has a gun, prevails in the Perl community as well.

I was involved with Hibernate from early 2006 up until late 2007 in many projects. When I finished my work with Java I went looking for similar ORM alternatives in Perl, to try out on my own projects. Heck, even Dr. Dobbs had me featured as a Hibernate developer in Brazil…so I desperately needed to get back to Perl.

I tested Rose::DB(fast!) and Class::DBI without much success, not that I have anything against these great modules, just that I was missing something…and even I didn’t know exactly what it was.

 My homegrown can of wORMs
I also forgot the “lazyness” virtue for a bit and tried my own solution, which unfortunately ended up on a production system. My custom ORM was becoming a monster, its shadow pending over me scared me to heck and just the thought of maintaining my own ORM code freaked me out. It was dirty, clever code, full of hacks, cryptic variables, and too many specifics for the system it was running under(a large tourism system that grouped years and years of heterogenous data), so I couldn’t release it as a CPAN module to share my development nor reuse it on my other projects. All in all, the lesson was taken: lazyness IS a programmer virtue. Cultivate it. Time to find a Perl ORM on CPAN.

So…to make a long story short, fast forward and for over a year now I’ve been in love with the DBIx::Class ORM framework.

It doesn’t feature “too much magic”, it lets you hack some SQL in there when you need it, it reuses other CPAN modules for the SQL generation and is overall a very granular, very well architected system. All this without losing the Perl attitude, it feels like Perl and it just works, period.

I went full throttle with DBIx::Class, or DBIC as they call it. Since I couldn’t find any books on the subject(are there any yet?) I read all I could read about it online. The DBIC documentation has matured with time, as did my own projects using the modules. Slowly I’ve been migrating thousands and thousands of lines of code from my older projects to DBIC code. Right now I’ve developed this natural feel for DBIC. It’s consistent, works a dandy. The DBIC community is alive and active, giving support and kindly answering all kinds of questions on the dbix-class@lists.scsys.co.uk mailing list.

Back to May 2008 and my CPAN upgrade
So, as I settled with DBIx::Class, I got used to upgrading the system every so often. DBIC is still under 1.0, so it’s changing constantly. I kept my routine upgrades blindly, religiously, happening until May 2008.

That’s when an otherwise tranquil CPAN upgrade went sour. It was the first time a CPAN module in release status broke my system.

After the CPAN ritual was over everything seemed fine until screens went blank on my browser. No data was returned from my HTTP requests. I went over to check the Apache server logs and this is the war zone I found:
[Tue May 27 06:01:16 2008] [notice] child pid 14013 exit signal Segmentation fault (11)
[Tue May 27 06:01:47 2008] [notice] child pid 13698 exit signal Segmentation fault (11)
[Tue May 27 06:02:04 2008] [notice] child pid 13678 exit signal Segmentation fault (11)
[Tue May 27 06:56:07 2008] [notice] child pid 13999 exit signal Segmentation fault (11)
[Tue May 27 06:57:28 2008] [notice] child pid 14000 exit signal Segmentation fault (11)
[Tue May 27 06:58:10 2008] [notice] child pid 14004 exit signal Segmentation fault (11)
[Tue May 27 06:59:48 2008] [notice] child pid 14008 exit signal Segmentation fault (11)
[Tue May 27 07:00:17 2008] [notice] child pid 14002 exit signal Segmentation fault (11)
(Actually, this sample of log entries was taken at the NEXT DAY from the upgrade…. I had been getting my butt kicked by this problem since the day before…if you got this far, then read on.)

I suspected my application first, went over several pieces of it and couldn’t replicate the problem. Sometimes it worked, sometimes it didn’t. Worst kind of bug…these have a way to freak our mental logic unit(worse, it blows the otherwise slo-blo patience fuse).

I added a forced DBI reconnect to every page and the problem seemed solved, until it happened yet again. Blank page, segmentation fault. The absence of some proper error messages on the main httpd log meant it crashed before the logging phase of the Apache request loop.

This couldn’t be good, but I hadn’t realised that much at this point.

At first I was sure it was the Linux instance of DLL hell …I thought maybe my latest DBD::mysql linked to some conflicting libmysqlclient version. Maybe I forgot to run ldconfig… But then ldconfig didn’t work… Slowly I tried all the simpler tricks, none worked.

So I recompiled MySQL. It didn’t matter really, it wasn’t a critical MySQL host. It was, though, the main Apache server front, and it was down… Pressure everywhere, customers, everyone asking when the system would be back up…and there I was recompiling Apache, MySQL and other 20 or so libraries, hoping the impedance would match at some point and everything would just work again.

Each full system recompilation took over 20 minutes. It’s a custom Apache server wholly patched by yours truly to work as an optimized(hopefully, when it works) application server for Perl apps.

Was it PHP that was messing around with mod_perl? Tried it. Was it mod_ssl meddling with the Apache guts and knocking me down? Tried it. Nothing seemed to work.

The dreaded gdb session
I finally realized I’d have to feed a httpd core into gdb and try to find out what was happening from the recovered black box. Hopefully the httpd process core would reveal the culprit.

Recompile Apache one more time, this time editing line 282 of the configure script, adding CFLAGS=-g to add debug symbols and information to the binary. (In the meantime the server is down, and the cell phone is on silent, 27 calls unanswered.)

So, a few minutes later we have a brand new httpd binary, nice and stuffed with debug symbols.

Adding
CoreDumpDirectory /usr/local/apache/coredump/
to our httpd.conf would guarantee we get some juicy cores dumped for us at that location.

We tell the kernel we want core dumps by using ulimit -c (not activated on the Fedora family by default, but if you’re lucky to use Slackware you’ll find the cores are always there).

ulimit -c 10240 (10 MB limit for core files)

And Bob’s your uncle. A few minutes later and we have a TON of cores under /usr/local/apache/coredump/ to examine. I go with the lowest PID first to examine the earliest possible casualties, then I chose some random PID’s to examine. Here we go…

# gdb ../bin/httpd -c core.1035
GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “i686-pc-linux-gnu”…

[... snip ....]

Program terminated with signal 11, Segmentation fault.
[New process 1045]
#0 0×084b739c in Perl_av_undef ()

OK. What the heck is Perl_av_undef()?! I didn’t have much time to investigate. Phone calls were piling up as quickly as core files…and as curious as I was, I needed solutions, not overnight hacking sessions into the Perl source code…

I know DBIC uses some large data structures passed in and out of functions. Class instances can be HUGE data structures that reference a deeply nested spiderweb of related classes, on and on recursively.

Researching Perl_av_undef() on Google gives me the clue I need: it’s something related to deeply nested recursive data structures blowing up the Perl stack, AV probably means ArrayValue and so on.

So I restart the Apache server without a kernel stack limit:

# ulimit -s unlimited
(Actually, I chose 100MB for the original test. But I later found that when testing, you better exaggerate to really eliminate a possibility. This time we’re trying to rule out stack problems here.)

Restarted the httpd server, and for a few minutes I thought we’d solved the problem… Relief only lasted about 5 minutes…and the core files reappeared.

By now you could tell I was a pretty desperate man. The stack problem was only part of the overall grim picture.

I decided to check the core files and see what had changed.

# gdb /usr/local/apache/bin/httpd -c core.1047
[... snip ...]
Program terminated with signal 11, Segmentation fault.
[New process 1047]
#0 0×00b9e46b in mysql_ping () from /usr/local/mysql//lib/mysql/libmysqlclient.so.15

And that’s when things started to turn around. Mysql was my first suspect after all….

Back to gdb:
(gdb) where
#0 0×00b9e46b in mysql_ping () from /usr/local/mysql//lib/mysql/libmysqlclient.so.15
#1 0×004d99f2 in XS_DBD__mysql__db_ping (cv=0×94d2680) at mysql.xs:519

#2 0×001edbf4 in XS_DBI_dispatch () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/DBI/DBI.so
#3 0×084bfa43 in Perl_pp_entersub ()
#4 0×084be32e in Perl_runops_standard ()
#5 0×084ba15d in Perl_call_sv ()
#6 0×0043ae8f in EMBPERL2_CallStoredCV () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#7 0×00453cf0 in embperl_Execute () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#8 0×0045a008 in ProviderEpRun_GetContentIndex () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#9 0×0045891a in Cache_GetContentIndex () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#10 0×00434a63 in ProcessFile () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#11 0×004355da in embperl_RunRequest () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#12 0×00435b8c in embperl_ExecuteRequest () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#13 0×0043233f in XS_Embperl__Req_ExecuteRequest () from /usr/lib/perl5/site_perl/5.10.0/i686-linux/auto/Embperl/Embperl.so
#14 0×084bfa43 in Perl_pp_entersub ()
#15 0×084be32e in Perl_runops_standard ()
#16 0×084ba15d in Perl_call_sv ()
#17 0×08371d3b in perl_call_handler (sv=0×96a8d20, r=0×9541c14, args=0×0) at mod_perl.c:1668
#18 0×083723f9 in perl_run_stacked_handlers (hook=0×85ef03b “PerlHandler”, r=0×9541c14, handlers=0×96a8cb0)
at mod_perl.c:1381
#19 0×08373861 in perl_handler (r=0×9541c14) at mod_perl.c:904
#20 0×08395497 in ap_invoke_handler (r=0×9541c14) at http_config.c:476
#21 0×083acd2f in process_request_internal (r=0×9541c14) at http_request.c:1299
#22 0×083ad1a0 in ap_internal_redirect (new_uri=0×951427c “/index.html”, r=0×9512ae4)
at http_request.c:1440
#23 0×00386fe4 in mod_gzip_redir1_handler () from /usr/local/apache/modules/mod_gzip.so
#24 0×00385219 in mod_gzip_handler () from /usr/local/apache/modules/mod_gzip.so
#25 0×08395497 in ap_invoke_handler (r=0×9512ae4) at http_config.c:476
#26 0×083acd2f in process_request_internal (r=0×9512ae4) at http_request.c:1299
#27 0×083acd8c in ap_process_request (r=0×9512ae4) at http_request.c:1315
#28 0×083a2ef6 in child_main (child_num_arg=12) at http_main.c:4971
#29 0×083a3227 in make_child (s=0×87be05c, slot=12, now=1211894763) at http_main.c:5150
#30 0×083a32c2 in startup_children (number_to_start=23) at http_main.c:5178
#31 0×083a3a66 in standalone_main (argc=3, argv=0xbfe95934) at http_main.c:5525
#32 0×083a434f in main (argc=3, argv=0xbfe95934) at http_main.c:5883

The call stack trace is very clear: we get that the problem happens with the native mysql_ping() C function called from an XS stub.

Now Google is our friend and we quickly arrive at this bug report from May 20 2008, precisely a week before my fantastic upgrade voyage.

Suggested solution is regression to DBD::mysql version 4.006… How frustrating. CPAN only installs the latest version, so we gotta do this one manually.

# tar xzvf DBD-mysql-4.006.tar.gz
# cd DBD-mysql-4.006
# perl Makefile.PL
# make
# make install
# /usr/local/apache/bin/apachectl restart

It’s been a year since….and the bug is still there. A poster on the original bug report discussion thread says 4.010 fixes the issue. I haven’t tested that one yet.

3 Comments

  1. Bron Gondwana says:

    As you can see from that bug report, I hit the bug too!

    I also provided a test case which proved that the bug existed by making it easy for anyone else to recreate. Generally a useful response to a bug if you can manage it. I’ve done so for a nasty little mmap bug in the Linux kernel and got a similarly good result. Developers do love a repeatable test case!

    (er, well – of course it’s still in the versions that have the bug… I might give 4.010 a go myself sometime)

  2. admin says:

    Hi Bron, nice to meet you. You bet the repeatable case was most useful, it was only after your proof of concept that MySQL officially ack’ed the bug.
    Thanks for coming by.
    Best wishes, Ze

  3. [...] the Twitter API was so well documented, I felt like having a go at it. I broke the lazyness rule(again) at my own peril, so bug reports, suggestions and criticism are always [...]

Leave a Comment