new virtual domain creation FAILS at DNS setup portion!

This just started happening today. I noticed it right after installing the latest virtualmin 'wbt' update. Might be related, or not - just as a data point.

Upon trying to create a new virtual top-level domain, the systems starts the creation script, and then hangs at the DNS portion. The /etc/named.conf file never gets populated with the new data.

This is not just unique to the example'd domain (d23.me) but also applies to other domains. There appears to be something stuck either with configurations or permissions, but it's not something that I can troubleshoot or figure out.

Please help!

Harry

Setting Up Virtual Server
In domain d23.me Creating administration group d23.me .. .. done Creating administration user d23.me .. .. done

Creating aliases for administration user .. .. done

Adding administration user to groups .. .. done

Creating home directory .. .. done

Creating mailbox for administration user .. .. done

Adding new DNS zone ..

Status: 
Active

Comments

Additional:

there are .lock files in the /var/named directory for any of the domains I attempted to create during this process. The .lock files just contain what appears to be a process number. I'm not sure deleting them would have any effect, since even brand new domain creation acts the same.

Addendum - when creating a sub-domain, the same freeze happens, albeit at 'restarting slave DNS servers'

Setting Up Virtual Server
In domain d23.fizbin.com Creating home directory .. .. done Adding records to DNS zone fizbin.com .. .. done

Adding to email domains list .. .. done

Adding default mail aliases .. .. done

Adding new virtual website .. .. done

Performing other Apache configuration .. .. done

Setting up scheduled Webalizer reporting .. .. done

Setting up log file rotation .. .. done

Creating MySQL database d23 .. .. done

Setting up spam filtering .. .. done

Setting up virus filtering .. .. done

Creating status monitor for website .. .. done

Setting up AWstats reporting .. .. done

Setting up password protection for AWstats .. .. done

Adding DAV directives to website configuration .. .. done

Adding DAV account for server administrator .. .. done

Overriding proxying for path /dav/ .. .. done

Adding Mailman alias and redirects to website configuration .. .. done

Re-starting DNS server .. .. done

Re-starting slave DNS servers ..

Joe's picture
Submitted by Joe on Wed, 01/26/2011 - 12:26 Pro Licensee

Are you DNS slaves live? (This shouldn't prevent things from working, and it'd be a bug we'd want to fix, but the fact that the problem always happens during DNS makes me think maybe one or more slaves are not responding and we're not handling that condition correctly.)

Anything relevant in /var/log/messages?

I'll try to login and have a look shortly (I'm not on a machine that has my private key at the moment, but I think I have it setup on one of our remote servers, so I can probably go through one of those).

The first issue sounds like a locking problem.

Does /etc/named.conf.lock or /etc/bind/named.conf.lock exist, and if so what process's PID is in it?

/var/named/chroot/etc/named.conf.lock exists, and contains the same number as the domainName.lock files contained.

4087

Just /etc/named.conf.lock doesn't exist.

/etc/bind/named.conf.lock does not exist either, nor does /var/named/chroot/etc/named.conf.lock

process ID 4087 is this, by the way:

4087 ? S 0:00 /usr/libexec/webmin/bind8/mass_delete.cgi

I killed it.

I deleted the

/var/named/chroot/etc/named.conf.lock

I restarted named

Did killing that process fix the domain creation problem?

It does appear that there's a problem with my DNS slaves, as they don't contain any updates.

Crap! Trying to track down what's going on with those.

Though, as you said, that shouldn't lock up the virtualmin process.

update: when trying to delete the partially created test domains, now it gets stuck during the AWstats configuration file and Cron job process.

HELP! My system has just become unusable. Furthermore, I removed the DNS clusters / slaves, which made no difference, apparently.

update: It does seem to progress beyond it, but here's the error - the problem seems to be a general failure of processing locking:

Delete Server
In domain testdomain.com Deleting mail aliases .. .. done Deleting AWstats configuration file and Cron job .. .. AWstats reporting failed! : virtualmin-awstats::feature_delete failed : Failed to lock file /etc/httpd/conf/httpd.conf after 5 minutes at /usr/libexec/webmin/web-lib-funcs.pl line 1340.

Removing DAV directives from website configuration ..

Joe's picture
Submitted by Joe on Wed, 01/26/2011 - 19:49 Pro Licensee

First up: Don't panic. You'll do something stupid if you panic. ;-)

This is beginning to look like an actual system problem, and not mere configuration issues or bugs in Virtualmin.

Check for disk and memory problems first. Look in dmesg for disk I/O errors or OOM killer messages or segmentation faults or similar.

If nothing shows up there, check the SMART status of your disks. If SMART is not enabled, enable it and trigger a diagnostic. There is a Webmin module for SMART.

Also, check top to be sure you aren't simply running out of available memory (though the errors would probably be different if that were the case) or ending up swap thrashing to the point that things are timing out.

If it's not disk or memory, then we can proceed to checking other stuff.

Whenever I see a bunch of random seeming errors from software I know isn't generally random, I pretty much always assume something is wrong with the system itself. That may not be the case here, but we should check it before going further and possibly breaking other stuff.

The system's fine - I checked all the hardware.

Things appear to be working again, after I deleted manually all the leftover from the various broken prior efforts, and once I removed the DNS slaves (and then recreated one DNS slave - it appears my ns2 slave simply won't synch or allow connections - a problem for another day).

In summary - is it possible that this was caused by the virtualmin update, by any chance?

Joe's picture
Submitted by Joe on Thu, 01/27/2011 - 03:12 Pro Licensee

Probably not caused by the update; I don't recall any changes in this version that would have touched any of the virtual server creation code (and we have had any similar bugs reported). It seems more likely it was triggered by the slave being offline, which would be a bug, but it could be a bug that's been around for ages.

I'll hand this one over to Jamie to see if he is able to reproduce it. It'd definitely be something we'd want to get fixed, if it's reproducible.

Maybe I can add a little thing here that might provide further insight.

On my experimental installation (hostname lyra), which I recently upgraded to 3.83, I had a somewhat similar effect. Though not with virtual domain creation, but also DNS cluster slave related.

The action I performed was "moving to another cluster slave". Before that, I had used the same slave BIND (on host gemini) for my production and my experimental Virtualmin, and now I installed a little extra Linux VM (host ara, with just Webmin on it) as an experimental DNS slave.

Then on lyra I removed gemini from the "Webmin Servers Index" list. I added ara to that list, and tested the connection by "remote-controlling" ara's Webmin thru lyra. Worked okay.

Then, when trying to add ara as a DNS cluster slave in Webmin's BIND module, it simply kept timing out at the "adding slave" stage, as if the slave was not responding. Though said slave's Webmin did remote-control just fine, as well as respond to DNS queries.

Unfortunately there was no useful information in Webmin's logs, and after removing and re-adding ara in the server index list a few times, it finally worked.

Sorry that I can't give more detailed or "stable" information on this, since I didn't test it further (I mostly credited the effect to lyra being an experimental VM which has seen quite some stuff in its short life already. ;) )

Yet when seeing this report here now, I figured that there might be maybe some issue with the DNS cluster slave thing since the 3.83 update. I can sure test this some more if desired.

So it looks like the domain creation hung at the step of notifying the remote slave DNS servers, leaving locks in place and leftover config file entries.

Does your slave system have ports 10000 - 10010 open on its firewall?