now (Re: [Orange-tech] node 127 broken in a new way)

Toni Alatalo antont at kyperjokki.fi
Mon Mar 6 22:01:00 CET 2006


On Monday 06 March 2006 22:50, you wrote:
> Done. Ahhh .... the network response of xseed1 seems much better now.
> Try it again and let me know.

ok now there is a job running again and 2 in the queue .. we'll see how they 
succeed (in the morning i guess, i better go to sleep ASAP cause am 
travelling in very early morning back to Adam from home)

i wonder if it could be my system leaving pipes/inodes/sockets/something not 
properly closed, and hence harming the networking in the long run? i had a 
bug like that in an earlier version, that then fixed, but when had to change 
the whole pipe using lib i could not find a way to explicitly close them and 
assumed it does it automatically .. then again i have restarted the manager 
process quite often (when have added features etc), so i guess the OS closes 
open files / pipes / sockets latest then .. or?

~Toni

>
> Toni Alatalo wrote:
> >On Monday 06 March 2006 22:36, you wrote:
> >>OK - I'm standing by...
> >
> >great - now is good to boot!
> >
> >>Toni Alatalo wrote:
> >>>On Monday 06 March 2006 21:33, you wrote:
> >>>>The problem may be the head node - xseed1. Let me know when I can
> >>>> reboot it without messing you all up. If that doesn't fix it, I'll go
> >>>> right
> >>>
> >>>ok the urgent job is almost done - i hope you can boot soon (perhaps in
> >>>10mins), i'll send yet another mail when its ok
> >>>
> >>>>Mark
> >>>
> >>>~Toni
> >>>
> >>>>Toni Alatalo wrote:
> >>>>>On Monday 06 March 2006 20:00, Prof. Mark Matties wrote:
> >>>>>>Something is wrong with it. I rebooted it from my office and now I
> >>>>>>mattcan't ssh in as root from xseed1. I'll try to get over there
> >>>>>> later this afternoon to take a look at it up close and personal.
> >>>>>
> >>>>>ok. there is also other probs now: seems that random nodes just
> >>>>>sometimes fail to start rendering.. i wonder if the networking within
> >>>>>the farm can be a bit unreliable?
> >>>>>
> >>>>>those are given the render command via ssh but just close the
> >>>>>stdin/stdout pipes without doing anything. it is possible that this is
> >>>>>also a bug in the libs i am using to read all the pipes.. or a
> >>>>>combination of how it works and how the networking behaves there, or
> >>>>>something.
> >>>>>
> >>>>>one idea i just got  would be adding a delay (of like 0.1sec) between
> >>>>>the consecutive commands to the diff nodes, if that helps the
> >>>>>connections survive. then again we did not have probs like this before
> >>>>>.. but perhaps it was introduced by the addition of more nodes.
> >>>>>
> >>>>>also i think i'll try to code a safeguard / check system, that retries
> >>>>> / gives jobs to other nodes if they fail on an attempt / on a node.
> >>>>>
> >>>>>this happens now with like 1-3 out of 160 nodes / job, so rendering is
> >>>>>still going strong and we can pretty efficiently fill the gaps in the
> >>>>>results afterwards (i do that on the farm too)
> >>>>>
> >>>>>>Mark
> >>>>>
> >>>>>~Toni
> >>>>>
> >>>>>>Toni Alatalo wrote:
> >>>>>>>all others work fully, and i added a blacklist to our render system,
> >>>>>>>so everything is going ok but hopefully you can sort this out
> >>>>>>> anyhow:
> >>>>>>>
> >>>>>>>xseed1:~/elephant/production/renderjobs/done orange$ /usr/bin/ssh -t
> >>>>>>>10.224.10.127 DYLD_LIBRARY_PATH="/nfs/xseed2/homes/orange"
> >>>>>>>blender.app/Contents/MacOS/blender
> >>>>>>>-b /nfs/xseed2/homes/orange/elephant/production/test/simple.blend -f
> >>>>>>> 1 dyld: blender.app/Contents/MacOS/blender Undefined symbols:
> >>>>>>> blender.app/Contents/MacOS/blender undefined reference to _expf
> >>>>>>> expected to be defined in /usr/lib/libSystem.B.dylib
> >>>>>>>blender.app/Contents/MacOS/blender undefined reference to _sqrtf
> >>>>>>>expected to be defined in /usr/lib/libSystem.B.dylib
> >>>>>>>Connection to 10.224.10.127 closed.
> >>>>>>>
> >>>>>>>that happens only on that single node! (was showing as missing
> >>>>>>> frames in our jobs). eg this one works:
> >>>>>>>
> >>>>>>>xseed1:~/elephant/production/renderjobs/done orange$ /usr/bin/ssh -t
> >>>>>>>10.224.10.200 DYLD_LIBRARY_PATH="/nfs/xseed2/homes/orange"
> >>>>>>>blender.app/Contents/MacOS/blender
> >>>>>>>-b /nfs/xseed2/homes/orange/elephant/production/test/simple.blend -f
> >>>>>>> 1 Using Python version 2.3
> >>>>>>>Fra:1 Mem:44.16M Sce: Scene Ve:8 Fa:6 La:1
> >>>>>>>Fra:1 Mem:49.22M  | Part 1-16
> >>>>>>>Fra:1 Mem:46.69M  | Part 2-16
> >>>>>>>Fra:1 Mem:49.22M  | Part 3-16
> >>>>>>>Fra:1 Mem:46.69M  | Part 4-16
> >>>>>>>Fra:1 Mem:49.22M  | Part 5-16
> >>>>>>>Fra:1 Mem:46.69M  | Part 6-16
> >>>>>>>Fra:1 Mem:49.22M  | Part 7-16
> >>>>>>>Fra:1 Mem:46.69M  | Part 8-16
> >>>>>>>Fra:1 Mem:50.92M  | Part 10-16
> >>>>>>>Fra:1 Mem:50.92M  | Part 11-16
> >>>>>>>Fra:1 Mem:50.92M  | Part 12-16
> >>>>>>>Fra:1 Mem:50.92M  | Part 13-16
> >>>>>>>Fra:1 Mem:50.92M  | Part 14-16
> >>>>>>>Fra:1 Mem:50.92M  | Part 15-16
> >>>>>>>Fra:1 Mem:50.92M  | Part 16-16
> >>>>>>>Fra:1 Mem:46.69M  | Part 9-16
> >>>>>>>Saved: /nfs/xseed2/homes/orange/renderout/test/simple/0001.exrSaved:
> >>>>>>>/nfs/xseed2/homes/orange/renderout/test/simple/0001.jpg Time:
> >>>>>>> 00:05.52
> >>>>>>>
> >>>>>>>~Toni
> >>>>>>>_______________________________________________
> >>>>>>>Orange-tech mailing list
> >>>>>>>Orange-tech at blender.org
> >>>>>>>http://projects.blender.org/mailman/listinfo/orange-tech


More information about the Orange-tech mailing list