[Orange-tech] node 127 broken in a new way

Toni Alatalo antont at kyperjokki.fi
Mon Mar 6 20:25:19 CET 2006


On Monday 06 March 2006 20:00, Prof. Mark Matties wrote:
> Something is wrong with it. I rebooted it from my office and now I mattcan't
> ssh in as root from xseed1. I'll try to get over there later this
> afternoon to take a look at it up close and personal.

ok. there is also other probs now: seems that random nodes just sometimes fail 
to start rendering.. i wonder if the networking within the farm can be a bit 
unreliable?

those are given the render command via ssh but just close the stdin/stdout 
pipes without doing anything. it is possible that this is also a bug in the 
libs i am using to read all the pipes.. or a combination of how it works and 
how the networking behaves there, or something.

one idea i just got  would be adding a delay (of like 0.1sec) between the 
consecutive commands to the diff nodes, if that helps the connections 
survive. then again we did not have probs like this before .. but perhaps it 
was introduced by the addition of more nodes.

also i think i'll try to code a safeguard / check system, that retries / gives 
jobs to other nodes if they fail on an attempt / on a node. 

this happens now with like 1-3 out of 160 nodes / job, so rendering is still 
going strong and we can pretty efficiently fill the gaps in the results 
afterwards (i do that on the farm too)

> Mark

~Toni

>
> Toni Alatalo wrote:
> > all others work fully, and i added a blacklist to our render system, so
> > everything is going ok but hopefully you can sort this out anyhow:
> >
> > xseed1:~/elephant/production/renderjobs/done orange$ /usr/bin/ssh -t
> > 10.224.10.127 DYLD_LIBRARY_PATH="/nfs/xseed2/homes/orange"
> > blender.app/Contents/MacOS/blender
> > -b /nfs/xseed2/homes/orange/elephant/production/test/simple.blend -f 1
> > dyld: blender.app/Contents/MacOS/blender Undefined symbols:
> > blender.app/Contents/MacOS/blender undefined reference to _expf expected
> > to be defined in /usr/lib/libSystem.B.dylib
> > blender.app/Contents/MacOS/blender undefined reference to _sqrtf expected
> > to be defined in /usr/lib/libSystem.B.dylib
> > Connection to 10.224.10.127 closed.
> >
> > that happens only on that single node! (was showing as missing frames in
> > our jobs). eg this one works:
> >
> > xseed1:~/elephant/production/renderjobs/done orange$ /usr/bin/ssh -t
> > 10.224.10.200 DYLD_LIBRARY_PATH="/nfs/xseed2/homes/orange"
> > blender.app/Contents/MacOS/blender
> > -b /nfs/xseed2/homes/orange/elephant/production/test/simple.blend -f 1
> > Using Python version 2.3
> > Fra:1 Mem:44.16M Sce: Scene Ve:8 Fa:6 La:1
> > Fra:1 Mem:49.22M  | Part 1-16
> > Fra:1 Mem:46.69M  | Part 2-16
> > Fra:1 Mem:49.22M  | Part 3-16
> > Fra:1 Mem:46.69M  | Part 4-16
> > Fra:1 Mem:49.22M  | Part 5-16
> > Fra:1 Mem:46.69M  | Part 6-16
> > Fra:1 Mem:49.22M  | Part 7-16
> > Fra:1 Mem:46.69M  | Part 8-16
> > Fra:1 Mem:50.92M  | Part 10-16
> > Fra:1 Mem:50.92M  | Part 11-16
> > Fra:1 Mem:50.92M  | Part 12-16
> > Fra:1 Mem:50.92M  | Part 13-16
> > Fra:1 Mem:50.92M  | Part 14-16
> > Fra:1 Mem:50.92M  | Part 15-16
> > Fra:1 Mem:50.92M  | Part 16-16
> > Fra:1 Mem:46.69M  | Part 9-16
> > Saved: /nfs/xseed2/homes/orange/renderout/test/simple/0001.exrSaved:
> > /nfs/xseed2/homes/orange/renderout/test/simple/0001.jpg Time: 00:05.52
> >
> > ~Toni
> > _______________________________________________
> > Orange-tech mailing list
> > Orange-tech at blender.org
> > http://projects.blender.org/mailman/listinfo/orange-tech


More information about the Orange-tech mailing list