What's the problem?

Let's start with a concrete exemple:

At the end of Lorax, many compositors were rendering their images all at the same time and time estimations given by Nuke were sometime surprising (3h-4h left for a simple nuke script). Monitoring a file being written, I realized its weight was increasing very slowly. It was 100ko, then 300ko, then after 10 minutes it became 400ko etc... I concluded the network was overloaded... :reflexionIntense:

I remembered that we had the same worries on Tales Of The Night. Our infrastructure was certainly much smaller but 3 After Effect renders were "pumping" the entire network. The found solution at that time was to render the file in local and copy it once finished. The effects were immediate: No more slowdown with the network. :sauteJoie:

So I tried to render the nuke script of a CG artist locally to see if it reduces the problem. The largest files to read were source files (because they are many), and the final image weight was actually very light so I was not thinking it could change anything. But, once again, the conclusion was clear: The rendering was finished in 10-15 minutes (I'm not kidding) instead of 3-4hrs... :laClasse:

I thought it was maybe the writing process which was a problem. However I tried to the copy the new images sequence to the server, and it happened to be very fast. So I've done this with every artist, one by one, then the network has no bottleneck anymore and every renders completed.

But this was a very poor fix. We had to understand why Nuke couldn't write its images faster through the network. After having asked on the nuke-users mailing list (where I could see I was not alone but The Foundry didn't seem to be able to fix this), I've started to "profile" Nuke to know how it does the job (strace and inotifywatch are you friends :dentcasse: ).

Sometimes conclusion seems to be obvious but it's always good to check you assumptions:

With Nuke, if you write a zip 1 line compressed EXR, in 1920x1080, Nuke will do a little less than 900 (approximately) write accesses on the file. If you are in zip 16 lines, it will do about 70 accesses (1080/16). And in uncompressed, that's really 1080 accesses. :trollface:

In facts, compress in zip 16 line is not efficient if images should be read by Nuke. And depending of your network infrastructure, write line by line can put it completely down. It's difficult to explain how finally few Nuke rendering can fill a network, even if this one is strong. I feel this is related to multithreading: Nuke reads images (often, many at the same time) on the network while it is writing through it.

The most obvious solution is therefore to write the rendered image(s) on the local disk and to copy it in one time (one access) on the network disk. If you don't have technical resources (or just time), it's the simplest approach, but on larger projects it can quickly become daunting and (lest we forget) source of errors.

There are several solutions and I was leaning on a prototype that I found interesting because it's easy to implement.

problems.jpg

The principle

  • You launch a Python thread that will watch a folder.
  • Every three seconds, the thread will list present files in a folder and check if their name will match to a given regular expression (the "pattern" of your file name).
  • If the file seems to be a file you want to move once finished, it searches the same file with ".finished" (example: "toto.001.exr.finished").
  • If this file exists, it moves the origin file and removes the ".finished" one and start the main loop again.
  • Once the render is done (so every images are finished), you ask the thread to stop itself.

As you can see, this method requires that you create a ".finished" file each time an image is finished. This is because it's impossible for the thread to know when an image is actually completed. The creation of this ".finished" file can be handled in a thousand different ways (For Maya, a simple "Post render frame" should do the job) so I won't go into the details. :siffle:

The code

Here it is:

import os, threading, re, time
 
class MoverThread( threading.Thread ) :
 
	def __init__( self, dirTocheck, dirToMoveIn, patternToCheck, force=False ) :
		threading.Thread.__init__( self )
 
		self._terminate = False
		self.dirTocheck = dirTocheck
		self.dirToMoveIn = dirToMoveIn
		self.force = force
 
		# regex pattern
		self.patternToCheck = patternToCheck
		self.rePattern = re.compile( patternToCheck )
 
		# sanity check
		if not os.path.isdir(self.dirTocheck) :
			raise Exception( "The given directory (dirTocheck) is not a valid directory -> %s" %  self.dirTocheck )
 
		if not os.path.isdir(self.dirToMoveIn) :
			raise Exception( "The given directory (dirToMoveIn) is not a valid directory -> %s" %  self.dirToMoveIn )
 
	def run( self ) :
 
		filesNotMoved = []
 
		while not self._terminate :
 
			# we wait 3 seconds before do anything
			time.sleep( 3 )
 
			# for every "entry" (file or folder) in the folder we check it have the good pattern. If it has, we check for a ".finished" file
			for entry in os.listdir( self.dirTocheck ) :
 
				# check the current entry is "compliant" with the given regex
				if not self.rePattern.match( entry ) :
					continue
 
				srcFilePath = os.path.join( self.dirTocheck, entry )
				dstFilePath = os.path.join( self.dirToMoveIn, entry )
 
				if os.path.isfile( srcFilePath+".finished" ) :
 
					# destination file aready exist?
					if os.path.isfile( dstFilePath ) and not self.force:
 
						# don't add the entry if it is already in the list
						if not entry in filesNotMoved :
							filesNotMoved.append( entry )
 
						continue
 
					# move the file to it new location
					os.rename( srcFilePath, dstFilePath )
					os.remove( srcFilePath+".finished" )
 
					print "File %s moved to %s" % ( entry, self.dirToMoveIn )
 
					break	# restart the while loop to avoid to continue the list of file we maybe have removed: ".finished"
 
		print "Terminated!"
 
		for fileNotMoved in filesNotMoved :
			print "Already exists: Can't move %s to %s" % ( fileNotMoved, self.dirToMoveIn )
 
 
 
	def join( self ) :
 
		self._terminate = True
 
		threading.Thread.join( self )

As you can see (or not), everything happens in a thread.

It's used like this:

import waitFinishAndCopy
myMoverThread = waitFinishAndCopy.MoverThread("/a/local/path/", "/a/network/path/", "^toto\.[0-9]{4}\.exr$")
myMoverThread.start()
# start rendering, do rendering, end rendering.
myMoverThread.join()

And voila!

Conclusion

I hope this modest prototype will inspire you if you are experiencing delays on your network. :mechantCrash:

I also suggest to do some profiling on your core network applications, especially if they are used by many people. Their behavior is always interesting (and sometimes surprising).

Have a nice day!

Dorian

:marioCours: