It is known that SharePoint keeps all its data into the SQL database. Now SQL is a relational database and there is nothing relational about storing documents into the content Database. As your SharePoint grow more and more people are storing documents into SharePoint.
When versioning is enabled on a document library each version of the document is kept separately instead of keeping only de bits that are different.
With many users on the system there is a high probability that the same document is store in multiple places in SharePoint.
To alleviate the problems described before we can store the documents outside SharePoint and keep only the attributes around them into the content database. This will keep the content database smaller and more manageable.
There are two options to achieve this:
- Using the External BLOB Storage API the BLOBS (this is how the documents are stored inside SQL database) can be stored outside of SharePoint content database. However this is a farm wide configuration and requires manual management of the orphaned blob files as there is no method in the interface to accommodate for this. For more info on this follow: http://technet.microsoft.com/en-us/magazine/2009.06.insidesharepoint.aspx
- Using a stubbing mechanism that I will describe further.
Stubbing
Defined as: a short part of something that is left after the main part has been removed; is the process of replacing the real files with a smaller file containing only the information necessary to retrieve the original file. Now the real file can be store anywhere and in any format as long it can be restored in a timely manner and unmodified.
How can we implement this in SharePoint?
The way that I envision is that you will create a new Stubbing Document Library that will allow for configuration of stubbing mechanism but for the end users will be transparent where the file is located physically.
If implemented correctly not only that the files are stored on an external location but duplicates could be detected and better versioning capabilities would be available thus optimizing the storage even further.
Few pieces are needed to make this work:
- Event receiver for Add, Update, Delete events that will replace the real file with the stub.
- A service that will handle the stubbing, versioning, binary diff, and any other management and reporting on the external storage.
- An HTTP module that will catch the stub just before reaching the client and will call the service to get and return the real file to the client. (there is no Get in the event receivers)
- Admin and configuration pages under document library settings.
- If versioning is implemented then would be nice to have ECB items for getting more info about a certain file.
The Event Receiver:
We are going to need at least the following three events to handle the stubbing of a new item, updates to an existing item and deletion of an item from SharePoint.
Item Added: We will let SharePoint to finish uploading the file into the document library and then we are going to call the service for a new stub as we have a new file. We now can replace the file with the stub keeping the filename and extension he same. the stub can be as simple as an XML.
Item Deleted: We will get the stub and call the web service to delete the file from external storage.
Item Updating: We will call the service for a new stub that is a child of the previous stub thus enabling versioning regardless of the document library having the versioning enabled. Actually enabling the versioning for this document library will complicate the things a little as we will need to handle more events.
The Stubbing Service:
A Web Service implementing the following methods:
Create Stub:
Parameters: file stream
Create a hash for the file and checks if the file is already in the external store.(this will handle duplicates). If the file exists then a reference count will need to be incremented and the existing stub will be cloned and returned. If the file does not exists then the file gets stored into the external storage and a new stub get created and returned.
Update Stub:
Parameters: new file stream, old stub
Will get the file referenced into the old stub. We will create a binary diff file and store this into the external storage. Create and return a new stub and store the old stub inside it thus making a new version of the existing file.
Delete Stub:
Parameters: stub
We decrement the reference count for the specified stub and if the reference count is zero we can delete the file from the external storage.
Get:
Parameters: stub
We will get and return the file from external storage. If the file is a binary diff then we will recursively look for the parent and generate the real file before returning.
Most likely there is a need for a database to store hash codes, stubs counts and other metadata used by the stubbing service.
HTTP Module:
This is required to be able to replace the stub file with the real file when the user requests the file from SharePoint.
We are going to attach ourselves to EndRequest event using the Init method.
This will allow us to check for the content type and the request URL to determine if the user will receive a stub so we can call the web service to get it replaced by the real file.
Library Settings Pages:
This is where we can plug a page to configure the Stubbing web Service URL and the external storage location.
ECB Items:
Would be nice to have few items here that will give the user some control and information about the stub.
Un-Stub: Will replace the stub with the real file and mark that item to not be stub again.
Re-Stub: Will replace the file with the stub and clear the un-stub flag.
Versions: Will display a page with all existing versions for the current item and give the possibility to restore the item to a previous version. will be nice to give the possibility of deleting old versions too.