Thursday, February 18, 2010

Validate xml against schema files using XLINQ

In many applications mainly integration projects you might have used XML. I have seen XML is used for data transfer which makes the integration very flexible but there are some cautions that need to be taken care.

Some of the integrations are very complex. You can expect the data being sent may not be on the format which we are expecting. So its always good idea to validate whatever we receive from other end before we do anything with it. That is one of the best practices and will avoid lots of problems.


Normally the schema of the xml is shared between two parties and the xml will be generated according to the schema. Most often I have seen developers create the xml and start building the logic around it. But what if the other party sends something wrong? This will actually break your program and you will not be able to tell other party that the parameter supplied is wrong.


Let’s see how we can achieve this easily.

I am going to create a sample schema file. Follow below steps on your visual studio
File - > New -> File. Select XML schema

Here is my schema

<?xml version="1.0" encoding="utf-16"?>
<xs:schema xmlns:b="http://schemas.microsoft.com/BizTalk/2003" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Document">
    <xs:complexType>
      <xs:sequence>
        <xs:element minOccurs="1" maxOccurs="10" name="DocumentIndex">
          <xs:complexType>
            <xs:attribute name="ID" type="xs:int" />
            <xs:attribute name="Value" type="xs:string" />
          </xs:complexType>
        </xs:element>
        <xs:element name="Attachment">
          <xs:complexType>
            <xs:attribute name="Type" type="xs:string" />
            <xs:attribute name="Size" type="xs:string" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="DocumentID" type="xs:int" />
    </xs:complexType>
  </xs:element>
</xs:schema>


According to this schema my XML should be as below


<Document DocumentID="12345>
<DocumentIndex ID="1" Value="03457911" />
<DocumentIndex ID="2" Value="03457911" />
<Attachment Type="Doc" Size="4567" />
</Document>


We need to validate this XML against our schema.. Here is how we do it.



XmlSchemaSet schemas = new XmlSchemaSet();
schemas.Add("", " XMLSchema.xsd");

XElement objElement = XElement.Load("testXMl.xml");
XDocument objXDoc = new XDocument(objElement);

bool errors = false;
objXDoc.Validate(schemas, (o, y) =>
{
MessageBox.Show(y.Message);
errors = true;
});

MessageBox.Show("validation error status " + errors.ToString());



In above code, if there is any problem with XML, it will show the error else it will just show validatation error status – false. The XML is valid only if the errors has the value false after our execution. If there is any problem with validation, you can simply return the error message on your program and reject the request.

Let us put our xml on testXMl.xml



<Document DocumentID="12345">
<DocumentIndex ID="1" Value="03457911" />
<DocumentIndex ID="2" Value="03457911" />
<Attachment Type="Doc" Size="1234" />
</Document>


So it validated and the output shows that there is no error. Now lets say what if we don’t supply Attachment element?


<Document DocumentID="12345">
<DocumentIndex ID="1" Value="03457911" />
<DocumentIndex ID="2" Value="03457911" />
</Document>


The error message that shows “The element 'Document' has incomplete content. List of possible elements expected: 'DocumentIndex, Attachment'.” According to our schema we should have at least one Attachment element in our xml.

What if the xml contains all the attributes and elements and still has some problem? Consider below xml


<Document DocumentID="12345">
<DocumentIndex ID="1a" Value="03457911" />
<DocumentIndex ID="2" Value="03457911" />
<Attachment Type="Doc" Size="1234" />
</Document>



Now when you run the program again, It will pop up different error
The 'ID' attribute is invalid - The value '1a' is invalid according to its datatype 'http://www.w3.org/2001/XMLSchema:int' - The string '1a' is not a valid Int32 value.
We are expecting documentIndex ID as int but the XML has invalid type.


When you are working with large size of xml, its hard to validate each element or attribute in your program. Validating the XML against the schema defined makes it very easy. If you change your schema later on, you don’t need to change anything in your code to validate your program.


So it is very easy to validate xml against schema and make sure you are actually working with the valid input. This would definitely avoid lots of errors due to the datatype mismatch or the requirement data missing on XML.

Monday, February 1, 2010

Some interesting factors while reviewing performance issues on SQL Queries

There are many situations you will write queries just to get the output right without thinking about performance. But as a developer I have seen many people taking care of minimal requirement for performance like avoiding cursors, temp tables etc. Some situation your query might look alright but there will be some hidden things that affect the performance which can be solved only by detailed analysis.

We had one of the sp which was using a table that had almost 2 billion records. The sp was using this table for 6 times on left outer join. We started reviewing if there is any way to fine tune this to reduce the IO cost. One of the best ways I always found was analyzing the execution plans.

One of the DBA gave me an idea of using the same query in other way without using the table many times. I liked the idea but I felt that was good only when we used same table. It had a significant improvement when I had to take data from only one table but when linked with other tables, it had a serious problem.

I will try to explain that scenario with a sample table and show you what is the advantage and disadvantage of using the queries and how you can measure the performance.

Here are my tables
  • Application
  • ApplicationContent
  • ApplicationAttachment

Script:

CREATE TABLE [Application](
      [ApplicationID] [int] NOT NULL,
      [ApplicationStatus] [varchar](1) NULL,
      [ApplicationDate] [date] NULL,
      [ApplicationType] [varchar](10) NULL
)

CREATE TABLE [ApplicationAttachment](
      [ApplicationID] [int] NOT NULL,
      [AttachPath] [varchar](50) NOT NULL,
      [AttachType] [varchar](50) NOT NULL,
      [AttachDate] [date] NOT NULL
)

CREATE TABLE [dbo].[ApplicationContent](
      [ApplicationID] [int] NOT NULL,
      [FieldID] [int] NOT NULL,
      [FieldValue] [varchar](100) NOT NULL
)



I have put few data in these tables.

 



Now look at the table ApplicationContent which has all the data of application on each row. But in some situations you might need them as columns. The most common approach to write this as below


select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
FName.FieldValue as FirstName, LName.FieldValue as LastName, Gen.FieldValue Gender, DOB.FieldValue as DOB from [Application] App
LEFT OUTER JOIN ApplicationContent FName on FName.ApplicationID = App.ApplicationID and FName.FieldID=1
LEFT OUTER JOIN ApplicationContent LName on LName.ApplicationID = App.ApplicationID and LName.FieldID=2
LEFT OUTER JOIN ApplicationContent Gen on Gen.ApplicationID = App.ApplicationID and Gen.FieldID=3
LEFT OUTER JOIN ApplicationContent DOB on DOB.ApplicationID = App.ApplicationID and DOB.FieldID=4


This would give the output as below

 




Do you see any problem with this query? Let’s imagine the table ApplicationContent which has data for millions of applications. For each application we can have multiple fields and it may be in hundreds of fields for each application. So definitely the size of this table will be very huge. Even though we have the indexes in place, the above query still hits the same table 4 times.

We can rewrite the same query in below form

Select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
MAX(CASE AContent.FieldID WHEN 1 THEN AContent.FieldValue ELSE NULL END) AS FirstName,
MAX(CASE AContent.FieldID WHEN 2 THEN AContent.FieldValue ELSE NULL END) AS LastName,
MAX(CASE AContent.FieldID WHEN 3 THEN AContent.FieldValue ELSE NULL END) AS Gender,
MAX(CASE AContent.FieldID WHEN 4 THEN AContent.FieldValue ELSE NULL END) AS DOB
From [Application] App
LEFT OUTER JOIN ApplicationContent AContent on AContent.ApplicationID = App.ApplicationID
GROUP BY App.ApplicationID, App.ApplicationStatus, App.ApplicationType


This query hit ApplicationContent table only once so the performance is much better.
The result of second query




This is exactly the same result set as above query.
You will be sure that the second query is much better if you look at the execution plan.
Execution plan for first query:

 



Lets look at the cost of second query

 



Ok the first query resulted overall cost of 0.022 and second one with 0.0071 so the second query performance is almost more than 65% better. The result set of both the queries are same. The main aspects that you need to consider here is that the second query completely eliminated the multiple IO reads of the same table.

So can we blindly say this is the best way to write the queries? I will give you another example with the same query set which needs data from other tables and I will let you decide to answer the question

Ok. Now lets say I need to get the attachment along with this data and get the applications ordered by Application date. So you can update the query as below


Select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
MAX(CASE AContent.FieldID WHEN 1 THEN AContent.FieldValue ELSE NULL END) AS FirstName,
MAX(CASE AContent.FieldID WHEN 2 THEN AContent.FieldValue ELSE NULL END) AS LastName,
MAX(CASE AContent.FieldID WHEN 3 THEN AContent.FieldValue ELSE NULL END) AS Gender,
MAX(CASE AContent.FieldID WHEN 4 THEN AContent.FieldValue ELSE NULL END) AS DOB,
AAttach.AttachPath, AAttach.AttachType
From [Application] App
LEFT OUTER JOIN ApplicationContent AContent on AContent.ApplicationID = App.ApplicationID
LEFT OUTER JOIN ApplicationAttachment AAttach on AAttach.ApplicationID = App.ApplicationID
GROUP BY App.ApplicationID, App.ApplicationStatus, App.ApplicationType,App.ApplicationDate, AAttach.AttachPath, AAttach.AttachType
Order by App.ApplicationDate



So if you think your query is still perfect then you are wrong. Lets looks at the execution plan now



Here you can see significant increase in the cost and also you can find two “Sort” step added which is contributing 33% each

 



This step added because of the Group by clause.

Now lets consider our first query and add these columns


select App.ApplicationID, App.ApplicationStatus, App.ApplicationType,
FName.FieldValue as FirstName, LName.FieldValue as LastName, Gen.FieldValue Gender, DOB.FieldValue as DOB,
AAttach.AttachPath, AAttach.AttachType
from [Application] App
LEFT OUTER JOIN ApplicationContent FName on FName.ApplicationID = App.ApplicationID and FName.FieldID=1
LEFT OUTER JOIN ApplicationContent LName on LName.ApplicationID = App.ApplicationID and LName.FieldID=2
LEFT OUTER JOIN ApplicationContent Gen on Gen.ApplicationID = App.ApplicationID and Gen.FieldID=3
LEFT OUTER JOIN ApplicationContent DOB on DOB.ApplicationID = App.ApplicationID and DOB.FieldID=4
LEFT OUTER JOIN ApplicationAttachment AAttach on AAttach.ApplicationID = App.ApplicationID
Order by App.ApplicationDate

Here is the execution plan for this

 




You can see only one table scan is added and this has not changed much. The cost of the query is almost same as the previous query which did not have attachment table scan and the order by.

Now you try joining another table on the same queries and see. The query which we were feeling most efficient will go down drastically compared to the first one discussed. The second query will help us only when we have to get the different records from the same table and it will not help you while adding any additional tables. So now you can clearly understand when we are using Group by and use the different tables, it keeps adding extra burden on sql server. So I would suggest you to analyze the query thoroughly before concluding on performance. The query and the result may look perfect but we might miss some important factors to consider.

Overall I just wanted to say that the performance of the query depends on the situations and need thorough analysis. There are different ways to fine tune the queries.