Tuesday, May 1, 2012

DataView ToTable Vs Linq to find distinct

One of my past post i mention that how to find distinct from DataTable (http://dotnetstep.in/find-distinct-value-from-dataset-datatable-in-net). Most of the time need is simple and we need to find out one or two column for distinct column but when need is complex and data is not unique than all columns are used to find distinct value of datatable.

When data in datatable is small is count ( Upto certain thousand) and it contains many duplicate record in that case performance is ok for dataview.ToTable but when no. of row increases performance decrease and in case dataView.ToTable, it is noticeable decrease in performance.
Performance depend on total no of record and total of distinct record that dataset/datatable going to return.

What is the solution ?
If you are using .net framework 3.5 SP1 or later then Linq to DataTable comes to rescue. DataTable.AsEnumerable().Distinct() is much faster than DataView ToTable functionality.
Here is performance matrix on sample data. ( You can see code that i used to measure performance later in this post)
NoOfItem/Distinct Item DataView ToTable Time (Milliseconds) Linq Distinct Time (Milliseconds)
100000/2000 13675 117
200000/2000 27271 231
100000/20000 176101 113
200000/20000 412053 483
100000/80000 777402 179
200000/80000 2249655 207
Seeing above result it is clear that Linq gives much better performance.

C# Code to find distinct using linq.
var items = dt.AsEnumerable().Distinct(System.Data.DataRowComparer.Default).ToList();   

Code used to measure performance.
class Program
    {        
        static System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
        static void Main(string[] args)
        {
            Console.WriteLine(String.Format("{0},{1},{2},{3},{4}", "NoOfItem","Distinctcountwithdataview", "Distinctcountwithdataviewtime", 
                                                                   "WithoutdataviewdistinctCount", 
                                                                   "Withoutdataviewdistinctcounttime"));
            for (int i = 0; i < 1000; i+=100)
            {
                DataTable dtSample = GenerateTable(1000 * i, 80000);
                var withDataView = NormalDataSetWithDataView(dtSample);               
                var withoutDataView = NormalDataSetWithOutDataView(dtSample);
                Console.WriteLine(String.Format("{0},{1},{2},{3},{4}",1000*i,withDataView.Item1,withDataView.Item2,withoutDataView.Item1,withoutDataView.Item2));                
            }
            
            Console.ReadLine();
        }

        private static Tuple<int,long> NormalDataSetWithDataView(DataTable dt)
        {
            StopWatchEnable(watch, true);            
            DataView dv1 = new DataView(dt);
            DataTable dtNew1 = dv1.ToTable(true, "Id", "Name", "Role", "DataColumn1", "DataColumn2", "DataColumn3", "DataColumn4", "DataColumn5", "DataColumn6", "DataColumn7");           
            return new Tuple<int,long>(dtNew1.Rows.Count , StopWatchEnable(watch, false));
        }
        private static Tuple<int, long> NormalDataSetWithOutDataView(DataTable dt)
        {
            StopWatchEnable(watch, true);            
            var items = dt.AsEnumerable().Distinct(System.Data.DataRowComparer.Default).ToList();            
            return new Tuple<int, long>(items.Count, StopWatchEnable(watch, false));
        }

        private static DataTable GenerateTable(int noOfRows, int modules)
        {
            DataTable dt = new DataTable();
            dt.Columns.Add("Id", typeof(int));
            dt.Columns.Add("Name", typeof(string));
            dt.Columns.Add("Role", typeof(string));
            dt.Columns.Add("DataColumn1", typeof(string));
            dt.Columns.Add("DataColumn2", typeof(string));
            dt.Columns.Add("DataColumn3", typeof(string));
            dt.Columns.Add("DataColumn4", typeof(string));
            dt.Columns.Add("DataColumn5", typeof(string));
            dt.Columns.Add("DataColumn6", typeof(string));
            dt.Columns.Add("DataColumn7", typeof(string));
            for (int i = 0; i < noOfRows; i++)
            {
                int id = i % modules;
                dt.Rows.Add(new object[]
                              { id, 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString(), 
                                "Test" + id.ToString() });
            }
            return dt;
        }

        private static long StopWatchEnable(System.Diagnostics.Stopwatch watch, bool enable)
        {
            if (enable)
            {
                watch.Restart();
                return 0;
            }
            else
            {
                watch.Stop();
                return watch.ElapsedMilliseconds;
            }
        }
    }



Why DataView.ToTable is slow ?
- I found that ToTable functionality implemented generic functionality to compare each type of row and the way compare data and also it return new result set with distinct field.

Note:
Above code consider that need to find distinct from all column. If you want to find based on certain field then implement custom IEqualityComparer for datarow.I haven’t tested but PLinq might give some performance improvement in Linq operation.

Suggestion are must welcome.

No comments: