The below query works fine, until I run it against 300,000 records in the time_expense table. Then it takes about 5.5 seconds to return the results, which is about 5.5 seconds too long. I need to optimize this query, but am unsure how to do that.
SELECT
ROUND (SUM(TE.hours)/ TOTAL, 3) * 100
AS EFFICIENCY,
DATEFORMAT(TE.slip_date, 'Mmm') + '-' + DATEFORMAT(TE.slip_date,'yy') AS MyMonthName,
MONTH(TE.slip_date)AS MyMonth,
YEAR(TE.slip_date)AS MyYear,
( SELECT
SUM(TE2.hours)
FROM
time_expense TE2
WHERE
TE2.ol_client_id = 1 AND
MONTH(TE2.slip_date)= MyMonth AND
YEAR(TE2.slip_date) = MyYear
)
AS TOTAL
FROM
time_expense TE
LEFT JOIN project P
ON TE.project_id = P.project_id
WHERE
TE.ol_client_id = 1 AND
TE.charge_status_id = 1 AND
P.flag_internal IS NULL
AND MyMonth IS NOT NULL
GROUP BY
MyYear,
MyMonth,
MyMonthName
ORDER BY
MyYear ASC,
MyMonth ASC
Basically, the query is calculating an efficiency, the hours entered for internal projects / the total hours entered, for the same month and year. The problem line is
( SELECT
SUM(TE2.hours)
FROM
time_expense TE2
WHERE
TE2.ol_client_id = 1 AND
MONTH(TE2.slip_date)= MyMonth AND
YEAR(TE2.slip_date) = MyYear
)
AS TOTAL
as this nested query its running multiple times. I know there must be a better way to write this query, but I can not seem to come up with one. Here is what the output looks like:
You could try replacing the innner-select with an outer join:
SELECT
ROUND (SUM(TE.hours)/ TOTAL, 3) * 100 AS EFFICIENCY,
DATEFORMAT(TE.slip_date, 'Mmm') + '-' + DATEFORMAT(TE.slip_date,'yy') AS MyMonthName,
MONTH(TE.slip_date)AS MyMonth,
YEAR(TE.slip_date)AS MyYear,
, SUM(TE2.hours) AS TOTAL
FROM
time_expense TE
LEFT JOIN project P
ON JOIN TE.project_id = P.project_id
LEFT JOIN time_expense TE2
ON TE2.ol_client_id = 1 AND
MONTH(TE2.slip_date)= MyMonth AND
YEAR(TE2.slip_date) = MyYear
WHERE
TE.ol_client_id = 1 AND
TE.charge_status_id = 1 AND
P.flag_internal IS NULL
AND MyMonth IS NOT NULL
GROUP BY
MyYear,
MyMonth,
MyMonthName
ORDER BY
MyYear ASC,
MyMonth ASC
gdotzlaw:
The below query works fine, until I run it against 300,000 records in the time_expense table. Then it takes about 5.5 seconds to return the results, which is about 5.5 seconds too long. I need to optimize this query, but am unsure how to do that.
The fact that is only taking 5.5 seconds is pretty amazing considering you’re running a separate query for every row returned. There is no “optimizing” this. Well, at least not significantly.
This is where the age-old transactional vs reporting data discussion needs a line drawn. This kind of an output is not transactional and expectations and approach should be adjusted accordingly.
The idea with reporting data is that you can live with a certain degree of real time lag in order to gain immediate response. To achieve this, fill a reporting table that munges together complex data at regular intervals and then run a much simpler/faster query against the reporting table to get your fast response. The data being returned is “old” by however long your refresh rate is on your reporting table.
If the user absolutely needs a real time snapshot now, you can additionally give them the option to pull in the real time data – but now they are prepped that it will take a bit of time to fulfill their request.
Thanks David. I was beginning to think this would be the direction I would need to go. The historical data won’t change (sums for 2006, 2007, 2008 etc), and only new records for the current year (and recent months) are going to be added. So, if I did the bulk of the sum calculations and stored the results in a separate table, then I could just use a procedure to pull recent data from the time_expense data and update the sums in the results table. Might be a good use for a stored procedure that is triggered periodically (like once a day) just to keep the results for the last 6 months fresh.
gdotzlaw:
Thanks David. I was beginning to think this would be the direction I would need to go. The historical data won’t change (sums for 2006, 2007, 2008 etc), and only new records for the current year (and recent months) are going to be added. So, if I did the bulk of the sum calculations and stored the results in a separate table, then I could just use a procedure to pull recent data from the time_expense data and update the sums in the results table. Might be a good use for a stored procedure that is triggered periodically (like once a day) just to keep the results for the last 6 months fresh.
Sound like the right direction?
Yup.
Note 1: Another option to a stored procedure is a Servoy batch processor. I tend to lean towards keeping all business logic in the Servoy layer and keeping the database layer as dumb as possible. But sometimes a procedure just makes the most sense.
Note 2: Even if you go with stored procedure, triggering procedures with a batch processor can give you some additional flexibility. Like a runtime user interface to manage all timed events.
There is a way to achive realtime results fast.
It requires you to update an additional table with cumulative results and a timestamp at every transaction. This wil ultimately slow down your updates but in many cases that delay is not significant.
This table containing the cumulative figures can now be queried for 2 records, the record corresponding with the beginning of the timeframe and the record corresponding with the end of the timeframe. Simply substracting these two records will yield your result, it doesn’t require all intermediate records to calculate the sum so your query runs a lot faster. You would need to keep cummulative versions of all terms in the expressions you use in your calculations.