Friday, August 22, 2008

Correspond manual recovery back to the BPEL process

Looks complex, but is it?

Manual recovery of BPEL process does not look like a trivial job at all.
And the BPEL Console for manual recovery does do seem to reveal a lot of information as to which process corresponds to which instance of BPEL process in the first glance.

Why do instances end up in manual recovery?

Well this is to do with how BPEL engine handles in coming messages.
BPEL delivery service does 2 things
  1. Use JMS to register message to be processed
  2. Save message in dehydration store
Once the BPEL thread picks the JMS message the instance goes into UNRESOLVED state.
And when the instance is complete the dehydration stores goes to HANDLED state.

If the server shuts down or crashes or engine times out and comes back unable to find the JMS message as it might be already consumed.

What does Manual Recovery show?

Manual recovery shows all instances that are in UNRESOLVED state.

So it would also show instance going from UNRESOLVED to HANDLED states in dehydration store. But are not yet marked as HANDLED.

From what I have seen in production box, a lot of instances come into manual recovery and disappear in a short while.

Should I recover everything that I see in Manual Recovery?

One should not recover every message that you see in manual recovery.
Some of these could be genuine messages that are in flight.
Recover only those messages that are in manual recovery for at least a duration of X which is larger than expected in that particular enterprise system.

What and where do you recover?

Well when you look at the manual recovery area there are 3 different tabs that you generally see in the BPEL console. Most often you would only recover from the first 2 that are represented invoke and callbacks.

Generally I use the instance ID of the BPEL process to identify manual recovery instance corresponds to which BPEL instance.
If there are instances in staled state in BPEL, there there would be a correspoonding manual recovery process instance.
This manualt recovery instance conversation ID will have also contain the instance ID of the BPEL process along with other information with which the conversation Id is build.

What if there is no instance ID but some thing like MD5{xyz...}

These are messages for which instances are not yet created. Can be safely recovered.

5 comments:

rashmialways said...

hi ,,
I designed one bpel which will on invoke submit bpel process for recovery(Conversation id as input parametera).where I submit bpel using conversation Id.
I am trying to see two tables.
invoke_message and dlv_message with that conversation id and I got single record in invoke_message table with state = 2 .
what does these state number stands for .
I could see State values varies from 0,1,2,3,4 in these two tables.

Only State 0 is retryable ?

Kalidass Mookkaiah said...

I will create another post to explain on the states.

hagi said...

Hi, nice post,thanks.

We are having this problem constantly in our bpel process.Some of them i can recover but most of them not.When i do a server restart the problem goes away but after a while we got it again.

Do you have an idea about why could this happen so often?

thanks.

Kalidass Mookkaiah said...

Need a proper analysis as to why this is happening.

Pick up a instance that has left over transient data in MRA. Pick up the instance and identify where the instance is struck.

Generally if you have a out of sync Async response or a pick activity that times out before the instance is ready to receive data, this happens.

Your best bet is still to investigate instance to identify the struck activity in BPEL console.

Be sure not to recover transactions that were created recently. Try to recover transactions that are a bit old as per your average instance life time.

Hope this helps.

MK

Unknown said...

Hi

Very informative. thanks !!!!

Had a question
We have quite a large no of messages that are stuck in recovery console. These have accumulated over a period of time.
When we try to manually process these messages in the recovery console, we have observed that, at times, the messages are not processed. We have to restart the server to process these messages. The question that i have is - does the large no of messages that are stuck in the recovery console contribute to the BPEL server getting stuck or unresponsive and needs a restart ?